L3S members Avishek Anand and Ralph Ewerth as well as the Visual Analytics Research Group of the German National Library of Science and Technology (TIB) received the "Best Paper Award" for their contribution "Understanding, Categorizing and Predicting Semantic Image-Text Relations" at this year's ACM International Conference on Multimedia Retrieval (ACM ICMR). The conference took place from 10 to 13 June 2019 in Ottawa, Canada. A total of 84 papers in the "Full Paper" category were submitted for review, of which 26 were invited to present a paper at the conference.
In the paper, the current state of art on image-text relations is supplemented by a further dimension. So far, image-text combinations have been characterized using the two metrics "Cross-modal Mutual Information" (CMI) ("How many objects/persons do image and text have in common?") and "Semantic Correlation" (SC) ("How much interpretation and context do image and text share?"). The winning paper now adds another dimension: the status relation of image and text. This relation describes whether both modalities – text and image – are equally important in conveying information or if one of them plays a superior role.
It is further shown how these three metrics can be used to derive a categorization of semantic image-text classes that allows (automatic) classification of image-text pairs according to their type. The authors worked interdisciplinary and took up research results from the communication sciences and transferred them in the field of multimedia information retrieval.
The authors present a system based on deep neural networks ("deep learning") that can automatically determine these image-text metrics and classes. To train such networks and to support future research, an (almost completely) automatically generated dataset is made publicly available.
Applications for this work can be found, for example, in the field of web-based learning or in schools: here, user-specific or topic-specific content can be filtered or sorted according to relevance. Potentially, however, the results can be applied to many different tasks in the context of multimodal information (generation of image descriptions, automatic question answering, search engines, etc.), as they provide a deeper insight into the interplay of image and text from a computer science perspective.