Abstract
We explore methods to enrich the diversity of captions associated with pictures for learning improved visual-semantic embeddings (VSE) in cross-modal retrieval. In the spirit of "A picture is worth a thousand words", it would take dozens of sentences to parallel each picture's content adequately. But in fact, real-world multimodal datasets tend to provide only a few (typically, five) descriptions per image. For cross-modal retrieval, the resulting lack of diversity and coverage prevents systems from capturing the fine-grained inter-modal dependencies and intra-modal diversities in the shared VSE space. Using the fact that the encoder-decoder architectures in neural machine translation (NMT) have the capacity to enrich both monolingual and multilingual textual diversity, we propose a novel framework leveraging multimodal neural machine translation (MMT) to perform forward and backward translations based on salient visual objects to generate additional text-image pairs which enables training improved monolingual cross-modal retrieval (English-Image) and multilingual cross-modal retrieval (English-Image and German-Image) models. Experimental results show that the proposed framework can substantially and consistently improve the performance of state-of-the-art models on multiple datasets. The results also suggest that the models with multilingual VSE outperform the models with monolingual VSE.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2020 International Conference on Multimedia Retrieval |
Editors | Klaus Schoeffmann, Phoebe Chen, Noel E. O’Connor |
Place of Publication | New York NY USA |
Publisher | Association for Computing Machinery (ACM) |
Pages | 53-62 |
Number of pages | 10 |
ISBN (Electronic) | 9781450370875 |
DOIs | |
Publication status | Published - Jun 2020 |
Event | ACM International Conference on Multimedia Retrieval 2020 - Dublin, Ireland Duration: 26 Oct 2020 → 29 Oct 2020 Conference number: 10th https://dl.acm.org/doi/proceedings/10.1145/3372278 (Proceedings) http://www.icmr2020.org (Website) |
Publication series
Name | ICMR 2020 - Proceedings of the 2020 International Conference on Multimedia Retrieval |
---|
Conference
Conference | ACM International Conference on Multimedia Retrieval 2020 |
---|---|
Abbreviated title | ICMR 2020 |
Country/Territory | Ireland |
City | Dublin |
Period | 26/10/20 → 29/10/20 |
Internet address |
|
Keywords
- Cross-modal retrieval
- Joint embedding
- Multilingual multimodal representation
- Multimodal machine translation