Forward and backward multimodal nmt for improved monolingual and multilingual cross-modal retrieval

Po-Yao Huang, Xiaojun Chang, Alexander Hauptmann, Eduard Hovy

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

Abstract

We explore methods to enrich the diversity of captions associated with pictures for learning improved visual-semantic embeddings (VSE) in cross-modal retrieval. In the spirit of "A picture is worth a thousand words", it would take dozens of sentences to parallel each picture's content adequately. But in fact, real-world multimodal datasets tend to provide only a few (typically, five) descriptions per image. For cross-modal retrieval, the resulting lack of diversity and coverage prevents systems from capturing the fine-grained inter-modal dependencies and intra-modal diversities in the shared VSE space. Using the fact that the encoder-decoder architectures in neural machine translation (NMT) have the capacity to enrich both monolingual and multilingual textual diversity, we propose a novel framework leveraging multimodal neural machine translation (MMT) to perform forward and backward translations based on salient visual objects to generate additional text-image pairs which enables training improved monolingual cross-modal retrieval (English-Image) and multilingual cross-modal retrieval (English-Image and German-Image) models. Experimental results show that the proposed framework can substantially and consistently improve the performance of state-of-the-art models on multiple datasets. The results also suggest that the models with multilingual VSE outperform the models with monolingual VSE.

Original languageEnglish
Title of host publicationProceedings of the 2020 International Conference on Multimedia Retrieval
EditorsKlaus Schoeffmann, Phoebe Chen, Noel E. O’Connor
Place of PublicationNew York NY USA
PublisherAssociation for Computing Machinery (ACM)
Pages53-62
Number of pages10
ISBN (Electronic)9781450370875
DOIs
Publication statusPublished - Jun 2020
EventACM International Conference on Multimedia Retrieval 2020 - Dublin, Ireland
Duration: 26 Oct 202029 Oct 2020
Conference number: 10th
https://dl.acm.org/doi/proceedings/10.1145/3372278 (Proceedings)
http://www.icmr2020.org (Website)

Publication series

NameICMR 2020 - Proceedings of the 2020 International Conference on Multimedia Retrieval

Conference

ConferenceACM International Conference on Multimedia Retrieval 2020
Abbreviated titleICMR 2020
CountryIreland
CityDublin
Period26/10/2029/10/20
Internet address

Keywords

  • Cross-modal retrieval
  • Joint embedding
  • Multilingual multimodal representation
  • Multimodal machine translation

Cite this

Huang, P-Y., Chang, X., Hauptmann, A., & Hovy, E. (2020). Forward and backward multimodal nmt for improved monolingual and multilingual cross-modal retrieval. In K. Schoeffmann, P. Chen, & N. E. O’Connor (Eds.), Proceedings of the 2020 International Conference on Multimedia Retrieval (pp. 53-62). (ICMR 2020 - Proceedings of the 2020 International Conference on Multimedia Retrieval). Association for Computing Machinery (ACM). https://doi.org/10.1145/3372278.3390674