Multi-head attention with diversity for learning grounded multilingual multimodal representations

Po-Yao Huang, Xiaojun Chang, Alexander Hauptmann

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

1 Citation (Scopus)

Abstract

With the aim of promoting and understanding the multilingual version of image search, we leverage visual object detection and propose a model with diverse multi-head attention to learn grounded multilingual multimodal representations. Specifically, our model attends to different types of textual semantics in two languages and visual objects for fine-grained alignments between sentences and images. We introduce a new objective function which explicitly encourages attention diversity to learn an improved visual-semantic embedding space. We evaluate our model in the German-Image and English-Image matching tasks on the Multi30K dataset, and in the Semantic Textual Similarity task with the English descriptions of visual content. Results show that our model yields a significant performance gain over other methods in all of the three tasks.

Original languageEnglish
Title of host publicationEMNLP-IJCNLP 2019, 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing
Subtitle of host publicationProceedings of the Conference
EditorsJing Jiang, Vincent Ng, Xiaojun Wan
Place of PublicationStroudsburg PA USA
PublisherAssociation for Computational Linguistics (ACL)
Pages1461-1467
Number of pages7
ISBN (Electronic)9781950737901
DOIs
Publication statusPublished - Nov 2019
EventJoint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 2019 - Hong Kong, China
Duration: 3 Nov 20197 Nov 2019
Conference number: 9th
https://www.emnlp-ijcnlp2019.org (Website)
https://www.aclweb.org/anthology/volumes/D19-1/ (Proceedings)

Conference

ConferenceJoint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 2019
Abbreviated titleEMNLP-IJCNLP 2019
CountryChina
CityHong Kong
Period3/11/197/11/19
Internet address

Cite this

Huang, P-Y., Chang, X., & Hauptmann, A. (2019). Multi-head attention with diversity for learning grounded multilingual multimodal representations. In J. Jiang, V. Ng, & X. Wan (Eds.), EMNLP-IJCNLP 2019, 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing: Proceedings of the Conference (pp. 1461-1467). [D19-1154] Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/D19-1154