Annotation efficient cross-modal retrieval with adversarial attentive alignment

Po-Yao Huang, Guoliang Kang, Wenhe Liu, Xiaojun Chang, Alexander G. Hauptmann

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

3 Citations (Scopus)

Abstract

Visual-semantic embeddings are central to many multimedia applications such as cross-modal retrieval between visual data and natural language descriptions. Conventionally, learning a joint embedding space relies on large parallel multimodal corpora. Since massive human annotation is expensive to obtain, there is a strong motivation in developing versatile algorithms to learn from large corpora with fewer annotations. In this paper, we propose a novel framework to leverage automatically extracted regional semantics from un-annotated images as additional weak supervision to learn visual-semantic embeddings. The proposed model employs adversarial attentive alignments to close the inherent heterogeneous gaps between annotated and un-annotated portions of visual and textual domains. To demonstrate its superiority, we conduct extensive experiments on sparsely annotated multimodal corpora. The experimental results show that the proposed model outperforms state-of-the-art visual-semantic embedding models by a significant margin for cross-modal retrieval tasks on the sparse Flickr30k and MS-COCO datasets. It is also worth noting that, despite using only 20% of the annotations, the proposed model can achieve competitive performance (Recall at 10 > 80.0% for 1K and > 70.0% for 5K text-to-image retrieval) compared to the benchmarks trained with the complete annotations.

Original languageEnglish
Title of host publicationProceedings of the 27th ACM International Conference on Multimedia
EditorsGuillaume Gravier, Hayley Hung, Chong-Wah Ngo, Wei Tsang Ooi
Place of PublicationNew York NY USA
PublisherAssociation for Computing Machinery (ACM)
Pages1758-1767
Number of pages10
ISBN (Electronic)9781450368896, 9781450367936
DOIs
Publication statusPublished - 2019
EventACM International Conference on Multimedia 2019 - Nice, France
Duration: 21 Oct 201925 Oct 2019
Conference number: 27th
https://dl.acm.org/doi/proceedings/10.1145/3343031

Conference

ConferenceACM International Conference on Multimedia 2019
Abbreviated titleMM 2019
CountryFrance
CityNice
Period21/10/1925/10/19
Internet address

Keywords

  • Adversarial Learning
  • Annotation Efficiency
  • Cross-modal Retrieval
  • Joint Embedding

Cite this