Annotation efficient cross-modal retrieval with adversarial attentive alignment

Po-Yao Huang, Guoliang Kang, Wenhe Liu, Xiaojun Chang, Alexander G. Hauptmann

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

Abstract

Visual-semantic embeddings are central to many multimedia applications such as cross-modal retrieval between visual data and natural language descriptions. Conventionally, learning a joint embedding space relies on large parallel multimodal corpora. Since massive human annotation is expensive to obtain, there is a strong motivation in developing versatile algorithms to learn from large corpora with fewer annotations. In this paper, we propose a novel framework to leverage automatically extracted regional semantics from un-annotated images as additional weak supervision to learn visual-semantic embeddings. The proposed model employs adversarial attentive alignments to close the inherent heterogeneous gaps between annotated and un-annotated portions of visual and textual domains. To demonstrate its superiority, we conduct extensive experiments on sparsely annotated multimodal corpora. The experimental results show that the proposed model outperforms state-of-the-art visual-semantic embedding models by a significant margin for cross-modal retrieval tasks on the sparse Flickr30k and MS-COCO datasets. It is also worth noting that, despite using only 20% of the annotations, the proposed model can achieve competitive performance (Recall at 10 > 80.0% for 1K and > 70.0% for 5K text-to-image retrieval) compared to the benchmarks trained with the complete annotations.

Original languageEnglish
Title of host publicationProceedings of the 27th ACM International Conference on Multimedia
EditorsGuillaume Gravier, Hayley Hung, Chong-Wah Ngo, Wei Tsang Ooi
Place of PublicationNew York NY USA
PublisherAssociation for Computing Machinery (ACM)
Pages1758-1767
Number of pages10
ISBN (Electronic)9781450368896, 9781450367936
DOIs
Publication statusPublished - 2019
EventACM International Conference on Multimedia 2019 - Nice, France
Duration: 21 Oct 201925 Oct 2019
Conference number: 27th
https://acmmm.org/

Conference

ConferenceACM International Conference on Multimedia 2019
Abbreviated titleMM 2019
CountryFrance
CityNice
Period21/10/1925/10/19
Internet address

Keywords

  • Adversarial Learning
  • Annotation Efficiency
  • Cross-modal Retrieval
  • Joint Embedding

Cite this

Huang, P-Y., Kang, G., Liu, W., Chang, X., & Hauptmann, A. G. (2019). Annotation efficient cross-modal retrieval with adversarial attentive alignment. In G. Gravier, H. Hung, C-W. Ngo, & W. Tsang Ooi (Eds.), Proceedings of the 27th ACM International Conference on Multimedia (pp. 1758-1767). New York NY USA: Association for Computing Machinery (ACM). https://doi.org/10.1145/3343031.3350894
Huang, Po-Yao ; Kang, Guoliang ; Liu, Wenhe ; Chang, Xiaojun ; Hauptmann, Alexander G. / Annotation efficient cross-modal retrieval with adversarial attentive alignment. Proceedings of the 27th ACM International Conference on Multimedia. editor / Guillaume Gravier ; Hayley Hung ; Chong-Wah Ngo ; Wei Tsang Ooi. New York NY USA : Association for Computing Machinery (ACM), 2019. pp. 1758-1767
@inproceedings{beb56653c5884c0d81d6abdd6678f495,
title = "Annotation efficient cross-modal retrieval with adversarial attentive alignment",
abstract = "Visual-semantic embeddings are central to many multimedia applications such as cross-modal retrieval between visual data and natural language descriptions. Conventionally, learning a joint embedding space relies on large parallel multimodal corpora. Since massive human annotation is expensive to obtain, there is a strong motivation in developing versatile algorithms to learn from large corpora with fewer annotations. In this paper, we propose a novel framework to leverage automatically extracted regional semantics from un-annotated images as additional weak supervision to learn visual-semantic embeddings. The proposed model employs adversarial attentive alignments to close the inherent heterogeneous gaps between annotated and un-annotated portions of visual and textual domains. To demonstrate its superiority, we conduct extensive experiments on sparsely annotated multimodal corpora. The experimental results show that the proposed model outperforms state-of-the-art visual-semantic embedding models by a significant margin for cross-modal retrieval tasks on the sparse Flickr30k and MS-COCO datasets. It is also worth noting that, despite using only 20{\%} of the annotations, the proposed model can achieve competitive performance (Recall at 10 > 80.0{\%} for 1K and > 70.0{\%} for 5K text-to-image retrieval) compared to the benchmarks trained with the complete annotations.",
keywords = "Adversarial Learning, Annotation Efficiency, Cross-modal Retrieval, Joint Embedding",
author = "Po-Yao Huang and Guoliang Kang and Wenhe Liu and Xiaojun Chang and Hauptmann, {Alexander G.}",
year = "2019",
doi = "10.1145/3343031.3350894",
language = "English",
pages = "1758--1767",
editor = "Gravier, {Guillaume } and Hung, {Hayley } and Ngo, {Chong-Wah } and {Tsang Ooi}, {Wei }",
booktitle = "Proceedings of the 27th ACM International Conference on Multimedia",
publisher = "Association for Computing Machinery (ACM)",
address = "United States of America",

}

Huang, P-Y, Kang, G, Liu, W, Chang, X & Hauptmann, AG 2019, Annotation efficient cross-modal retrieval with adversarial attentive alignment. in G Gravier, H Hung, C-W Ngo & W Tsang Ooi (eds), Proceedings of the 27th ACM International Conference on Multimedia. Association for Computing Machinery (ACM), New York NY USA, pp. 1758-1767, ACM International Conference on Multimedia 2019, Nice, France, 21/10/19. https://doi.org/10.1145/3343031.3350894

Annotation efficient cross-modal retrieval with adversarial attentive alignment. / Huang, Po-Yao; Kang, Guoliang; Liu, Wenhe; Chang, Xiaojun; Hauptmann, Alexander G.

Proceedings of the 27th ACM International Conference on Multimedia. ed. / Guillaume Gravier; Hayley Hung; Chong-Wah Ngo; Wei Tsang Ooi. New York NY USA : Association for Computing Machinery (ACM), 2019. p. 1758-1767.

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

TY - GEN

T1 - Annotation efficient cross-modal retrieval with adversarial attentive alignment

AU - Huang, Po-Yao

AU - Kang, Guoliang

AU - Liu, Wenhe

AU - Chang, Xiaojun

AU - Hauptmann, Alexander G.

PY - 2019

Y1 - 2019

N2 - Visual-semantic embeddings are central to many multimedia applications such as cross-modal retrieval between visual data and natural language descriptions. Conventionally, learning a joint embedding space relies on large parallel multimodal corpora. Since massive human annotation is expensive to obtain, there is a strong motivation in developing versatile algorithms to learn from large corpora with fewer annotations. In this paper, we propose a novel framework to leverage automatically extracted regional semantics from un-annotated images as additional weak supervision to learn visual-semantic embeddings. The proposed model employs adversarial attentive alignments to close the inherent heterogeneous gaps between annotated and un-annotated portions of visual and textual domains. To demonstrate its superiority, we conduct extensive experiments on sparsely annotated multimodal corpora. The experimental results show that the proposed model outperforms state-of-the-art visual-semantic embedding models by a significant margin for cross-modal retrieval tasks on the sparse Flickr30k and MS-COCO datasets. It is also worth noting that, despite using only 20% of the annotations, the proposed model can achieve competitive performance (Recall at 10 > 80.0% for 1K and > 70.0% for 5K text-to-image retrieval) compared to the benchmarks trained with the complete annotations.

AB - Visual-semantic embeddings are central to many multimedia applications such as cross-modal retrieval between visual data and natural language descriptions. Conventionally, learning a joint embedding space relies on large parallel multimodal corpora. Since massive human annotation is expensive to obtain, there is a strong motivation in developing versatile algorithms to learn from large corpora with fewer annotations. In this paper, we propose a novel framework to leverage automatically extracted regional semantics from un-annotated images as additional weak supervision to learn visual-semantic embeddings. The proposed model employs adversarial attentive alignments to close the inherent heterogeneous gaps between annotated and un-annotated portions of visual and textual domains. To demonstrate its superiority, we conduct extensive experiments on sparsely annotated multimodal corpora. The experimental results show that the proposed model outperforms state-of-the-art visual-semantic embedding models by a significant margin for cross-modal retrieval tasks on the sparse Flickr30k and MS-COCO datasets. It is also worth noting that, despite using only 20% of the annotations, the proposed model can achieve competitive performance (Recall at 10 > 80.0% for 1K and > 70.0% for 5K text-to-image retrieval) compared to the benchmarks trained with the complete annotations.

KW - Adversarial Learning

KW - Annotation Efficiency

KW - Cross-modal Retrieval

KW - Joint Embedding

UR - http://www.scopus.com/inward/record.url?scp=85074835405&partnerID=8YFLogxK

U2 - 10.1145/3343031.3350894

DO - 10.1145/3343031.3350894

M3 - Conference Paper

SP - 1758

EP - 1767

BT - Proceedings of the 27th ACM International Conference on Multimedia

A2 - Gravier, Guillaume

A2 - Hung, Hayley

A2 - Ngo, Chong-Wah

A2 - Tsang Ooi, Wei

PB - Association for Computing Machinery (ACM)

CY - New York NY USA

ER -

Huang P-Y, Kang G, Liu W, Chang X, Hauptmann AG. Annotation efficient cross-modal retrieval with adversarial attentive alignment. In Gravier G, Hung H, Ngo C-W, Tsang Ooi W, editors, Proceedings of the 27th ACM International Conference on Multimedia. New York NY USA: Association for Computing Machinery (ACM). 2019. p. 1758-1767 https://doi.org/10.1145/3343031.3350894