Learning to collocate neural modules for image captioning

Xu Yang, Hanwang Zhang, Jianfei Cai

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

2 Citations (Scopus)

Abstract

We do not speak word by word from scratch; our brain quickly structures a pattern like sth do sth at someplace and then fill in the detailed description. To render existing encoder-decoder image captioners such human-like reasoning, we propose a novel framework: learning to Collocate Neural Modules (CNM), to generate the "inner pattern" connecting visual encoder and language decoder. Unlike the widely-used neural module networks in visual Q&A, where the language (i.e., question) is fully observable, CNM for captioning is more challenging as the language is being generated and thus is partially observable. To this end, we make the following technical contributions for CNM training: 1) compact module design --- one for function words and three for visual content words (e.g., noun, adjective, and verb), 2) soft module fusion and multi-step module execution, robustifying the visual reasoning in partial observation, 3) a linguistic loss for module controller being faithful to part-of-speech collocations (e.g., adjective is before noun). Extensive experiments on the challenging MS-COCO image captioning benchmark validate the effectiveness of our CNM image captioner. In particular, CNM achieves a new state-of-the-art 127.9 CIDEr-D on Karpathy split and a single-model 126.0 c40 on the official server. CNM is also robust to few training samples, e.g., by training only one sentence per image, CNM can halve the performance loss compared to a strong baseline.
Original languageEnglish
Title of host publicationProceedings - IEEE International Conference on Computer Vision, ICCV 2019
EditorsIn So Kweon, Nikos Paragios, Ming-Hsuan Yang, Svetlana Lazebnik
Place of PublicationPiscataway NJ USA
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages4250-4260
Number of pages11
ISBN (Electronic)9781728148038
ISBN (Print)9781728148045
DOIs
Publication statusPublished - 2019
Externally publishedYes
EventIEEE International Conference on Computer Vision 2019 - Seoul, Korea, Republic of (South)
Duration: 27 Oct 20192 Nov 2019
Conference number: 17th
http://iccv2019.thecvf.com/

Publication series

NameProceedings of the IEEE International Conference on Computer Vision
PublisherIEEE, Institute of Electrical and Electronics Engineers
Volume2019-October
ISSN (Print)1550-5499
ISSN (Electronic)2380-7504

Conference

ConferenceIEEE International Conference on Computer Vision 2019
Abbreviated titleICCV 2019
CountryKorea, Republic of (South)
CitySeoul
Period27/10/192/11/19
Internet address

Cite this

Yang, X., Zhang, H., & Cai, J. (2019). Learning to collocate neural modules for image captioning. In I. S. Kweon, N. Paragios, M-H. Yang, & S. Lazebnik (Eds.), Proceedings - IEEE International Conference on Computer Vision, ICCV 2019 (pp. 4250-4260). (Proceedings of the IEEE International Conference on Computer Vision; Vol. 2019-October). IEEE, Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/ICCV.2019.00435