Auto-Parsing Network for image Captioning and Visual Question Answering

Xu Yang, Chongyang Gao, Hanwang Zhang, Jianfei Cai

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

25 Citations (Scopus)

Abstract

We propose an Auto-Parsing Network (APN) to discover and exploit the input data's hidden tree structures for improving the effectiveness of the Transformer-based vision-language systems. Specifically, we impose a Probabilistic Graphical Model (PGM) parameterized by the attention operations on each self-attention layer to incorporate sparse assumption. We use this PGM to softly segment an input sequence into a few clusters where each cluster can be treated as the parent of the inside entities. By stacking these PGM constrained self-attention layers, the clusters in a lower layer compose into a new sequence, and the PGM in a higher layer will further segment this sequence. Iteratively, a sparse tree can be implicitly parsed, and this tree's hierarchical knowledge is incorporated into the transformed embeddings, which can be used for solving the target vision-language tasks. Specifically, we showcase that our APN can strengthen Transformer based networks in two major vision-language tasks: Captioning and Visual Question Answering. Also, a PGM probability-based parsing algorithm is developed by which we can discover what the hidden structure of input is during the inference.

Original languageEnglish
Title of host publicationProceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021
EditorsDima Damen, Tal Hassner, Chris Pal, Yoichi Sato
Place of PublicationPiscataway NJ USA
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages2177-2187
Number of pages11
ISBN (Electronic)9781665428125
ISBN (Print)9781665428132
DOIs
Publication statusPublished - 2021
EventIEEE International Conference on Computer Vision 2021 - Online, United States of America
Duration: 11 Oct 202117 Oct 2021
https://iccv2021.thecvf.com/home (Website)
https://ieeexplore.ieee.org/xpl/conhome/9709627/proceeding (Proceedings)

Publication series

NameProceedings of the IEEE International Conference on Computer Vision
PublisherIEEE, Institute of Electrical and Electronics Engineers
ISSN (Print)1550-5499
ISSN (Electronic)2380-7504

Conference

ConferenceIEEE International Conference on Computer Vision 2021
Abbreviated titleICCV 2021
Country/TerritoryUnited States of America
CityOnline
Period11/10/2117/10/21
Internet address

Cite this