TY - JOUR
T1 - Multi-level, multi-modal interactions for visual question answering over text in images
AU - Chen, Jincai
AU - Zhang, Sheng
AU - Zeng, Jiangfeng
AU - Zou, Fuhao
AU - Li, Yuan-Fang
AU - Liu, Tao
AU - Lu, Ping
N1 - Funding Information:
This work was supported by the National Natural Science Foundation of China under Grant No. 61672246, No. 61272068, No. 61672254, No. 62102159 and Program for Hust Academic Frontier Youth Team and the Natural Science Foundation of Hubei Province under grant No. 2020CFB492 and the Humanities and Social Science Fund of Ministry of Education of China under grant No. 21YJC870002. In addition, we gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
Publisher Copyright:
© 2021, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2022/7
Y1 - 2022/7
N2 - Visual scenes containing text in the TextVQA task require a simultaneous understanding of images, questions, and text in images to reason answers. However, most existing cross-modal tasks merely involve two modalities. There are thus few methods for modeling interactions across three modalities. To bridge this gap, we propose in this work cross- and intra-modal interaction modules for multiple (more than two) modalities, where scaled dot-product attention method is applied to model inter- and intra-modal relationship. In addition, we introduce guidance information to assist the attention method to learn a more accurate relationship distribution. We construct a Multi-level Complete Interaction (MLCI) model for the TextVQA task via stacking multiple blocks composed of our proposed interaction modules. We design a multi-level feature joint prediction approach to exploit output representations from each block in a complementary way to predict answers. The experimental results on the TextVQA dataset show that our model obtains a 5.42% improvement in accuracy more than the baseline. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method. Our code is publicly available at https://github.com/zhangshengHust/mlci.
AB - Visual scenes containing text in the TextVQA task require a simultaneous understanding of images, questions, and text in images to reason answers. However, most existing cross-modal tasks merely involve two modalities. There are thus few methods for modeling interactions across three modalities. To bridge this gap, we propose in this work cross- and intra-modal interaction modules for multiple (more than two) modalities, where scaled dot-product attention method is applied to model inter- and intra-modal relationship. In addition, we introduce guidance information to assist the attention method to learn a more accurate relationship distribution. We construct a Multi-level Complete Interaction (MLCI) model for the TextVQA task via stacking multiple blocks composed of our proposed interaction modules. We design a multi-level feature joint prediction approach to exploit output representations from each block in a complementary way to predict answers. The experimental results on the TextVQA dataset show that our model obtains a 5.42% improvement in accuracy more than the baseline. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method. Our code is publicly available at https://github.com/zhangshengHust/mlci.
KW - Multi-level feature fusion
KW - Multi-modal feature interaction
KW - Optical character recognition
KW - Self-attention mechanism
KW - Visual question answering
UR - http://www.scopus.com/inward/record.url?scp=85119827107&partnerID=8YFLogxK
U2 - 10.1007/s11280-021-00976-2
DO - 10.1007/s11280-021-00976-2
M3 - Article
AN - SCOPUS:85119827107
SN - 1386-145X
VL - 25
SP - 1607
EP - 1623
JO - World Wide Web
JF - World Wide Web
ER -