Multi-level, multi-modal interactions for visual question answering over text in images

Jincai Chen, Sheng Zhang, Jiangfeng Zeng, Fuhao Zou, Yuan-Fang Li, Tao Liu, Ping Lu

Research output: Contribution to journalArticleResearchpeer-review

Abstract

Visual scenes containing text in the TextVQA task require a simultaneous understanding of images, questions, and text in images to reason answers. However, most existing cross-modal tasks merely involve two modalities. There are thus few methods for modeling interactions across three modalities. To bridge this gap, we propose in this work cross- and intra-modal interaction modules for multiple (more than two) modalities, where scaled dot-product attention method is applied to model inter- and intra-modal relationship. In addition, we introduce guidance information to assist the attention method to learn a more accurate relationship distribution. We construct a Multi-level Complete Interaction (MLCI) model for the TextVQA task via stacking multiple blocks composed of our proposed interaction modules. We design a multi-level feature joint prediction approach to exploit output representations from each block in a complementary way to predict answers. The experimental results on the TextVQA dataset show that our model obtains a 5.42% improvement in accuracy more than the baseline. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method. Our code is publicly available at https://github.com/zhangshengHust/mlci.

Original languageEnglish
Pages (from-to)1607–1623
Number of pages17
JournalWorld Wide Web
Volume25
DOIs
Publication statusPublished - Jul 2022

Keywords

  • Multi-level feature fusion
  • Multi-modal feature interaction
  • Optical character recognition
  • Self-attention mechanism
  • Visual question answering

Cite this