Video based visual question answering (V-VQA) remains challenging at the intersection of vision and language. In this paper, we propose a novel architecture, namely Generalized Pyramid Co-attention with Learnable Aggregation Net (GPC) to address two central problems: 1) how to deploy co-attention to V-VQA task considering the complex and diverse content of videos; and 2) how to aggregate the frame-level features (or word-level features) without destroying the feature distributions and temporal information. To solve the first problem, we propose a Generalized Pyramid Co-attention structure with a novel diversity learning module to explicitly encourage attention accuracy and diversity. And we first instantiate it into a Multi-path Pyramid Co-attention (MPC) to capture diverse feature. Then we find each attention branch of original co-attention mechanism does not interact with the others, which results in coarse attention maps. So we extend the MPC structure to a Cascaded Pyramid Transformer Co-attention (CPTC) module in which we replace co-attention with transformer co-attention. To solve the second problem, we propose a new learnable aggregation method with a set of evidence gates. It automatically aggregates adaptively-weighted frame-level features (or word-level features) to extract rich video (or question) context semantic information. With evidence gates, it then further chooses the most related signals representing the evidence information to predict the answer. Extensive validations on the two V-VQA datasets, TGIF-QA and TVQA show that both our proposed MPC and CPTC achieve the state-of-the-art performance and CPTC performs better under various settings and metrics. Code and model have been released at:https://github.com/lixiangpengcs/LAD-Net-for-VideoQA.
- Cascaded pyramid transformer co-attention
- Diversity learning
- Learnable aggregation
- Video question answering