TY - JOUR
T1 - Neural multimodal cooperative learning toward micro-video understanding
AU - Wei, Yinwei
AU - Wang, Xiang
AU - Guan, Weili
AU - Nie, Liqiang
AU - Lin, Zhouchen
AU - Chen, Baoquan
N1 - Funding Information:
This work was supported in part by the National Basic Research Program of China (973 Program) under Grant 2015CB352502, in part by the National Natural Science Foundation of China under Grant 61772310, Grant 61702300, and Grant 61702302, in part by the Project of Thousand Youth Talents 2016, and in part by the Tencent AI Lab Rhino-Bird Joint Research Program under Grant JR201805. The work of Z. Lin was supported in part by the National Natural Science Foundation (NSF) of China under Grant 61625301 and Grant 61731018 and in part by Microsoft Research Asia.
Funding Information:
Manuscript received August 5, 2018; revised February 12, 2019 and May 30, 2019; accepted June 7, 2019. Date of publication July 1, 2019; date of current version September 12, 2019. This work was supported in part by the National Basic Research Program of China (973 Program) under Grant 2015CB352502, in part by the National Natural Science Foundation of China under Grant 61772310, Grant 61702300, and Grant 61702302, in part by the Project of Thousand Youth Talents 2016, and in part by the Tencent AI Lab Rhino-Bird Joint Research Program under Grant JR201805. The work of Z. Lin was supported in part by the National Natural Science Foundation (NSF) of China under Grant 61625301 and Grant 61731018 and in part by Microsoft Research Asia. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Lucio Marcenaro. (Corresponding author: Liqiang Nie.) Y. Wei, L. Nie, and B. Chen are with the College of Computer Science and Technology, Shandong University, Qingdao 266237, China (e-mail: [email protected]; [email protected]; [email protected]).
Publisher Copyright:
© 1992-2012 IEEE.
PY - 2019/7/1
Y1 - 2019/7/1
N2 - The prevailing characteristics of micro-videos result in the less descriptive power of each modality. The micro-video representations, several pioneer efforts proposed, are limited in implicitly exploring the consistency between different modality information but ignore the complementarity. In this paper, we focus on how to explicitly separate the consistent features and the complementary features from the mixed information and harness their combination to improve the expressiveness of each modality. Toward this end, we present a neural multimodal cooperative learning (NMCL) model to split the consistent component and the complementary component by a novel relation-aware attention mechanism. Specifically, the computed attention score can be used to measure the correlation between the features extracted from different modalities. Then, a threshold is learned for each modality to distinguish the consistent and complementary features according to the score. Thereafter, we integrate the consistent parts to enhance the representations and supplement the complementary ones to reinforce the information in each modality. As to the problem of redundant information, which may cause overfitting and is hard to distinguish, we devise an attention network to dynamically capture the features which closely related the category and output a discriminative representation for prediction. The experimental results on a real-world micro-video dataset show that the NMCL outperforms the state-of-the-art methods. Further studies verify the effectiveness and cooperative effects brought by the attentive mechanism.
AB - The prevailing characteristics of micro-videos result in the less descriptive power of each modality. The micro-video representations, several pioneer efforts proposed, are limited in implicitly exploring the consistency between different modality information but ignore the complementarity. In this paper, we focus on how to explicitly separate the consistent features and the complementary features from the mixed information and harness their combination to improve the expressiveness of each modality. Toward this end, we present a neural multimodal cooperative learning (NMCL) model to split the consistent component and the complementary component by a novel relation-aware attention mechanism. Specifically, the computed attention score can be used to measure the correlation between the features extracted from different modalities. Then, a threshold is learned for each modality to distinguish the consistent and complementary features according to the score. Thereafter, we integrate the consistent parts to enhance the representations and supplement the complementary ones to reinforce the information in each modality. As to the problem of redundant information, which may cause overfitting and is hard to distinguish, we devise an attention network to dynamically capture the features which closely related the category and output a discriminative representation for prediction. The experimental results on a real-world micro-video dataset show that the NMCL outperforms the state-of-the-art methods. Further studies verify the effectiveness and cooperative effects brought by the attentive mechanism.
KW - attention model
KW - consistency and complementarity
KW - Cooperative learning
KW - venue category estimation
UR - http://www.scopus.com/inward/record.url?scp=85072509532&partnerID=8YFLogxK
U2 - 10.1109/TIP.2019.2923608
DO - 10.1109/TIP.2019.2923608
M3 - Article
C2 - 31265394
AN - SCOPUS:85072509532
SN - 1057-7149
VL - 29
SP - 1
EP - 14
JO - IEEE Transactions on Image Processing
JF - IEEE Transactions on Image Processing
ER -