DNN multimodal fusion techniques for predicting video sentiment

Jennifer Williams, Ramona Comanescu, Oana Radu, Leimin Tian

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearch


We present our work on sentiment prediction using the benchmark MOSI dataset from the CMU-MultimodalDataSDK. Previous work on multimodal sentiment analysis have been focused on input-level feature fusion or decision-level fusion for multimodal fusion. Here, we propose an intermediate-level feature fusion, which merges weights from each modality (audio, video, and text) during training with subsequent additional training. Moreover, we tested principle component analysis (PCA) for feature selection. We found that applying PCA increases unimodal performance, and multimodal fusion outperforms unimodal models. Our experiments show that our proposed intermediate-level feature fusion outperforms other fusion techniques, and it achieves the best performance with an overall binary accuracy of 74.0% on video+text modalities. Our work also improves feature selection for unimodal sentiment analysis, while proposing a novel and effective multimodal fusion architecture for this task.
Original languageEnglish
Title of host publicationACL 2018 - First Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML)
Subtitle of host publicationProceedings of the Workshop - July 20, 2018 Melbourne, Australia
EditorsAmir Zadeh, Louis-Philippe Morency, Paul Pu Liang, Soujanya Poria, Erik Cambria, Stefan Scherer
Place of PublicationStroudsburg PA USA
PublisherAssociation for Computational Linguistics (ACL)
Number of pages9
ISBN (Electronic)9781948087469
Publication statusPublished - 2018
Externally publishedYes
EventGrand Challenge and Workshopon Human Multimodal Language 2018 - Melbourne, Australia
Duration: 20 Jul 201820 Jul 2018


ConferenceGrand Challenge and Workshopon Human Multimodal Language 2018
Abbreviated titleChallenge-HML 2018
Internet address

Cite this