DNN multimodal fusion techniques for predicting video sentiment

Jennifer Williams, Ramona Comanescu, Oana Radu, Leimin Tian

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearch

Abstract

We present our work on sentiment prediction using the benchmark MOSI dataset from the CMU-MultimodalDataSDK. Previous work on multimodal sentiment analysis have been focused on input-level feature fusion or decision-level fusion for multimodal fusion. Here, we propose an intermediate-level feature fusion, which merges weights from each modality (audio, video, and text) during training with subsequent additional training. Moreover, we tested principle component analysis (PCA) for feature selection. We found that applying PCA increases unimodal performance, and multimodal fusion outperforms unimodal models. Our experiments show that our proposed intermediate-level feature fusion outperforms other fusion techniques, and it achieves the best performance with an overall binary accuracy of 74.0% on video+text modalities. Our work also improves feature selection for unimodal sentiment analysis, while proposing a novel and effective multimodal fusion architecture for this task.
Original languageEnglish
Title of host publicationACL 2018 - First Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML)
Subtitle of host publicationProceedings of the Workshop - July 20, 2018 Melbourne, Australia
EditorsAmir Zadeh, Louis-Philippe Morency, Paul Pu Liang, Soujanya Poria, Erik Cambria, Stefan Scherer
Place of PublicationStroudsburg PA USA
PublisherAssociation for Computational Linguistics (ACL)
Pages64-72
Number of pages9
ISBN (Electronic)9781948087469
Publication statusPublished - 2018
Externally publishedYes
EventGrand Challenge and Workshopon Human Multimodal Language 2018 - Melbourne, Australia
Duration: 20 Jul 201820 Jul 2018
http://multicomp.cs.cmu.edu/acl2018multimodalchallenge/

Conference

ConferenceGrand Challenge and Workshopon Human Multimodal Language 2018
Abbreviated titleChallenge-HML 2018
CountryAustralia
CityMelbourne
Period20/07/1820/07/18
Internet address

Cite this

Williams, J., Comanescu, R., Radu, O., & Tian, L. (2018). DNN multimodal fusion techniques for predicting video sentiment. In A. Zadeh, L-P. Morency, P. Pu Liang, S. Poria, E. Cambria, & S. Scherer (Eds.), ACL 2018 - First Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML): Proceedings of the Workshop - July 20, 2018 Melbourne, Australia (pp. 64-72). Stroudsburg PA USA: Association for Computational Linguistics (ACL).
Williams, Jennifer ; Comanescu, Ramona ; Radu, Oana ; Tian, Leimin. / DNN multimodal fusion techniques for predicting video sentiment. ACL 2018 - First Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML): Proceedings of the Workshop - July 20, 2018 Melbourne, Australia. editor / Amir Zadeh ; Louis-Philippe Morency ; Paul Pu Liang ; Soujanya Poria ; Erik Cambria ; Stefan Scherer. Stroudsburg PA USA : Association for Computational Linguistics (ACL), 2018. pp. 64-72
@inproceedings{1c4babb793ee44a4b56c4652d7560c0d,
title = "DNN multimodal fusion techniques for predicting video sentiment",
abstract = "We present our work on sentiment prediction using the benchmark MOSI dataset from the CMU-MultimodalDataSDK. Previous work on multimodal sentiment analysis have been focused on input-level feature fusion or decision-level fusion for multimodal fusion. Here, we propose an intermediate-level feature fusion, which merges weights from each modality (audio, video, and text) during training with subsequent additional training. Moreover, we tested principle component analysis (PCA) for feature selection. We found that applying PCA increases unimodal performance, and multimodal fusion outperforms unimodal models. Our experiments show that our proposed intermediate-level feature fusion outperforms other fusion techniques, and it achieves the best performance with an overall binary accuracy of 74.0{\%} on video+text modalities. Our work also improves feature selection for unimodal sentiment analysis, while proposing a novel and effective multimodal fusion architecture for this task.",
author = "Jennifer Williams and Ramona Comanescu and Oana Radu and Leimin Tian",
year = "2018",
language = "English",
pages = "64--72",
editor = "Zadeh, {Amir } and Morency, {Louis-Philippe } and {Pu Liang}, {Paul } and Poria, {Soujanya } and Cambria, {Erik } and Scherer, {Stefan }",
booktitle = "ACL 2018 - First Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML)",
publisher = "Association for Computational Linguistics (ACL)",

}

Williams, J, Comanescu, R, Radu, O & Tian, L 2018, DNN multimodal fusion techniques for predicting video sentiment. in A Zadeh, L-P Morency, P Pu Liang, S Poria, E Cambria & S Scherer (eds), ACL 2018 - First Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML): Proceedings of the Workshop - July 20, 2018 Melbourne, Australia. Association for Computational Linguistics (ACL), Stroudsburg PA USA, pp. 64-72, Grand Challenge and Workshopon Human Multimodal Language 2018, Melbourne, Australia, 20/07/18.

DNN multimodal fusion techniques for predicting video sentiment. / Williams, Jennifer; Comanescu, Ramona; Radu, Oana; Tian, Leimin.

ACL 2018 - First Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML): Proceedings of the Workshop - July 20, 2018 Melbourne, Australia. ed. / Amir Zadeh; Louis-Philippe Morency; Paul Pu Liang; Soujanya Poria; Erik Cambria; Stefan Scherer. Stroudsburg PA USA : Association for Computational Linguistics (ACL), 2018. p. 64-72.

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearch

TY - GEN

T1 - DNN multimodal fusion techniques for predicting video sentiment

AU - Williams, Jennifer

AU - Comanescu, Ramona

AU - Radu, Oana

AU - Tian, Leimin

PY - 2018

Y1 - 2018

N2 - We present our work on sentiment prediction using the benchmark MOSI dataset from the CMU-MultimodalDataSDK. Previous work on multimodal sentiment analysis have been focused on input-level feature fusion or decision-level fusion for multimodal fusion. Here, we propose an intermediate-level feature fusion, which merges weights from each modality (audio, video, and text) during training with subsequent additional training. Moreover, we tested principle component analysis (PCA) for feature selection. We found that applying PCA increases unimodal performance, and multimodal fusion outperforms unimodal models. Our experiments show that our proposed intermediate-level feature fusion outperforms other fusion techniques, and it achieves the best performance with an overall binary accuracy of 74.0% on video+text modalities. Our work also improves feature selection for unimodal sentiment analysis, while proposing a novel and effective multimodal fusion architecture for this task.

AB - We present our work on sentiment prediction using the benchmark MOSI dataset from the CMU-MultimodalDataSDK. Previous work on multimodal sentiment analysis have been focused on input-level feature fusion or decision-level fusion for multimodal fusion. Here, we propose an intermediate-level feature fusion, which merges weights from each modality (audio, video, and text) during training with subsequent additional training. Moreover, we tested principle component analysis (PCA) for feature selection. We found that applying PCA increases unimodal performance, and multimodal fusion outperforms unimodal models. Our experiments show that our proposed intermediate-level feature fusion outperforms other fusion techniques, and it achieves the best performance with an overall binary accuracy of 74.0% on video+text modalities. Our work also improves feature selection for unimodal sentiment analysis, while proposing a novel and effective multimodal fusion architecture for this task.

M3 - Conference Paper

SP - 64

EP - 72

BT - ACL 2018 - First Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML)

A2 - Zadeh, Amir

A2 - Morency, Louis-Philippe

A2 - Pu Liang, Paul

A2 - Poria, Soujanya

A2 - Cambria, Erik

A2 - Scherer, Stefan

PB - Association for Computational Linguistics (ACL)

CY - Stroudsburg PA USA

ER -

Williams J, Comanescu R, Radu O, Tian L. DNN multimodal fusion techniques for predicting video sentiment. In Zadeh A, Morency L-P, Pu Liang P, Poria S, Cambria E, Scherer S, editors, ACL 2018 - First Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML): Proceedings of the Workshop - July 20, 2018 Melbourne, Australia. Stroudsburg PA USA: Association for Computational Linguistics (ACL). 2018. p. 64-72