Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features

Leimin Tian, Johanna Moore, Catherine Lai

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearch

Abstract

Automatic emotion recognition is vital for building natural and engaging human-computer interaction systems. Combining information from multiple modalities typically improves emotion recognition performance. In previous work, features from different modalities have generally been fused at the same level with two types of fusion strategies: Feature-Level fusion, which concatenates feature sets before recognition; and Decision-Level fusion, which makes the final decision based on outputs of the unimodal models. However, different features may describe data at different time scales or have different levels of abstraction. Cognitive Science research also indicates that when perceiving emotions, humans use information from different modalities at different cognitive levels and time steps. Therefore, we propose a Hierarchical fusion strategy for multimodal emotion recognition, which incorporates global or more abstract features at higher levels of its knowledge-inspired structure. We build multimodal emotion recognition models combining state-of-the-art acoustic and lexical features to study the performance of the proposed Hierarchical fusion. Experiments on two emotion databases of spoken dialogue show that this fusion strategy consistently outperforms both Feature-Level and Decision-Level fusion. The multimodal emotion recognition models using the Hierarchical fusion strategy achieved state-of-the-art performance on recognizing emotions in both spontaneous and acted dialogue.

Original languageEnglish
Title of host publication2016 IEEE Workshop on Spoken Language Technology - SLT 2016 Proceedings
Subtitle of host publicationDecember 13–16, 2016 San Diego, California, U.S.A.
EditorsDilek Hakkani-Tur, Julia Hirschberg, Douglas Reynolds, Frank Seide, Zheng Hua Tan, Dan Povey
Place of PublicationPiscataway NJ USA
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages565-572
Number of pages8
ISBN (Electronic)9781509049035, 9781509049028
ISBN (Print)9781509049042
DOIs
Publication statusPublished - 2016
Externally publishedYes
EventIEEE Workshop on Spoken Language Technology 2016 - San Diego, United States of America
Duration: 13 Dec 201616 Dec 2016
https://www2.securecms.com/SLT2016//Default.asp

Conference

ConferenceIEEE Workshop on Spoken Language Technology 2016
Abbreviated title SLT 2016
CountryUnited States of America
CitySan Diego
Period13/12/1616/12/16
Internet address

Keywords

  • Dialogue
  • Emotion recognition
  • Human-computer interaction
  • LSTM
  • Modality fusion

Cite this

Tian, L., Moore, J., & Lai, C. (2016). Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features. In D. Hakkani-Tur, J. Hirschberg, D. Reynolds, F. Seide, Z. Hua Tan, & D. Povey (Eds.), 2016 IEEE Workshop on Spoken Language Technology - SLT 2016 Proceedings: December 13–16, 2016 San Diego, California, U.S.A. (pp. 565-572). [7846319] Piscataway NJ USA: IEEE, Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/SLT.2016.7846319
Tian, Leimin ; Moore, Johanna ; Lai, Catherine. / Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features. 2016 IEEE Workshop on Spoken Language Technology - SLT 2016 Proceedings: December 13–16, 2016 San Diego, California, U.S.A.. editor / Dilek Hakkani-Tur ; Julia Hirschberg ; Douglas Reynolds ; Frank Seide ; Zheng Hua Tan ; Dan Povey. Piscataway NJ USA : IEEE, Institute of Electrical and Electronics Engineers, 2016. pp. 565-572
@inproceedings{77beb1c8c4104d85a3ed1e651506889d,
title = "Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features",
abstract = "Automatic emotion recognition is vital for building natural and engaging human-computer interaction systems. Combining information from multiple modalities typically improves emotion recognition performance. In previous work, features from different modalities have generally been fused at the same level with two types of fusion strategies: Feature-Level fusion, which concatenates feature sets before recognition; and Decision-Level fusion, which makes the final decision based on outputs of the unimodal models. However, different features may describe data at different time scales or have different levels of abstraction. Cognitive Science research also indicates that when perceiving emotions, humans use information from different modalities at different cognitive levels and time steps. Therefore, we propose a Hierarchical fusion strategy for multimodal emotion recognition, which incorporates global or more abstract features at higher levels of its knowledge-inspired structure. We build multimodal emotion recognition models combining state-of-the-art acoustic and lexical features to study the performance of the proposed Hierarchical fusion. Experiments on two emotion databases of spoken dialogue show that this fusion strategy consistently outperforms both Feature-Level and Decision-Level fusion. The multimodal emotion recognition models using the Hierarchical fusion strategy achieved state-of-the-art performance on recognizing emotions in both spontaneous and acted dialogue.",
keywords = "Dialogue, Emotion recognition, Human-computer interaction, LSTM, Modality fusion",
author = "Leimin Tian and Johanna Moore and Catherine Lai",
year = "2016",
doi = "10.1109/SLT.2016.7846319",
language = "English",
isbn = "9781509049042",
pages = "565--572",
editor = "Hakkani-Tur, {Dilek } and Hirschberg, {Julia } and Reynolds, {Douglas } and Seide, {Frank } and {Hua Tan}, {Zheng } and Dan Povey",
booktitle = "2016 IEEE Workshop on Spoken Language Technology - SLT 2016 Proceedings",
publisher = "IEEE, Institute of Electrical and Electronics Engineers",
address = "United States of America",

}

Tian, L, Moore, J & Lai, C 2016, Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features. in D Hakkani-Tur, J Hirschberg, D Reynolds, F Seide, Z Hua Tan & D Povey (eds), 2016 IEEE Workshop on Spoken Language Technology - SLT 2016 Proceedings: December 13–16, 2016 San Diego, California, U.S.A.., 7846319, IEEE, Institute of Electrical and Electronics Engineers, Piscataway NJ USA, pp. 565-572, IEEE Workshop on Spoken Language Technology 2016, San Diego, United States of America, 13/12/16. https://doi.org/10.1109/SLT.2016.7846319

Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features. / Tian, Leimin; Moore, Johanna; Lai, Catherine.

2016 IEEE Workshop on Spoken Language Technology - SLT 2016 Proceedings: December 13–16, 2016 San Diego, California, U.S.A.. ed. / Dilek Hakkani-Tur; Julia Hirschberg; Douglas Reynolds; Frank Seide; Zheng Hua Tan; Dan Povey. Piscataway NJ USA : IEEE, Institute of Electrical and Electronics Engineers, 2016. p. 565-572 7846319.

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearch

TY - GEN

T1 - Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features

AU - Tian, Leimin

AU - Moore, Johanna

AU - Lai, Catherine

PY - 2016

Y1 - 2016

N2 - Automatic emotion recognition is vital for building natural and engaging human-computer interaction systems. Combining information from multiple modalities typically improves emotion recognition performance. In previous work, features from different modalities have generally been fused at the same level with two types of fusion strategies: Feature-Level fusion, which concatenates feature sets before recognition; and Decision-Level fusion, which makes the final decision based on outputs of the unimodal models. However, different features may describe data at different time scales or have different levels of abstraction. Cognitive Science research also indicates that when perceiving emotions, humans use information from different modalities at different cognitive levels and time steps. Therefore, we propose a Hierarchical fusion strategy for multimodal emotion recognition, which incorporates global or more abstract features at higher levels of its knowledge-inspired structure. We build multimodal emotion recognition models combining state-of-the-art acoustic and lexical features to study the performance of the proposed Hierarchical fusion. Experiments on two emotion databases of spoken dialogue show that this fusion strategy consistently outperforms both Feature-Level and Decision-Level fusion. The multimodal emotion recognition models using the Hierarchical fusion strategy achieved state-of-the-art performance on recognizing emotions in both spontaneous and acted dialogue.

AB - Automatic emotion recognition is vital for building natural and engaging human-computer interaction systems. Combining information from multiple modalities typically improves emotion recognition performance. In previous work, features from different modalities have generally been fused at the same level with two types of fusion strategies: Feature-Level fusion, which concatenates feature sets before recognition; and Decision-Level fusion, which makes the final decision based on outputs of the unimodal models. However, different features may describe data at different time scales or have different levels of abstraction. Cognitive Science research also indicates that when perceiving emotions, humans use information from different modalities at different cognitive levels and time steps. Therefore, we propose a Hierarchical fusion strategy for multimodal emotion recognition, which incorporates global or more abstract features at higher levels of its knowledge-inspired structure. We build multimodal emotion recognition models combining state-of-the-art acoustic and lexical features to study the performance of the proposed Hierarchical fusion. Experiments on two emotion databases of spoken dialogue show that this fusion strategy consistently outperforms both Feature-Level and Decision-Level fusion. The multimodal emotion recognition models using the Hierarchical fusion strategy achieved state-of-the-art performance on recognizing emotions in both spontaneous and acted dialogue.

KW - Dialogue

KW - Emotion recognition

KW - Human-computer interaction

KW - LSTM

KW - Modality fusion

UR - http://www.scopus.com/inward/record.url?scp=85016039382&partnerID=8YFLogxK

U2 - 10.1109/SLT.2016.7846319

DO - 10.1109/SLT.2016.7846319

M3 - Conference Paper

SN - 9781509049042

SP - 565

EP - 572

BT - 2016 IEEE Workshop on Spoken Language Technology - SLT 2016 Proceedings

A2 - Hakkani-Tur, Dilek

A2 - Hirschberg, Julia

A2 - Reynolds, Douglas

A2 - Seide, Frank

A2 - Hua Tan, Zheng

A2 - Povey, Dan

PB - IEEE, Institute of Electrical and Electronics Engineers

CY - Piscataway NJ USA

ER -

Tian L, Moore J, Lai C. Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features. In Hakkani-Tur D, Hirschberg J, Reynolds D, Seide F, Hua Tan Z, Povey D, editors, 2016 IEEE Workshop on Spoken Language Technology - SLT 2016 Proceedings: December 13–16, 2016 San Diego, California, U.S.A.. Piscataway NJ USA: IEEE, Institute of Electrical and Electronics Engineers. 2016. p. 565-572. 7846319 https://doi.org/10.1109/SLT.2016.7846319