Abstract
Automatic emotion recognition is vital for building natural and engaging human-computer interaction systems. Combining information from multiple modalities typically improves emotion recognition performance. In previous work, features from different modalities have generally been fused at the same level with two types of fusion strategies: Feature-Level fusion, which concatenates feature sets before recognition; and Decision-Level fusion, which makes the final decision based on outputs of the unimodal models. However, different features may describe data at different time scales or have different levels of abstraction. Cognitive Science research also indicates that when perceiving emotions, humans use information from different modalities at different cognitive levels and time steps. Therefore, we propose a Hierarchical fusion strategy for multimodal emotion recognition, which incorporates global or more abstract features at higher levels of its knowledge-inspired structure. We build multimodal emotion recognition models combining state-of-the-art acoustic and lexical features to study the performance of the proposed Hierarchical fusion. Experiments on two emotion databases of spoken dialogue show that this fusion strategy consistently outperforms both Feature-Level and Decision-Level fusion. The multimodal emotion recognition models using the Hierarchical fusion strategy achieved state-of-the-art performance on recognizing emotions in both spontaneous and acted dialogue.
| Original language | English |
|---|---|
| Title of host publication | 2016 IEEE Workshop on Spoken Language Technology - SLT 2016 Proceedings |
| Subtitle of host publication | December 13–16, 2016 San Diego, California, U.S.A. |
| Editors | Dilek Hakkani-Tur, Julia Hirschberg, Douglas Reynolds, Frank Seide, Zheng Hua Tan, Dan Povey |
| Place of Publication | Piscataway NJ USA |
| Publisher | IEEE, Institute of Electrical and Electronics Engineers |
| Pages | 565-572 |
| Number of pages | 8 |
| ISBN (Electronic) | 9781509049035, 9781509049028 |
| ISBN (Print) | 9781509049042 |
| DOIs | |
| Publication status | Published - 2016 |
| Externally published | Yes |
| Event | IEEE Workshop on Spoken Language Technology 2016 - San Diego, United States of America Duration: 13 Dec 2016 → 16 Dec 2016 https://www2.securecms.com/SLT2016//Default.asp |
Conference
| Conference | IEEE Workshop on Spoken Language Technology 2016 |
|---|---|
| Abbreviated title | SLT 2016 |
| Country/Territory | United States of America |
| City | San Diego |
| Period | 13/12/16 → 16/12/16 |
| Internet address |
Keywords
- Dialogue
- Emotion recognition
- Human-computer interaction
- LSTM
- Modality fusion