TY - JOUR
T1 - Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition
AU - Zhang, Sheng
AU - Chen, Min
AU - Chen, Jincai
AU - Li, Yuan Fang
AU - Wu, Yiling
AU - Li, Minglei
AU - Zhu, Chuanbo
N1 - Funding Information:
This work was supported by the National Natural Science Foundation of China under Grant No. 61672246 , No. 61272068 and the Technology Innovation Project of Hubei Province of China under Grant 2019AHB061 . All authors approved the version of the manuscript to be published.
Publisher Copyright:
© 2021 Elsevier B.V.
Copyright:
Copyright 2021 Elsevier B.V., All rights reserved.
PY - 2021/10/11
Y1 - 2021/10/11
N2 - Speech emotion recognition is an important task with a wide range of applications. However, the progress of speech emotion recognition is limited by the lack of large, high-quality labeled speech datasets, due to the high annotation cost and the inherent ambiguity in emotion labels. The recent emergence of large-scale video data makes it possible to obtain massive, though unlabeled speech data. To exploit this unlabeled data, previous works have explored semi-supervised learning methods on various tasks. However, noisy pseudo-labels remain a challenge for these methods. In this work, to alleviate the above issue, we propose a new architecture that combines cross-modal knowledge transfer from visual to audio modality into our semi-supervised learning method with consistency regularization. We posit that introducing visual emotional knowledge by the cross-modal transfer method can increase the diversity and accuracy of pseudo-labels and improve the robustness of the model. To combine knowledge from cross-modal transfer and semi-supervised learning, we design two fusion algorithms, i.e. weighted fusion and consistent & random. Our experiments on CH-SIMS and IEMOCAP datasets show that our method can effectively use additional unlabeled audio-visual data to outperform state-of-the-art results.
AB - Speech emotion recognition is an important task with a wide range of applications. However, the progress of speech emotion recognition is limited by the lack of large, high-quality labeled speech datasets, due to the high annotation cost and the inherent ambiguity in emotion labels. The recent emergence of large-scale video data makes it possible to obtain massive, though unlabeled speech data. To exploit this unlabeled data, previous works have explored semi-supervised learning methods on various tasks. However, noisy pseudo-labels remain a challenge for these methods. In this work, to alleviate the above issue, we propose a new architecture that combines cross-modal knowledge transfer from visual to audio modality into our semi-supervised learning method with consistency regularization. We posit that introducing visual emotional knowledge by the cross-modal transfer method can increase the diversity and accuracy of pseudo-labels and improve the robustness of the model. To combine knowledge from cross-modal transfer and semi-supervised learning, we design two fusion algorithms, i.e. weighted fusion and consistent & random. Our experiments on CH-SIMS and IEMOCAP datasets show that our method can effectively use additional unlabeled audio-visual data to outperform state-of-the-art results.
KW - Cross-modal knowledge transfer
KW - Semi-supervised learning
KW - Speech emotion recognition
UR - http://www.scopus.com/inward/record.url?scp=85111911468&partnerID=8YFLogxK
U2 - 10.1016/j.knosys.2021.107340
DO - 10.1016/j.knosys.2021.107340
M3 - Article
AN - SCOPUS:85111911468
SN - 0950-7051
VL - 229
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
M1 - 107340
ER -