Abstract
Pre-trained speech encoders have been central to pushing state-of-the-art results across various speech understanding and generation tasks. Nonetheless, the capabilities of these encoders in low-resource settings are yet to be thoroughly explored. To address this, we conduct a comprehensive set of experiments using a representative set of 3 state-of-the-art encoders (Wav2vec2, WavLM, Whisper) in the low-resource setting across 7 speech understanding and generation tasks. We provide various quantitative and qualitative analyses on task performance, convergence speed, and representational properties of the encoders. We observe a connection between the pre-training protocols of these encoders and the way in which they capture information in their internal layers. In particular, we observe the Whisper encoder exhibits the greatest low-resource capabilities on content-driven tasks in terms of performance and convergence speed.
Original language | English |
---|---|
Title of host publication | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2023 |
Editors | Naomi Harte, Julie Berndsen, Gareth Jones |
Place of Publication | Dublin Ireland |
Publisher | International Speech Communication Association (ISCA) |
Pages | 1498-1502 |
Number of pages | 5 |
DOIs | |
Publication status | Published - 2023 |
Event | Annual Conference of the International Speech Communication Association 2023 - Dublin, Ireland Duration: 20 Aug 2023 → 24 Aug 2023 Conference number: 24th https://interspeech2023.org/ (Website) https://www.isca-speech.org/archive/interspeech_2023/index.html (Proceedings) |
Conference
Conference | Annual Conference of the International Speech Communication Association 2023 |
---|---|
Abbreviated title | Interspeech 2023 |
Country/Territory | Ireland |
City | Dublin |
Period | 20/08/23 → 24/08/23 |
Internet address |
|
Keywords
- low-resource setting
- speech encoders
- speech understanding