Abstract
Pre-trained speech Transformers have facilitated great success across various speech processing tasks. However, fine-tuning these encoders for downstream tasks require sufficiently large training data to converge or to achieve state-of-the-art. In text domain this has been partly attributed to sub-optimality of the representation space in pre-trained Transformers. In this work, we take a sober look into pre-trained speech encoders and rewire their representation space without requiring any task-specific labels. Our method utilises neutrally synthesised version of audio inputs along with frame masking to construct positive pairs for contrastive self-supervised learning. When used for augmenting the wav2vec 2 encoder, we observe consistent improvement of isotropy in the representation space. Our experiments on 6 speech processing tasks, exhibit a significant convergence speedup during task fine-tuning as well as consistent task improvement, specially in low-resource settings.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing |
Editors | Yoav Goldberg, Zornitsa Kozareva, Yue Zhang |
Place of Publication | Stroudsburg PA USA |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 1952–1959 |
Number of pages | 8 |
Publication status | Published - Dec 2022 |
Event | Empirical Methods in Natural Language Processing 2022 - Abu Dhabi, United Arab Emirates Duration: 7 Dec 2022 → 11 Dec 2022 https://preview.aclanthology.org/emnlp-22-ingestion/volumes/2022.emnlp-main/ (Proceedings) https://2022.emnlp.org/ (Website) |
Conference
Conference | Empirical Methods in Natural Language Processing 2022 |
---|---|
Abbreviated title | EMNLP 2022 |
Country/Territory | United Arab Emirates |
City | Abu Dhabi |
Period | 7/12/22 → 11/12/22 |
Internet address |