Abstract
Obtaining high-quality parallel corpora is of paramount importance for training NMT systems. However, as many language pairs lack adequate gold-standard training data, a popular approach has been to mine so-called ``pseudo-parallel'' sentences from paired documents in two languages. In this paper, we outline some drawbacks with current methods that rely on an embedding similarity threshold, and propose a heuristic method in its place. Our method involves translating both halves of a paired corpus before mining, and then performing a majority vote on sentence pairs mined in three ways: after translating documents in language x to language y, after translating language y to x, and using the original documents in languages x and y. We demonstrate success with this novel approach on the Tatoeba similarity search benchmark in 64 low-resource languages, and on NMT in Kazakh and Gujarati. We also uncover the effect of resource-related factors (i.e. how much monolingual/bilingual data is available for a given language) on the optimal choice of bitext mining method, demonstrating that there is currently no one-size-fits-all approach for this task. We make the code and data used in our experiments publicly available.
Original language | English |
---|---|
Title of host publication | Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021) |
Editors | Reinhard Rapp, Serge Sharoff, Pierre Zweigenbaum |
Place of Publication | Shoumen BULGARIA |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 46-59 |
Number of pages | 14 |
ISBN (Electronic) | 9789544520762 |
Publication status | Published - 2021 |
Externally published | Yes |
Event | Workshop on Building and Using Comparable Corpora 2021 - , United States of America Duration: 6 Sept 2021 → 6 Sept 2021 Conference number: 14th https://aclanthology.org/2021.bucc-1.0/ (Proceedings) https://comparable.limsi.fr/bucc2021/ (Website) |
Conference
Conference | Workshop on Building and Using Comparable Corpora 2021 |
---|---|
Abbreviated title | BUCC 2021 |
Country/Territory | United States of America |
Period | 6/09/21 → 6/09/21 |
Internet address |
|