Majority voting with bidirectional pre-translation for bitext retrieval

Alexander Jones, Derry Tanti Wijaya

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

Abstract

Obtaining high-quality parallel corpora is of paramount importance for training NMT systems. However, as many language pairs lack adequate gold-standard training data, a popular approach has been to mine so-called ``pseudo-parallel'' sentences from paired documents in two languages. In this paper, we outline some drawbacks with current methods that rely on an embedding similarity threshold, and propose a heuristic method in its place. Our method involves translating both halves of a paired corpus before mining, and then performing a majority vote on sentence pairs mined in three ways: after translating documents in language x to language y, after translating language y to x, and using the original documents in languages x and y. We demonstrate success with this novel approach on the Tatoeba similarity search benchmark in 64 low-resource languages, and on NMT in Kazakh and Gujarati. We also uncover the effect of resource-related factors (i.e. how much monolingual/bilingual data is available for a given language) on the optimal choice of bitext mining method, demonstrating that there is currently no one-size-fits-all approach for this task. We make the code and data used in our experiments publicly available.
Original languageEnglish
Title of host publicationProceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)
EditorsReinhard Rapp, Serge Sharoff, Pierre Zweigenbaum
Place of PublicationShoumen BULGARIA
PublisherAssociation for Computational Linguistics (ACL)
Pages46-59
Number of pages14
ISBN (Electronic)9789544520762
Publication statusPublished - 2021
Externally publishedYes
EventWorkshop on Building and Using Comparable Corpora 2021 - , United States of America
Duration: 6 Sept 20216 Sept 2021
Conference number: 14th
https://aclanthology.org/2021.bucc-1.0/ (Proceedings)
https://comparable.limsi.fr/bucc2021/ (Website)

Conference

ConferenceWorkshop on Building and Using Comparable Corpora 2021
Abbreviated titleBUCC 2021
Country/TerritoryUnited States of America
Period6/09/216/09/21
Internet address

Cite this