Dynamic Programming Encoding for subword segmentation in neural machine translation

Xuanli He, Reza Haffari, Mohammad Norouzi

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

Abstract

This paper introduces Dynamic Programming Encoding (DPE), a new segmentation algorithm for tokenizing sentences into subword units. We view the subword segmentation of output sentences as a latent variable that should be marginalized out for learning and inference. A mixed character-subword transformer is proposed, which enables exact log marginal likelihood estimation and exact MAP inference to find target segmentations with maximum posterior probability. DPE uses a lightweight mixed character-subword transformer as a means of pre-processing parallel data to segment output sentences using dynamic programming. Empirical results on machine translation suggest that DPE is effective for segmenting output sentences and can be combined with BPE dropout for stochastic segmentation of source sentences. DPE achieves an average improvement of 0.9 BLEU over BPE (Sennrich et al., 2016) and an average improvement of 0.55 BLEU over BPE dropout (Provilkov et al., 2019) on several WMT datasets including English <=> (German, Romanian, Estonian, Finnish, Hungarian).
Original languageEnglish
Title of host publicationACL 2020 - The 58th Annual Meeting of the Association for Computational Linguistics
Subtitle of host publicationProceedings of the Conference
EditorsJoyce Chai, Natalie Schluter, Joel Tetreault
Place of PublicationStroudsburg PA USA
PublisherAssociation for Computational Linguistics (ACL)
Pages3042–3051
Number of pages10
ISBN (Electronic)9781952148255
DOIs
Publication statusPublished - 2020
EventAnnual Meeting of the Association of Computational Linguistics 2020 - Virtual, Seattle, United States of America
Duration: 5 Jul 202010 Jul 2020
Conference number: 58th
https://www.aclweb.org/anthology/volumes/2020.acl-main/

Conference

ConferenceAnnual Meeting of the Association of Computational Linguistics 2020
Abbreviated titleACL 2020
Country/TerritoryUnited States of America
CitySeattle
Period5/07/2010/07/20
Internet address

Cite this