Show some love to your n-grams: a bit of progress and stronger n-gram language modeling baselines

Ehsan Shareghi, Daniela Gerz, Ivan Vulić, Anna Korhonen

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

3 Citations (Scopus)

Abstract

In recent years neural language models (LMs) have set state-of-the-art performance for several benchmarking datasets. While the reasons for their success and their computational demand are well-documented, a comparison between neural models and more recent developments in n-gram models is neglected. In this paper, we examine the recent progress in n-gram literature, running experiments on 50 languages covering all morphological language families. Experimental results illustrate that a simple extension of Modified Kneser-Ney outperforms an LSTM language model on 42 languages while a word-level Bayesian n-gram LM (Shareghi et al., 2017) outperforms the character-aware neural model (Kim et al., 2016) on average across all languages, and its extension which explicitly injects linguistic knowledge (Gerz et al., 2018a) on 8 languages. Further experiments on larger Europarl datasets for 3 languages indicate that neural architectures are able to outperform computationally much cheaper n-gram models: n-gram training is up to 15, 000× quicker. Our experiments illustrate that standalone n-gram models lend themselves as natural choices for resource-lean or morphologically rich languages, while the recent progress has significantly improved their accuracy.

Original languageEnglish
Title of host publicationNAACL 2019, The 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Subtitle of host publicationProceedings of the Conference Vol. 1 (Long and Short Papers), June 2 - June 7, 2019
EditorsChristy Doran, Thamar Solorio
Place of PublicationStroudsburg PA USA
PublisherAssociation for Computational Linguistics (ACL)
Pages4113-4118
Number of pages6
Volume1
ISBN (Electronic)9781950737130
Publication statusPublished - Jun 2019
EventNorth American Association for Computational Linguistics 2019: Human Language Technologies - Minneapolis, United States of America
Duration: 2 Jun 20197 Jun 2019
https://naacl2019.org/

Conference

ConferenceNorth American Association for Computational Linguistics 2019
Abbreviated titleNAACL HLT 2019
CountryUnited States of America
CityMinneapolis
Period2/06/197/06/19
Internet address

Cite this