That'll do fine! A coarse lexical resource for English-Hindi MT, using polylingual topic models

Diptesh Kanojia, Aditya Joshi, Pushpak Bhattacharyya, Mark James Carman

    Research output: Chapter in Book/Report/Conference proceedingConference PaperResearch

    1 Citation (Scopus)

    Abstract

    Parallel corpora are often injected with bilingual lexical resources for improved Indian language machine translation (MT). In absence of such lexical resources, multilingual topic models have been used to create coarse lexical resources in the past, using a Cartesian product approach. Our results show that for morphologically rich languages like Hindi, the Cartesian product approach is detrimental for MT. We then present a novel 'sentential' approach to use this coarse lexical resource from a multilingual topic model. Our coarse lexical resource when injected with a parallel corpus outperforms a system trained using parallel corpus and a good quality lexical resource. As demonstrated by the quality of our coarse lexical resource and its beneft to MT, we believe that our sentential approach to create such a resource will help MT for resource-constrained languages.

    Original languageEnglish
    Title of host publicationProceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)
    EditorsNicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Stelios Piperidis
    Place of PublicationParis France
    PublisherEuropean Language Resources Association (ELRA)
    Pages2199-2203
    Number of pages5
    ISBN (Print)9782951740891
    Publication statusPublished - 2016
    EventInternational Conference on Language Resources and Evaluation 2016 - Portoroz, Slovenia
    Duration: 23 May 201628 May 2016
    Conference number: 10th
    http://www.lrec-conf.org/proceedings/lrec2016/index.html (Proceedings)

    Conference

    ConferenceInternational Conference on Language Resources and Evaluation 2016
    Abbreviated titleLREC 2016
    CountrySlovenia
    CityPortoroz
    Period23/05/1628/05/16
    Internet address

    Keywords

    • Coarse dictionary
    • Machine translation
    • Statistical machine translation
    • Topic models

    Cite this