Challenge dataset of cognates and false friend pairs from Indian languages

Diptesh Kanojia, Pushpak Bhattacharyya, Malhar Kulkarni, Gholamreza Haffari

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

Abstract

Cognates are present in multiple variants of the same text across different languages (e.g., “hund” in German and “hound” in English language mean “dog”). They pose a challenge to various Natural Language Processing (NLP) applications such as Machine Translation, Cross-lingual Sense Disambiguation, Computational Phylogenetics, and Information Retrieval. A possible solution to address this challenge is to identify cognates across language pairs. In this paper, we describe the creation of two cognate datasets for twelve Indian languages, namely Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. We digitize the cognate data from an Indian language cognate dictionary and utilize linked Indian language Wordnets to generate cognate sets. Additionally, we use the Wordnet data to create a False Friends' dataset for eleven language pairs. We also evaluate the efficacy of our dataset using previously available baseline cognate detection approaches. We also perform a manual evaluation with the help of lexicographers and release the curated gold-standard dataset with this paper.

Original languageEnglish
Title of host publicationLREC 2020 - Twelfth International Conference on Language Resources and Evaluation, Conference Proceedings
EditorsNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Place of PublicationParis France
PublisherEuropean Language Resources Association (ELRA)
Pages3096-3102
Number of pages7
ISBN (Electronic)9791095546344
Publication statusPublished - 2020
EventInternational Conference on Language Resources and Evaluation 2020 - Marseille, France
Duration: 11 May 202016 May 2020
Conference number: 12th
https://lrec2020.lrec-conf.org/en/ (Website)

Conference

ConferenceInternational Conference on Language Resources and Evaluation 2020
Abbreviated titleLREC 2020
CountryFrance
CityMarseille
Period11/05/2016/05/20
Internet address

Keywords

  • Cognate dataset
  • Cognate sets
  • False friends
  • Gold data
  • Indian languages
  • True cognates

Cite this