Cognate identification to improve phylogenetic trees for Indian languages

Diptesh Kanojia, Pushpak Bhattacharyya, Malhar Kulkarni, Gholemreza Haffari

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

1 Citation (Scopus)

Abstract

Cognates are present in multiple variants of the same text across different languages. Computational Phylogenetics uses algorithms and techniques to analyze these variants and infer phylogenetic trees for a hypothesized accurate representation based on the output of the computational algorithm used. In our work, we detect cognates among a few Indian languages namely Hindi, Marathi, Punjabi, and Sanskrit for helping build cognate sets for phylogenetic inference. Cognate detection helps phylogenetic inference by helping isolate diachronic sound changes and thus detect the words of a common origin. A cognate set manually annotated with the help of a lexicographer is generally used to automatically infer phylogenetic trees. Our work creates cognate sets of each language pair and infers phylogenetic trees based on a bayesian framework using the Maximum likelihood method. We also implement our work to an online interface and infer phylogenetic trees based on automatically detected cognate sets. The online interface helps create phylogenetic trees based on the textual data provided as an input. It helps a lexicographer provide manual input of data, edit the data based on their expert opinion and eventually create phylogenetic trees based on various algorithms including our work on automatically creating cognate sets. We go on to discuss the nuances in detection cognates with respect to these Indian languages and also discuss the categorization of Cognate words i.e., “Tatasama” and “Tadbhava” words.

Original languageEnglish
Title of host publicationCODS-COMAD 2019
Subtitle of host publicationProceedings of the 6th ACM IKDD CoDS and 24th COMAD, January 3 - 5, 2019, Kolkata, India
EditorsRaghu Krishnapuram, Parag Singla
Place of PublicationNew York NY USA
PublisherAssociation for Computing Machinery (ACM)
Pages297-300
Number of pages4
ISBN (Electronic)9781450362078
DOIs
Publication statusPublished - 2019
EventACM India Joint International Conference on Data Science and Management of Data (CoDS-COMAD) 2019 - Kolkata, India
Duration: 3 Jan 20195 Jan 2019
Conference number: 6th & 24th
https://cods-comad.in/2019/index.html

Conference

ConferenceACM India Joint International Conference on Data Science and Management of Data (CoDS-COMAD) 2019
Abbreviated titleACM iKDD CoDS and COMAD
CountryIndia
CityKolkata
Period3/01/195/01/19
Internet address

Keywords

  • Cognate Detection
  • Cognate Identification
  • Computational Phylogenetics
  • Historical Linguistics
  • Indian Languages
  • Natural Language Processing
  • Phylogenetic Tree Generation
  • Phylogenetics

Cite this

Kanojia, D., Bhattacharyya, P., Kulkarni, M., & Haffari, G. (2019). Cognate identification to improve phylogenetic trees for Indian languages. In R. Krishnapuram, & P. Singla (Eds.), CODS-COMAD 2019 : Proceedings of the 6th ACM IKDD CoDS and 24th COMAD, January 3 - 5, 2019, Kolkata, India (pp. 297-300). [150] Association for Computing Machinery (ACM). https://doi.org/10.1145/3297001.3297045