Cognate identification to improve phylogenetic trees for Indian languages

Diptesh Kanojia, Pushpak Bhattacharyya, Malhar Kulkarni, Gholemreza Haffari

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

Abstract

Cognates are present in multiple variants of the same text across different languages. Computational Phylogenetics uses algorithms and techniques to analyze these variants and infer phylogenetic trees for a hypothesized accurate representation based on the output of the computational algorithm used. In our work, we detect cognates among a few Indian languages namely Hindi, Marathi, Punjabi, and Sanskrit for helping build cognate sets for phylogenetic inference. Cognate detection helps phylogenetic inference by helping isolate diachronic sound changes and thus detect the words of a common origin. A cognate set manually annotated with the help of a lexicographer is generally used to automatically infer phylogenetic trees. Our work creates cognate sets of each language pair and infers phylogenetic trees based on a bayesian framework using the Maximum likelihood method. We also implement our work to an online interface and infer phylogenetic trees based on automatically detected cognate sets. The online interface helps create phylogenetic trees based on the textual data provided as an input. It helps a lexicographer provide manual input of data, edit the data based on their expert opinion and eventually create phylogenetic trees based on various algorithms including our work on automatically creating cognate sets. We go on to discuss the nuances in detection cognates with respect to these Indian languages and also discuss the categorization of Cognate words i.e., “Tatasama” and “Tadbhava” words.

Original languageEnglish
Title of host publicationCODS-COMAD 2019
Subtitle of host publicationProceedings of the 6th ACM IKDD CoDS and 24th COMAD, January 3 - 5, 2019, Kolkata, India
EditorsRaghu Krishnapuram, Parag Singla
Place of PublicationNew York NY USA
PublisherAssociation for Computing Machinery (ACM)
Pages297-300
Number of pages4
ISBN (Electronic)9781450362078
DOIs
Publication statusPublished - 2019
EventACM India Joint International Conference on Data Science and Management of Data (CoDS-COMAD) 2019 - Kolkata, India
Duration: 3 Jan 20195 Jan 2019
Conference number: 6th & 24th
https://cods-comad.in/2019/index.html

Conference

ConferenceACM India Joint International Conference on Data Science and Management of Data (CoDS-COMAD) 2019
Abbreviated titleACM iKDD CoDS and COMAD
CountryIndia
CityKolkata
Period3/01/195/01/19
Internet address

Keywords

  • Cognate Detection
  • Cognate Identification
  • Computational Phylogenetics
  • Historical Linguistics
  • Indian Languages
  • Natural Language Processing
  • Phylogenetic Tree Generation
  • Phylogenetics

Cite this

Kanojia, D., Bhattacharyya, P., Kulkarni, M., & Haffari, G. (2019). Cognate identification to improve phylogenetic trees for Indian languages. In R. Krishnapuram, & P. Singla (Eds.), CODS-COMAD 2019 : Proceedings of the 6th ACM IKDD CoDS and 24th COMAD, January 3 - 5, 2019, Kolkata, India (pp. 297-300). [150] New York NY USA: Association for Computing Machinery (ACM). https://doi.org/10.1145/3297001.3297045
Kanojia, Diptesh ; Bhattacharyya, Pushpak ; Kulkarni, Malhar ; Haffari, Gholemreza. / Cognate identification to improve phylogenetic trees for Indian languages. CODS-COMAD 2019 : Proceedings of the 6th ACM IKDD CoDS and 24th COMAD, January 3 - 5, 2019, Kolkata, India. editor / Raghu Krishnapuram ; Parag Singla. New York NY USA : Association for Computing Machinery (ACM), 2019. pp. 297-300
@inproceedings{4d82f8ce1fa342ee980f3a23fc20649d,
title = "Cognate identification to improve phylogenetic trees for Indian languages",
abstract = "Cognates are present in multiple variants of the same text across different languages. Computational Phylogenetics uses algorithms and techniques to analyze these variants and infer phylogenetic trees for a hypothesized accurate representation based on the output of the computational algorithm used. In our work, we detect cognates among a few Indian languages namely Hindi, Marathi, Punjabi, and Sanskrit for helping build cognate sets for phylogenetic inference. Cognate detection helps phylogenetic inference by helping isolate diachronic sound changes and thus detect the words of a common origin. A cognate set manually annotated with the help of a lexicographer is generally used to automatically infer phylogenetic trees. Our work creates cognate sets of each language pair and infers phylogenetic trees based on a bayesian framework using the Maximum likelihood method. We also implement our work to an online interface and infer phylogenetic trees based on automatically detected cognate sets. The online interface helps create phylogenetic trees based on the textual data provided as an input. It helps a lexicographer provide manual input of data, edit the data based on their expert opinion and eventually create phylogenetic trees based on various algorithms including our work on automatically creating cognate sets. We go on to discuss the nuances in detection cognates with respect to these Indian languages and also discuss the categorization of Cognate words i.e., “Tatasama” and “Tadbhava” words.",
keywords = "Cognate Detection, Cognate Identification, Computational Phylogenetics, Historical Linguistics, Indian Languages, Natural Language Processing, Phylogenetic Tree Generation, Phylogenetics",
author = "Diptesh Kanojia and Pushpak Bhattacharyya and Malhar Kulkarni and Gholemreza Haffari",
year = "2019",
doi = "10.1145/3297001.3297045",
language = "English",
pages = "297--300",
editor = "Raghu Krishnapuram and Parag Singla",
booktitle = "CODS-COMAD 2019",
publisher = "Association for Computing Machinery (ACM)",
address = "United States of America",

}

Kanojia, D, Bhattacharyya, P, Kulkarni, M & Haffari, G 2019, Cognate identification to improve phylogenetic trees for Indian languages. in R Krishnapuram & P Singla (eds), CODS-COMAD 2019 : Proceedings of the 6th ACM IKDD CoDS and 24th COMAD, January 3 - 5, 2019, Kolkata, India., 150, Association for Computing Machinery (ACM), New York NY USA, pp. 297-300, ACM India Joint International Conference on Data Science and Management of Data (CoDS-COMAD) 2019 , Kolkata, India, 3/01/19. https://doi.org/10.1145/3297001.3297045

Cognate identification to improve phylogenetic trees for Indian languages. / Kanojia, Diptesh; Bhattacharyya, Pushpak; Kulkarni, Malhar; Haffari, Gholemreza.

CODS-COMAD 2019 : Proceedings of the 6th ACM IKDD CoDS and 24th COMAD, January 3 - 5, 2019, Kolkata, India. ed. / Raghu Krishnapuram; Parag Singla. New York NY USA : Association for Computing Machinery (ACM), 2019. p. 297-300 150.

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

TY - GEN

T1 - Cognate identification to improve phylogenetic trees for Indian languages

AU - Kanojia, Diptesh

AU - Bhattacharyya, Pushpak

AU - Kulkarni, Malhar

AU - Haffari, Gholemreza

PY - 2019

Y1 - 2019

N2 - Cognates are present in multiple variants of the same text across different languages. Computational Phylogenetics uses algorithms and techniques to analyze these variants and infer phylogenetic trees for a hypothesized accurate representation based on the output of the computational algorithm used. In our work, we detect cognates among a few Indian languages namely Hindi, Marathi, Punjabi, and Sanskrit for helping build cognate sets for phylogenetic inference. Cognate detection helps phylogenetic inference by helping isolate diachronic sound changes and thus detect the words of a common origin. A cognate set manually annotated with the help of a lexicographer is generally used to automatically infer phylogenetic trees. Our work creates cognate sets of each language pair and infers phylogenetic trees based on a bayesian framework using the Maximum likelihood method. We also implement our work to an online interface and infer phylogenetic trees based on automatically detected cognate sets. The online interface helps create phylogenetic trees based on the textual data provided as an input. It helps a lexicographer provide manual input of data, edit the data based on their expert opinion and eventually create phylogenetic trees based on various algorithms including our work on automatically creating cognate sets. We go on to discuss the nuances in detection cognates with respect to these Indian languages and also discuss the categorization of Cognate words i.e., “Tatasama” and “Tadbhava” words.

AB - Cognates are present in multiple variants of the same text across different languages. Computational Phylogenetics uses algorithms and techniques to analyze these variants and infer phylogenetic trees for a hypothesized accurate representation based on the output of the computational algorithm used. In our work, we detect cognates among a few Indian languages namely Hindi, Marathi, Punjabi, and Sanskrit for helping build cognate sets for phylogenetic inference. Cognate detection helps phylogenetic inference by helping isolate diachronic sound changes and thus detect the words of a common origin. A cognate set manually annotated with the help of a lexicographer is generally used to automatically infer phylogenetic trees. Our work creates cognate sets of each language pair and infers phylogenetic trees based on a bayesian framework using the Maximum likelihood method. We also implement our work to an online interface and infer phylogenetic trees based on automatically detected cognate sets. The online interface helps create phylogenetic trees based on the textual data provided as an input. It helps a lexicographer provide manual input of data, edit the data based on their expert opinion and eventually create phylogenetic trees based on various algorithms including our work on automatically creating cognate sets. We go on to discuss the nuances in detection cognates with respect to these Indian languages and also discuss the categorization of Cognate words i.e., “Tatasama” and “Tadbhava” words.

KW - Cognate Detection

KW - Cognate Identification

KW - Computational Phylogenetics

KW - Historical Linguistics

KW - Indian Languages

KW - Natural Language Processing

KW - Phylogenetic Tree Generation

KW - Phylogenetics

UR - http://www.scopus.com/inward/record.url?scp=85061122467&partnerID=8YFLogxK

U2 - 10.1145/3297001.3297045

DO - 10.1145/3297001.3297045

M3 - Conference Paper

SP - 297

EP - 300

BT - CODS-COMAD 2019

A2 - Krishnapuram, Raghu

A2 - Singla, Parag

PB - Association for Computing Machinery (ACM)

CY - New York NY USA

ER -

Kanojia D, Bhattacharyya P, Kulkarni M, Haffari G. Cognate identification to improve phylogenetic trees for Indian languages. In Krishnapuram R, Singla P, editors, CODS-COMAD 2019 : Proceedings of the 6th ACM IKDD CoDS and 24th COMAD, January 3 - 5, 2019, Kolkata, India. New York NY USA: Association for Computing Machinery (ACM). 2019. p. 297-300. 150 https://doi.org/10.1145/3297001.3297045