Exploiting parse structures for native language identification

Sze-Meng Wong, Mark Dras

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearch

65 Citations (Scopus)


Attempts to profile authors according to their characteristics extracted from textual data, including native language, have drawn attention in recent years, via various machine learning approaches utilising mostly lexical features. Drawing on the idea of contrastive analysis, which postulates that syntactic errors in a text are to some extent influenced by the native language of an author, this paper explores the usefulness of syntactic features for native language identification. We take two types of parse substructure as features-horizontal slices of trees, and the more general feature schemas from discriminative parse reranking-and show that using this kind of syntactic feature results in an accuracy score in classification of seven native languages of around 80%, an error reduction of more than 30%.

Original languageEnglish
Title of host publicationProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2011)
EditorsRegina Barzilay, Mark Johnson
Place of PublicationStroudsburg, PA, USA
PublisherAssociation for Computing Machinery (ACM)
Number of pages11
ISBN (Print)9781937284114
Publication statusPublished - 2011
Externally publishedYes
EventEmpirical Methods in Natural Language Processing 2011 - John McIntyre Conference Centre, Edinburgh, United Kingdom
Duration: 27 Jul 201129 Jul 2011
https://www.aclweb.org/anthology/volumes/D11-1/ (Proceedings)


ConferenceEmpirical Methods in Natural Language Processing 2011
Abbreviated titleEMNLP 2011
Country/TerritoryUnited Kingdom
OtherSIGDAT, the Association for Computational Linguistics' special interest group on linguistic data and corpus-based approaches to NLP, invites participation in EMNLP 2011, Conference on Empirical Methods in Natural Language Processing.

The conference will be held on July 27-29 (Wed–Fri) at the John McIntyre Conference Centre, Edinburgh, UK. Workshops will be held on July 30-31 (Sat-Sun) at the Informatics Forum, Edinburgh.
Internet address


  • Native language identification

Cite this