Topic Modeling for Native Language Identification

Sze-Meng Jojo Wong, Mark Dras, Mark Johnson

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearch


Native language identification (NLI) is the task of determining the native language of an author writing in a second language. Several pieces of earlier work have found that features such as function words, part-of-speech n-grams and syntactic structure are helpful in NLI, perhaps representing characteristic errors of different native language speakers. This paper looks at the idea of using Latent Dirichlet Allocation as a feature clustering technique over lexical features to see whether there is any evidence that these smaller-scale features do cluster into more coherent latent factors, and investigates their effect in a classification task. We find that although (not unexpectedly)classification accuracy decreases, there is some evidence of coherent clustering, which could help with much larger syntactic feature spaces.
Original languageEnglish
Title of host publicationProceedings of the Australasian Language Technology Association Workshop (ALTA 2011)
EditorsDiego Molla, David Martinez
Place of PublicationCanberra, Australia
PublisherAssociation for Computational Linguistics (ACL)
Number of pages10
Publication statusPublished - 1 Dec 2011
Externally publishedYes
EventAustralasian Language Technology Association Workshop 2011 - Australian National University, Canberra, Australia
Duration: 1 Dec 20112 Dec 2011
Conference number: 9th (Proceedings)


ConferenceAustralasian Language Technology Association Workshop 2011
Abbreviated titleALTAW 2011
OtherThis year, the Australasian Language Technology Workshop (ALTA) was held at the Australian National University (ANU) in Canberra on Thursday 1st and Friday 2nd of December 2011. This event was the ninth annual installment of the ALTA Workshop in its most-recent incarnation, and the continuation of an annual workshop series that has existed under various guises since the early 90s.

The ALTA Workshop was held in conjunction with langfest 2011, which includes the 2nd combined conference of the Applied Linguistics Association of Australia (ALAA) and the Applied Linguistics Association of New Zealand (ALANZ), as well as the 42nd Annual Conference of the Australian Linguistics Society (ALS) and the 16th Australasian Document Computing Symposium (ADCS 2011).
Internet address


  • Topic models
  • Native language identification

Cite this