Abstract
Native language identification (NLI) is the task of determining the native language of an author writing in a second language. Several pieces of earlier work have found that features such as function words, part-of-speech n-grams and syntactic structure are helpful in NLI, perhaps representing characteristic errors of different native language speakers. This paper looks at the idea of using Latent Dirichlet Allocation as a feature clustering technique over lexical features to see whether there is any evidence that these smaller-scale features do cluster into more coherent latent factors, and investigates their effect in a classification task. We find that although (not unexpectedly)classification accuracy decreases, there is some evidence of coherent clustering, which could help with much larger syntactic feature spaces.
Original language | English |
---|---|
Title of host publication | Proceedings of the Australasian Language Technology Association Workshop (ALTA 2011) |
Editors | Diego Molla, David Martinez |
Place of Publication | Canberra, Australia |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 115-124 |
Number of pages | 10 |
Publication status | Published - 1 Dec 2011 |
Externally published | Yes |
Event | Australasian Language Technology Association Workshop 2011 - Australian National University, Canberra, Australia Duration: 1 Dec 2011 → 2 Dec 2011 Conference number: 9th https://www.aclweb.org/anthology/events/alta-2011/ (Proceedings) |
Conference
Conference | Australasian Language Technology Association Workshop 2011 |
---|---|
Abbreviated title | ALTAW 2011 |
Country/Territory | Australia |
City | Canberra |
Period | 1/12/11 → 2/12/11 |
Other | This year, the Australasian Language Technology Workshop (ALTA) was held at the Australian National University (ANU) in Canberra on Thursday 1st and Friday 2nd of December 2011. This event was the ninth annual installment of the ALTA Workshop in its most-recent incarnation, and the continuation of an annual workshop series that has existed under various guises since the early 90s. The ALTA Workshop was held in conjunction with langfest 2011, which includes the 2nd combined conference of the Applied Linguistics Association of Australia (ALAA) and the Applied Linguistics Association of New Zealand (ALANZ), as well as the 42nd Annual Conference of the Australian Linguistics Society (ALS) and the 16th Australasian Document Computing Symposium (ADCS 2011). |
Internet address |
|
Keywords
- Topic models
- Native language identification