Constrained sequence classification for lexical disambiguation

Tran The Truyen, Dinh Q. Phung, Svetha Venkatesh

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

7 Citations (Scopus)


This paper addresses lexical ambiguity with focus on a particular problem known as accent prediction, in that given an accentless sequence, we need to restore correct accents. This can be modelled as a sequence classification problem for which variants of Markov chains can be applied. Although the state space is large (about the vocabulary size), it is highly constrained when conditioned on the data observation. We investigate the application of several methods, including Powered Product-of-N-grams, Structured Perceptron and Conditional Random Fields (CRFs). We empirically show in the Vietnamese case that these methods are fairly robust and efficient. The second-order CRFs achieve best results with about 94% term accuracy.

Original languageEnglish
Title of host publicationPRICAI 2008
Subtitle of host publicationTrends in Artificial Intelligence - 10th Pacific Rim International Conference on Artificial Intelligence, Proceedings
Number of pages12
Publication statusPublished - 1 Dec 2008
Externally publishedYes
EventPacific Rim International Conference on Artificial Intelligence 2008 - Hanoi, Vietnam
Duration: 15 Dec 200819 Dec 2008
Conference number: 10th

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume5351 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


ConferencePacific Rim International Conference on Artificial Intelligence 2008
Abbreviated titlePRICAI-2008
Internet address


  • Conditional random fields
  • Constrained sequence classification
  • Lexical disambiguation
  • Vietnamese accent restoration

Cite this