TY - JOUR
T1 - Vector representation based on a supervised codebook for Nepali documents classification
AU - Sitaula, Chiranjibi
AU - Basnet, Anish
AU - Aryal, Sunil
N1 - Publisher Copyright:
© Copyright 2021 Sitaula et al.
Copyright:
Copyright 2021 Elsevier B.V., All rights reserved.
PY - 2021/3/3
Y1 - 2021/3/3
N2 - Document representation with outlier tokens exacerbates the classification performance due to the uncertain orientation of such tokens. Most existing document representation methods in different languages including Nepali mostly ignore the strategies to filter them out from documents before learning their representations. In this article, we propose a novel document representation method based on a supervised codebook to represent the Nepali documents, where our codebook contains only semantic tokens without outliers. Our codebook is domainspecific as it is based on tokens in a given corpus that have higher similarities with the class labels in the corpus. Our method adopts a simple yet prominent representation method for each word, called probability-based word embedding. To show the efficacy of our method, we evaluate its performance in the document classification task using Support Vector Machine and validate against widely used document representation methods such as Bag of Words, Latent Dirichlet allocation, Long Short-Term Memory, Word2Vec, Bidirectional Encoder Representations from Transformers and so on, using four Nepali text datasets (we denote them shortly as A1, A2, A3 and A4). The experimental results show that our method produces state-of-the-art classification performance (77.46% accuracy on A1, 67.53% accuracy on A2, 80.54% accuracy on A3 and 89.58% accuracy on A4) compared to the widely used existing document representation methods. It yields the best classification accuracy on three datasets (A1, A2 and A3) and a comparable accuracy on the fourth dataset (A4). Furthermore, we introduce the largest Nepali document dataset (A4), called NepaliLinguistic dataset, to the linguistic community.
AB - Document representation with outlier tokens exacerbates the classification performance due to the uncertain orientation of such tokens. Most existing document representation methods in different languages including Nepali mostly ignore the strategies to filter them out from documents before learning their representations. In this article, we propose a novel document representation method based on a supervised codebook to represent the Nepali documents, where our codebook contains only semantic tokens without outliers. Our codebook is domainspecific as it is based on tokens in a given corpus that have higher similarities with the class labels in the corpus. Our method adopts a simple yet prominent representation method for each word, called probability-based word embedding. To show the efficacy of our method, we evaluate its performance in the document classification task using Support Vector Machine and validate against widely used document representation methods such as Bag of Words, Latent Dirichlet allocation, Long Short-Term Memory, Word2Vec, Bidirectional Encoder Representations from Transformers and so on, using four Nepali text datasets (we denote them shortly as A1, A2, A3 and A4). The experimental results show that our method produces state-of-the-art classification performance (77.46% accuracy on A1, 67.53% accuracy on A2, 80.54% accuracy on A3 and 89.58% accuracy on A4) compared to the widely used existing document representation methods. It yields the best classification accuracy on three datasets (A1, A2 and A3) and a comparable accuracy on the fourth dataset (A4). Furthermore, we introduce the largest Nepali document dataset (A4), called NepaliLinguistic dataset, to the linguistic community.
KW - Classification
KW - Codebook
KW - Feature extraction
KW - Machine learning
KW - Nepali documents
KW - Text classification
UR - http://www.scopus.com/inward/record.url?scp=85102826060&partnerID=8YFLogxK
U2 - 10.7717/peerj-cs.412
DO - 10.7717/peerj-cs.412
M3 - Article
C2 - 33817053
AN - SCOPUS:85102826060
VL - 7
JO - PeerJ Computer Science
JF - PeerJ Computer Science
SN - 2376-5992
ER -