Statistical compression-based models for text classification

Vidya Saikrishna, David L. Dowe, Sid Ray

    Research output: Chapter in Book/Report/Conference proceedingConference PaperResearch

    2 Citations (Scopus)

    Abstract

    Text classification is the task of assigning predefined categories to text documents. It is a common machine learning problem. Statistical text classification that makes use of machine learning methods to learn classification rules are particularly known to be successful in this regard. In this research project we are trying to re-invent the text classification problem with a sound methodology based on statistical data compression technique-the Minimum Message Length (MML) principle. To model the data sequence we have used the Probabilistic Finite State Automata (PFSAs). We propose two approaches for text classification using the MML-PFSAs. We have tested both the approaches with the Enron spam dataset and the results of our empirical evaluation has been recorded in terms of the well known classification measures i.e. recall, precision, accuracy and error. The results indicate good classification accuracy that can be compared with the state of art classifiers.

    Original languageEnglish
    Title of host publicationProceedings on 5th International Conference on Eco-Friendly Computing and Communication Systems, ICECCS 2016
    EditorsVinod Prasad , Ashutosh Kumar Singh , Jimson Mathew
    Place of PublicationPiscataway USA
    PublisherIEEE, Institute of Electrical and Electronics Engineers
    Pages1-6
    Number of pages6
    ISBN (Electronic)9781509043590
    ISBN (Print)9781509043560
    DOIs
    Publication statusPublished - 5 Apr 2017
    Event5th International Conference on Eco-Friendly Computing and Communication Systems, ICECCS 2016 - Bhopal, Madhya Pradesh, India
    Duration: 8 Nov 20169 Nov 2016

    Conference

    Conference5th International Conference on Eco-Friendly Computing and Communication Systems, ICECCS 2016
    Country/TerritoryIndia
    CityBhopal, Madhya Pradesh
    Period8/11/169/11/16

    Keywords

    • Minimum Message Length (MML)
    • Probabilistic Finite State Automaton (PFSA)
    • Spam Filtering

    Cite this