Statistical compression-based models for text classification

Vidya Saikrishna, David L. Dowe, Sid Ray

    Research output: Chapter in Book/Report/Conference proceedingConference PaperResearch

    Abstract

    Text classification is the task of assigning predefined categories to text documents. It is a common machine learning problem. Statistical text classification that makes use of machine learning methods to learn classification rules are particularly known to be successful in this regard. In this research project we are trying to re-invent the text classification problem with a sound methodology based on statistical data compression technique-the Minimum Message Length (MML) principle. To model the data sequence we have used the Probabilistic Finite State Automata (PFSAs). We propose two approaches for text classification using the MML-PFSAs. We have tested both the approaches with the Enron spam dataset and the results of our empirical evaluation has been recorded in terms of the well known classification measures i.e. recall, precision, accuracy and error. The results indicate good classification accuracy that can be compared with the state of art classifiers.

    Original languageEnglish
    Title of host publicationProceedings on 5th International Conference on Eco-Friendly Computing and Communication Systems, ICECCS 2016
    EditorsVinod Prasad , Ashutosh Kumar Singh , Jimson Mathew
    Place of PublicationPiscataway USA
    PublisherIEEE, Institute of Electrical and Electronics Engineers
    Pages1-6
    Number of pages6
    ISBN (Electronic)9781509043590
    ISBN (Print)9781509043560
    DOIs
    Publication statusPublished - 5 Apr 2017
    Event5th International Conference on Eco-Friendly Computing and Communication Systems, ICECCS 2016 - Bhopal, Madhya Pradesh, India
    Duration: 8 Nov 20169 Nov 2016

    Conference

    Conference5th International Conference on Eco-Friendly Computing and Communication Systems, ICECCS 2016
    CountryIndia
    CityBhopal, Madhya Pradesh
    Period8/11/169/11/16

    Keywords

    • Minimum Message Length (MML)
    • Probabilistic Finite State Automaton (PFSA)
    • Spam Filtering

    Cite this

    Saikrishna, V., Dowe, D. L., & Ray, S. (2017). Statistical compression-based models for text classification. In V. Prasad , A. K. Singh , & J. Mathew (Eds.), Proceedings on 5th International Conference on Eco-Friendly Computing and Communication Systems, ICECCS 2016 (pp. 1-6). [7893212] Piscataway USA: IEEE, Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/Eco-friendly.2016.7893212
    Saikrishna, Vidya ; Dowe, David L. ; Ray, Sid. / Statistical compression-based models for text classification. Proceedings on 5th International Conference on Eco-Friendly Computing and Communication Systems, ICECCS 2016. editor / Vinod Prasad ; Ashutosh Kumar Singh ; Jimson Mathew . Piscataway USA : IEEE, Institute of Electrical and Electronics Engineers, 2017. pp. 1-6
    @inproceedings{1672d305ec7e4223b9d827794ae14133,
    title = "Statistical compression-based models for text classification",
    abstract = "Text classification is the task of assigning predefined categories to text documents. It is a common machine learning problem. Statistical text classification that makes use of machine learning methods to learn classification rules are particularly known to be successful in this regard. In this research project we are trying to re-invent the text classification problem with a sound methodology based on statistical data compression technique-the Minimum Message Length (MML) principle. To model the data sequence we have used the Probabilistic Finite State Automata (PFSAs). We propose two approaches for text classification using the MML-PFSAs. We have tested both the approaches with the Enron spam dataset and the results of our empirical evaluation has been recorded in terms of the well known classification measures i.e. recall, precision, accuracy and error. The results indicate good classification accuracy that can be compared with the state of art classifiers.",
    keywords = "Minimum Message Length (MML), Probabilistic Finite State Automaton (PFSA), Spam Filtering",
    author = "Vidya Saikrishna and Dowe, {David L.} and Sid Ray",
    year = "2017",
    month = "4",
    day = "5",
    doi = "10.1109/Eco-friendly.2016.7893212",
    language = "English",
    isbn = "9781509043560",
    pages = "1--6",
    editor = "{Prasad }, {Vinod } and {Singh }, {Ashutosh Kumar } and {Mathew }, Jimson",
    booktitle = "Proceedings on 5th International Conference on Eco-Friendly Computing and Communication Systems, ICECCS 2016",
    publisher = "IEEE, Institute of Electrical and Electronics Engineers",
    address = "United States of America",

    }

    Saikrishna, V, Dowe, DL & Ray, S 2017, Statistical compression-based models for text classification. in V Prasad , AK Singh & J Mathew (eds), Proceedings on 5th International Conference on Eco-Friendly Computing and Communication Systems, ICECCS 2016., 7893212, IEEE, Institute of Electrical and Electronics Engineers, Piscataway USA, pp. 1-6, 5th International Conference on Eco-Friendly Computing and Communication Systems, ICECCS 2016, Bhopal, Madhya Pradesh, India, 8/11/16. https://doi.org/10.1109/Eco-friendly.2016.7893212

    Statistical compression-based models for text classification. / Saikrishna, Vidya; Dowe, David L.; Ray, Sid.

    Proceedings on 5th International Conference on Eco-Friendly Computing and Communication Systems, ICECCS 2016. ed. / Vinod Prasad ; Ashutosh Kumar Singh ; Jimson Mathew . Piscataway USA : IEEE, Institute of Electrical and Electronics Engineers, 2017. p. 1-6 7893212.

    Research output: Chapter in Book/Report/Conference proceedingConference PaperResearch

    TY - GEN

    T1 - Statistical compression-based models for text classification

    AU - Saikrishna, Vidya

    AU - Dowe, David L.

    AU - Ray, Sid

    PY - 2017/4/5

    Y1 - 2017/4/5

    N2 - Text classification is the task of assigning predefined categories to text documents. It is a common machine learning problem. Statistical text classification that makes use of machine learning methods to learn classification rules are particularly known to be successful in this regard. In this research project we are trying to re-invent the text classification problem with a sound methodology based on statistical data compression technique-the Minimum Message Length (MML) principle. To model the data sequence we have used the Probabilistic Finite State Automata (PFSAs). We propose two approaches for text classification using the MML-PFSAs. We have tested both the approaches with the Enron spam dataset and the results of our empirical evaluation has been recorded in terms of the well known classification measures i.e. recall, precision, accuracy and error. The results indicate good classification accuracy that can be compared with the state of art classifiers.

    AB - Text classification is the task of assigning predefined categories to text documents. It is a common machine learning problem. Statistical text classification that makes use of machine learning methods to learn classification rules are particularly known to be successful in this regard. In this research project we are trying to re-invent the text classification problem with a sound methodology based on statistical data compression technique-the Minimum Message Length (MML) principle. To model the data sequence we have used the Probabilistic Finite State Automata (PFSAs). We propose two approaches for text classification using the MML-PFSAs. We have tested both the approaches with the Enron spam dataset and the results of our empirical evaluation has been recorded in terms of the well known classification measures i.e. recall, precision, accuracy and error. The results indicate good classification accuracy that can be compared with the state of art classifiers.

    KW - Minimum Message Length (MML)

    KW - Probabilistic Finite State Automaton (PFSA)

    KW - Spam Filtering

    UR - http://www.scopus.com/inward/record.url?scp=85019005497&partnerID=8YFLogxK

    U2 - 10.1109/Eco-friendly.2016.7893212

    DO - 10.1109/Eco-friendly.2016.7893212

    M3 - Conference Paper

    SN - 9781509043560

    SP - 1

    EP - 6

    BT - Proceedings on 5th International Conference on Eco-Friendly Computing and Communication Systems, ICECCS 2016

    A2 - Prasad , Vinod

    A2 - Singh , Ashutosh Kumar

    A2 - Mathew , Jimson

    PB - IEEE, Institute of Electrical and Electronics Engineers

    CY - Piscataway USA

    ER -

    Saikrishna V, Dowe DL, Ray S. Statistical compression-based models for text classification. In Prasad V, Singh AK, Mathew J, editors, Proceedings on 5th International Conference on Eco-Friendly Computing and Communication Systems, ICECCS 2016. Piscataway USA: IEEE, Institute of Electrical and Electronics Engineers. 2017. p. 1-6. 7893212 https://doi.org/10.1109/Eco-friendly.2016.7893212