Sample-based attribute selective An DE for large data

Shenglei Chen, Ana M. Martinez, Geoffrey I. Webb, Limin Wang

    Research output: Contribution to journalArticleResearchpeer-review

    29 Citations (Scopus)

    Abstract

    More and more applications have come with large data sets in the past decade. However, existing algorithms cannot guarantee to scale well on large data. Averaged n-Dependence Estimators (AnDE) allows for flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence, AnDE is especially appropriate for large data learning. In this paper, we propose a sample-based attribute selection technique for AnDE. It needs one more pass through the training data, in which a multitude of approximate AnDE models are built and efficiently assessed by leave-one-out cross validation. The use of a sample reduces the training time. Experiments on 15 large data sets demonstrate that the proposed technique significantly reduces AnDE's error at the cost of a modest increase in training time. This efficient and scalable out-of-core approach delivers superior or comparable performance to typical in-core Bayesian network classifiers.

    Original languageEnglish
    Article number7565579
    Pages (from-to)172-185
    Number of pages14
    JournalIEEE Transactions on Knowledge and Data Engineering
    Volume29
    Issue number1
    DOIs
    Publication statusPublished - 1 Jan 2017

    Keywords

    • Bayesian network classifiers
    • Large data
    • Classification learning
    • Attribute selection
    • Averaged n-dependence estimators (AnDE)
    • Leave-one-out cross validation

    Cite this