Sample-based attribute selective An DE for large data

Shenglei Chen, Ana M. Martinez, Geoffrey I. Webb, Limin Wang

    Research output: Contribution to journalArticleResearchpeer-review

    Abstract

    More and more applications have come with large data sets in the past decade. However, existing algorithms cannot guarantee to scale well on large data. Averaged n-Dependence Estimators (AnDE) allows for flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence, AnDE is especially appropriate for large data learning. In this paper, we propose a sample-based attribute selection technique for AnDE. It needs one more pass through the training data, in which a multitude of approximate AnDE models are built and efficiently assessed by leave-one-out cross validation. The use of a sample reduces the training time. Experiments on 15 large data sets demonstrate that the proposed technique significantly reduces AnDE's error at the cost of a modest increase in training time. This efficient and scalable out-of-core approach delivers superior or comparable performance to typical in-core Bayesian network classifiers.

    Original languageEnglish
    Article number7565579
    Pages (from-to)172-185
    Number of pages14
    JournalIEEE Transactions on Knowledge and Data Engineering
    Volume29
    Issue number1
    DOIs
    Publication statusPublished - 1 Jan 2017

    Keywords

    • Bayesian network classifiers
    • Large data
    • Classification learning
    • Attribute selection
    • Averaged n-dependence estimators (AnDE)
    • Leave-one-out cross validation

    Cite this

    Chen, Shenglei ; Martinez, Ana M. ; Webb, Geoffrey I. ; Wang, Limin. / Sample-based attribute selective An DE for large data. In: IEEE Transactions on Knowledge and Data Engineering. 2017 ; Vol. 29, No. 1. pp. 172-185.
    @article{6cbdf79aa7e44b4c83d46f28745a783d,
    title = "Sample-based attribute selective An DE for large data",
    abstract = "More and more applications have come with large data sets in the past decade. However, existing algorithms cannot guarantee to scale well on large data. Averaged n-Dependence Estimators (AnDE) allows for flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence, AnDE is especially appropriate for large data learning. In this paper, we propose a sample-based attribute selection technique for AnDE. It needs one more pass through the training data, in which a multitude of approximate AnDE models are built and efficiently assessed by leave-one-out cross validation. The use of a sample reduces the training time. Experiments on 15 large data sets demonstrate that the proposed technique significantly reduces AnDE's error at the cost of a modest increase in training time. This efficient and scalable out-of-core approach delivers superior or comparable performance to typical in-core Bayesian network classifiers.",
    keywords = "Bayesian network classifiers, Large data, Classification learning, Attribute selection, Averaged n-dependence estimators (AnDE), Leave-one-out cross validation",
    author = "Shenglei Chen and Martinez, {Ana M.} and Webb, {Geoffrey I.} and Limin Wang",
    year = "2017",
    month = "1",
    day = "1",
    doi = "10.1109/TKDE.2016.2608881",
    language = "English",
    volume = "29",
    pages = "172--185",
    journal = "IEEE Transactions on Knowledge and Data Engineering",
    issn = "1041-4347",
    publisher = "IEEE, Institute of Electrical and Electronics Engineers",
    number = "1",

    }

    Sample-based attribute selective An DE for large data. / Chen, Shenglei; Martinez, Ana M.; Webb, Geoffrey I.; Wang, Limin.

    In: IEEE Transactions on Knowledge and Data Engineering, Vol. 29, No. 1, 7565579, 01.01.2017, p. 172-185.

    Research output: Contribution to journalArticleResearchpeer-review

    TY - JOUR

    T1 - Sample-based attribute selective An DE for large data

    AU - Chen, Shenglei

    AU - Martinez, Ana M.

    AU - Webb, Geoffrey I.

    AU - Wang, Limin

    PY - 2017/1/1

    Y1 - 2017/1/1

    N2 - More and more applications have come with large data sets in the past decade. However, existing algorithms cannot guarantee to scale well on large data. Averaged n-Dependence Estimators (AnDE) allows for flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence, AnDE is especially appropriate for large data learning. In this paper, we propose a sample-based attribute selection technique for AnDE. It needs one more pass through the training data, in which a multitude of approximate AnDE models are built and efficiently assessed by leave-one-out cross validation. The use of a sample reduces the training time. Experiments on 15 large data sets demonstrate that the proposed technique significantly reduces AnDE's error at the cost of a modest increase in training time. This efficient and scalable out-of-core approach delivers superior or comparable performance to typical in-core Bayesian network classifiers.

    AB - More and more applications have come with large data sets in the past decade. However, existing algorithms cannot guarantee to scale well on large data. Averaged n-Dependence Estimators (AnDE) allows for flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence, AnDE is especially appropriate for large data learning. In this paper, we propose a sample-based attribute selection technique for AnDE. It needs one more pass through the training data, in which a multitude of approximate AnDE models are built and efficiently assessed by leave-one-out cross validation. The use of a sample reduces the training time. Experiments on 15 large data sets demonstrate that the proposed technique significantly reduces AnDE's error at the cost of a modest increase in training time. This efficient and scalable out-of-core approach delivers superior or comparable performance to typical in-core Bayesian network classifiers.

    KW - Bayesian network classifiers

    KW - Large data

    KW - Classification learning

    KW - Attribute selection

    KW - Averaged n-dependence estimators (AnDE)

    KW - Leave-one-out cross validation

    UR - http://www.scopus.com/inward/record.url?scp=85027493553&partnerID=8YFLogxK

    U2 - 10.1109/TKDE.2016.2608881

    DO - 10.1109/TKDE.2016.2608881

    M3 - Article

    VL - 29

    SP - 172

    EP - 185

    JO - IEEE Transactions on Knowledge and Data Engineering

    JF - IEEE Transactions on Knowledge and Data Engineering

    SN - 1041-4347

    IS - 1

    M1 - 7565579

    ER -