Traditional data mining algorithms are commonly based on fully labeled data, which is often practically difficult to obtain. In recent years, positive unlabeled (PU) learning has emerged as a useful technique to address this issue, which allows algorithms to learn from only positive and unlabeled data by relaxing the requirement for obtaining fully labeled data. Existing PU learning algorithms based on Bayesian classifiers, including PNB and PAODE, have been successfully applied to multiple classification problems. However, their empirical performance is affected by the attribute independence assumption. With the goal of effectively tackling positive unlabeled learning tasks with higher-level attribute dependence, we propose a novel PU learning algorithm in this study, termed PAnDE, which extends the AnDE (Averaged n-Dependence Estimators) algorithm based on the ‘selected completely at random’ assumption. We performed benchmarking tests to compare the performance of PAnDE with PNB (based on Naive Bayes) and PAODE (based on the Averaged One-Dependence Estimators) on 20 UCI datasets and three other real-world (human protein glycosylation) datasets. The results demonstrate that PAnDE outperformed PNB and PAODE, highlighting the predictive power of PAnDE and its scalability in a range of real-world applications.
|Number of pages||11|
|Journal||ICIC Express Letters, Part B: Applications|
|Publication status||Published - Sep 2017|
- Averaged n-dependence estimators
- Bayesian classification
- Positive unlabeled learning