Abstract
Logistic regression (LR) for classification is the workhorse in industry, where a set of predefined classes is required. The model, however, fails to work in the case where the class labels are not known in advance, a problem we term label-drift classification. Label-drift classification problem naturally occurs in many applications, especially in the context of streaming settings where the incoming data may contain samples categorized with new classes that have not been previously seen. Additionally, in the wave of big data, traditional LR methods may fail due to their expense of running time. In this paper, we introduce a novel variant of LR, namely onepass logistic regression (OLR) to offer a principled treatment for label-drift and large-scale classifications. To handle largescale classification for big data, we further extend our OLR to a distributed setting for parallelization, termed sparkling OLR (Spark-OLR). We demonstrate the scalability of our proposed methods on large-scale datasets with more than one hundred million data points. The experimental results show that the predictive performances of our methods are comparable or better than those of state-of-The-Art baselines whilst the execution time is much faster at an order of magnitude. In addition, the OLR and Spark-OLR are invariant to data shuffling and have no hyperparameter to tune that significantly benefits data practitioners and overcomes the curse of big data cross-validation to select optimal hyperparameters.
Original language | English |
---|---|
Title of host publication | Proceedings - 16th IEEE International Conference on Data Mining, ICDM 2016 |
Editors | Francesco Bonchi, Josep Domingo-Ferrer, Ricardo Baeza-Yates, Zhi-Hua Zhou, Xindong Wu |
Place of Publication | Los Alamitos CA USA |
Publisher | IEEE, Institute of Electrical and Electronics Engineers |
Pages | 1113-1118 |
Number of pages | 6 |
ISBN (Print) | 9781509054725 |
DOIs | |
Publication status | Published - 2016 |
Externally published | Yes |
Event | IEEE International Conference on Data Mining 2016 - Barcelona Catalonia, Spain Duration: 12 Dec 2016 → 15 Dec 2016 Conference number: 16th http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=7837023 (IEEE Conference Proceedings) |
Conference
Conference | IEEE International Conference on Data Mining 2016 |
---|---|
Abbreviated title | ICDM 2016 |
Country/Territory | Spain |
City | Barcelona Catalonia |
Period | 12/12/16 → 15/12/16 |
Internet address |
|
Keywords
- Apache spark
- Distributed system
- Label-drift
- Large-scale classification
- Logistic regression