One-pass logistic regression for label-drift and large-scale classification on distributed systems

Vu Nguyen, Tu Dinh Nguyen, Trung Le, Svetha Venkatesh, Dinh Phung

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

5 Citations (Scopus)

Abstract

Logistic regression (LR) for classification is the workhorse in industry, where a set of predefined classes is required. The model, however, fails to work in the case where the class labels are not known in advance, a problem we term label-drift classification. Label-drift classification problem naturally occurs in many applications, especially in the context of streaming settings where the incoming data may contain samples categorized with new classes that have not been previously seen. Additionally, in the wave of big data, traditional LR methods may fail due to their expense of running time. In this paper, we introduce a novel variant of LR, namely onepass logistic regression (OLR) to offer a principled treatment for label-drift and large-scale classifications. To handle largescale classification for big data, we further extend our OLR to a distributed setting for parallelization, termed sparkling OLR (Spark-OLR). We demonstrate the scalability of our proposed methods on large-scale datasets with more than one hundred million data points. The experimental results show that the predictive performances of our methods are comparable or better than those of state-of-The-Art baselines whilst the execution time is much faster at an order of magnitude. In addition, the OLR and Spark-OLR are invariant to data shuffling and have no hyperparameter to tune that significantly benefits data practitioners and overcomes the curse of big data cross-validation to select optimal hyperparameters.

Original languageEnglish
Title of host publicationProceedings - 16th IEEE International Conference on Data Mining, ICDM 2016
EditorsFrancesco Bonchi, Josep Domingo-Ferrer, Ricardo Baeza-Yates, Zhi-Hua Zhou, Xindong Wu
Place of PublicationLos Alamitos CA USA
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages1113-1118
Number of pages6
ISBN (Print)9781509054725
DOIs
Publication statusPublished - 2016
Externally publishedYes
EventIEEE International Conference on Data Mining 2016 - Barcelona Catalonia, Spain
Duration: 12 Dec 201615 Dec 2016
Conference number: 16th
http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=7837023 (IEEE Conference Proceedings)

Conference

ConferenceIEEE International Conference on Data Mining 2016
Abbreviated titleICDM 2016
CountrySpain
CityBarcelona Catalonia
Period12/12/1615/12/16
Internet address

Keywords

  • Apache spark
  • Distributed system
  • Label-drift
  • Large-scale classification
  • Logistic regression

Cite this