Distributed data augmented support vector machine on Spark

Tu Dinh Nguyen, Vu Nguyen, Trung Le, Dinh Phung

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

10 Citations (Scopus)


Support vector machines (SVMs) are widely-used for classification in machine learning and data mining tasks. However, they traditionally have been applied to small to medium datasets. Recent need to scale up with data size has attracted research attention to develop new methods and implementation for SVM to perform tasks at scale. Distributed SVMs are relatively new and studied recently, but the distributed implementation for SVM with data augmentation has not been developed. This paper introduces a distributed data augmentation implementation for SVM on Apache Spark, a recent advanced and popular platform for distributed computing that has been employed widely in research as well as in industry. We term our implementation sparkling vector machine (SkVM) which supports both classification and regression tasks by scanning through the data exactly once. In addition, we further develop a framework to handle the data with new classes arriving under an online classification setting where new data points can have labels that have not previously seen - a problem we term label-drift classification. We demonstrate the scalability of our proposed method on large-scale datasets with more than one hundred million data points. The experimental results show that the predictive performances of our method are comparable or better than those of baselines whilst the execution time is much faster at an order of magnitude.

Original languageEnglish
Title of host publication2016 23rd International Conference on Pattern Recognition (ICPR 2016)
EditorsLarry Davis, Alberto Del Bimbo, Brian C. Lovell
Place of PublicationPiscataway NJ USA
PublisherIEEE, Institute of Electrical and Electronics Engineers
Number of pages6
ISBN (Electronic)9781509048472
ISBN (Print)9781509048489
Publication statusPublished - 2016
Externally publishedYes
EventInternational Conference on Pattern Recognition 2016 - Cancun, Mexico
Duration: 4 Dec 20168 Dec 2016
Conference number: 23rd
https://ieeexplore.ieee.org/xpl/conhome/7893644/proceeding (Proceedings)


ConferenceInternational Conference on Pattern Recognition 2016
Abbreviated titleICPR 2016
Internet address


  • Apache Spark
  • Big data
  • Distributed computing
  • Large-scale classification
  • Support vector machine

Cite this