Stratified sampling for extreme multi-label data

Maximillian Merrillees, Lan Du

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

1 Citation (Scopus)


Extreme multi-label classification (XML) is becoming increasingly relevant in the era of big data. Yet, there is no method for effectively generating stratified partitions of XML datasets. Instead, researchers typically rely on provided test-train splits that, 1) aren’t always representative of the entire dataset, and 2) are missing many of the labels. This can lead to poor generalization ability and unreliable performance estimates, as has been established in the binary and multi-class settings. As such, this paper presents a new and simple algorithm that can efficiently generate stratified partitions of XML datasets with millions of unique labels. We also examine the label distributions of prevailing benchmark splits, and investigate the issues that arise from using unrepresentative subsets of data for model development. The results highlight the difficulty of stratifying XML data, and demonstrate the importance of using stratified partitions for training and evaluation.

Original languageEnglish
Title of host publication25th Pacific-Asia Conference, PAKDD 2021 Virtual Event, May 11–14, 2021 Proceedings, Part II
EditorsKamal Karlapalem, Hong Cheng, Naren Ramakrishnan, R. K. Agrawal, P. Krishna Reddy, Jaideep Srivastava, Tanmoy Chakraborty
Place of PublicationCham Switzerland
Number of pages12
ISBN (Electronic)9783030757656
ISBN (Print)9783030757649
Publication statusPublished - 2021
EventPacific-Asia Conference on Knowledge Discovery and Data Mining 2021 - Virtual, Delhi, India
Duration: 11 May 202114 May 2021
Conference number: 25th (Website) (Proceedings)

Publication series

NameLecture Notes in Computer Science
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


ConferencePacific-Asia Conference on Knowledge Discovery and Data Mining 2021
Abbreviated titlePAKDD 2021
Internet address


  • Extreme multi-label learning
  • Stratified sampling
  • XML

Cite this