Abstract
In many scientific tasks we are interested in finding correlations in our data. This raises many questions, such as how to reliably and interpretably measure correlation between a multivariate set of attributes, how to do so without having to make assumptions on data distribution or the type of correlation, and, how to search efficiently for the most correlated attribute sets. We answer these questions for discovery tasks with categorical data. In particular, we propose a corrected-for-chance, consistent, and efficient estimator for normalized total correlation, in order to obtain a reliable, interpretable, and non-parametric measure for correlation over multivariate sets. For the discovery of the top-k correlated sets, we derive an effective algorithmic framework based on a tight bounding function. This framework offers exact, approximate, and heuristic search. Empirical evaluation shows that already for small sample sizes the estimator leads to low-regret optimization outcomes, while the algorithms are shown to be highly effective for both large and high-dimensional data. Through a case study we confirm that our discovery framework identifies interesting and meaningful correlations.
Original language | English |
---|---|
Title of host publication | Proceedings - 19th IEEE International Conference on Data Mining, ICDM 2019 |
Editors | Jianyong Wang, Kyuseok Shim, Xindong Wu |
Place of Publication | Piscataway NJ USA |
Publisher | IEEE, Institute of Electrical and Electronics Engineers |
Pages | 1252-1257 |
Number of pages | 6 |
ISBN (Electronic) | 9781728146034, 9781728146041 |
ISBN (Print) | 9781728146058 |
DOIs | |
Publication status | Published - 2019 |
Event | IEEE International Conference on Data Mining 2019 - Beijing, China Duration: 8 Nov 2019 → 11 Nov 2019 Conference number: 19th http://icdm2019.bigke.org/ |
Publication series
Name | Proceedings - IEEE International Conference on Data Mining, ICDM |
---|---|
Publisher | IEEE, Institute of Electrical and Electronics Engineers |
Volume | 2019-November |
ISSN (Print) | 1550-4786 |
ISSN (Electronic) | 2374-8486 |
Conference
Conference | IEEE International Conference on Data Mining 2019 |
---|---|
Abbreviated title | ICDM 2019 |
Country | China |
City | Beijing |
Period | 8/11/19 → 11/11/19 |
Internet address |
Keywords
- Branch-and-bound
- Information theory
- Knowledge discovery
- Optimization
- Total correlation