Discovering reliable correlations in categorical data

Panagiotis Mandros, Mario Boley, Jilles Vreeken

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

1 Citation (Scopus)

Abstract

In many scientific tasks we are interested in finding correlations in our data. This raises many questions, such as how to reliably and interpretably measure correlation between a multivariate set of attributes, how to do so without having to make assumptions on data distribution or the type of correlation, and, how to search efficiently for the most correlated attribute sets. We answer these questions for discovery tasks with categorical data. In particular, we propose a corrected-for-chance, consistent, and efficient estimator for normalized total correlation, in order to obtain a reliable, interpretable, and non-parametric measure for correlation over multivariate sets. For the discovery of the top-k correlated sets, we derive an effective algorithmic framework based on a tight bounding function. This framework offers exact, approximate, and heuristic search. Empirical evaluation shows that already for small sample sizes the estimator leads to low-regret optimization outcomes, while the algorithms are shown to be highly effective for both large and high-dimensional data. Through a case study we confirm that our discovery framework identifies interesting and meaningful correlations.

Original languageEnglish
Title of host publicationProceedings - 19th IEEE International Conference on Data Mining, ICDM 2019
EditorsJianyong Wang, Kyuseok Shim, Xindong Wu
Place of PublicationPiscataway NJ USA
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages1252-1257
Number of pages6
ISBN (Electronic)9781728146034, 9781728146041
ISBN (Print)9781728146058
DOIs
Publication statusPublished - 2019
EventIEEE International Conference on Data Mining 2019 - Beijing, China
Duration: 8 Nov 201911 Nov 2019
Conference number: 19th
http://icdm2019.bigke.org/
https://ieeexplore.ieee.org/xpl/conhome/8961330/proceeding (Proceedings)

Publication series

NameProceedings - IEEE International Conference on Data Mining, ICDM
PublisherIEEE, Institute of Electrical and Electronics Engineers
Volume2019-November
ISSN (Print)1550-4786
ISSN (Electronic)2374-8486

Conference

ConferenceIEEE International Conference on Data Mining 2019
Abbreviated titleICDM 2019
Country/TerritoryChina
CityBeijing
Period8/11/1911/11/19
Internet address

Keywords

  • Branch-and-bound
  • Information theory
  • Knowledge discovery
  • Optimization
  • Total correlation

Cite this