Abstract
Discovering associations among variables is an important data mining task. The associations can be considered as statistical dependencies among random variables, expressed as the structure of an underlying probabilistic graphical model. Current methods for graphical model structure discovery either do not scale well to datasets with large sample sizes, or suffer from high false discovery rates when the number of dimensions is much larger than the sample size. In this paper, we propose a scalable and statistically efficient approach for graphical model structure discovery for multivariate data involving continuous variables. Our approach uses a minimum message length (MML)-based objective, for which we design a greedy algorithm where the best edges maximising improvements to the MML-based score are added incrementally to the graphical model. We present extensive empirical results on synthetic data with different sample, variable, clique and inverse correlation coefficient and show that our method outperforms strong baselines in terms of both speed and the accuracy of the predicted associations among the random variables in the graphical model. We also report that our method performs significantly very well in AML, BRCA cancer data and other real-life datasets.
| Original language | English |
|---|---|
| Article number | 64 |
| Number of pages | 17 |
| Journal | SN Computer Science |
| Volume | 1 |
| Issue number | 2 |
| DOIs | |
| Publication status | Published - 7 Feb 2021 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 3 Good Health and Well-being
Keywords
- Associations
- Minimum message length
- Gaussian graphical models
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver