Beyond clustering: Sub-DAG discovery for categorising documents

Ramakrishna B. Bairi, Mark Carman, Ganesh Ramakrishnan

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

Abstract

We study the problem of generating DAG-structured category hierarchies over a given set of documents associated with "importance" scores. Example application includes automatically generating Wikipedia disambiguation pages for a set of articles having click counts associated with them. Unlike previous works, which focus on clustering the set of documents using the category hierarchy as features, we directly pose the problem as that of finding a DAG structured generative mode that has maximum likelihood of generating the observed "importance" scores for each document where documents are modeled as the leaf nodes in the DAG structure. Desirable properties of the categories in the inferred DAG-structured hierarchy include document coverage and category relevance, each of which, we show, is naturally modeled by our generative model. We propose two different algorithms for estimating the model parameters. One by modeling the DAG as a Bayesian Network and estimating its parameters via Gibbs Sampling; and the other by estimating the path probabilities using the Expectation Maximization algorithm. We empirically evaluate our method on the problem of automatically generating Wikipedia disambiguation pages using human generated clusterings as the ground truth. We find that our framework improves upon the baselines according to the F1 score and Entropy that are used as standard metrics to evaluate the hierarchical clustering.

Original languageEnglish
Title of host publicationProceedings of the 25th ACM International Conference on Information and Knowledge Management (CIKM 2016)
Subtitle of host publicationOctober 24-28, 2016, Indianapolis, IN, USA
EditorsKavita Ganesan, Chase Geigle, Xia Ning
Place of PublicationNew York, New York
PublisherAssociation for Computing Machinery (ACM)
Pages801-810
Number of pages10
ISBN (Electronic)9781450340731
DOIs
Publication statusPublished - 24 Oct 2016
EventACM International Conference on Information and Knowledge Management 2016 - Indianapolis, United States of America
Duration: 24 Oct 201628 Oct 2016
Conference number: 25th
https://dl.acm.org/doi/proceedings/10.1145/2983323

Conference

ConferenceACM International Conference on Information and Knowledge Management 2016
Abbreviated titleCIKM 2016
CountryUnited States of America
CityIndianapolis
Period24/10/1628/10/16
Internet address

Cite this