Bibliographic analysis on research publications using authors, categorical labels and the citation network

Kar Wai Lim, Wray Buntine

    Research output: Contribution to journalArticleResearchpeer-review

    Abstract

    Bibliographic analysis considers the author’s research areas, the citation network and the paper content among other things. In this paper, we combine these three in a topic model that produces a bibliographic model of authors, topics and documents, using a non-parametric extension of a combination of the Poisson mixed-topic link model and the author-topic model. This gives rise to the Citation Network Topic Model (CNTM). We propose a novel and efficient inference algorithm for the CNTM to explore subsets of research publications from CiteSeerX. The publication datasets are organised into three corpora, totalling to about 168k publications with about 62k authors. The queried datasets are made available online. In three publicly available corpora in addition to the queried datasets, our proposed model demonstrates an improved performance in both model fitting and document clustering, compared to several baselines. Moreover, our model allows extraction of additional useful knowledge from the corpora, such as the visualisation of the author-topics network. Additionally, we propose a simple method to incorporate supervision into topic modelling to achieve further improvement on the clustering task.

    Original languageEnglish
    Pages (from-to)185-213
    Number of pages29
    JournalMachine Learning
    Volume103
    Issue number2
    DOIs
    Publication statusPublished - 1 May 2016

    Keywords

    • Bibliographic analysis
    • Topic model
    • Bayesian non-parametric
    • Author-citation network

    Cite this

    @article{eb0e8f22ff76461c97998516cbf8ca8f,
    title = "Bibliographic analysis on research publications using authors, categorical labels and the citation network",
    abstract = "Bibliographic analysis considers the author’s research areas, the citation network and the paper content among other things. In this paper, we combine these three in a topic model that produces a bibliographic model of authors, topics and documents, using a non-parametric extension of a combination of the Poisson mixed-topic link model and the author-topic model. This gives rise to the Citation Network Topic Model (CNTM). We propose a novel and efficient inference algorithm for the CNTM to explore subsets of research publications from CiteSeerX. The publication datasets are organised into three corpora, totalling to about 168k publications with about 62k authors. The queried datasets are made available online. In three publicly available corpora in addition to the queried datasets, our proposed model demonstrates an improved performance in both model fitting and document clustering, compared to several baselines. Moreover, our model allows extraction of additional useful knowledge from the corpora, such as the visualisation of the author-topics network. Additionally, we propose a simple method to incorporate supervision into topic modelling to achieve further improvement on the clustering task.",
    keywords = "Bibliographic analysis, Topic model, Bayesian non-parametric, Author-citation network",
    author = "Lim, {Kar Wai} and Wray Buntine",
    year = "2016",
    month = "5",
    day = "1",
    doi = "10.1007/s10994-016-5554-z",
    language = "English",
    volume = "103",
    pages = "185--213",
    journal = "Machine Learning",
    issn = "0885-6125",
    publisher = "Springer",
    number = "2",

    }

    Bibliographic analysis on research publications using authors, categorical labels and the citation network. / Lim, Kar Wai; Buntine, Wray.

    In: Machine Learning, Vol. 103, No. 2, 01.05.2016, p. 185-213.

    Research output: Contribution to journalArticleResearchpeer-review

    TY - JOUR

    T1 - Bibliographic analysis on research publications using authors, categorical labels and the citation network

    AU - Lim, Kar Wai

    AU - Buntine, Wray

    PY - 2016/5/1

    Y1 - 2016/5/1

    N2 - Bibliographic analysis considers the author’s research areas, the citation network and the paper content among other things. In this paper, we combine these three in a topic model that produces a bibliographic model of authors, topics and documents, using a non-parametric extension of a combination of the Poisson mixed-topic link model and the author-topic model. This gives rise to the Citation Network Topic Model (CNTM). We propose a novel and efficient inference algorithm for the CNTM to explore subsets of research publications from CiteSeerX. The publication datasets are organised into three corpora, totalling to about 168k publications with about 62k authors. The queried datasets are made available online. In three publicly available corpora in addition to the queried datasets, our proposed model demonstrates an improved performance in both model fitting and document clustering, compared to several baselines. Moreover, our model allows extraction of additional useful knowledge from the corpora, such as the visualisation of the author-topics network. Additionally, we propose a simple method to incorporate supervision into topic modelling to achieve further improvement on the clustering task.

    AB - Bibliographic analysis considers the author’s research areas, the citation network and the paper content among other things. In this paper, we combine these three in a topic model that produces a bibliographic model of authors, topics and documents, using a non-parametric extension of a combination of the Poisson mixed-topic link model and the author-topic model. This gives rise to the Citation Network Topic Model (CNTM). We propose a novel and efficient inference algorithm for the CNTM to explore subsets of research publications from CiteSeerX. The publication datasets are organised into three corpora, totalling to about 168k publications with about 62k authors. The queried datasets are made available online. In three publicly available corpora in addition to the queried datasets, our proposed model demonstrates an improved performance in both model fitting and document clustering, compared to several baselines. Moreover, our model allows extraction of additional useful knowledge from the corpora, such as the visualisation of the author-topics network. Additionally, we propose a simple method to incorporate supervision into topic modelling to achieve further improvement on the clustering task.

    KW - Bibliographic analysis

    KW - Topic model

    KW - Bayesian non-parametric

    KW - Author-citation network

    UR - http://www.scopus.com/inward/record.url?scp=84960377007&partnerID=8YFLogxK

    U2 - 10.1007/s10994-016-5554-z

    DO - 10.1007/s10994-016-5554-z

    M3 - Article

    VL - 103

    SP - 185

    EP - 213

    JO - Machine Learning

    JF - Machine Learning

    SN - 0885-6125

    IS - 2

    ER -