Nonparametric Bayesian topic modelling with the hierarchical Pitman–Yor processes

Kar Wai Lim, Wray Buntine, Changyou Chen, Lan Du

    Research output: Contribution to journalArticleResearchpeer-review

    Abstract

    The Dirichlet process and its extension, the Pitman–Yor process, are stochastic processes that take probability distributions as a parameter. These processes can be stacked up to form a hierarchical nonparametric Bayesian model. In this article, we present efficient methods for the use of these processes in this hierarchical context, and apply them to latent variable models for text analytics. In particular, we propose a general framework for designing these Bayesian models, which are called topic models in the computer science community. We then propose a specific nonparametric Bayesian topic model for modelling text from social media. We focus on tweets (posts on Twitter) in this article due to their ease of access. We find that our nonparametric model performs better than existing parametric models in both goodness of fit and real world applications.

    Original languageEnglish
    Pages (from-to)172-191
    Number of pages20
    JournalInternational Journal of Approximate Reasoning
    Volume78
    DOIs
    Publication statusPublished - 1 Nov 2016

    Keywords

    • Bayesian nonparametric methods
    • Markov chain Monte Carlo
    • Topic models
    • Hierarchical Pitman-Yor processes

    Cite this

    @article{639c8f9ae7de4448b8f33b132476bc0b,
    title = "Nonparametric Bayesian topic modelling with the hierarchical Pitman–Yor processes",
    abstract = "The Dirichlet process and its extension, the Pitman–Yor process, are stochastic processes that take probability distributions as a parameter. These processes can be stacked up to form a hierarchical nonparametric Bayesian model. In this article, we present efficient methods for the use of these processes in this hierarchical context, and apply them to latent variable models for text analytics. In particular, we propose a general framework for designing these Bayesian models, which are called topic models in the computer science community. We then propose a specific nonparametric Bayesian topic model for modelling text from social media. We focus on tweets (posts on Twitter) in this article due to their ease of access. We find that our nonparametric model performs better than existing parametric models in both goodness of fit and real world applications.",
    keywords = "Bayesian nonparametric methods, Markov chain Monte Carlo, Topic models, Hierarchical Pitman-Yor processes",
    author = "Lim, {Kar Wai} and Wray Buntine and Changyou Chen and Lan Du",
    year = "2016",
    month = "11",
    day = "1",
    doi = "10.1016/j.ijar.2016.07.007",
    language = "English",
    volume = "78",
    pages = "172--191",
    journal = "International Journal of Approximate Reasoning",
    issn = "0888-613X",
    publisher = "Elsevier",

    }

    Nonparametric Bayesian topic modelling with the hierarchical Pitman–Yor processes. / Lim, Kar Wai; Buntine, Wray; Chen, Changyou; Du, Lan.

    In: International Journal of Approximate Reasoning, Vol. 78, 01.11.2016, p. 172-191.

    Research output: Contribution to journalArticleResearchpeer-review

    TY - JOUR

    T1 - Nonparametric Bayesian topic modelling with the hierarchical Pitman–Yor processes

    AU - Lim, Kar Wai

    AU - Buntine, Wray

    AU - Chen, Changyou

    AU - Du, Lan

    PY - 2016/11/1

    Y1 - 2016/11/1

    N2 - The Dirichlet process and its extension, the Pitman–Yor process, are stochastic processes that take probability distributions as a parameter. These processes can be stacked up to form a hierarchical nonparametric Bayesian model. In this article, we present efficient methods for the use of these processes in this hierarchical context, and apply them to latent variable models for text analytics. In particular, we propose a general framework for designing these Bayesian models, which are called topic models in the computer science community. We then propose a specific nonparametric Bayesian topic model for modelling text from social media. We focus on tweets (posts on Twitter) in this article due to their ease of access. We find that our nonparametric model performs better than existing parametric models in both goodness of fit and real world applications.

    AB - The Dirichlet process and its extension, the Pitman–Yor process, are stochastic processes that take probability distributions as a parameter. These processes can be stacked up to form a hierarchical nonparametric Bayesian model. In this article, we present efficient methods for the use of these processes in this hierarchical context, and apply them to latent variable models for text analytics. In particular, we propose a general framework for designing these Bayesian models, which are called topic models in the computer science community. We then propose a specific nonparametric Bayesian topic model for modelling text from social media. We focus on tweets (posts on Twitter) in this article due to their ease of access. We find that our nonparametric model performs better than existing parametric models in both goodness of fit and real world applications.

    KW - Bayesian nonparametric methods

    KW - Markov chain Monte Carlo

    KW - Topic models

    KW - Hierarchical Pitman-Yor processes

    UR - http://www.scopus.com/inward/record.url?scp=84979608871&partnerID=8YFLogxK

    U2 - 10.1016/j.ijar.2016.07.007

    DO - 10.1016/j.ijar.2016.07.007

    M3 - Article

    VL - 78

    SP - 172

    EP - 191

    JO - International Journal of Approximate Reasoning

    JF - International Journal of Approximate Reasoning

    SN - 0888-613X

    ER -