Leveraging external information in topic modelling

He Zhao, Lan Du, Wray Buntine, Gang Liu

    Research output: Contribution to journalArticleResearchpeer-review

    Abstract

    Besides the text content, documents usually come with rich sets of meta-information, such as categories of documents and semantic/syntactic features of words, like those encoded in word embeddings. Incorporating such meta-information directly into the generative process of topic models can improve modelling accuracy and topic quality, especially in the case where the word-occurrence information in the training data is insufficient. In this article, we present a topic model called MetaLDA, which is able to leverage either document or word meta-information, or both of them jointly, in the generative process. With two data augmentation techniques, we can derive an efficient Gibbs sampling algorithm, which benefits from the fully local conjugacy of the model. Moreover, the algorithm is favoured by the sparsity of the meta-information. Extensive experiments on several real-world datasets demonstrate that our model achieves superior performance in terms of both perplexity and topic quality, particularly in handling sparse texts. In addition, our model runs significantly faster than other models using meta-information.

    LanguageEnglish
    Pages1-33
    Number of pages33
    JournalKnowledge and Information Systems
    DOIs
    Publication statusAccepted/In press - 12 May 2018

    Keywords

    • Data augmentation
    • Gibbs sampling
    • Latent Dirichlet allocation
    • Side information

    Cite this

    @article{01892296fead42779388dcfed3b78832,
    title = "Leveraging external information in topic modelling",
    abstract = "Besides the text content, documents usually come with rich sets of meta-information, such as categories of documents and semantic/syntactic features of words, like those encoded in word embeddings. Incorporating such meta-information directly into the generative process of topic models can improve modelling accuracy and topic quality, especially in the case where the word-occurrence information in the training data is insufficient. In this article, we present a topic model called MetaLDA, which is able to leverage either document or word meta-information, or both of them jointly, in the generative process. With two data augmentation techniques, we can derive an efficient Gibbs sampling algorithm, which benefits from the fully local conjugacy of the model. Moreover, the algorithm is favoured by the sparsity of the meta-information. Extensive experiments on several real-world datasets demonstrate that our model achieves superior performance in terms of both perplexity and topic quality, particularly in handling sparse texts. In addition, our model runs significantly faster than other models using meta-information.",
    keywords = "Data augmentation, Gibbs sampling, Latent Dirichlet allocation, Side information",
    author = "He Zhao and Lan Du and Wray Buntine and Gang Liu",
    year = "2018",
    month = "5",
    day = "12",
    doi = "10.1007/s10115-018-1213-y",
    language = "English",
    pages = "1--33",
    journal = "Knowledge and Information Systems",
    issn = "0219-1377",
    publisher = "Springer-Verlag London Ltd.",

    }

    Leveraging external information in topic modelling. / Zhao, He; Du, Lan; Buntine, Wray; Liu, Gang.

    In: Knowledge and Information Systems, 12.05.2018, p. 1-33.

    Research output: Contribution to journalArticleResearchpeer-review

    TY - JOUR

    T1 - Leveraging external information in topic modelling

    AU - Zhao, He

    AU - Du, Lan

    AU - Buntine, Wray

    AU - Liu, Gang

    PY - 2018/5/12

    Y1 - 2018/5/12

    N2 - Besides the text content, documents usually come with rich sets of meta-information, such as categories of documents and semantic/syntactic features of words, like those encoded in word embeddings. Incorporating such meta-information directly into the generative process of topic models can improve modelling accuracy and topic quality, especially in the case where the word-occurrence information in the training data is insufficient. In this article, we present a topic model called MetaLDA, which is able to leverage either document or word meta-information, or both of them jointly, in the generative process. With two data augmentation techniques, we can derive an efficient Gibbs sampling algorithm, which benefits from the fully local conjugacy of the model. Moreover, the algorithm is favoured by the sparsity of the meta-information. Extensive experiments on several real-world datasets demonstrate that our model achieves superior performance in terms of both perplexity and topic quality, particularly in handling sparse texts. In addition, our model runs significantly faster than other models using meta-information.

    AB - Besides the text content, documents usually come with rich sets of meta-information, such as categories of documents and semantic/syntactic features of words, like those encoded in word embeddings. Incorporating such meta-information directly into the generative process of topic models can improve modelling accuracy and topic quality, especially in the case where the word-occurrence information in the training data is insufficient. In this article, we present a topic model called MetaLDA, which is able to leverage either document or word meta-information, or both of them jointly, in the generative process. With two data augmentation techniques, we can derive an efficient Gibbs sampling algorithm, which benefits from the fully local conjugacy of the model. Moreover, the algorithm is favoured by the sparsity of the meta-information. Extensive experiments on several real-world datasets demonstrate that our model achieves superior performance in terms of both perplexity and topic quality, particularly in handling sparse texts. In addition, our model runs significantly faster than other models using meta-information.

    KW - Data augmentation

    KW - Gibbs sampling

    KW - Latent Dirichlet allocation

    KW - Side information

    UR - http://www.scopus.com/inward/record.url?scp=85046794939&partnerID=8YFLogxK

    U2 - 10.1007/s10115-018-1213-y

    DO - 10.1007/s10115-018-1213-y

    M3 - Article

    SP - 1

    EP - 33

    JO - Knowledge and Information Systems

    T2 - Knowledge and Information Systems

    JF - Knowledge and Information Systems

    SN - 0219-1377

    ER -