Semiparametric latent topic modeling on consumer-generated corpora

Dominic B. Dayta, Erniel B. Barrios

Research output: Contribution to journalArticleResearchpeer-review

Abstract

Common methods used for topic modeling have generally suffered problems of overfitting, leading to diminished predictive performance, as well as a weakness towards reconstructing sparse topic structures that involve only a few critical words to aid in interpretation. Considering the text typically contained in customer feedback, this paper proposes a semiparametric topic model utilizing a two-step approach: (1) makes use of nonnegative matrix factorization to recover topic distributions based on word co-occurrences and; (2) use semiparametric regression to identify factors driving the expression of particular topics in the documents given additional auxiliary information such as location, time of writing, and other features of the author. This approach provides a generative model that can be useful for predicting topics in new documents based on these auxiliary variables, and is demonstrated to accurately identify topics even for documents limited in length or size of vocabulary. In an application to real customer feedback, the topics provided by our model are shown to be as interpretable and useful for downstream analysis tasks as with those produced by current legacy methods.

Original languageEnglish
Number of pages23
JournalAnnals of Data Science
DOIs
Publication statusAccepted/In press - 2025

Keywords

  • Customer complaint
  • Latent dirichlet allocation
  • Nonnegative matrix factorization
  • Semiparametric regression
  • Topic modelling

Cite this