TY - JOUR
T1 - Semiparametric latent topic modeling on consumer-generated corpora
AU - Dayta, Dominic B.
AU - Barrios, Erniel B.
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025
Y1 - 2025
N2 - Common methods used for topic modeling have generally suffered problems of overfitting, leading to diminished predictive performance, as well as a weakness towards reconstructing sparse topic structures that involve only a few critical words to aid in interpretation. Considering the text typically contained in customer feedback, this paper proposes a semiparametric topic model utilizing a two-step approach: (1) makes use of nonnegative matrix factorization to recover topic distributions based on word co-occurrences and; (2) use semiparametric regression to identify factors driving the expression of particular topics in the documents given additional auxiliary information such as location, time of writing, and other features of the author. This approach provides a generative model that can be useful for predicting topics in new documents based on these auxiliary variables, and is demonstrated to accurately identify topics even for documents limited in length or size of vocabulary. In an application to real customer feedback, the topics provided by our model are shown to be as interpretable and useful for downstream analysis tasks as with those produced by current legacy methods.
AB - Common methods used for topic modeling have generally suffered problems of overfitting, leading to diminished predictive performance, as well as a weakness towards reconstructing sparse topic structures that involve only a few critical words to aid in interpretation. Considering the text typically contained in customer feedback, this paper proposes a semiparametric topic model utilizing a two-step approach: (1) makes use of nonnegative matrix factorization to recover topic distributions based on word co-occurrences and; (2) use semiparametric regression to identify factors driving the expression of particular topics in the documents given additional auxiliary information such as location, time of writing, and other features of the author. This approach provides a generative model that can be useful for predicting topics in new documents based on these auxiliary variables, and is demonstrated to accurately identify topics even for documents limited in length or size of vocabulary. In an application to real customer feedback, the topics provided by our model are shown to be as interpretable and useful for downstream analysis tasks as with those produced by current legacy methods.
KW - Customer complaint
KW - Latent dirichlet allocation
KW - Nonnegative matrix factorization
KW - Semiparametric regression
KW - Topic modelling
UR - http://www.scopus.com/inward/record.url?scp=85217249324&partnerID=8YFLogxK
U2 - 10.1007/s40745-025-00587-y
DO - 10.1007/s40745-025-00587-y
M3 - Article
AN - SCOPUS:85217249324
SN - 2198-5804
JO - Annals of Data Science
JF - Annals of Data Science
ER -