A left-to-right algorithm for likelihood estimation in gamma-poisson factor analysis

Joan Capdevila, Jesús Cerquides, Jordi Torres, François Petitjean, Wray Buntine

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

Abstract

Computing the probability of unseen documents is a natural evaluation task in topic modeling. Previous work has addressed this problem for the well-known Latent Dirichlet Allocation (LDA) model. However, the same problem for a more general class of topic models, referred here to as Gamma-Poisson Factor Analysis (GaP-FA), remains unexplored, which hampers a fair comparison between models. Recent findings on the exact marginal likelihood of GaP-FA enable the derivation of a closed-form expression. In this paper, we show that its exact computation grows exponentially with the number of topics and non-zero words in a document, thus being only solvable for relatively small models and short documents. Experimentation in various corpus also indicates that existing methods in the literature are unlikely to accurately estimate this probability. With that in mind, we propose L2R, a left-to-right sequential sampler that decomposes the document probability into a product of conditionals and estimates them separately. We then proceed by confirming that our estimator converges and is unbiased for both small and large collections. Code related to this paper is available at: https://github.com/jcapde/L2R, https://doi.org/10.7910/DVN/GDTAAC.

Original languageEnglish
Title of host publicationMachine Learning and Knowledge Discovery in Databases
Subtitle of host publicationEuropean Conference, ECML PKDD 2018 Dublin, Ireland, September 10–14, 2018 Proceedings, Part II
EditorsMichele Berlingerio, Francesco Bonchi, Thomas Gärtner, Neil Hurley, Georgiana Ifrim
Place of PublicationCham Switzerland
PublisherSpringer
Pages638-654
Number of pages17
ISBN (Electronic)9783030109288
ISBN (Print)9783030109271
DOIs
Publication statusPublished - 2019
EventEuropean Conference on Machine Learning European Conference on Principles and Practice of Knowledge Discovery in Databases 2018 - Dublin, Ireland
Duration: 10 Sep 201814 Sep 2018
http://www.ecmlpkdd2018.org/

Publication series

NameLecture Notes in Computer Science
PublisherSpringer
Volume11052
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceEuropean Conference on Machine Learning European Conference on Principles and Practice of Knowledge Discovery in Databases 2018
Abbreviated titleECML-PKDD 2018
CountryIreland
CityDublin
Period10/09/1814/09/18
Internet address

Keywords

  • Estimation methods
  • Factor analysis
  • Gamma-poisson
  • Importance sampling
  • Left-to-right
  • Topic models

Cite this

Capdevila, J., Cerquides, J., Torres, J., Petitjean, F., & Buntine, W. (2019). A left-to-right algorithm for likelihood estimation in gamma-poisson factor analysis. In M. Berlingerio, F. Bonchi, T. Gärtner, N. Hurley, & G. Ifrim (Eds.), Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018 Dublin, Ireland, September 10–14, 2018 Proceedings, Part II (pp. 638-654). (Lecture Notes in Computer Science ; Vol. 11052 ). Cham Switzerland: Springer. https://doi.org/10.1007/978-3-030-10928-8_38
Capdevila, Joan ; Cerquides, Jesús ; Torres, Jordi ; Petitjean, François ; Buntine, Wray. / A left-to-right algorithm for likelihood estimation in gamma-poisson factor analysis. Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018 Dublin, Ireland, September 10–14, 2018 Proceedings, Part II. editor / Michele Berlingerio ; Francesco Bonchi ; Thomas Gärtner ; Neil Hurley ; Georgiana Ifrim. Cham Switzerland : Springer, 2019. pp. 638-654 (Lecture Notes in Computer Science ).
@inproceedings{0e09b9581a9a43ec9266c898c3ed3f0e,
title = "A left-to-right algorithm for likelihood estimation in gamma-poisson factor analysis",
abstract = "Computing the probability of unseen documents is a natural evaluation task in topic modeling. Previous work has addressed this problem for the well-known Latent Dirichlet Allocation (LDA) model. However, the same problem for a more general class of topic models, referred here to as Gamma-Poisson Factor Analysis (GaP-FA), remains unexplored, which hampers a fair comparison between models. Recent findings on the exact marginal likelihood of GaP-FA enable the derivation of a closed-form expression. In this paper, we show that its exact computation grows exponentially with the number of topics and non-zero words in a document, thus being only solvable for relatively small models and short documents. Experimentation in various corpus also indicates that existing methods in the literature are unlikely to accurately estimate this probability. With that in mind, we propose L2R, a left-to-right sequential sampler that decomposes the document probability into a product of conditionals and estimates them separately. We then proceed by confirming that our estimator converges and is unbiased for both small and large collections. Code related to this paper is available at: https://github.com/jcapde/L2R, https://doi.org/10.7910/DVN/GDTAAC.",
keywords = "Estimation methods, Factor analysis, Gamma-poisson, Importance sampling, Left-to-right, Topic models",
author = "Joan Capdevila and Jes{\'u}s Cerquides and Jordi Torres and Fran{\cc}ois Petitjean and Wray Buntine",
year = "2019",
doi = "10.1007/978-3-030-10928-8_38",
language = "English",
isbn = "9783030109271",
series = "Lecture Notes in Computer Science",
publisher = "Springer",
pages = "638--654",
editor = "Michele Berlingerio and Francesco Bonchi and Thomas G{\"a}rtner and Neil Hurley and Georgiana Ifrim",
booktitle = "Machine Learning and Knowledge Discovery in Databases",

}

Capdevila, J, Cerquides, J, Torres, J, Petitjean, F & Buntine, W 2019, A left-to-right algorithm for likelihood estimation in gamma-poisson factor analysis. in M Berlingerio, F Bonchi, T Gärtner, N Hurley & G Ifrim (eds), Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018 Dublin, Ireland, September 10–14, 2018 Proceedings, Part II. Lecture Notes in Computer Science , vol. 11052 , Springer, Cham Switzerland, pp. 638-654, European Conference on Machine Learning European Conference on Principles and Practice of Knowledge Discovery in Databases 2018, Dublin, Ireland, 10/09/18. https://doi.org/10.1007/978-3-030-10928-8_38

A left-to-right algorithm for likelihood estimation in gamma-poisson factor analysis. / Capdevila, Joan; Cerquides, Jesús; Torres, Jordi; Petitjean, François; Buntine, Wray.

Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018 Dublin, Ireland, September 10–14, 2018 Proceedings, Part II. ed. / Michele Berlingerio; Francesco Bonchi; Thomas Gärtner; Neil Hurley; Georgiana Ifrim. Cham Switzerland : Springer, 2019. p. 638-654 (Lecture Notes in Computer Science ; Vol. 11052 ).

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

TY - GEN

T1 - A left-to-right algorithm for likelihood estimation in gamma-poisson factor analysis

AU - Capdevila, Joan

AU - Cerquides, Jesús

AU - Torres, Jordi

AU - Petitjean, François

AU - Buntine, Wray

PY - 2019

Y1 - 2019

N2 - Computing the probability of unseen documents is a natural evaluation task in topic modeling. Previous work has addressed this problem for the well-known Latent Dirichlet Allocation (LDA) model. However, the same problem for a more general class of topic models, referred here to as Gamma-Poisson Factor Analysis (GaP-FA), remains unexplored, which hampers a fair comparison between models. Recent findings on the exact marginal likelihood of GaP-FA enable the derivation of a closed-form expression. In this paper, we show that its exact computation grows exponentially with the number of topics and non-zero words in a document, thus being only solvable for relatively small models and short documents. Experimentation in various corpus also indicates that existing methods in the literature are unlikely to accurately estimate this probability. With that in mind, we propose L2R, a left-to-right sequential sampler that decomposes the document probability into a product of conditionals and estimates them separately. We then proceed by confirming that our estimator converges and is unbiased for both small and large collections. Code related to this paper is available at: https://github.com/jcapde/L2R, https://doi.org/10.7910/DVN/GDTAAC.

AB - Computing the probability of unseen documents is a natural evaluation task in topic modeling. Previous work has addressed this problem for the well-known Latent Dirichlet Allocation (LDA) model. However, the same problem for a more general class of topic models, referred here to as Gamma-Poisson Factor Analysis (GaP-FA), remains unexplored, which hampers a fair comparison between models. Recent findings on the exact marginal likelihood of GaP-FA enable the derivation of a closed-form expression. In this paper, we show that its exact computation grows exponentially with the number of topics and non-zero words in a document, thus being only solvable for relatively small models and short documents. Experimentation in various corpus also indicates that existing methods in the literature are unlikely to accurately estimate this probability. With that in mind, we propose L2R, a left-to-right sequential sampler that decomposes the document probability into a product of conditionals and estimates them separately. We then proceed by confirming that our estimator converges and is unbiased for both small and large collections. Code related to this paper is available at: https://github.com/jcapde/L2R, https://doi.org/10.7910/DVN/GDTAAC.

KW - Estimation methods

KW - Factor analysis

KW - Gamma-poisson

KW - Importance sampling

KW - Left-to-right

KW - Topic models

UR - http://www.scopus.com/inward/record.url?scp=85061159201&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-10928-8_38

DO - 10.1007/978-3-030-10928-8_38

M3 - Conference Paper

SN - 9783030109271

T3 - Lecture Notes in Computer Science

SP - 638

EP - 654

BT - Machine Learning and Knowledge Discovery in Databases

A2 - Berlingerio, Michele

A2 - Bonchi, Francesco

A2 - Gärtner, Thomas

A2 - Hurley, Neil

A2 - Ifrim, Georgiana

PB - Springer

CY - Cham Switzerland

ER -

Capdevila J, Cerquides J, Torres J, Petitjean F, Buntine W. A left-to-right algorithm for likelihood estimation in gamma-poisson factor analysis. In Berlingerio M, Bonchi F, Gärtner T, Hurley N, Ifrim G, editors, Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018 Dublin, Ireland, September 10–14, 2018 Proceedings, Part II. Cham Switzerland: Springer. 2019. p. 638-654. (Lecture Notes in Computer Science ). https://doi.org/10.1007/978-3-030-10928-8_38