### Abstract

Computing the probability of unseen documents is a natural evaluation task in topic modeling. Previous work has addressed this problem for the well-known Latent Dirichlet Allocation (LDA) model. However, the same problem for a more general class of topic models, referred here to as Gamma-Poisson Factor Analysis (GaP-FA), remains unexplored, which hampers a fair comparison between models. Recent findings on the exact marginal likelihood of GaP-FA enable the derivation of a closed-form expression. In this paper, we show that its exact computation grows exponentially with the number of topics and non-zero words in a document, thus being only solvable for relatively small models and short documents. Experimentation in various corpus also indicates that existing methods in the literature are unlikely to accurately estimate this probability. With that in mind, we propose L2R, a left-to-right sequential sampler that decomposes the document probability into a product of conditionals and estimates them separately. We then proceed by confirming that our estimator converges and is unbiased for both small and large collections. Code related to this paper is available at: https://github.com/jcapde/L2R, https://doi.org/10.7910/DVN/GDTAAC.

Original language | English |
---|---|

Title of host publication | Machine Learning and Knowledge Discovery in Databases |

Subtitle of host publication | European Conference, ECML PKDD 2018 Dublin, Ireland, September 10–14, 2018 Proceedings, Part II |

Editors | Michele Berlingerio, Francesco Bonchi, Thomas Gärtner, Neil Hurley, Georgiana Ifrim |

Place of Publication | Cham Switzerland |

Publisher | Springer |

Pages | 638-654 |

Number of pages | 17 |

ISBN (Electronic) | 9783030109288 |

ISBN (Print) | 9783030109271 |

DOIs | |

Publication status | Published - 2019 |

Event | European Conference on Machine Learning European Conference on Principles and Practice of Knowledge Discovery in Databases: ECML-PKDD 2018 - Dublin, Ireland Duration: 10 Sep 2018 → 14 Sep 2018 http://www.ecmlpkdd2018.org/ |

### Publication series

Name | Lecture Notes in Computer Science |
---|---|

Publisher | Springer |

Volume | 11052 |

ISSN (Print) | 0302-9743 |

ISSN (Electronic) | 1611-3349 |

### Conference

Conference | European Conference on Machine Learning European Conference on Principles and Practice of Knowledge Discovery in Databases: ECML-PKDD 2018 |
---|---|

Abbreviated title | ECML-PKDD 2018 |

Country | Ireland |

City | Dublin |

Period | 10/09/18 → 14/09/18 |

Internet address |

### Keywords

- Estimation methods
- Factor analysis
- Gamma-poisson
- Importance sampling
- Left-to-right
- Topic models

### Cite this

*Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018 Dublin, Ireland, September 10–14, 2018 Proceedings, Part II*(pp. 638-654). (Lecture Notes in Computer Science ; Vol. 11052 ). Cham Switzerland: Springer. https://doi.org/10.1007/978-3-030-10928-8_38

}

*Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018 Dublin, Ireland, September 10–14, 2018 Proceedings, Part II.*Lecture Notes in Computer Science , vol. 11052 , Springer, Cham Switzerland, pp. 638-654, European Conference on Machine Learning European Conference on Principles and Practice of Knowledge Discovery in Databases: ECML-PKDD 2018, Dublin, Ireland, 10/09/18. https://doi.org/10.1007/978-3-030-10928-8_38

**A left-to-right algorithm for likelihood estimation in gamma-poisson factor analysis.** / Capdevila, Joan; Cerquides, Jesús; Torres, Jordi; Petitjean, François; Buntine, Wray.

Research output: Chapter in Book/Report/Conference proceeding › Conference Paper › Research › peer-review

TY - GEN

T1 - A left-to-right algorithm for likelihood estimation in gamma-poisson factor analysis

AU - Capdevila, Joan

AU - Cerquides, Jesús

AU - Torres, Jordi

AU - Petitjean, François

AU - Buntine, Wray

PY - 2019

Y1 - 2019

N2 - Computing the probability of unseen documents is a natural evaluation task in topic modeling. Previous work has addressed this problem for the well-known Latent Dirichlet Allocation (LDA) model. However, the same problem for a more general class of topic models, referred here to as Gamma-Poisson Factor Analysis (GaP-FA), remains unexplored, which hampers a fair comparison between models. Recent findings on the exact marginal likelihood of GaP-FA enable the derivation of a closed-form expression. In this paper, we show that its exact computation grows exponentially with the number of topics and non-zero words in a document, thus being only solvable for relatively small models and short documents. Experimentation in various corpus also indicates that existing methods in the literature are unlikely to accurately estimate this probability. With that in mind, we propose L2R, a left-to-right sequential sampler that decomposes the document probability into a product of conditionals and estimates them separately. We then proceed by confirming that our estimator converges and is unbiased for both small and large collections. Code related to this paper is available at: https://github.com/jcapde/L2R, https://doi.org/10.7910/DVN/GDTAAC.

AB - Computing the probability of unseen documents is a natural evaluation task in topic modeling. Previous work has addressed this problem for the well-known Latent Dirichlet Allocation (LDA) model. However, the same problem for a more general class of topic models, referred here to as Gamma-Poisson Factor Analysis (GaP-FA), remains unexplored, which hampers a fair comparison between models. Recent findings on the exact marginal likelihood of GaP-FA enable the derivation of a closed-form expression. In this paper, we show that its exact computation grows exponentially with the number of topics and non-zero words in a document, thus being only solvable for relatively small models and short documents. Experimentation in various corpus also indicates that existing methods in the literature are unlikely to accurately estimate this probability. With that in mind, we propose L2R, a left-to-right sequential sampler that decomposes the document probability into a product of conditionals and estimates them separately. We then proceed by confirming that our estimator converges and is unbiased for both small and large collections. Code related to this paper is available at: https://github.com/jcapde/L2R, https://doi.org/10.7910/DVN/GDTAAC.

KW - Estimation methods

KW - Factor analysis

KW - Gamma-poisson

KW - Importance sampling

KW - Left-to-right

KW - Topic models

UR - http://www.scopus.com/inward/record.url?scp=85061159201&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-10928-8_38

DO - 10.1007/978-3-030-10928-8_38

M3 - Conference Paper

SN - 9783030109271

T3 - Lecture Notes in Computer Science

SP - 638

EP - 654

BT - Machine Learning and Knowledge Discovery in Databases

A2 - Berlingerio, Michele

A2 - Bonchi, Francesco

A2 - Gärtner, Thomas

A2 - Hurley, Neil

A2 - Ifrim, Georgiana

PB - Springer

CY - Cham Switzerland

ER -