### Abstract

Computing the probability of unseen documents is a natural evaluation task in topic modeling. Previous work has addressed this problem for the well-known Latent Dirichlet Allocation (LDA) model. However, the same problem for a more general class of topic models, referred here to as Gamma-Poisson Factor Analysis (GaP-FA), remains unexplored, which hampers a fair comparison between models. Recent findings on the exact marginal likelihood of GaP-FA enable the derivation of a closed-form expression. In this paper, we show that its exact computation grows exponentially with the number of topics and non-zero words in a document, thus being only solvable for relatively small models and short documents. Experimentation in various corpus also indicates that existing methods in the literature are unlikely to accurately estimate this probability. With that in mind, we propose L2R, a left-to-right sequential sampler that decomposes the document probability into a product of conditionals and estimates them separately. We then proceed by confirming that our estimator converges and is unbiased for both small and large collections. Code related to this paper is available at: https://github.com/jcapde/L2R, https://doi.org/10.7910/DVN/GDTAAC.

Original language | English |
---|---|

Title of host publication | Machine Learning and Knowledge Discovery in Databases |

Subtitle of host publication | European Conference, ECML PKDD 2018 Dublin, Ireland, September 10–14, 2018 Proceedings, Part II |

Editors | Michele Berlingerio, Francesco Bonchi, Thomas Gärtner, Neil Hurley, Georgiana Ifrim |

Place of Publication | Cham Switzerland |

Publisher | Springer |

Pages | 638-654 |

Number of pages | 17 |

ISBN (Electronic) | 9783030109288 |

ISBN (Print) | 9783030109271 |

DOIs | |

Publication status | Published - 2019 |

Event | European Conference on Machine Learning European Conference on Principles and Practice of Knowledge Discovery in Databases: ECML-PKDD 2018 - Dublin, Ireland Duration: 10 Sep 2018 → 14 Sep 2018 http://www.ecmlpkdd2018.org/ |

### Publication series

Name | Lecture Notes in Computer Science |
---|---|

Publisher | Springer |

Volume | 11052 |

ISSN (Print) | 0302-9743 |

ISSN (Electronic) | 1611-3349 |

### Conference

Conference | European Conference on Machine Learning European Conference on Principles and Practice of Knowledge Discovery in Databases: ECML-PKDD 2018 |
---|---|

Abbreviated title | ECML-PKDD 2018 |

Country | Ireland |

City | Dublin |

Period | 10/09/18 → 14/09/18 |

Internet address |

### Keywords

- Estimation methods
- Factor analysis
- Gamma-poisson
- Importance sampling
- Left-to-right
- Topic models

### Cite this

*Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018 Dublin, Ireland, September 10–14, 2018 Proceedings, Part II*(pp. 638-654). (Lecture Notes in Computer Science ; Vol. 11052 ). Springer. https://doi.org/10.1007/978-3-030-10928-8_38