Predicting good configurations for github and stack overflow topic models

Christoph Treude, Markus Wagner

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

33 Citations (Scopus)

Abstract

Software repositories contain large amounts of textual data, ranging from source code comments and issue descriptions to questions, answers, and comments on Stack Overflow. To make sense of this textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model that aims to explain the structure of a corpus by grouping texts. LDA requires multiple parameters to work well, and there are only rough and sometimes conflicting guidelines available on how these parameters should be set. In this paper, we contribute (i) a broad study of parameters to arrive at good local optima for GitHub and Stack Overflow text corpora, (ii) an a-posteriori characterisation of text corpora related to eight programming languages, and (iii) an analysis of corpus feature importance via per-corpus LDA configuration. We find that (1) popular rules of thumb for topic modelling parameter configuration are not applicable to the corpora used in our experiments, (2) corpora sampled from GitHub and Stack Overflow have different characteristics and require different configurations to achieve good model fit, and (3) we can predict good configurations for unseen corpora reliably. These findings support researchers and practitioners in efficiently determining suitable configurations for topic modelling when analysing textual data contained in software repositories.

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE/ACM 16th International Conference on Mining Software Repositories, MSR 2019
EditorsBram Adams, Sonia Haiduc
Place of PublicationPiscataway NJ USA
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages84-95
Number of pages12
ISBN (Electronic)9781728134123
ISBN (Print)9781728133706
DOIs
Publication statusPublished - 2019
Externally publishedYes
EventIEEE International Working Conference on Mining Software Repositories 2019 - Montreal, Canada
Duration: 26 May 201927 May 2019
Conference number: 16th
https://conf.researchr.org/home/msr-2019
https://ieeexplore.ieee.org/xpl/conhome/8804710/proceeding (Proceedings)

Publication series

NameIEEE International Working Conference on Mining Software Repositories
PublisherIEEE, Institute of Electrical and Electronics Engineers
Volume2019-May
ISSN (Print)2160-1852
ISSN (Electronic)2160-1860

Conference

ConferenceIEEE International Working Conference on Mining Software Repositories 2019
Abbreviated titleMSR 2019
Country/TerritoryCanada
CityMontreal
Period26/05/1927/05/19
Internet address

Keywords

  • Algorithm portfolio
  • Corpus features
  • Topic modelling

Cite this