Automated configuration bug report prediction using text mining

Xin Xia, David Lo, Weiwei Qiu, Xingen Wang, Bo Zhou

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

31 Citations (Scopus)


Configuration bugs are one of the dominant causes of software failures. Previous studies show that a configuration bug could cause huge financial losses in a software system. The importance of configuration bugs has attracted various research studies, e.g., To detect, diagnose, and fix configuration bugs. Given a bug report, an approach that can identify whether the bug is a configuration bug could help developers reduce debugging effort. We refer to this problem as configuration bug reports prediction. To address this problem, we develop a new automated framework that applies text mining technologies on the natural-language description of bug reports to train a statistical model on historical bug reports with known labels (i.e., Configuration or non-configuration), and the statistical model is then used to predict a label for a new bug report. Developers could apply our model to automatically predict labels of bug reports to improve their productivity. Our tool first applies feature selection techniques (e.g., Information gain and Chi-square) to pre-process the textual information in bug reports, and then applies various text mining techniques (e.g., Naive Bayes, SVM, naive Bayes multinomial) to build statistical models. We evaluate our solution on 5 bug report datasets including accumulo, activemq, camel, flume, and wicket. We show that naive Bayes multinomial with information gain achieves the best performance. On average across the 5 projects, its accuracy, configuration F-measure and non-configuration F-measure are 0.811, 0.450, and 0.880, respectively. We also compare our solution with the method proposed by Arshad et al. The results show that our proposed approach that uses naive Bayes multinomial with information gain on average improves accuracy, configuration F-measure and non-configuration F-measure scores of Arshad et al.'s method by 8.34%, 103.7%, and 4.24%, respectively.

Original languageEnglish
Title of host publicationProceedings - IEEE 38th Annual International Computers, Software and Applications Conference, COMPSAC 2014
Subtitle of host publication27–29 July 2014 Västerås, Sweden
EditorsCarl Chang, Yan Gao, Ali Hurson, Mihhail Matskin, Bruce McMillin, Yasuo Okabe, Cristina Seceleanu, Kenichi Yoshida
Place of PublicationPiscataway NJ USA
PublisherIEEE, Institute of Electrical and Electronics Engineers
Number of pages10
ISBN (Print)9781479935741
Publication statusPublished - 2014
Externally publishedYes
EventInternational Computer Software and Applications Conference 2014 - Vasteras, Sweden
Duration: 27 Jul 201429 Jul 2014
Conference number: 38th (Proceedings)


ConferenceInternational Computer Software and Applications Conference 2014
Abbreviated titleCOMPSAC 2014
Internet address


  • Configuration Bug
  • Data Mining
  • Feature Selection

Cite this