Mining Software defects: should we consider affected releases?

Suraj Yatish, Jirayus Jiarpakdee, Patanamon Thongtanunam, Chakkrit Tantithamthavorn

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

60 Citations (Scopus)


With the rise of the Mining Software Repositories (MSR) field, defect datasets extracted from software repositories play a foundational role in many empirical studies related to software quality. At the core of defect data preparation is the identification of post-release defects. Prior studies leverage many heuristics (e.g., keywords and issue IDs) to identify post-release defects. However, such the heuristic approach is based on several assumptions, which pose common threats to the validity of many studies. In this paper, we set out to investigate the nature of the difference of defect datasets generated by the heuristic approach and the realistic approach that leverages the earliest affected release that is realistically estimated by a software development team for a given defect. In addition, we investigate the impact of defect identification approaches on the predictive accuracy and the ranking of defective modules that are produced by defect models. Through a case study of defect datasets of 32 releases, we find that that the heuristic approach has a large impact on both defect count datasets and binary defect datasets. Surprisingly, we find that the heuristic approach has a minimal impact on defect count models, suggesting that future work should not be too concerned about defect count models that are constructed using heuristic defect datasets. On the other hand, using defect datasets generated by the realistic approach lead to an improvement in the predictive accuracy of defect classification models.

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE/ACM 41st International Conference on Software Engineering, ICSE 2019
EditorsTevfik Bultan, Jon Whittle
Place of PublicationPiscataway NJ USA
PublisherIEEE, Institute of Electrical and Electronics Engineers
Number of pages12
ISBN (Electronic)9781728108698
ISBN (Print)978172808704
Publication statusPublished - 2019
EventInternational Conference on Software Engineering 2019 - Fairmont The Queen Elizabeth Hotel, Montreal, Canada
Duration: 25 May 201931 May 2019
Conference number: 41st (Proceedings)


ConferenceInternational Conference on Software Engineering 2019
Abbreviated titleICSE 2019
Internet address


  • Defect Prediction Models
  • Empirical Software Engineering
  • Mining Software Repositories
  • Software Quality

Cite this