Search based training data selection for cross project defect prediction

Seyedrebvar Hosseini, Burak Turhan, Mika Mantyla

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

3 Citations (Scopus)

Abstract

Context: Previous studies have shown that steered training data or dataset selection can lead to better performance for cross project defect prediction (CPDP). On the other hand, data quality is an issue to consider in CPDP. Aim: We aim at utilising the Nearest Neighbor (NN)-Filter, embedded in a genetic algorithm, for generating evolving training datasets to tackle CPDP, while accounting for potential noise in defect labels. Method: We propose a new search based training data (i.e., instance) selection approach for CPDP called GIS (Genetic Instance Selection) that looks for solutions to optimize a combined measure of F-Measure and GMean, on a validation set generated by (NN)-filter. The genetic operations consider the similarities in features and address possible noise in assigned defect labels. We use 13 datasets from PROMISE repository in order to compare the performance of GIS with benchmark CPDP methods, namely (NN)-filter and naive CPDP, as well as with within project defect prediction (WPDP). Results: Our results show that GIS is significantly better than (NN)-Filter in terms of F-Measure (p−value 0.001, Cohen’s d = 0.697) and GMean (p−value 0.001, Cohen’s d = 0.946). It also outperforms the naive CPDP approach in terms of F-Measure (p−value 0.001, Cohen’s d = 0.753) and GMean (p−value 0.001, Cohen’s d = 0.994). In addition, the performance of our approach is better than that of WPDP, again considering F-Measure (p−value 0.001, Cohen’s d = 0.227) and GMean (p−value 0.001, Cohen’s d = 0.595) values. Conclusions: We conclude that search based instance selection is a promising way to tackle CPDP. Especially, the performance comparison with the within project scenario encourages further investigation of our approach. However, the performance of GIS is based on high recall in the expense of low precision. Using different optimization goals, e.g. targeting high precision, would be a future direction to investigate.

Original languageEnglish
Title of host publicationProceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2016)
Subtitle of host publicationCiudad Real, Spain — September 09 - 09, 2016
EditorsAndriy Miranskyy, Hongyu Zhang
Place of PublicationNew York NY USA
PublisherAssociation for Computing Machinery (ACM)
Number of pages10
ISBN (Print)9781450347723
DOIs
Publication statusPublished - 2016
Externally publishedYes
EventInternational Conference on Predictive Models and Data Analytics in Software Engineering 2016 - Ciudad Real, Spain
Duration: 7 Sep 20167 Sep 2016
Conference number: 12th
http://promisedata.org/2016/

Conference

ConferenceInternational Conference on Predictive Models and Data Analytics in Software Engineering 2016
Abbreviated titlePROMISE 2016
CountrySpain
CityCiudad Real
Period7/09/167/09/16
Internet address

Keywords

  • Cross project defect prediction
  • Genetic algorithms
  • Instance selection
  • Search based optimization
  • Training data selection

Cite this

Hosseini, S., Turhan, B., & Mantyla, M. (2016). Search based training data selection for cross project defect prediction. In A. Miranskyy, & H. Zhang (Eds.), Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2016): Ciudad Real, Spain — September 09 - 09, 2016 [2972964] Association for Computing Machinery (ACM). https://doi.org/10.1145/2972958.2972964