A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction

Seyedrebvar Hosseini, Burak Turhan, Mika Mäntylä

Research output: Contribution to journalArticleResearchpeer-review

Abstract

Context: Previous studies have shown that steered training data or dataset selection can lead to better performance for cross project defect prediction(CPDP). On the other hand, feature selection and data quality are issues to consider in CPDP. 

Objective: We aim at utilizing the Nearest Neighbor (NN)-Filter, embedded in genetic algorithm to produce validation sets for generating evolving training datasets to tackle CPDP while accounting for potential noise in defect labels. We also investigate the impact of using different feature sets. 

Method: We extend our proposed approach, Genetic Instance Selection (GIS), by incorporating feature selection in its setting. We use 41 releases of 11 multi-version projects to assess the performance GIS in comparison with benchmark CPDP (NN-filter and Naive-CPDP) and within project (Cross-Validation(CV) and Previous Releases(PR)). To assess the impact of feature sets, we use two sets of features, SCM+OO+LOC(all) and CK+LOC(ckloc) as well as iterative info-gain subsetting(IG) for feature selection. 

Results: GIS variant with info gain feature selection is significantly better than NN-Filter (all,ckloc,IG) in terms of F1 (p=values≪0.001, Cohen's d={0.621,0.845,0.762}) and G (p=values≪0.001, Cohen's d={0.899,1.114,1.056}), and Naive CPDP (all,ckloc,IG) in terms of F1 (p=values≪0.001, Cohen's d={0.743,0.865,0.789}) and G (p=values≪0.001, Cohen's d={1.027,1.119,1.050}). Overall, the performance of GIS is comparable to that of within project defect prediction (WPDP) benchmarks, i.e. CV and PR. In terms of multiple comparisons test, all variants of GIS belong to the top ranking group of approaches. 

Conclusions: We conclude that datasets obtained from search based approaches combined with feature selection techniques is a promising way to tackle CPDP. Especially, the performance comparison with the within project scenario encourages further investigation of our approach. However, the performance of GIS is based on high recall in the expense of a loss in precision. Using different optimization goals, utilizing other validation datasets and other feature selection techniques are possible future directions to investigate.

Original languageEnglish
Pages (from-to)296-312
Number of pages17
JournalInformation and Software Technology
Volume95
DOIs
Publication statusPublished - Mar 2018
Externally publishedYes

Keywords

  • Cross project defect prediction
  • Genetic algorithms
  • Instance selection
  • Search based optimization
  • Training data selection

Cite this

@article{cefaaf23027341dbb6d6c53203520a61,
title = "A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction",
abstract = "Context: Previous studies have shown that steered training data or dataset selection can lead to better performance for cross project defect prediction(CPDP). On the other hand, feature selection and data quality are issues to consider in CPDP. Objective: We aim at utilizing the Nearest Neighbor (NN)-Filter, embedded in genetic algorithm to produce validation sets for generating evolving training datasets to tackle CPDP while accounting for potential noise in defect labels. We also investigate the impact of using different feature sets. Method: We extend our proposed approach, Genetic Instance Selection (GIS), by incorporating feature selection in its setting. We use 41 releases of 11 multi-version projects to assess the performance GIS in comparison with benchmark CPDP (NN-filter and Naive-CPDP) and within project (Cross-Validation(CV) and Previous Releases(PR)). To assess the impact of feature sets, we use two sets of features, SCM+OO+LOC(all) and CK+LOC(ckloc) as well as iterative info-gain subsetting(IG) for feature selection. Results: GIS variant with info gain feature selection is significantly better than NN-Filter (all,ckloc,IG) in terms of F1 (p=values≪0.001, Cohen's d={0.621,0.845,0.762}) and G (p=values≪0.001, Cohen's d={0.899,1.114,1.056}), and Naive CPDP (all,ckloc,IG) in terms of F1 (p=values≪0.001, Cohen's d={0.743,0.865,0.789}) and G (p=values≪0.001, Cohen's d={1.027,1.119,1.050}). Overall, the performance of GIS is comparable to that of within project defect prediction (WPDP) benchmarks, i.e. CV and PR. In terms of multiple comparisons test, all variants of GIS belong to the top ranking group of approaches. Conclusions: We conclude that datasets obtained from search based approaches combined with feature selection techniques is a promising way to tackle CPDP. Especially, the performance comparison with the within project scenario encourages further investigation of our approach. However, the performance of GIS is based on high recall in the expense of a loss in precision. Using different optimization goals, utilizing other validation datasets and other feature selection techniques are possible future directions to investigate.",
keywords = "Cross project defect prediction, Genetic algorithms, Instance selection, Search based optimization, Training data selection",
author = "Seyedrebvar Hosseini and Burak Turhan and Mika M{\"a}ntyl{\"a}",
year = "2018",
month = "3",
doi = "10.1016/j.infsof.2017.06.004",
language = "English",
volume = "95",
pages = "296--312",
journal = "Information and Software Technology",
issn = "0950-5849",
publisher = "Elsevier",

}

A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction. / Hosseini, Seyedrebvar; Turhan, Burak; Mäntylä, Mika.

In: Information and Software Technology, Vol. 95, 03.2018, p. 296-312.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction

AU - Hosseini, Seyedrebvar

AU - Turhan, Burak

AU - Mäntylä, Mika

PY - 2018/3

Y1 - 2018/3

N2 - Context: Previous studies have shown that steered training data or dataset selection can lead to better performance for cross project defect prediction(CPDP). On the other hand, feature selection and data quality are issues to consider in CPDP. Objective: We aim at utilizing the Nearest Neighbor (NN)-Filter, embedded in genetic algorithm to produce validation sets for generating evolving training datasets to tackle CPDP while accounting for potential noise in defect labels. We also investigate the impact of using different feature sets. Method: We extend our proposed approach, Genetic Instance Selection (GIS), by incorporating feature selection in its setting. We use 41 releases of 11 multi-version projects to assess the performance GIS in comparison with benchmark CPDP (NN-filter and Naive-CPDP) and within project (Cross-Validation(CV) and Previous Releases(PR)). To assess the impact of feature sets, we use two sets of features, SCM+OO+LOC(all) and CK+LOC(ckloc) as well as iterative info-gain subsetting(IG) for feature selection. Results: GIS variant with info gain feature selection is significantly better than NN-Filter (all,ckloc,IG) in terms of F1 (p=values≪0.001, Cohen's d={0.621,0.845,0.762}) and G (p=values≪0.001, Cohen's d={0.899,1.114,1.056}), and Naive CPDP (all,ckloc,IG) in terms of F1 (p=values≪0.001, Cohen's d={0.743,0.865,0.789}) and G (p=values≪0.001, Cohen's d={1.027,1.119,1.050}). Overall, the performance of GIS is comparable to that of within project defect prediction (WPDP) benchmarks, i.e. CV and PR. In terms of multiple comparisons test, all variants of GIS belong to the top ranking group of approaches. Conclusions: We conclude that datasets obtained from search based approaches combined with feature selection techniques is a promising way to tackle CPDP. Especially, the performance comparison with the within project scenario encourages further investigation of our approach. However, the performance of GIS is based on high recall in the expense of a loss in precision. Using different optimization goals, utilizing other validation datasets and other feature selection techniques are possible future directions to investigate.

AB - Context: Previous studies have shown that steered training data or dataset selection can lead to better performance for cross project defect prediction(CPDP). On the other hand, feature selection and data quality are issues to consider in CPDP. Objective: We aim at utilizing the Nearest Neighbor (NN)-Filter, embedded in genetic algorithm to produce validation sets for generating evolving training datasets to tackle CPDP while accounting for potential noise in defect labels. We also investigate the impact of using different feature sets. Method: We extend our proposed approach, Genetic Instance Selection (GIS), by incorporating feature selection in its setting. We use 41 releases of 11 multi-version projects to assess the performance GIS in comparison with benchmark CPDP (NN-filter and Naive-CPDP) and within project (Cross-Validation(CV) and Previous Releases(PR)). To assess the impact of feature sets, we use two sets of features, SCM+OO+LOC(all) and CK+LOC(ckloc) as well as iterative info-gain subsetting(IG) for feature selection. Results: GIS variant with info gain feature selection is significantly better than NN-Filter (all,ckloc,IG) in terms of F1 (p=values≪0.001, Cohen's d={0.621,0.845,0.762}) and G (p=values≪0.001, Cohen's d={0.899,1.114,1.056}), and Naive CPDP (all,ckloc,IG) in terms of F1 (p=values≪0.001, Cohen's d={0.743,0.865,0.789}) and G (p=values≪0.001, Cohen's d={1.027,1.119,1.050}). Overall, the performance of GIS is comparable to that of within project defect prediction (WPDP) benchmarks, i.e. CV and PR. In terms of multiple comparisons test, all variants of GIS belong to the top ranking group of approaches. Conclusions: We conclude that datasets obtained from search based approaches combined with feature selection techniques is a promising way to tackle CPDP. Especially, the performance comparison with the within project scenario encourages further investigation of our approach. However, the performance of GIS is based on high recall in the expense of a loss in precision. Using different optimization goals, utilizing other validation datasets and other feature selection techniques are possible future directions to investigate.

KW - Cross project defect prediction

KW - Genetic algorithms

KW - Instance selection

KW - Search based optimization

KW - Training data selection

UR - http://www.scopus.com/inward/record.url?scp=85021334527&partnerID=8YFLogxK

U2 - 10.1016/j.infsof.2017.06.004

DO - 10.1016/j.infsof.2017.06.004

M3 - Article

VL - 95

SP - 296

EP - 312

JO - Information and Software Technology

JF - Information and Software Technology

SN - 0950-5849

ER -