A two-phase transfer learning model for cross-project defect prediction

Chao Liu, Dan Yang, Xin Xia, Meng Yan, Xiaohong Zhang

Research output: Contribution to journalArticleResearchpeer-review

16 Citations (Scopus)

Abstract

Context: Previous studies have shown that a transfer learning model, TCA+ proposed by Nam et al., can significantly improve the performance of cross-project defect prediction (CPDP). TCA+ achieves the improvement by reducing data distribution difference between source (training data) and target (testing data) projects. However, TCA+ is unstable, i.e., its performance varies largely when using different source projects to build prediction models. In practice, it is hard to choose a suitable source project to build the prediction model. Objective: To address the limitation of TCA+, we propose a two-phase transfer learning model (TPTL) for CPDP. Method: In the first phase, we propose a source project estimator (SPE) to automatically choose two source projects with the highest distribution similarity to a target project from candidates. Next, two source projects that are estimated to achieve the highest values of F1-score and cost-effectiveness are selected. In the second phase, we leverage TCA+ to build two prediction models based on the two selected projects and combine their prediction results to further improve the prediction performance. Results: We evaluate TPTL on 42 defect datasets from PROMISE repository, and compare it with two versions of TCA+ (TCA+_Rnd, randomly selecting one source project; TCA+_All, using all alternative source projects), a related source project selection model TDS proposed by Herbold, a state-of-the-art CPDP model leveraging a log transformation (LT) method, and a transfer learning model Dycom with better form of TCA. Experiment results show that, on average across 42 datasets, TPTL respectively improves these baseline models by 19%, 5%, 36%, 27%, and 11% in terms of F1-score; by 64%, 92%, 71%, 11%, and 66% in terms of cost-effectiveness. Conclusion: The proposed TPTL model can solve the instability problem of TCA+, showing substantial improvements over the state-of-the-art and related CPDP models.

Original languageEnglish
Pages (from-to)125-136
Number of pages12
JournalInformation and Software Technology
Volume107
DOIs
Publication statusPublished - Mar 2019

Keywords

  • Cross-Project prediction
  • Defect prediction
  • Source project selection
  • Transfer learning

Cite this