A two-phase transfer learning model for cross-project defect prediction

Chao Liu, Dan Yang, Xin Xia, Meng Yan, Xiaohong Zhang

Research output: Contribution to journalArticleResearchpeer-review

Abstract

Context: Previous studies have shown that a transfer learning model, TCA+ proposed by Nam et al., can significantly improve the performance of cross-project defect prediction (CPDP). TCA+ achieves the improvement by reducing data distribution difference between source (training data) and target (testing data) projects. However, TCA+ is unstable, i.e., its performance varies largely when using different source projects to build prediction models. In practice, it is hard to choose a suitable source project to build the prediction model. Objective: To address the limitation of TCA+, we propose a two-phase transfer learning model (TPTL) for CPDP. Method: In the first phase, we propose a source project estimator (SPE) to automatically choose two source projects with the highest distribution similarity to a target project from candidates. Next, two source projects that are estimated to achieve the highest values of F1-score and cost-effectiveness are selected. In the second phase, we leverage TCA+ to build two prediction models based on the two selected projects and combine their prediction results to further improve the prediction performance. Results: We evaluate TPTL on 42 defect datasets from PROMISE repository, and compare it with two versions of TCA+ (TCA+_Rnd, randomly selecting one source project; TCA+_All, using all alternative source projects), a related source project selection model TDS proposed by Herbold, a state-of-the-art CPDP model leveraging a log transformation (LT) method, and a transfer learning model Dycom with better form of TCA. Experiment results show that, on average across 42 datasets, TPTL respectively improves these baseline models by 19%, 5%, 36%, 27%, and 11% in terms of F1-score; by 64%, 92%, 71%, 11%, and 66% in terms of cost-effectiveness. Conclusion: The proposed TPTL model can solve the instability problem of TCA+, showing substantial improvements over the state-of-the-art and related CPDP models.

Original languageEnglish
Pages (from-to)125-136
Number of pages12
JournalInformation and Software Technology
Volume107
DOIs
Publication statusPublished - Mar 2019

Keywords

  • Cross-Project prediction
  • Defect prediction
  • Source project selection
  • Transfer learning

Cite this

Liu, Chao ; Yang, Dan ; Xia, Xin ; Yan, Meng ; Zhang, Xiaohong. / A two-phase transfer learning model for cross-project defect prediction. In: Information and Software Technology. 2019 ; Vol. 107. pp. 125-136.
@article{d952fa26b85a41ac803a0cd41d0d5ddd,
title = "A two-phase transfer learning model for cross-project defect prediction",
abstract = "Context: Previous studies have shown that a transfer learning model, TCA+ proposed by Nam et al., can significantly improve the performance of cross-project defect prediction (CPDP). TCA+ achieves the improvement by reducing data distribution difference between source (training data) and target (testing data) projects. However, TCA+ is unstable, i.e., its performance varies largely when using different source projects to build prediction models. In practice, it is hard to choose a suitable source project to build the prediction model. Objective: To address the limitation of TCA+, we propose a two-phase transfer learning model (TPTL) for CPDP. Method: In the first phase, we propose a source project estimator (SPE) to automatically choose two source projects with the highest distribution similarity to a target project from candidates. Next, two source projects that are estimated to achieve the highest values of F1-score and cost-effectiveness are selected. In the second phase, we leverage TCA+ to build two prediction models based on the two selected projects and combine their prediction results to further improve the prediction performance. Results: We evaluate TPTL on 42 defect datasets from PROMISE repository, and compare it with two versions of TCA+ (TCA+_Rnd, randomly selecting one source project; TCA+_All, using all alternative source projects), a related source project selection model TDS proposed by Herbold, a state-of-the-art CPDP model leveraging a log transformation (LT) method, and a transfer learning model Dycom with better form of TCA. Experiment results show that, on average across 42 datasets, TPTL respectively improves these baseline models by 19{\%}, 5{\%}, 36{\%}, 27{\%}, and 11{\%} in terms of F1-score; by 64{\%}, 92{\%}, 71{\%}, 11{\%}, and 66{\%} in terms of cost-effectiveness. Conclusion: The proposed TPTL model can solve the instability problem of TCA+, showing substantial improvements over the state-of-the-art and related CPDP models.",
keywords = "Cross-Project prediction, Defect prediction, Source project selection, Transfer learning",
author = "Chao Liu and Dan Yang and Xin Xia and Meng Yan and Xiaohong Zhang",
year = "2019",
month = "3",
doi = "10.1016/j.infsof.2018.11.005",
language = "English",
volume = "107",
pages = "125--136",
journal = "Information and Software Technology",
issn = "0950-5849",
publisher = "Elsevier",

}

A two-phase transfer learning model for cross-project defect prediction. / Liu, Chao; Yang, Dan; Xia, Xin; Yan, Meng; Zhang, Xiaohong.

In: Information and Software Technology, Vol. 107, 03.2019, p. 125-136.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - A two-phase transfer learning model for cross-project defect prediction

AU - Liu, Chao

AU - Yang, Dan

AU - Xia, Xin

AU - Yan, Meng

AU - Zhang, Xiaohong

PY - 2019/3

Y1 - 2019/3

N2 - Context: Previous studies have shown that a transfer learning model, TCA+ proposed by Nam et al., can significantly improve the performance of cross-project defect prediction (CPDP). TCA+ achieves the improvement by reducing data distribution difference between source (training data) and target (testing data) projects. However, TCA+ is unstable, i.e., its performance varies largely when using different source projects to build prediction models. In practice, it is hard to choose a suitable source project to build the prediction model. Objective: To address the limitation of TCA+, we propose a two-phase transfer learning model (TPTL) for CPDP. Method: In the first phase, we propose a source project estimator (SPE) to automatically choose two source projects with the highest distribution similarity to a target project from candidates. Next, two source projects that are estimated to achieve the highest values of F1-score and cost-effectiveness are selected. In the second phase, we leverage TCA+ to build two prediction models based on the two selected projects and combine their prediction results to further improve the prediction performance. Results: We evaluate TPTL on 42 defect datasets from PROMISE repository, and compare it with two versions of TCA+ (TCA+_Rnd, randomly selecting one source project; TCA+_All, using all alternative source projects), a related source project selection model TDS proposed by Herbold, a state-of-the-art CPDP model leveraging a log transformation (LT) method, and a transfer learning model Dycom with better form of TCA. Experiment results show that, on average across 42 datasets, TPTL respectively improves these baseline models by 19%, 5%, 36%, 27%, and 11% in terms of F1-score; by 64%, 92%, 71%, 11%, and 66% in terms of cost-effectiveness. Conclusion: The proposed TPTL model can solve the instability problem of TCA+, showing substantial improvements over the state-of-the-art and related CPDP models.

AB - Context: Previous studies have shown that a transfer learning model, TCA+ proposed by Nam et al., can significantly improve the performance of cross-project defect prediction (CPDP). TCA+ achieves the improvement by reducing data distribution difference between source (training data) and target (testing data) projects. However, TCA+ is unstable, i.e., its performance varies largely when using different source projects to build prediction models. In practice, it is hard to choose a suitable source project to build the prediction model. Objective: To address the limitation of TCA+, we propose a two-phase transfer learning model (TPTL) for CPDP. Method: In the first phase, we propose a source project estimator (SPE) to automatically choose two source projects with the highest distribution similarity to a target project from candidates. Next, two source projects that are estimated to achieve the highest values of F1-score and cost-effectiveness are selected. In the second phase, we leverage TCA+ to build two prediction models based on the two selected projects and combine their prediction results to further improve the prediction performance. Results: We evaluate TPTL on 42 defect datasets from PROMISE repository, and compare it with two versions of TCA+ (TCA+_Rnd, randomly selecting one source project; TCA+_All, using all alternative source projects), a related source project selection model TDS proposed by Herbold, a state-of-the-art CPDP model leveraging a log transformation (LT) method, and a transfer learning model Dycom with better form of TCA. Experiment results show that, on average across 42 datasets, TPTL respectively improves these baseline models by 19%, 5%, 36%, 27%, and 11% in terms of F1-score; by 64%, 92%, 71%, 11%, and 66% in terms of cost-effectiveness. Conclusion: The proposed TPTL model can solve the instability problem of TCA+, showing substantial improvements over the state-of-the-art and related CPDP models.

KW - Cross-Project prediction

KW - Defect prediction

KW - Source project selection

KW - Transfer learning

UR - http://www.scopus.com/inward/record.url?scp=85056828109&partnerID=8YFLogxK

U2 - 10.1016/j.infsof.2018.11.005

DO - 10.1016/j.infsof.2018.11.005

M3 - Article

VL - 107

SP - 125

EP - 136

JO - Information and Software Technology

JF - Information and Software Technology

SN - 0950-5849

ER -