The impact of data merging on the interpretation of cross-project Just-In-Time defect models

Dayi Lin, Chakkrit Tantithamthavorn, Ahmed E. Hassan

Research output: Contribution to journalArticleResearchpeer-review

9 Citations (Scopus)

Abstract

Just-In-Time (JIT) defect models are classification models that identify the code commits that are likely to introduce defects. Cross-project JIT models have been introduced to address the suboptimal performance of JIT models when historical data is limited. However, many studies built cross-project JIT models using a pool of mixed data from multiple projects (i.e., data merging)---assuming that the properties of defect-introducing commits of a project are similar to that of the other projects, which is likely not true. In this paper, we set out to investigate the interpretation of JIT defect models that are built from individual project data and a pool of mixed project data with and without consideration of project-level variances. Through a case study of 20 datasets of open source projects, we found that (1) the interpretation of JIT models that are built from individual projects varies among projects; and (2) the project-level variances cannot be captured by a JIT model that is trained from a pool of mixed data from multiple projects without considering project-level variances (i.e., a global JIT model). On the other hand, a mixed-effect JIT model that considers project-level variances represents the different interpretations better, without sacrificing performance, especially when the contexts of projects are considered. The results hold for different mixed-effect learning algorithms. When the goal is to derive sound interpretation of cross-project JIT models, we suggest that practitioners and researchers should opt to use a mixed-effect modelling approach that considers individual projects and contexts.

Original languageEnglish
Pages (from-to)2969-2986
Number of pages19
JournalIEEE Transactions on Software Engineering
Volume48
Issue number8
DOIs
Publication statusPublished - 1 Aug 2022

Keywords

  • Context modeling
  • Cross-Project Defect Prediction
  • Data Merging
  • Data models
  • Just-In-Time Defect Prediction
  • Measurement
  • Merging
  • Mixed-Effect Model
  • Planning
  • Predictive models
  • Training

Cite this