Characterizing and identifying reverted commits

Meng Yan, Xin Xia, David Lo, Ahmed E. Hassan, Shanping Li

Research output: Contribution to journalArticleResearchpeer-review

1 Citation (Scopus)

Abstract

In practice, a popular and coarse-grained approach for recovering from a problematic commit is to revert it (i.e., undoing the change). However, reverted commits could induce some issues for software development, such as impeding the development progress and increasing the difficulty for maintenance. In order to mitigate these issues, we set out to explore the following central question: can we characterize and identify which commits will be reverted? In this paper, we characterize commits using 27 commit features and build an identification model to identify commits that will be reverted. We first identify reverted commits by analyzing commit messages and comparing the changed content, and extract 27 commit features that can be divided into three dimensions, namely change, developer and message, respectively. Then, we build an identification model (e.g., random forest) based on the extracted features. To evaluate the effectiveness of our proposed model, we perform an empirical study on ten open source projects including a total of 125,241 commits. Our experimental results show that our model outperforms two baselines in terms of AUC-ROC and cost-effectiveness (i.e., percentage of detected reverted commits when inspecting 20% of total changed LOC). In terms of the average performance across the ten studied projects, our model achieves an AUC-ROC of 0.756 and a cost-effectiveness of 0.746, significantly improving the baselines by substantial margins. In addition, we found that “developer” is the most discriminative dimension among the three dimensions of features for the identification of reverted commits. However, using all the three dimensions of commit features leads to better performance.

Original languageEnglish
Pages (from-to)2171-2208
Number of pages38
JournalEmpirical Software Engineering
Volume24
Issue number4
DOIs
Publication statusPublished - Aug 2019

Keywords

  • Empirical study
  • Feature engineering
  • Identification model
  • Reverted commits

Cite this

Yan, Meng ; Xia, Xin ; Lo, David ; Hassan, Ahmed E. ; Li, Shanping. / Characterizing and identifying reverted commits. In: Empirical Software Engineering. 2019 ; Vol. 24, No. 4. pp. 2171-2208.
@article{cb45270a19d5433b9ba3993b5af0f06b,
title = "Characterizing and identifying reverted commits",
abstract = "In practice, a popular and coarse-grained approach for recovering from a problematic commit is to revert it (i.e., undoing the change). However, reverted commits could induce some issues for software development, such as impeding the development progress and increasing the difficulty for maintenance. In order to mitigate these issues, we set out to explore the following central question: can we characterize and identify which commits will be reverted? In this paper, we characterize commits using 27 commit features and build an identification model to identify commits that will be reverted. We first identify reverted commits by analyzing commit messages and comparing the changed content, and extract 27 commit features that can be divided into three dimensions, namely change, developer and message, respectively. Then, we build an identification model (e.g., random forest) based on the extracted features. To evaluate the effectiveness of our proposed model, we perform an empirical study on ten open source projects including a total of 125,241 commits. Our experimental results show that our model outperforms two baselines in terms of AUC-ROC and cost-effectiveness (i.e., percentage of detected reverted commits when inspecting 20{\%} of total changed LOC). In terms of the average performance across the ten studied projects, our model achieves an AUC-ROC of 0.756 and a cost-effectiveness of 0.746, significantly improving the baselines by substantial margins. In addition, we found that “developer” is the most discriminative dimension among the three dimensions of features for the identification of reverted commits. However, using all the three dimensions of commit features leads to better performance.",
keywords = "Empirical study, Feature engineering, Identification model, Reverted commits",
author = "Meng Yan and Xin Xia and David Lo and Hassan, {Ahmed E.} and Shanping Li",
year = "2019",
month = "8",
doi = "10.1007/s10664-019-09688-8",
language = "English",
volume = "24",
pages = "2171--2208",
journal = "Empirical Software Engineering",
issn = "1382-3256",
publisher = "Springer-Verlag London Ltd.",
number = "4",

}

Characterizing and identifying reverted commits. / Yan, Meng; Xia, Xin; Lo, David; Hassan, Ahmed E.; Li, Shanping.

In: Empirical Software Engineering, Vol. 24, No. 4, 08.2019, p. 2171-2208.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - Characterizing and identifying reverted commits

AU - Yan, Meng

AU - Xia, Xin

AU - Lo, David

AU - Hassan, Ahmed E.

AU - Li, Shanping

PY - 2019/8

Y1 - 2019/8

N2 - In practice, a popular and coarse-grained approach for recovering from a problematic commit is to revert it (i.e., undoing the change). However, reverted commits could induce some issues for software development, such as impeding the development progress and increasing the difficulty for maintenance. In order to mitigate these issues, we set out to explore the following central question: can we characterize and identify which commits will be reverted? In this paper, we characterize commits using 27 commit features and build an identification model to identify commits that will be reverted. We first identify reverted commits by analyzing commit messages and comparing the changed content, and extract 27 commit features that can be divided into three dimensions, namely change, developer and message, respectively. Then, we build an identification model (e.g., random forest) based on the extracted features. To evaluate the effectiveness of our proposed model, we perform an empirical study on ten open source projects including a total of 125,241 commits. Our experimental results show that our model outperforms two baselines in terms of AUC-ROC and cost-effectiveness (i.e., percentage of detected reverted commits when inspecting 20% of total changed LOC). In terms of the average performance across the ten studied projects, our model achieves an AUC-ROC of 0.756 and a cost-effectiveness of 0.746, significantly improving the baselines by substantial margins. In addition, we found that “developer” is the most discriminative dimension among the three dimensions of features for the identification of reverted commits. However, using all the three dimensions of commit features leads to better performance.

AB - In practice, a popular and coarse-grained approach for recovering from a problematic commit is to revert it (i.e., undoing the change). However, reverted commits could induce some issues for software development, such as impeding the development progress and increasing the difficulty for maintenance. In order to mitigate these issues, we set out to explore the following central question: can we characterize and identify which commits will be reverted? In this paper, we characterize commits using 27 commit features and build an identification model to identify commits that will be reverted. We first identify reverted commits by analyzing commit messages and comparing the changed content, and extract 27 commit features that can be divided into three dimensions, namely change, developer and message, respectively. Then, we build an identification model (e.g., random forest) based on the extracted features. To evaluate the effectiveness of our proposed model, we perform an empirical study on ten open source projects including a total of 125,241 commits. Our experimental results show that our model outperforms two baselines in terms of AUC-ROC and cost-effectiveness (i.e., percentage of detected reverted commits when inspecting 20% of total changed LOC). In terms of the average performance across the ten studied projects, our model achieves an AUC-ROC of 0.756 and a cost-effectiveness of 0.746, significantly improving the baselines by substantial margins. In addition, we found that “developer” is the most discriminative dimension among the three dimensions of features for the identification of reverted commits. However, using all the three dimensions of commit features leads to better performance.

KW - Empirical study

KW - Feature engineering

KW - Identification model

KW - Reverted commits

UR - http://www.scopus.com/inward/record.url?scp=85062728420&partnerID=8YFLogxK

U2 - 10.1007/s10664-019-09688-8

DO - 10.1007/s10664-019-09688-8

M3 - Article

VL - 24

SP - 2171

EP - 2208

JO - Empirical Software Engineering

JF - Empirical Software Engineering

SN - 1382-3256

IS - 4

ER -