TY - JOUR
T1 - Context-aware retrieval-based Deep Commit Message Generation
AU - Wang, Haoye
AU - Xia, Xin
AU - Lo, David
AU - He, Qiang
AU - Wang, Xinyu
AU - Grundy, John
N1 - Funding Information:
This research was partially supported by the Australian Research Council’s Discovery Early Career Researcher Award (DECRA) funding scheme (DE200100021), ARC Laureate Fellowship funding scheme (FL190100035), ARC Discovery grant (DP200100020), and by the Key-Area Research and Development Program of Guangdong Province (No. 2020B0101100005), Key Research and Development Program of Zhejiang Province (No. 2021C01014), ARC Laureate Fellowship funding scheme (FL190100035), ARC Discovery grant (DP200100020), and the National Research Foundation, Singapore under its Industry Alignment Fund – Prepositioning (IAF-PP) Funding Initiative. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore. Authors’ addresses: H. Wang and X. Wang, Cao Guangbiao building 405, Yuquan campus, Zhejiang University, China 310063; emails: {why_, wangxinyu}@zju.edu.cn; X. Xia (corresponding author) and J. Grundy, Building 6, 29 Ancora Im-paro Way, Clayton Campus, Monash University VIC 3800; emails: {xin.xia, john.grundy}@monash.edu, [email protected]; D. Lo, School of Information Systems, Singapore Management University, 80 Stamford Road, Singapore 178902; email: [email protected]; Q. He, John St, Hawthorn VIC 3122, Australia; email: [email protected]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2021 Association for Computing Machinery. 1049-331X/2021/07-ART56 $15.00 https://doi.org/10.1145/3464689
Publisher Copyright:
© 2021 ACM.
Copyright:
Copyright 2021 Elsevier B.V., All rights reserved.
PY - 2021/10
Y1 - 2021/10
N2 - Commit messages recorded in version control systems contain valuable information for software development, maintenance, and comprehension. Unfortunately, developers often commit code with empty or poor quality commit messages. To address this issue, several studies have proposed approaches to generate commit messages from commit diffs. Recent studies make use of neural machine translation algorithms to try and translate git diffs into commit messages and have achieved some promising results. However, these learning-based methods tend to generate high-frequency words but ignore low-frequency ones. In addition, they suffer from exposure bias issues, which leads to a gap between training phase and testing phase. In this article, we propose CoRec to address the above two limitations. Specifically, we first train a context-aware encoder-decoder model that randomly selects the previous output of the decoder or the embedding vector of a ground truth word as context to make the model gradually aware of previous alignment choices. Given a diff for testing, the trained model is reused to retrieve the most similar diff from the training set. Finally, we use the retrieval diff to guide the probability distribution for the final generated vocabulary. Our method combines the advantages of both information retrieval and neural machine translation. We evaluate CoRec on a dataset from Liu et al. and a large-scale dataset crawled from 10K popular Java repositories in Github. Our experimental results show that CoRec significantly outperforms the state-of-the-art method NNGen by 19% on average in terms of BLEU.
AB - Commit messages recorded in version control systems contain valuable information for software development, maintenance, and comprehension. Unfortunately, developers often commit code with empty or poor quality commit messages. To address this issue, several studies have proposed approaches to generate commit messages from commit diffs. Recent studies make use of neural machine translation algorithms to try and translate git diffs into commit messages and have achieved some promising results. However, these learning-based methods tend to generate high-frequency words but ignore low-frequency ones. In addition, they suffer from exposure bias issues, which leads to a gap between training phase and testing phase. In this article, we propose CoRec to address the above two limitations. Specifically, we first train a context-aware encoder-decoder model that randomly selects the previous output of the decoder or the embedding vector of a ground truth word as context to make the model gradually aware of previous alignment choices. Given a diff for testing, the trained model is reused to retrieve the most similar diff from the training set. Finally, we use the retrieval diff to guide the probability distribution for the final generated vocabulary. Our method combines the advantages of both information retrieval and neural machine translation. We evaluate CoRec on a dataset from Liu et al. and a large-scale dataset crawled from 10K popular Java repositories in Github. Our experimental results show that CoRec significantly outperforms the state-of-the-art method NNGen by 19% on average in terms of BLEU.
KW - Commit message generation
KW - information retrieval
KW - neural machine translation
UR - https://www.scopus.com/pages/publications/85112049240
U2 - 10.1145/3464689
DO - 10.1145/3464689
M3 - Article
AN - SCOPUS:85112049240
SN - 1049-331X
VL - 30
JO - ACM Transactions on Software Engineering and Methodology
JF - ACM Transactions on Software Engineering and Methodology
IS - 4
M1 - 56
ER -