Detecting code clones with graph neural network and flow-augmented abstract syntax tree

Wenhan Wang, Ge Li, Bo Ma, Xin Xia, Zhi Jin

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

173 Citations (Scopus)

Abstract

Code clones are semantically similar code fragments pairs that are syntactically similar or different. Detection of code clones can help to reduce the cost of software maintenance and prevent bugs. Numerous approaches of detecting code clones have been proposed previously, but most of them focus on detecting syntactic clones and do not work well on semantic clones with different syntactic features. To detect semantic clones, researchers have tried to adopt deep learning for code clone detection to automatically learn latent semantic features from data. Especially, to leverage grammar information, several approaches used abstract syntax trees (AST) as input and achieved significant progress on code clone benchmarks in various programming languages. However, these AST-based approaches still can not fully leverage the structural information of code fragments, especially semantic information such as control flow and data flow. To leverage control and data flow information, in this paper, we build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST). We construct FA-AST by augmenting original ASTs with explicit control and data flow edges. Then we apply two different types of graph neural networks (GNN) on FA-AST to measure the similarity of code pairs. As far as we have concerned, we are the first to apply graph neural networks on the domain of code clone detection. We apply our FA-AST and graph neural networks on two Java datasets: Google Code Jam and BigCloneBench. Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.

Original languageEnglish
Title of host publicationProceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution, and Reengineering
EditorsKostas Kontogiannis, Foutse Khomh, Alexander Chatzigeorgiou, Marios-Eleftherios Fokaefs, Minghui Zhou
Place of PublicationPiscataway NJ USA
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages261-271
Number of pages11
ISBN (Electronic)9781728151434
ISBN (Print)9781728151441
DOIs
Publication statusPublished - 2020
EventIEEE International Conference on Software Analysis, Evolution, and Reengineering 2020 - London, Canada
Duration: 18 Feb 202021 Feb 2020
Conference number: 27th
https://saner2020.csd.uwo.ca (Website)
https://ieeexplore.ieee.org/xpl/conhome/9040394/proceeding (Proceedings)

Conference

ConferenceIEEE International Conference on Software Analysis, Evolution, and Reengineering 2020
Abbreviated titleSANER 2020
Country/TerritoryCanada
CityLondon
Period18/02/2021/02/20
Internet address

Keywords

  • clone detection
  • control flow
  • data flow
  • deep learning
  • graph neural network

Cite this