TY - JOUR
T1 - VulExplainer
T2 - A transformer-based hierarchical distillation for explaining vnerability types
AU - Fu, Michael
AU - Nguyen, Van
AU - Tantithamthavorn, Chakkrit Kla
AU - Le, Trung
AU - Phung, Dinh
N1 - Publisher Copyright:
IEEE
PY - 2023/10/1
Y1 - 2023/10/1
N2 - Deep learning-based vulnerability prediction approaches are proposed to help under-resourced security practitioners to detect vulnerable functions. However, security practitioners still do not know what type of vulnerabilities correspond to a given prediction (aka CWE-ID). Thus, a novel approach to explain the type of vulnerabilities for a given prediction is imperative. In this paper, we propose VulExplainer, an approach to explain the type of vulnerabilities. We represent VulExplainer as a vulnerability classification task. However, vulnerabilities have diverse characteristics (i.e., CWE-IDs) and the number of labeled samples in each CWE-ID is highly imbalanced (known as a highly imbalanced multi-class classification problem), which often lead to inaccurate predictions. Thus, we introduce a Transformer-based hierarchical distillation for software vulnerability classification in order to address the highly imbalanced types of software vulnerabilities. Specifically, we split a complex label distribution into sub-distributions based on CWE abstract types (i.e., categorizations that group similar CWE-IDs). Thus, similar CWE-IDs can be grouped and each group will have a more balanced label distribution. We learn TextCNN teachers on each of the simplified distributions respectively, however, they only perform well in their group. Thus, we build a transformer student model to generalize the performance of TextCNN teachers through our hierarchical knowledge distillation framework. Through an extensive evaluation using the real-world 8,636 vulnerabilities, our approach outperforms all of the baselines by 5%-29%. The results also demonstrate that our approach can be applied to Transformer-based architectures such as CodeBERT, GraphCodeBERT, and CodeGPT. Moreover, our method maintains compatibility with any Transformer-based model without requiring any architectural modifications but only adds a special distillation token to the input. These results highlight our significant contributions towards the fundamental and practical problem of explaining software vulnerability.
AB - Deep learning-based vulnerability prediction approaches are proposed to help under-resourced security practitioners to detect vulnerable functions. However, security practitioners still do not know what type of vulnerabilities correspond to a given prediction (aka CWE-ID). Thus, a novel approach to explain the type of vulnerabilities for a given prediction is imperative. In this paper, we propose VulExplainer, an approach to explain the type of vulnerabilities. We represent VulExplainer as a vulnerability classification task. However, vulnerabilities have diverse characteristics (i.e., CWE-IDs) and the number of labeled samples in each CWE-ID is highly imbalanced (known as a highly imbalanced multi-class classification problem), which often lead to inaccurate predictions. Thus, we introduce a Transformer-based hierarchical distillation for software vulnerability classification in order to address the highly imbalanced types of software vulnerabilities. Specifically, we split a complex label distribution into sub-distributions based on CWE abstract types (i.e., categorizations that group similar CWE-IDs). Thus, similar CWE-IDs can be grouped and each group will have a more balanced label distribution. We learn TextCNN teachers on each of the simplified distributions respectively, however, they only perform well in their group. Thus, we build a transformer student model to generalize the performance of TextCNN teachers through our hierarchical knowledge distillation framework. Through an extensive evaluation using the real-world 8,636 vulnerabilities, our approach outperforms all of the baselines by 5%-29%. The results also demonstrate that our approach can be applied to Transformer-based architectures such as CodeBERT, GraphCodeBERT, and CodeGPT. Moreover, our method maintains compatibility with any Transformer-based model without requiring any architectural modifications but only adds a special distillation token to the input. These results highlight our significant contributions towards the fundamental and practical problem of explaining software vulnerability.
KW - Codes
KW - Data models
KW - Security
KW - Software
KW - software security
KW - Software vulnerability
KW - Static VAr compensators
KW - Task analysis
KW - Transformers
UR - http://www.scopus.com/inward/record.url?scp=85168263380&partnerID=8YFLogxK
U2 - 10.1109/TSE.2023.3305244
DO - 10.1109/TSE.2023.3305244
M3 - Article
AN - SCOPUS:85168263380
SN - 0098-5589
VL - 49
SP - 4550
EP - 4565
JO - IEEE Transactions on Software Engineering
JF - IEEE Transactions on Software Engineering
IS - 10
ER -