TY - JOUR
T1 - A novel generative adversarial network for improving crash severity modeling with imbalanced data
AU - Chen, Junlan
AU - Pu, Ziyuan
AU - Zheng, Nan
AU - Wen, Xiao
AU - Ding, Hongliang
AU - Guo, Xiucheng
N1 - Funding Information:
The research is supported by the Key Laboratory of Transport Industry of Comprehensive Transportation Theory (Nanjing Modern Multimodal Transportation Laboratory), Ministry of Transport, PRC: [Grant No. MTF2023002]. We thank Dr. Yingheng Zhang and Dr. Chi Wei for useful comments that significantly improved the presentation of this article.
Funding Information:
The research is supported by the Ministry of Transport of PRC Key Laboratory of Transport Industry of Comprehensive Transportation Theory (Nanjing Modern Multimodal Transportation Laboratory): [Grant No. MTF2023002 ]. We thank Dr. Yingheng Zhang and Dr. Chi Wei for useful comments that significantly improved the presentation of this article.
Publisher Copyright:
© 2024 Elsevier Ltd
PY - 2024/7
Y1 - 2024/7
N2 - Traffic crash data is often greatly imbalanced with the majority of non-fatal crashes and only a small number of fatal crashes. Such data imbalance issue poses a challenge for crash severity modelling, especially for classifying and interpreting fatal crashes with very limited samples. To address the data imbalance issues, the data resampling techniques are commonly used methods to rebalance the number of samples among all categories of the dataset, such as under-sampling and over-sampling techniques. However, it is challenging for most traditional and existing deep learning-based resampling methods, e.g., synthetic minority oversampling technique (SMOTE) and Generative Adversarial Networks (GAN), to handle both continuous and discrete risk factors in traffic crash datasets, since they are built upon by smooth and continuous functions which are not applicable for processing discrete variables. Though some resampling methods are capable of handling both continuous and discrete variables, they may struggle with mode collapse issues associated with sparse discrete risk factors so that the diversity of the underlying data distribution can not be captured due to oversampling repetitive and similar samples. To address the aforementioned issues, the current study proposes a traffic crash data generation method based on the Conditional Tabular GAN (CTGAN) to rebalance crash datasets for improving performance of crash severity classification and interpretation. The designed experiments are conducted to evaluate contributions of the synthetic data for improving crash severity classification, the distribution consistency between synthetic and benchmark datasets, and the parameter recovery (i.e., the accuracy of parameter estimation and probability prediction) for various resampling strategies. A 4-year real-world dataset collected in Washington State, U.S., and Monte Carlo simulations are utilized for demonstrating the designed experiments. The results indicate that crash severity modeling using synthetic data generated by the mix-resampling of CTGAN and random under-sampling (CTGAN-RU) outperforms all baseline methods. In addition, the proposed deep generative method demonstrates the capability in maintaining distribution consistency and achieving accurate parameter recovery. This study can provide valuable insights for traffic safety researchers and engineers into crash severity modeling, especially when handling imbalanced crash data of various types.
AB - Traffic crash data is often greatly imbalanced with the majority of non-fatal crashes and only a small number of fatal crashes. Such data imbalance issue poses a challenge for crash severity modelling, especially for classifying and interpreting fatal crashes with very limited samples. To address the data imbalance issues, the data resampling techniques are commonly used methods to rebalance the number of samples among all categories of the dataset, such as under-sampling and over-sampling techniques. However, it is challenging for most traditional and existing deep learning-based resampling methods, e.g., synthetic minority oversampling technique (SMOTE) and Generative Adversarial Networks (GAN), to handle both continuous and discrete risk factors in traffic crash datasets, since they are built upon by smooth and continuous functions which are not applicable for processing discrete variables. Though some resampling methods are capable of handling both continuous and discrete variables, they may struggle with mode collapse issues associated with sparse discrete risk factors so that the diversity of the underlying data distribution can not be captured due to oversampling repetitive and similar samples. To address the aforementioned issues, the current study proposes a traffic crash data generation method based on the Conditional Tabular GAN (CTGAN) to rebalance crash datasets for improving performance of crash severity classification and interpretation. The designed experiments are conducted to evaluate contributions of the synthetic data for improving crash severity classification, the distribution consistency between synthetic and benchmark datasets, and the parameter recovery (i.e., the accuracy of parameter estimation and probability prediction) for various resampling strategies. A 4-year real-world dataset collected in Washington State, U.S., and Monte Carlo simulations are utilized for demonstrating the designed experiments. The results indicate that crash severity modeling using synthetic data generated by the mix-resampling of CTGAN and random under-sampling (CTGAN-RU) outperforms all baseline methods. In addition, the proposed deep generative method demonstrates the capability in maintaining distribution consistency and achieving accurate parameter recovery. This study can provide valuable insights for traffic safety researchers and engineers into crash severity modeling, especially when handling imbalanced crash data of various types.
KW - Crash severity modeling
KW - Deep learning
KW - Generative adversarial networks
KW - Imbalanced data
KW - Statistical model
UR - http://www.scopus.com/inward/record.url?scp=85193837088&partnerID=8YFLogxK
U2 - 10.1016/j.trc.2024.104642
DO - 10.1016/j.trc.2024.104642
M3 - Article
AN - SCOPUS:85193837088
SN - 0968-090X
VL - 164
JO - Transportation Research Part C: Emerging Technologies
JF - Transportation Research Part C: Emerging Technologies
M1 - 104642
ER -