Feature selection and machine learning approach for carotid atherosclerosis in asymptomatic adults

Tao Liang, Qiao-Li Wang, Xiao-Qin Liu, Zhen Zhou, Scott Lowe, Zi-Heng Chen, Chen-Yu Sun

Research output: Contribution to journalArticleResearchpeer-review


Objective: The presence of carotid atherosclerosis reflects the overall atherosclerotic load and may predict cardiovascular and cerebrovascular accidents. Studies have reported risk factors for carotid atherosclerosis. However, few practical models have been established to predict carotid atherosclerosis risk. Thus, this study was conducted to investigate important features of carotid atherosclerosis and to propose a machine learning-based method for predicting carotid atherosclerosis in asymptomatic adults. Methods: Cross-sectional study was conducted using routine medical check-up data of individuals from January 2019 to January 2020. Pearson’s correlation analysis was performed to correlate the features. Then, features were selected by python’s feature-selection library and analyzed through three algorithms. Multiple machine learning algorithms, including Decision Tree, Random Forest and Logistic Regression (LR) were used to predict the risk of carotid atherosclerotic plaques and compared their precision, accuracy, recall, F1-score and area under the curve. Results: A total of 150 individuals were enrolled in this study, 30 (20%) of them were found with carotid atherosclerotic plaques. Sex, age, body mass index, total cholesterol, Systolic blood pressure (SBP), and carbohydrate antigen 724 (CA724) were independently correlated to carotid atherosclerosis. Pepsinogen I and pepsinogen II serum levels had no correlations with Carotid intima-media thickness and pulse wave velocity. SBP, diastolic blood pressure age, low-density lipoprotein, Pepsinogen I, pepsinogen II, body mass index, Waist, CA724, and Uric Acid contribute to the cumulative importance of 0.9, and SBP was the most crucial feature for carotid atherosclerosis. LR algorithm has the precision (0.92), values of recall (0.91), F1 (0.9), and area under the curve (0.95), and showed the optimal performance to predict the presence or absence of carotid atherosclerosis in asymptomatic adults. The code for analysis in this article was uploaded to GitHub (https://github.com/ganbingliangyi/machine-learning). Conclusions: SBP was the most crucial feature in ranking features, the LR algorithm showed the optimal performance to predict the presence or absence of carotid atherosclerosis in asymptomatic adults.
Original languageEnglish
Article number22
Number of pages7
JournalMedical Data Mining
Issue number4
Publication statusPublished - 2022
Externally publishedYes

Cite this