TY - JOUR
T1 - Predicting factors for survival of breast cancer patients using machine learning techniques
AU - Ganggayah, Mogana Darshini
AU - Taib, Nur Aishah
AU - Har, Yip Cheng
AU - Lio, Pietro
AU - Dhillon, Sarinder Kaur
N1 - Funding Information:
The University of Malaya Prototype Research Grant Scheme (PR001-2017A) and the High Impact Research (HIR) Grant (UM.C/HIR/MOHE/06) from the Ministry of Higher Education, Malaysia funded this study by providing facilities for data management in the University Malaya Medical Centre and the Data Science and Bioinformatics Laboratory, University of Malaya.
Publisher Copyright:
© 2019 The Author(s).
PY - 2019
Y1 - 2019
N2 - Background: Breast cancer is one of the most common diseases in women worldwide. Many studies have been conducted to predict the survival indicators, however most of these analyses were predominantly performed using basic statistical methods. As an alternative, this study used machine learning techniques to build models for detecting and visualising significant prognostic indicators of breast cancer survival rate. Methods: A large hospital-based breast cancer dataset retrieved from the University Malaya Medical Centre, Kuala Lumpur, Malaysia (n = 8066) with diagnosis information between 1993 and 2016 was used in this study. The dataset contained 23 predictor variables and one dependent variable, which referred to the survival status of the patients (alive or dead). In determining the significant prognostic factors of breast cancer survival rate, prediction models were built using decision tree, random forest, neural networks, extreme boost, logistic regression, and support vector machine. Next, the dataset was clustered based on the receptor status of breast cancer patients identified via immunohistochemistry to perform advanced modelling using random forest. Subsequently, the important variables were ranked via variable selection methods in random forest. Finally, decision trees were built and validation was performed using survival analysis. Results: In terms of both model accuracy and calibration measure, all algorithms produced close outcomes, with the lowest obtained from decision tree (accuracy = 79.8%) and the highest from random forest (accuracy = 82.7%). The important variables identified in this study were cancer stage classification, tumour size, number of total axillary lymph nodes removed, number of positive lymph nodes, types of primary treatment, and methods of diagnosis. Conclusion: Interestingly the various machine learning algorithms used in this study yielded close accuracy hence these methods could be used as alternative predictive tools in the breast cancer survival studies, particularly in the Asian region. The important prognostic factors influencing survival rate of breast cancer identified in this study, which were validated by survival curves, are useful and could be translated into decision support tools in the medical domain.
AB - Background: Breast cancer is one of the most common diseases in women worldwide. Many studies have been conducted to predict the survival indicators, however most of these analyses were predominantly performed using basic statistical methods. As an alternative, this study used machine learning techniques to build models for detecting and visualising significant prognostic indicators of breast cancer survival rate. Methods: A large hospital-based breast cancer dataset retrieved from the University Malaya Medical Centre, Kuala Lumpur, Malaysia (n = 8066) with diagnosis information between 1993 and 2016 was used in this study. The dataset contained 23 predictor variables and one dependent variable, which referred to the survival status of the patients (alive or dead). In determining the significant prognostic factors of breast cancer survival rate, prediction models were built using decision tree, random forest, neural networks, extreme boost, logistic regression, and support vector machine. Next, the dataset was clustered based on the receptor status of breast cancer patients identified via immunohistochemistry to perform advanced modelling using random forest. Subsequently, the important variables were ranked via variable selection methods in random forest. Finally, decision trees were built and validation was performed using survival analysis. Results: In terms of both model accuracy and calibration measure, all algorithms produced close outcomes, with the lowest obtained from decision tree (accuracy = 79.8%) and the highest from random forest (accuracy = 82.7%). The important variables identified in this study were cancer stage classification, tumour size, number of total axillary lymph nodes removed, number of positive lymph nodes, types of primary treatment, and methods of diagnosis. Conclusion: Interestingly the various machine learning algorithms used in this study yielded close accuracy hence these methods could be used as alternative predictive tools in the breast cancer survival studies, particularly in the Asian region. The important prognostic factors influencing survival rate of breast cancer identified in this study, which were validated by survival curves, are useful and could be translated into decision support tools in the medical domain.
KW - Data science
KW - Decision tree
KW - Factors influencing survival of breast cancer
KW - Machine learning
KW - Random forest
UR - http://www.scopus.com/inward/record.url?scp=85063385470&partnerID=8YFLogxK
U2 - 10.1186/s12911-019-0801-4
DO - 10.1186/s12911-019-0801-4
M3 - Article
C2 - 30902088
AN - SCOPUS:85063385470
SN - 1472-6947
VL - 19
JO - BMC Medical Informatics and Decision Making
JF - BMC Medical Informatics and Decision Making
M1 - 48
ER -