Abstract
Single amino acid polymorphisms (SAPs) are the most abundant form of known genetic variations associated with human diseases. It is of great interest to study the sequence-structure-function relationship underlying SAPs. In this work, we collected the human variant data from three databases and divided them into three categories, i.e. cancer somatic mutations (CSM), Mendelian disease-related variant (SVD) and neutral polymorphisms (SVP). We built support vector machine (SVM) classifiers to predict these three classes of SAPs, using the optimal features selected by a random forest algorithm. Consequently, 280 sequence-derived and structural features were initially extracted from the curated datasets from which 18 optimal candidate features were further selected by random forest. Furthermore, we performed a stepwise feature selection to select characteristic sequence and structural features that are important for predicting each SAPs class. As a result, our predictors achieved a prediction accuracy (ACC) of 84.97, 96.93, 86.98 and 88.24 , for the three classes, CSM, SVD and SVP, respectively. Performance comparison with other previously developed tools such as SIFT, SNAP and Polyphen2 indicates that our method provides a favorable performance with higher Sensitivity scores and Matthew s correlation coefficients (MCC). These results indicate that the prediction performance of SAPs classifiers can be effectively improved by feature selection. Moreover, division of SAPs into three respective categories and construction of accurate SVM-based classifiers for each class provides a practically useful way for investigating the difference between Mendelian disease-related variants and cancer somatic mutations.
Original language | English |
---|---|
Title of host publication | 2011 IEEE Conference on Systems Biology 2011 |
Editors | L Chen, X S Zhang, Y Wang |
Place of Publication | China |
Publisher | IEEE, Institute of Electrical and Electronics Engineers |
Pages | 18 - 26 |
Number of pages | 9 |
ISBN (Print) | 9781457716669 |
Publication status | Published - 2011 |
Event | International Conference on Computational Systems Biology (ISB) 2011 - Zhuhai, China Duration: 2 Sept 2011 → 4 Sept 2011 Conference number: 5th |
Conference
Conference | International Conference on Computational Systems Biology (ISB) 2011 |
---|---|
Abbreviated title | ISB 2011 |
Country/Territory | China |
City | Zhuhai |
Period | 2/09/11 → 4/09/11 |