FunSAV: predicting the functional effect of single amino acid variants using a two-stage random forest model

Mingjun Wang, Xing-Ming Zhao, Kazuhiro Takemoto, Haisong Xu, Tatsuya Akutsu, Jiangning Song

Research output: Contribution to journalArticleResearchpeer-review

33 Citations (Scopus)

Abstract

Single amino acid variants (SAVs) are the most abundant form of known genetic variations associated with human disease. Successful prediction of the functional impact of SAVs from sequences can thus lead to an improved understanding of the underlying mechanisms of why a SAV may be associated with certain disease. In this work, we constructed a high-quality structural dataset that contained 679 high-quality protein structures with 2,048 SAVs by collecting the human genetic variant data from multiple resources and dividing them into two categories, i.e., disease-associated and neutral variants. We built a two-stage random forest (RF) model, termed as FunSAV, to predict the functional effect of SAVs by combining sequence, structure and residue-contact network features with other additional features that were not explored in previous studies. Importantly, a two-step feature selection procedure was proposed to select the most important and informative features that contribute to the prediction of disease association of SAVs. In cross-validation experiments on the benchmark dataset, FunSAV achieved a good prediction performance with the area under the curve (AUC) of 0.882, which is competitive with and in some cases better than other existing tools including SIFT, SNAP, Polyphen2, PANTHER, nsSNPAnalyzer and PhD-SNP.
Original languageEnglish
Article numbere43847
Number of pages14
JournalPLoS ONE
Volume7
Issue number8
DOIs
Publication statusPublished - 2012

Cite this