TY - JOUR
T1 - TransEFVP
T2 - A Two-Stage Approach for the Prediction of Human Pathogenic Variants Based on Protein Sequence Embedding Fusion
AU - Yan, Zihao
AU - Ge, Fang
AU - Liu, Yan
AU - Zhang, Yumeng
AU - Li, Fuyi
AU - Song, Jiangning
AU - Yu, Dong-Jun
N1 - Funding Information:
This work was supported by the National Natural Science Foundation of China (62372234 and 62072243), the Natural Science Foundation of Jiangsu (BK20201304), and the Natural Science Research Start-up Foundation of Recruiting Talents of Nanjing University of Posts and Telecommunications (Grant No. NY223062).
Publisher Copyright:
© 2024 American Chemical Society.
PY - 2024/2/26
Y1 - 2024/2/26
N2 - Studying the effect of single amino acid variations (SAVs) on protein structure and function is integral to advancing our understanding of molecular processes, evolutionary biology, and disease mechanisms. Screening for deleterious variants is one of the crucial issues in precision medicine. Here, we propose a novel computational approach, TransEFVP, based on large-scale protein language model embeddings and a transformer-based neural network to predict disease-associated SAVs. The model adopts a two-stage architecture: the first stage is designed to fuse different feature embeddings through a transformer encoder. In the second stage, a support vector machine model is employed to quantify the pathogenicity of SAVs after dimensionality reduction. The prediction performance of TransEFVP on blind test data achieves a Matthews correlation coefficient of 0.751, an F1-score of 0.846, and an area under the receiver operating characteristic curve of 0.871, higher than the existing state-of-the-art methods. The benchmark results demonstrate that TransEFVP can be explored as an accurate and effective SAV pathogenicity prediction method. The data and codes for TransEFVP are available at https://github.com/yzh9607/TransEFVP/tree/master for academic use.
AB - Studying the effect of single amino acid variations (SAVs) on protein structure and function is integral to advancing our understanding of molecular processes, evolutionary biology, and disease mechanisms. Screening for deleterious variants is one of the crucial issues in precision medicine. Here, we propose a novel computational approach, TransEFVP, based on large-scale protein language model embeddings and a transformer-based neural network to predict disease-associated SAVs. The model adopts a two-stage architecture: the first stage is designed to fuse different feature embeddings through a transformer encoder. In the second stage, a support vector machine model is employed to quantify the pathogenicity of SAVs after dimensionality reduction. The prediction performance of TransEFVP on blind test data achieves a Matthews correlation coefficient of 0.751, an F1-score of 0.846, and an area under the receiver operating characteristic curve of 0.871, higher than the existing state-of-the-art methods. The benchmark results demonstrate that TransEFVP can be explored as an accurate and effective SAV pathogenicity prediction method. The data and codes for TransEFVP are available at https://github.com/yzh9607/TransEFVP/tree/master for academic use.
UR - http://www.scopus.com/inward/record.url?scp=85186170645&partnerID=8YFLogxK
U2 - 10.1021/acs.jcim.3c02019
DO - 10.1021/acs.jcim.3c02019
M3 - Article
C2 - 38334115
AN - SCOPUS:85186170645
SN - 1549-9596
VL - 64
SP - 1407
EP - 1418
JO - Journal of Chemical Information and Modeling
JF - Journal of Chemical Information and Modeling
IS - 4
ER -