TransEFVP: A Two-Stage Approach for the Prediction of Human Pathogenic Variants Based on Protein Sequence Embedding Fusion

Zihao Yan, Fang Ge, Yan Liu, Yumeng Zhang, Fuyi Li, Jiangning Song, Dong-Jun Yu

Research output: Contribution to journalArticleResearchpeer-review

2 Citations (Scopus)

Abstract

Studying the effect of single amino acid variations (SAVs) on protein structure and function is integral to advancing our understanding of molecular processes, evolutionary biology, and disease mechanisms. Screening for deleterious variants is one of the crucial issues in precision medicine. Here, we propose a novel computational approach, TransEFVP, based on large-scale protein language model embeddings and a transformer-based neural network to predict disease-associated SAVs. The model adopts a two-stage architecture: the first stage is designed to fuse different feature embeddings through a transformer encoder. In the second stage, a support vector machine model is employed to quantify the pathogenicity of SAVs after dimensionality reduction. The prediction performance of TransEFVP on blind test data achieves a Matthews correlation coefficient of 0.751, an F1-score of 0.846, and an area under the receiver operating characteristic curve of 0.871, higher than the existing state-of-the-art methods. The benchmark results demonstrate that TransEFVP can be explored as an accurate and effective SAV pathogenicity prediction method. The data and codes for TransEFVP are available at https://github.com/yzh9607/TransEFVP/tree/master for academic use.

Original languageEnglish
Pages (from-to)1407-1418
Number of pages12
JournalJournal of Chemical Information and Modeling
Volume64
Issue number4
DOIs
Publication statusPublished - 26 Feb 2024

Cite this