Identifying Protein-Nucleotide Binding Residues via Grouped Multi-task Learning and Pre-trained Protein Language Models

Jiashun Wu, Yan Liu, Ying Zhang, Xiaoyu Wang, He Yan, Yiheng Zhu, Jiangning Song, Dong-Jun Yu

Research output: Contribution to journalArticleResearchpeer-review

Abstract

The accurate identification of protein-nucleotide binding residues is crucial for protein function annotation and drug discovery. Numerous computational methods have been proposed to predict these binding residues, achieving remarkable performance. However, due to the limited availability and high variability of nucleotides, predicting binding residues for diverse nucleotides remains a significant challenge. To address these, we propose NucGMTL, a new grouped deep multi-task learning approach designed for predicting binding residues of all observed nucleotides in the BioLiP database. NucGMTL leverages pre-trained protein language models to generate robust sequence embedding and incorporates multi-scale learning along with scale-based self-attention mechanisms to capture a broader range of feature dependencies. To effectively harness the shared binding patterns across various nucleotides, deep multi-task learning is utilized to distill common representations, taking advantage of auxiliary information from similar nucleotides selected based on task grouping. Performance evaluation on benchmark data sets shows that NucGMTL achieves an average area under the Precision-Recall curve (AUPRC) of 0.594, surpassing other state-of-the-art methods. Further analyses highlight that the predominant advantage of NucGMTL can be reflected by its effective integration of grouped multi-task learning and pre-trained protein language models. The data set and source code are freely accessible at: https://github.com/jerry1984Y/NucGMTL.

Original languageEnglish
Pages (from-to)1040-1052
Number of pages13
JournalJournal of Chemical Information and Modeling
Volume65
Issue number2
DOIs
Publication statusPublished - 2025

Cite this