TY - JOUR
T1 - Digerati – A multipath parallel hybrid deep learning framework for the identification of mycobacterial PE/PPE proteins
AU - Li, Fuyi
AU - Guo, Xudong
AU - Bi, Yue
AU - Jia, Runchang
AU - Pitt, Miranda E.
AU - Pan, Shirui
AU - Li, Shuqin
AU - Gasser, Robin B.
AU - Coin, Lachlan JM
AU - Song, Jiangning
N1 - Funding Information:
This work is supported by the National Natural Scientific Foundation of China (grant 62202388 ), the National Key Research and Development Program of China (No. 2022YFF1000104 ), the Qin Chuangyuan Innovation and Entrepreneurship Talent Project, Shaanxi Province, China (grant QCYRCXM-2022-230 ), and Talent Research Funding at Northwest A&F University (No. Z1090222021 ).
Publisher Copyright:
© 2023 Elsevier Ltd
PY - 2023/9
Y1 - 2023/9
N2 - The genome of Mycobacterium tuberculosis contains a relatively high percentage (10%) of genes that are poorly characterised because of their highly repetitive nature and high GC content. Some of these genes encode proteins of the PE/PPE family, which are thought to be involved in host-pathogen interactions, virulence, and disease pathogenicity. Members of this family are genetically divergent and challenging to both identify and classify using conventional computational tools. Thus, advanced in silico methods are needed to identify proteins of this family for subsequent functional annotation efficiently. In this study, we developed the first deep learning-based approach, termed Digerati, for the rapid and accurate identification of PE and PPE family proteins. Digerati was built upon a multipath parallel hybrid deep learning framework, which equips multi-layer convolutional neural networks with bidirectional, long short-term memory, equipped with a self-attention module to effectively learn the higher-order feature representations of PE/PPE proteins. Empirical studies demonstrated that Digerati achieved a significantly better performance (∼18–20%) than alignment-based approaches, including BLASTP, PHMMER, and HHsuite, in both prediction accuracy and speed. Digerati is anticipated to facilitate community-wide efforts to conduct high-throughput identification and analysis of PE/PPE family members. The webserver and source codes of Digerati are publicly available at http://web.unimelb-bioinfortools.cloud.edu.au/Digerati/.
AB - The genome of Mycobacterium tuberculosis contains a relatively high percentage (10%) of genes that are poorly characterised because of their highly repetitive nature and high GC content. Some of these genes encode proteins of the PE/PPE family, which are thought to be involved in host-pathogen interactions, virulence, and disease pathogenicity. Members of this family are genetically divergent and challenging to both identify and classify using conventional computational tools. Thus, advanced in silico methods are needed to identify proteins of this family for subsequent functional annotation efficiently. In this study, we developed the first deep learning-based approach, termed Digerati, for the rapid and accurate identification of PE and PPE family proteins. Digerati was built upon a multipath parallel hybrid deep learning framework, which equips multi-layer convolutional neural networks with bidirectional, long short-term memory, equipped with a self-attention module to effectively learn the higher-order feature representations of PE/PPE proteins. Empirical studies demonstrated that Digerati achieved a significantly better performance (∼18–20%) than alignment-based approaches, including BLASTP, PHMMER, and HHsuite, in both prediction accuracy and speed. Digerati is anticipated to facilitate community-wide efforts to conduct high-throughput identification and analysis of PE/PPE family members. The webserver and source codes of Digerati are publicly available at http://web.unimelb-bioinfortools.cloud.edu.au/Digerati/.
KW - Bioinformatics
KW - Deep learning
KW - Mycobacterial
KW - PE/PPE protein
KW - Sequence analysis
UR - https://www.scopus.com/pages/publications/85162897295
U2 - 10.1016/j.compbiomed.2023.107155
DO - 10.1016/j.compbiomed.2023.107155
M3 - Article
C2 - 37356289
AN - SCOPUS:85162897295
SN - 0010-4825
VL - 163
JO - Computers in Biology and Medicine
JF - Computers in Biology and Medicine
M1 - 107155
ER -