Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction: a comprehensive revisit and benchmarking of existing methods

Fuyi Li, Yanan Wang, Chen Li, Tatiana T. Marquez-Lago, Andre Leier, Neil D. Rawlings, Gholamreza Haffari, Jerico Revote, Tatsuya Akutsu, Kuo-Chen Chou, Anthony W. Purcell, Robert N. Pike, Geoffrey I. Webb, A. Ian Smith, Trevor Lithgow, Roger J. Daly, James C. Whisstock, Jiangning Song

Research output: Contribution to journalArticleResearchpeer-review

Abstract

The roles of proteolytic cleavage have been intensively investigated and discussed during the past two decades. This irreversible chemical process has been frequently reported to influence a number of crucial biological processes (BPs), such as cell cycle, protein regulation and inflammation. A number of advanced studies have been published aiming at deciphering the mechanisms of proteolytic cleavage. Given its significance and the large number of functionally enriched substrates targeted by specific proteases, many computational approaches have been established for accurate prediction of protease-specific substrates and their cleavage sites. Consequently, there is an urgent need to systematically assess the state-of-the-art computational approaches for protease-specific cleavage site prediction to further advance the existing methodologies and to improve the prediction performance. With this goal in mind, in this article, we carefully evaluated a total of 19 computational methods (including 8 scoring function-based methods and 11 machine learning-based methods) in terms of their underlying algorithm, calculated features, performance evaluation and software usability. Then, extensive independent tests were performed to assess the robustness and scalability of the reviewed methods using our carefully prepared independent test data sets with 3641 cleavage sites (specific to 10 proteases). The comparative experimental results demonstrate that PROSPERous is the most accurate generic method for predicting eight protease-specific cleavage sites, while GPS-CCD and LabCaS outperformed other predictors for calpain-specific cleavage sites. Based on our review, we then outlined some potential ways to improve the prediction performance and ease the computational burden by applying ensemble learning, deep learning, positive unlabeled learning and parallel and distributed computing techniques. We anticipate that our study will serve as a practical and useful guide for interested readers to further advance next-generation bioinformatics tools for protease-specific cleavage site prediction.
Original languageEnglish
Article numberbby077
Number of pages17
JournalBriefings in Bioinformatics
DOIs
Publication statusAccepted/In press - 2018

Keywords

  • protease
  • substrate specificity
  • substrate cleavage
  • bioinformatics
  • sequence analysis
  • machine learning
  • prediction model

Cite this

@article{1f7220556471479a8e702e68e04cd5e0,
title = "Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction: a comprehensive revisit and benchmarking of existing methods",
abstract = "The roles of proteolytic cleavage have been intensively investigated and discussed during the past two decades. This irreversible chemical process has been frequently reported to influence a number of crucial biological processes (BPs), such as cell cycle, protein regulation and inflammation. A number of advanced studies have been published aiming at deciphering the mechanisms of proteolytic cleavage. Given its significance and the large number of functionally enriched substrates targeted by specific proteases, many computational approaches have been established for accurate prediction of protease-specific substrates and their cleavage sites. Consequently, there is an urgent need to systematically assess the state-of-the-art computational approaches for protease-specific cleavage site prediction to further advance the existing methodologies and to improve the prediction performance. With this goal in mind, in this article, we carefully evaluated a total of 19 computational methods (including 8 scoring function-based methods and 11 machine learning-based methods) in terms of their underlying algorithm, calculated features, performance evaluation and software usability. Then, extensive independent tests were performed to assess the robustness and scalability of the reviewed methods using our carefully prepared independent test data sets with 3641 cleavage sites (specific to 10 proteases). The comparative experimental results demonstrate that PROSPERous is the most accurate generic method for predicting eight protease-specific cleavage sites, while GPS-CCD and LabCaS outperformed other predictors for calpain-specific cleavage sites. Based on our review, we then outlined some potential ways to improve the prediction performance and ease the computational burden by applying ensemble learning, deep learning, positive unlabeled learning and parallel and distributed computing techniques. We anticipate that our study will serve as a practical and useful guide for interested readers to further advance next-generation bioinformatics tools for protease-specific cleavage site prediction.",
keywords = "protease, substrate specificity, substrate cleavage, bioinformatics, sequence analysis, machine learning, prediction model",
author = "Fuyi Li and Yanan Wang and Chen Li and Marquez-Lago, {Tatiana T.} and Andre Leier and Rawlings, {Neil D.} and Gholamreza Haffari and Jerico Revote and Tatsuya Akutsu and Kuo-Chen Chou and Purcell, {Anthony W.} and Pike, {Robert N.} and Webb, {Geoffrey I.} and Smith, {A. Ian} and Trevor Lithgow and Daly, {Roger J.} and Whisstock, {James C.} and Jiangning Song",
year = "2018",
doi = "10.1093/bib/bby077",
language = "English",
journal = "Briefings in Bioinformatics",
issn = "1467-5463",
publisher = "Oxford Univ Press",

}

TY - JOUR

T1 - Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction

T2 - a comprehensive revisit and benchmarking of existing methods

AU - Li, Fuyi

AU - Wang, Yanan

AU - Li, Chen

AU - Marquez-Lago, Tatiana T.

AU - Leier, Andre

AU - Rawlings, Neil D.

AU - Haffari, Gholamreza

AU - Revote, Jerico

AU - Akutsu, Tatsuya

AU - Chou, Kuo-Chen

AU - Purcell, Anthony W.

AU - Pike, Robert N.

AU - Webb, Geoffrey I.

AU - Smith, A. Ian

AU - Lithgow, Trevor

AU - Daly, Roger J.

AU - Whisstock, James C.

AU - Song, Jiangning

PY - 2018

Y1 - 2018

N2 - The roles of proteolytic cleavage have been intensively investigated and discussed during the past two decades. This irreversible chemical process has been frequently reported to influence a number of crucial biological processes (BPs), such as cell cycle, protein regulation and inflammation. A number of advanced studies have been published aiming at deciphering the mechanisms of proteolytic cleavage. Given its significance and the large number of functionally enriched substrates targeted by specific proteases, many computational approaches have been established for accurate prediction of protease-specific substrates and their cleavage sites. Consequently, there is an urgent need to systematically assess the state-of-the-art computational approaches for protease-specific cleavage site prediction to further advance the existing methodologies and to improve the prediction performance. With this goal in mind, in this article, we carefully evaluated a total of 19 computational methods (including 8 scoring function-based methods and 11 machine learning-based methods) in terms of their underlying algorithm, calculated features, performance evaluation and software usability. Then, extensive independent tests were performed to assess the robustness and scalability of the reviewed methods using our carefully prepared independent test data sets with 3641 cleavage sites (specific to 10 proteases). The comparative experimental results demonstrate that PROSPERous is the most accurate generic method for predicting eight protease-specific cleavage sites, while GPS-CCD and LabCaS outperformed other predictors for calpain-specific cleavage sites. Based on our review, we then outlined some potential ways to improve the prediction performance and ease the computational burden by applying ensemble learning, deep learning, positive unlabeled learning and parallel and distributed computing techniques. We anticipate that our study will serve as a practical and useful guide for interested readers to further advance next-generation bioinformatics tools for protease-specific cleavage site prediction.

AB - The roles of proteolytic cleavage have been intensively investigated and discussed during the past two decades. This irreversible chemical process has been frequently reported to influence a number of crucial biological processes (BPs), such as cell cycle, protein regulation and inflammation. A number of advanced studies have been published aiming at deciphering the mechanisms of proteolytic cleavage. Given its significance and the large number of functionally enriched substrates targeted by specific proteases, many computational approaches have been established for accurate prediction of protease-specific substrates and their cleavage sites. Consequently, there is an urgent need to systematically assess the state-of-the-art computational approaches for protease-specific cleavage site prediction to further advance the existing methodologies and to improve the prediction performance. With this goal in mind, in this article, we carefully evaluated a total of 19 computational methods (including 8 scoring function-based methods and 11 machine learning-based methods) in terms of their underlying algorithm, calculated features, performance evaluation and software usability. Then, extensive independent tests were performed to assess the robustness and scalability of the reviewed methods using our carefully prepared independent test data sets with 3641 cleavage sites (specific to 10 proteases). The comparative experimental results demonstrate that PROSPERous is the most accurate generic method for predicting eight protease-specific cleavage sites, while GPS-CCD and LabCaS outperformed other predictors for calpain-specific cleavage sites. Based on our review, we then outlined some potential ways to improve the prediction performance and ease the computational burden by applying ensemble learning, deep learning, positive unlabeled learning and parallel and distributed computing techniques. We anticipate that our study will serve as a practical and useful guide for interested readers to further advance next-generation bioinformatics tools for protease-specific cleavage site prediction.

KW - protease

KW - substrate specificity

KW - substrate cleavage

KW - bioinformatics

KW - sequence analysis

KW - machine learning

KW - prediction model

U2 - 10.1093/bib/bby077

DO - 10.1093/bib/bby077

M3 - Article

JO - Briefings in Bioinformatics

JF - Briefings in Bioinformatics

SN - 1467-5463

M1 - bby077

ER -