Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework

Yanju Zhang, Ruopeng Xie, Jiawei Wang, Andre Leier, Tatiana T. Marquez-Lago, Tatsuya Akutsu, Geoffrey I. Webb, Kuo-Chen Chou, Jiangning Song

Research output: Contribution to journalArticleResearchpeer-review

Abstract

As a newly discovered post-translational modification (PTM), lysine malonylation (Kmal) regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning (ML) methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of Kmal sites. We identify optimized feature sets, with which four commonly used ML methods (random forest, support vector machines, K-nearest neighbor and logistic regression) and one recently proposed [Light Gradient Boosting Machine (LightGBM)] are trained on data from three species, namely, Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (AUC: 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu.ezproxy.lib.monash.edu.au/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.
Original languageEnglish
Article numberbby079
Number of pages15
JournalBriefings in Bioinformatics
DOIs
Publication statusAccepted/In press - 2019

Keywords

  • lysine malonylation
  • computational prediction
  • feature encoding methods
  • machine learning
  • ensemble learning
  • light gradient boosting machine

Cite this

@article{85b0fbccd96d49319f34aa918d71fb8e,
title = "Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework",
abstract = "As a newly discovered post-translational modification (PTM), lysine malonylation (Kmal) regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning (ML) methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of Kmal sites. We identify optimized feature sets, with which four commonly used ML methods (random forest, support vector machines, K-nearest neighbor and logistic regression) and one recently proposed [Light Gradient Boosting Machine (LightGBM)] are trained on data from three species, namely, Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (AUC: 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu.ezproxy.lib.monash.edu.au/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.",
keywords = "lysine malonylation, computational prediction, feature encoding methods, machine learning, ensemble learning, light gradient boosting machine",
author = "Yanju Zhang and Ruopeng Xie and Jiawei Wang and Andre Leier and Marquez-Lago, {Tatiana T.} and Tatsuya Akutsu and Webb, {Geoffrey I.} and Kuo-Chen Chou and Jiangning Song",
year = "2019",
doi = "10.1093/bib/bby079",
language = "English",
journal = "Briefings in Bioinformatics",
issn = "1467-5463",
publisher = "Oxford Univ Press",

}

Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework. / Zhang, Yanju; Xie, Ruopeng; Wang, Jiawei ; Leier, Andre; Marquez-Lago, Tatiana T.; Akutsu, Tatsuya; Webb, Geoffrey I.; Chou, Kuo-Chen; Song, Jiangning.

In: Briefings in Bioinformatics, 2019.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework

AU - Zhang, Yanju

AU - Xie, Ruopeng

AU - Wang, Jiawei

AU - Leier, Andre

AU - Marquez-Lago, Tatiana T.

AU - Akutsu, Tatsuya

AU - Webb, Geoffrey I.

AU - Chou, Kuo-Chen

AU - Song, Jiangning

PY - 2019

Y1 - 2019

N2 - As a newly discovered post-translational modification (PTM), lysine malonylation (Kmal) regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning (ML) methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of Kmal sites. We identify optimized feature sets, with which four commonly used ML methods (random forest, support vector machines, K-nearest neighbor and logistic regression) and one recently proposed [Light Gradient Boosting Machine (LightGBM)] are trained on data from three species, namely, Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (AUC: 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu.ezproxy.lib.monash.edu.au/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.

AB - As a newly discovered post-translational modification (PTM), lysine malonylation (Kmal) regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning (ML) methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of Kmal sites. We identify optimized feature sets, with which four commonly used ML methods (random forest, support vector machines, K-nearest neighbor and logistic regression) and one recently proposed [Light Gradient Boosting Machine (LightGBM)] are trained on data from three species, namely, Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (AUC: 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu.ezproxy.lib.monash.edu.au/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.

KW - lysine malonylation

KW - computational prediction

KW - feature encoding methods

KW - machine learning

KW - ensemble learning

KW - light gradient boosting machine

U2 - 10.1093/bib/bby079

DO - 10.1093/bib/bby079

M3 - Article

JO - Briefings in Bioinformatics

JF - Briefings in Bioinformatics

SN - 1467-5463

M1 - bby079

ER -