PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins

Yanju Zhang, Sha Yu, Ruopeng Xie, Jiahui Li, Andre Leier, Tatiana T. Marquez-Lago, Tatsuya Akutsu, A. Ian Smith, Zongyuan Ge, Jiawei Wang, Trevor Lithgow, Jiangning Song

Research output: Contribution to journalArticleResearchpeer-review

Abstract

Motivation
Gram-positive bacteria have developed secretion systems to transport proteins across their cell wall, a process that plays an important role during host infection. These secretion mechanisms have also been harnessed for therapeutic purposes in many biotechnology applications. Accordingly, the identification of features that select a protein for efficient secretion from these microorganisms has become an important task. Among all the secreted proteins, “non-classical” secreted proteins are difficult to identify as they lack discernable signal peptide sequences and can make use of diverse secretion pathways. Currently, several computational methods have been developed to facilitate the discovery of such non-classical secreted proteins; however, the existing methods are based on either simulated or limited experimental datasets. In addition, they often employ basic features to train the models in a simple and coarse-grained manner. The availability of more experimentally validated datasets, advanced feature engineering techniques and novel machine learning approaches creates new opportunities for the development of improved predictors of “non-classical” secreted proteins from sequence data.
Results
In this work, we first constructed a high-quality dataset of experimentally verified “non-classical” secreted proteins, which we then used to create benchmark datasets. Using these benchmark datasets, we comprehensively analyzed a wide range of features and assessed their individual performance. Subsequently, we developed a two-layer LightGBM ensemble model that integrates several single-feature based models into an overall prediction framework. At this stage, LightGBM, a gradient boosting machine, was used as a machine learning approach and the necessary parameter optimization was performed by a particle swarm optimization (PSO) strategy. All single feature-based LightGBM models were then integrated into a unified ensemble model to further improve the predictive performance. Consequently, the final ensemble model achieved a superior performance with an Accuracy of 0.900, an F-value of 0.903, Matthew’s correlation coefficient of 0.803, and an area under the curve value of 0.963, and outperforming previous state-of-the-art predictors on the independent test. Based on our proposed optimal ensemble model, we further developed an accessible online predictor, PeNGaRoo, to serve users’ demands. We believe this online web server, together with our proposed methodology, will expedite the discovery of non-classically secreted effector proteins in Gram-positive bacteria and further inspire the development of next-generation predictors.
Original languageEnglish
Article numberbtz629
Number of pages8
JournalBioinformatics
DOIs
Publication statusAccepted/In press - 8 Aug 2019

Cite this

Zhang, Yanju ; Yu, Sha ; Xie, Ruopeng ; Li, Jiahui ; Leier, Andre ; Marquez-Lago, Tatiana T. ; Akutsu, Tatsuya ; Smith, A. Ian ; Ge, Zongyuan ; Wang, Jiawei ; Lithgow, Trevor ; Song, Jiangning. / PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins. In: Bioinformatics. 2019.
@article{9a53b1d838764365a7b41064727571e0,
title = "PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins",
abstract = "MotivationGram-positive bacteria have developed secretion systems to transport proteins across their cell wall, a process that plays an important role during host infection. These secretion mechanisms have also been harnessed for therapeutic purposes in many biotechnology applications. Accordingly, the identification of features that select a protein for efficient secretion from these microorganisms has become an important task. Among all the secreted proteins, “non-classical” secreted proteins are difficult to identify as they lack discernable signal peptide sequences and can make use of diverse secretion pathways. Currently, several computational methods have been developed to facilitate the discovery of such non-classical secreted proteins; however, the existing methods are based on either simulated or limited experimental datasets. In addition, they often employ basic features to train the models in a simple and coarse-grained manner. The availability of more experimentally validated datasets, advanced feature engineering techniques and novel machine learning approaches creates new opportunities for the development of improved predictors of “non-classical” secreted proteins from sequence data.ResultsIn this work, we first constructed a high-quality dataset of experimentally verified “non-classical” secreted proteins, which we then used to create benchmark datasets. Using these benchmark datasets, we comprehensively analyzed a wide range of features and assessed their individual performance. Subsequently, we developed a two-layer LightGBM ensemble model that integrates several single-feature based models into an overall prediction framework. At this stage, LightGBM, a gradient boosting machine, was used as a machine learning approach and the necessary parameter optimization was performed by a particle swarm optimization (PSO) strategy. All single feature-based LightGBM models were then integrated into a unified ensemble model to further improve the predictive performance. Consequently, the final ensemble model achieved a superior performance with an Accuracy of 0.900, an F-value of 0.903, Matthew’s correlation coefficient of 0.803, and an area under the curve value of 0.963, and outperforming previous state-of-the-art predictors on the independent test. Based on our proposed optimal ensemble model, we further developed an accessible online predictor, PeNGaRoo, to serve users’ demands. We believe this online web server, together with our proposed methodology, will expedite the discovery of non-classically secreted effector proteins in Gram-positive bacteria and further inspire the development of next-generation predictors.",
author = "Yanju Zhang and Sha Yu and Ruopeng Xie and Jiahui Li and Andre Leier and Marquez-Lago, {Tatiana T.} and Tatsuya Akutsu and Smith, {A. Ian} and Zongyuan Ge and Jiawei Wang and Trevor Lithgow and Jiangning Song",
year = "2019",
month = "8",
day = "8",
doi = "10.1093/bioinformatics/btz629",
language = "English",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press, USA",

}

PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins. / Zhang, Yanju; Yu, Sha; Xie, Ruopeng; Li, Jiahui; Leier, Andre; Marquez-Lago, Tatiana T.; Akutsu, Tatsuya; Smith, A. Ian; Ge, Zongyuan; Wang, Jiawei; Lithgow, Trevor; Song, Jiangning.

In: Bioinformatics, 08.08.2019.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins

AU - Zhang, Yanju

AU - Yu, Sha

AU - Xie, Ruopeng

AU - Li, Jiahui

AU - Leier, Andre

AU - Marquez-Lago, Tatiana T.

AU - Akutsu, Tatsuya

AU - Smith, A. Ian

AU - Ge, Zongyuan

AU - Wang, Jiawei

AU - Lithgow, Trevor

AU - Song, Jiangning

PY - 2019/8/8

Y1 - 2019/8/8

N2 - MotivationGram-positive bacteria have developed secretion systems to transport proteins across their cell wall, a process that plays an important role during host infection. These secretion mechanisms have also been harnessed for therapeutic purposes in many biotechnology applications. Accordingly, the identification of features that select a protein for efficient secretion from these microorganisms has become an important task. Among all the secreted proteins, “non-classical” secreted proteins are difficult to identify as they lack discernable signal peptide sequences and can make use of diverse secretion pathways. Currently, several computational methods have been developed to facilitate the discovery of such non-classical secreted proteins; however, the existing methods are based on either simulated or limited experimental datasets. In addition, they often employ basic features to train the models in a simple and coarse-grained manner. The availability of more experimentally validated datasets, advanced feature engineering techniques and novel machine learning approaches creates new opportunities for the development of improved predictors of “non-classical” secreted proteins from sequence data.ResultsIn this work, we first constructed a high-quality dataset of experimentally verified “non-classical” secreted proteins, which we then used to create benchmark datasets. Using these benchmark datasets, we comprehensively analyzed a wide range of features and assessed their individual performance. Subsequently, we developed a two-layer LightGBM ensemble model that integrates several single-feature based models into an overall prediction framework. At this stage, LightGBM, a gradient boosting machine, was used as a machine learning approach and the necessary parameter optimization was performed by a particle swarm optimization (PSO) strategy. All single feature-based LightGBM models were then integrated into a unified ensemble model to further improve the predictive performance. Consequently, the final ensemble model achieved a superior performance with an Accuracy of 0.900, an F-value of 0.903, Matthew’s correlation coefficient of 0.803, and an area under the curve value of 0.963, and outperforming previous state-of-the-art predictors on the independent test. Based on our proposed optimal ensemble model, we further developed an accessible online predictor, PeNGaRoo, to serve users’ demands. We believe this online web server, together with our proposed methodology, will expedite the discovery of non-classically secreted effector proteins in Gram-positive bacteria and further inspire the development of next-generation predictors.

AB - MotivationGram-positive bacteria have developed secretion systems to transport proteins across their cell wall, a process that plays an important role during host infection. These secretion mechanisms have also been harnessed for therapeutic purposes in many biotechnology applications. Accordingly, the identification of features that select a protein for efficient secretion from these microorganisms has become an important task. Among all the secreted proteins, “non-classical” secreted proteins are difficult to identify as they lack discernable signal peptide sequences and can make use of diverse secretion pathways. Currently, several computational methods have been developed to facilitate the discovery of such non-classical secreted proteins; however, the existing methods are based on either simulated or limited experimental datasets. In addition, they often employ basic features to train the models in a simple and coarse-grained manner. The availability of more experimentally validated datasets, advanced feature engineering techniques and novel machine learning approaches creates new opportunities for the development of improved predictors of “non-classical” secreted proteins from sequence data.ResultsIn this work, we first constructed a high-quality dataset of experimentally verified “non-classical” secreted proteins, which we then used to create benchmark datasets. Using these benchmark datasets, we comprehensively analyzed a wide range of features and assessed their individual performance. Subsequently, we developed a two-layer LightGBM ensemble model that integrates several single-feature based models into an overall prediction framework. At this stage, LightGBM, a gradient boosting machine, was used as a machine learning approach and the necessary parameter optimization was performed by a particle swarm optimization (PSO) strategy. All single feature-based LightGBM models were then integrated into a unified ensemble model to further improve the predictive performance. Consequently, the final ensemble model achieved a superior performance with an Accuracy of 0.900, an F-value of 0.903, Matthew’s correlation coefficient of 0.803, and an area under the curve value of 0.963, and outperforming previous state-of-the-art predictors on the independent test. Based on our proposed optimal ensemble model, we further developed an accessible online predictor, PeNGaRoo, to serve users’ demands. We believe this online web server, together with our proposed methodology, will expedite the discovery of non-classically secreted effector proteins in Gram-positive bacteria and further inspire the development of next-generation predictors.

U2 - 10.1093/bioinformatics/btz629

DO - 10.1093/bioinformatics/btz629

M3 - Article

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

M1 - btz629

ER -