GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome

Fuyi Li, Chen Li, Mingjun Wang, Geoffrey I Webb, Yang Zhang, James C Whisstock, Jiangning Song

Research output: Contribution to journalArticleResearchpeer-review

50 Citations (Scopus)

Abstract

MOTIVATION: Glycosylation is a ubiquitous type of protein post-translational modification (PTM) in eukaryotic cells, which plays vital roles in various biological processes (BPs) such as cellular communication, ligand recognition and subcellular recognition. It is estimated that >50 of the entire human proteome is glycosylated. However, it is still a significant challenge to identify glycosylation sites, which requires expensive/laborious experimental research. Thus, bioinformatics approaches that can predict the glycan occupancy at specific sequons in protein sequences would be useful for understanding and utilizing this important PTM. RESULTS: In this study, we present a novel bioinformatics tool called GlycoMine, which is a comprehensive tool for the systematic in silico identification of C-linked, N-linked, and O-linked glycosylation sites in the human proteome. GlycoMine was developed using the random forest algorithm and evaluated based on a well-prepared up-to-date benchmark dataset that encompasses all three types of glycosylation sites, which was curated from multiple public resources. Heterogeneous sequences and functional features were derived from various sources, and subjected to further two-step feature selection to characterize a condensed subset of optimal features that contributed most to the type-specific prediction of glycosylation sites. Five-fold cross-validation and independent tests show that this approach significantly improved the prediction performance compared with four existing prediction tools: NetNGlyc, NetOGlyc, EnsembleGly and GPP. We demonstrated that this tool could identify candidate glycosylation sites in case study proteins and applied it to identify many high-confidence glycosylation target proteins by screening the entire human proteome.
Original languageEnglish
Pages (from-to)1411 - 1419
Number of pages9
JournalBioinformatics
Volume31
Issue number9
DOIs
Publication statusPublished - 2015

Cite this

@article{0f2ff545fe274ca5a28bb0318c4e6aed,
title = "GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome",
abstract = "MOTIVATION: Glycosylation is a ubiquitous type of protein post-translational modification (PTM) in eukaryotic cells, which plays vital roles in various biological processes (BPs) such as cellular communication, ligand recognition and subcellular recognition. It is estimated that >50 of the entire human proteome is glycosylated. However, it is still a significant challenge to identify glycosylation sites, which requires expensive/laborious experimental research. Thus, bioinformatics approaches that can predict the glycan occupancy at specific sequons in protein sequences would be useful for understanding and utilizing this important PTM. RESULTS: In this study, we present a novel bioinformatics tool called GlycoMine, which is a comprehensive tool for the systematic in silico identification of C-linked, N-linked, and O-linked glycosylation sites in the human proteome. GlycoMine was developed using the random forest algorithm and evaluated based on a well-prepared up-to-date benchmark dataset that encompasses all three types of glycosylation sites, which was curated from multiple public resources. Heterogeneous sequences and functional features were derived from various sources, and subjected to further two-step feature selection to characterize a condensed subset of optimal features that contributed most to the type-specific prediction of glycosylation sites. Five-fold cross-validation and independent tests show that this approach significantly improved the prediction performance compared with four existing prediction tools: NetNGlyc, NetOGlyc, EnsembleGly and GPP. We demonstrated that this tool could identify candidate glycosylation sites in case study proteins and applied it to identify many high-confidence glycosylation target proteins by screening the entire human proteome.",
author = "Fuyi Li and Chen Li and Mingjun Wang and Webb, {Geoffrey I} and Yang Zhang and Whisstock, {James C} and Jiangning Song",
year = "2015",
doi = "10.1093/bioinformatics/btu852",
language = "English",
volume = "31",
pages = "1411 -- 1419",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press, USA",
number = "9",

}

GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome. / Li, Fuyi; Li, Chen; Wang, Mingjun; Webb, Geoffrey I; Zhang, Yang; Whisstock, James C; Song, Jiangning.

In: Bioinformatics, Vol. 31, No. 9, 2015, p. 1411 - 1419.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome

AU - Li, Fuyi

AU - Li, Chen

AU - Wang, Mingjun

AU - Webb, Geoffrey I

AU - Zhang, Yang

AU - Whisstock, James C

AU - Song, Jiangning

PY - 2015

Y1 - 2015

N2 - MOTIVATION: Glycosylation is a ubiquitous type of protein post-translational modification (PTM) in eukaryotic cells, which plays vital roles in various biological processes (BPs) such as cellular communication, ligand recognition and subcellular recognition. It is estimated that >50 of the entire human proteome is glycosylated. However, it is still a significant challenge to identify glycosylation sites, which requires expensive/laborious experimental research. Thus, bioinformatics approaches that can predict the glycan occupancy at specific sequons in protein sequences would be useful for understanding and utilizing this important PTM. RESULTS: In this study, we present a novel bioinformatics tool called GlycoMine, which is a comprehensive tool for the systematic in silico identification of C-linked, N-linked, and O-linked glycosylation sites in the human proteome. GlycoMine was developed using the random forest algorithm and evaluated based on a well-prepared up-to-date benchmark dataset that encompasses all three types of glycosylation sites, which was curated from multiple public resources. Heterogeneous sequences and functional features were derived from various sources, and subjected to further two-step feature selection to characterize a condensed subset of optimal features that contributed most to the type-specific prediction of glycosylation sites. Five-fold cross-validation and independent tests show that this approach significantly improved the prediction performance compared with four existing prediction tools: NetNGlyc, NetOGlyc, EnsembleGly and GPP. We demonstrated that this tool could identify candidate glycosylation sites in case study proteins and applied it to identify many high-confidence glycosylation target proteins by screening the entire human proteome.

AB - MOTIVATION: Glycosylation is a ubiquitous type of protein post-translational modification (PTM) in eukaryotic cells, which plays vital roles in various biological processes (BPs) such as cellular communication, ligand recognition and subcellular recognition. It is estimated that >50 of the entire human proteome is glycosylated. However, it is still a significant challenge to identify glycosylation sites, which requires expensive/laborious experimental research. Thus, bioinformatics approaches that can predict the glycan occupancy at specific sequons in protein sequences would be useful for understanding and utilizing this important PTM. RESULTS: In this study, we present a novel bioinformatics tool called GlycoMine, which is a comprehensive tool for the systematic in silico identification of C-linked, N-linked, and O-linked glycosylation sites in the human proteome. GlycoMine was developed using the random forest algorithm and evaluated based on a well-prepared up-to-date benchmark dataset that encompasses all three types of glycosylation sites, which was curated from multiple public resources. Heterogeneous sequences and functional features were derived from various sources, and subjected to further two-step feature selection to characterize a condensed subset of optimal features that contributed most to the type-specific prediction of glycosylation sites. Five-fold cross-validation and independent tests show that this approach significantly improved the prediction performance compared with four existing prediction tools: NetNGlyc, NetOGlyc, EnsembleGly and GPP. We demonstrated that this tool could identify candidate glycosylation sites in case study proteins and applied it to identify many high-confidence glycosylation target proteins by screening the entire human proteome.

UR - http://bioinformatics.oxfordjournals.org/content/31/9/1411.full.pdf+html

U2 - 10.1093/bioinformatics/btu852

DO - 10.1093/bioinformatics/btu852

M3 - Article

VL - 31

SP - 1411

EP - 1419

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 9

ER -