Reducing overfitting in predicting intrinsically unstructured proteins

Pengfei Han, Xiuzhen Zhang, Raymond S. Norton, Zhiping Feng

Research output: Chapter in Book/Report/Conference proceedingConference PaperOther

Abstract

Intrinsically unstructured or disordered proteins are proteins that lack fixed 3-D structure globally or contain long disordered regions. Predicting disordered regions has attracted significant research recently. In developing a decision tree based disordered region predictor, we note that many previous predictors applying 20 amino acid compositions as training parameter tend to overfit the data. In this paper we propose to alleviate overfitting in prediction of intrinsically unstructured proteins by reducing input parameters. We also compare this approach with the random forest model, which is inherently tolerant to overfitting. Our experiments suggest that reducing 20 amino acid compositions into 4 groups according to amino acid property can reduce the overfitting in decision tree model. Alternatively, ensemble-learning techniques like random forest is inherently more tolerant to this kind of overfitting and can be a promising candidate in disordered region prediction.

Original languageEnglish
Title of host publicationAdvances in Knowledge Discovery and Data Mining - 11th Pacific-Asia Conference, PAKDD 2007, Proceedings
Pages515-522
Number of pages8
Volume4426 LNAI
Publication statusPublished - 2007
Externally publishedYes
EventPacific-Asia Conference on Knowledge Discovery and Data Mining 2007 - Nanjing, China
Duration: 22 May 200725 May 2007
Conference number: 11th

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4426 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferencePacific-Asia Conference on Knowledge Discovery and Data Mining 2007
Abbreviated titlePAKDD'07
CountryChina
CityNanjing
Period22/05/0725/05/07

Keywords

  • Amino acid composition
  • Decision tree
  • Disordered region
  • Intrinsically unstructured proteins
  • Overfitting
  • Random forest

Cite this

Han, P., Zhang, X., Norton, R. S., & Feng, Z. (2007). Reducing overfitting in predicting intrinsically unstructured proteins. In Advances in Knowledge Discovery and Data Mining - 11th Pacific-Asia Conference, PAKDD 2007, Proceedings (Vol. 4426 LNAI, pp. 515-522). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 4426 LNAI).
Han, Pengfei ; Zhang, Xiuzhen ; Norton, Raymond S. ; Feng, Zhiping. / Reducing overfitting in predicting intrinsically unstructured proteins. Advances in Knowledge Discovery and Data Mining - 11th Pacific-Asia Conference, PAKDD 2007, Proceedings. Vol. 4426 LNAI 2007. pp. 515-522 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{05f3584953b14f51a301029937dbe231,
title = "Reducing overfitting in predicting intrinsically unstructured proteins",
abstract = "Intrinsically unstructured or disordered proteins are proteins that lack fixed 3-D structure globally or contain long disordered regions. Predicting disordered regions has attracted significant research recently. In developing a decision tree based disordered region predictor, we note that many previous predictors applying 20 amino acid compositions as training parameter tend to overfit the data. In this paper we propose to alleviate overfitting in prediction of intrinsically unstructured proteins by reducing input parameters. We also compare this approach with the random forest model, which is inherently tolerant to overfitting. Our experiments suggest that reducing 20 amino acid compositions into 4 groups according to amino acid property can reduce the overfitting in decision tree model. Alternatively, ensemble-learning techniques like random forest is inherently more tolerant to this kind of overfitting and can be a promising candidate in disordered region prediction.",
keywords = "Amino acid composition, Decision tree, Disordered region, Intrinsically unstructured proteins, Overfitting, Random forest",
author = "Pengfei Han and Xiuzhen Zhang and Norton, {Raymond S.} and Zhiping Feng",
year = "2007",
language = "English",
isbn = "9783540717003",
volume = "4426 LNAI",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "515--522",
booktitle = "Advances in Knowledge Discovery and Data Mining - 11th Pacific-Asia Conference, PAKDD 2007, Proceedings",

}

Han, P, Zhang, X, Norton, RS & Feng, Z 2007, Reducing overfitting in predicting intrinsically unstructured proteins. in Advances in Knowledge Discovery and Data Mining - 11th Pacific-Asia Conference, PAKDD 2007, Proceedings. vol. 4426 LNAI, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 4426 LNAI, pp. 515-522, Pacific-Asia Conference on Knowledge Discovery and Data Mining 2007, Nanjing, China, 22/05/07.

Reducing overfitting in predicting intrinsically unstructured proteins. / Han, Pengfei; Zhang, Xiuzhen; Norton, Raymond S.; Feng, Zhiping.

Advances in Knowledge Discovery and Data Mining - 11th Pacific-Asia Conference, PAKDD 2007, Proceedings. Vol. 4426 LNAI 2007. p. 515-522 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 4426 LNAI).

Research output: Chapter in Book/Report/Conference proceedingConference PaperOther

TY - GEN

T1 - Reducing overfitting in predicting intrinsically unstructured proteins

AU - Han, Pengfei

AU - Zhang, Xiuzhen

AU - Norton, Raymond S.

AU - Feng, Zhiping

PY - 2007

Y1 - 2007

N2 - Intrinsically unstructured or disordered proteins are proteins that lack fixed 3-D structure globally or contain long disordered regions. Predicting disordered regions has attracted significant research recently. In developing a decision tree based disordered region predictor, we note that many previous predictors applying 20 amino acid compositions as training parameter tend to overfit the data. In this paper we propose to alleviate overfitting in prediction of intrinsically unstructured proteins by reducing input parameters. We also compare this approach with the random forest model, which is inherently tolerant to overfitting. Our experiments suggest that reducing 20 amino acid compositions into 4 groups according to amino acid property can reduce the overfitting in decision tree model. Alternatively, ensemble-learning techniques like random forest is inherently more tolerant to this kind of overfitting and can be a promising candidate in disordered region prediction.

AB - Intrinsically unstructured or disordered proteins are proteins that lack fixed 3-D structure globally or contain long disordered regions. Predicting disordered regions has attracted significant research recently. In developing a decision tree based disordered region predictor, we note that many previous predictors applying 20 amino acid compositions as training parameter tend to overfit the data. In this paper we propose to alleviate overfitting in prediction of intrinsically unstructured proteins by reducing input parameters. We also compare this approach with the random forest model, which is inherently tolerant to overfitting. Our experiments suggest that reducing 20 amino acid compositions into 4 groups according to amino acid property can reduce the overfitting in decision tree model. Alternatively, ensemble-learning techniques like random forest is inherently more tolerant to this kind of overfitting and can be a promising candidate in disordered region prediction.

KW - Amino acid composition

KW - Decision tree

KW - Disordered region

KW - Intrinsically unstructured proteins

KW - Overfitting

KW - Random forest

UR - http://www.scopus.com/inward/record.url?scp=38149094114&partnerID=8YFLogxK

M3 - Conference Paper

SN - 9783540717003

VL - 4426 LNAI

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 515

EP - 522

BT - Advances in Knowledge Discovery and Data Mining - 11th Pacific-Asia Conference, PAKDD 2007, Proceedings

ER -

Han P, Zhang X, Norton RS, Feng Z. Reducing overfitting in predicting intrinsically unstructured proteins. In Advances in Knowledge Discovery and Data Mining - 11th Pacific-Asia Conference, PAKDD 2007, Proceedings. Vol. 4426 LNAI. 2007. p. 515-522. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).