Reducing overfitting in predicting intrinsically unstructured proteins

Pengfei Han, Xiuzhen Zhang, Raymond S. Norton, Zhiping Feng

Research output: Chapter in Book/Report/Conference proceedingConference PaperOther

1 Citation (Scopus)


Intrinsically unstructured or disordered proteins are proteins that lack fixed 3-D structure globally or contain long disordered regions. Predicting disordered regions has attracted significant research recently. In developing a decision tree based disordered region predictor, we note that many previous predictors applying 20 amino acid compositions as training parameter tend to overfit the data. In this paper we propose to alleviate overfitting in prediction of intrinsically unstructured proteins by reducing input parameters. We also compare this approach with the random forest model, which is inherently tolerant to overfitting. Our experiments suggest that reducing 20 amino acid compositions into 4 groups according to amino acid property can reduce the overfitting in decision tree model. Alternatively, ensemble-learning techniques like random forest is inherently more tolerant to this kind of overfitting and can be a promising candidate in disordered region prediction.

Original languageEnglish
Title of host publicationAdvances in Knowledge Discovery and Data Mining - 11th Pacific-Asia Conference, PAKDD 2007, Proceedings
Number of pages8
Volume4426 LNAI
Publication statusPublished - 2007
Externally publishedYes
EventPacific-Asia Conference on Knowledge Discovery and Data Mining 2007 - Nanjing, China
Duration: 22 May 200725 May 2007
Conference number: 11th (Proceedings)

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4426 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


ConferencePacific-Asia Conference on Knowledge Discovery and Data Mining 2007
Abbreviated titlePAKDD 2007
Internet address


  • Amino acid composition
  • Decision tree
  • Disordered region
  • Intrinsically unstructured proteins
  • Overfitting
  • Random forest

Cite this