Abstract
Intrinsically unstructured or disordered proteins are proteins that lack fixed 3-D structure globally or contain long disordered regions. Predicting disordered regions has attracted significant research recently. In developing a decision tree based disordered region predictor, we note that many previous predictors applying 20 amino acid compositions as training parameter tend to overfit the data. In this paper we propose to alleviate overfitting in prediction of intrinsically unstructured proteins by reducing input parameters. We also compare this approach with the random forest model, which is inherently tolerant to overfitting. Our experiments suggest that reducing 20 amino acid compositions into 4 groups according to amino acid property can reduce the overfitting in decision tree model. Alternatively, ensemble-learning techniques like random forest is inherently more tolerant to this kind of overfitting and can be a promising candidate in disordered region prediction.
Original language | English |
---|---|
Title of host publication | Advances in Knowledge Discovery and Data Mining - 11th Pacific-Asia Conference, PAKDD 2007, Proceedings |
Pages | 515-522 |
Number of pages | 8 |
Volume | 4426 LNAI |
Publication status | Published - 2007 |
Externally published | Yes |
Event | Pacific-Asia Conference on Knowledge Discovery and Data Mining 2007 - Nanjing, China Duration: 22 May 2007 → 25 May 2007 Conference number: 11th https://link.springer.com/book/10.1007/978-3-540-71701-0 (Proceedings) |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 4426 LNAI |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | Pacific-Asia Conference on Knowledge Discovery and Data Mining 2007 |
---|---|
Abbreviated title | PAKDD 2007 |
Country/Territory | China |
City | Nanjing |
Period | 22/05/07 → 25/05/07 |
Internet address |
|
Keywords
- Amino acid composition
- Decision tree
- Disordered region
- Intrinsically unstructured proteins
- Overfitting
- Random forest