iLearn

an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data

Zhen Chen, Pei Zhao, Fuyi Li, Tatiana T. Marquez-Lago, Andre Leier, Jerico Revote, Yan Zhu, David R. Powell, Tatsuya Akutsu, Geoffrey I. Webb, Kuo-Chen Chou, A. Ian Smith, Roger J. Daly, Jian Li, Jiangning Song

Research output: Contribution to journalArticleResearchpeer-review

Abstract

With the explosive growth of biological sequences generated in the post-genomic era, one of the most challenging problems in bioinformatics and computational biology is to computationally characterize sequences, structures and functions in an efficient, accurate and high-throughput manner. A number of online web servers and stand-alone tools have been developed to address this to date; however, all these tools have their limitations and drawbacks in terms of their effectiveness, user-friendliness and capacity. Here, we present iLearn, a comprehensive and versatile Python-based toolkit, integrating the functionality of feature extraction, clustering, normalization, selection, dimensionality reduction, predictor construction, best descriptor/model selection, ensemble learning and results visualization for DNA, RNA and protein sequences. iLearn was designed for users that only want to upload their data set and select the functions they need calculated from it, while all necessary procedures and optimal settings are completed automatically by the software. iLearn includes a variety of descriptors for DNA, RNA and proteins, and four feature output formats are supported so as to facilitate direct output usage or communication with other computational tools. In total, iLearn encompasses 16 different types of feature clustering, selection, normalization and dimensionality reduction algorithms, and five commonly used machine-learning algorithms, thereby greatly facilitating feature analysis and predictor construction. iLearn is made freely available via an online web server and a stand-alone toolkit.
Original languageEnglish
Article numberbbz041
Number of pages11
JournalBriefings in Bioinformatics
DOIs
Publication statusAccepted/In press - 24 Apr 2019

Keywords

  • bioinformatics
  • integrated platform
  • sequence analysis
  • machine learning
  • automated modeling
  • data clustering
  • feature selection
  • biomedical data mining

Cite this

@article{48415c1c2c164db9a0a7131c8ebf5905,
title = "iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data",
abstract = "With the explosive growth of biological sequences generated in the post-genomic era, one of the most challenging problems in bioinformatics and computational biology is to computationally characterize sequences, structures and functions in an efficient, accurate and high-throughput manner. A number of online web servers and stand-alone tools have been developed to address this to date; however, all these tools have their limitations and drawbacks in terms of their effectiveness, user-friendliness and capacity. Here, we present iLearn, a comprehensive and versatile Python-based toolkit, integrating the functionality of feature extraction, clustering, normalization, selection, dimensionality reduction, predictor construction, best descriptor/model selection, ensemble learning and results visualization for DNA, RNA and protein sequences. iLearn was designed for users that only want to upload their data set and select the functions they need calculated from it, while all necessary procedures and optimal settings are completed automatically by the software. iLearn includes a variety of descriptors for DNA, RNA and proteins, and four feature output formats are supported so as to facilitate direct output usage or communication with other computational tools. In total, iLearn encompasses 16 different types of feature clustering, selection, normalization and dimensionality reduction algorithms, and five commonly used machine-learning algorithms, thereby greatly facilitating feature analysis and predictor construction. iLearn is made freely available via an online web server and a stand-alone toolkit.",
keywords = "bioinformatics, integrated platform, sequence analysis, machine learning, automated modeling, data clustering, feature selection, biomedical data mining",
author = "Zhen Chen and Pei Zhao and Fuyi Li and Marquez-Lago, {Tatiana T.} and Andre Leier and Jerico Revote and Yan Zhu and Powell, {David R.} and Tatsuya Akutsu and Webb, {Geoffrey I.} and Kuo-Chen Chou and Smith, {A. Ian} and Daly, {Roger J.} and Jian Li and Jiangning Song",
year = "2019",
month = "4",
day = "24",
doi = "10.1093/bib/bbz041",
language = "English",
journal = "Briefings in Bioinformatics",
issn = "1467-5463",
publisher = "Oxford Univ Press",

}

iLearn : an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. / Chen, Zhen; Zhao, Pei; Li, Fuyi; Marquez-Lago, Tatiana T.; Leier, Andre; Revote, Jerico; Zhu, Yan; Powell, David R.; Akutsu, Tatsuya; Webb, Geoffrey I.; Chou, Kuo-Chen; Smith, A. Ian; Daly, Roger J.; Li, Jian; Song, Jiangning.

In: Briefings in Bioinformatics, 24.04.2019.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - iLearn

T2 - an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data

AU - Chen, Zhen

AU - Zhao, Pei

AU - Li, Fuyi

AU - Marquez-Lago, Tatiana T.

AU - Leier, Andre

AU - Revote, Jerico

AU - Zhu, Yan

AU - Powell, David R.

AU - Akutsu, Tatsuya

AU - Webb, Geoffrey I.

AU - Chou, Kuo-Chen

AU - Smith, A. Ian

AU - Daly, Roger J.

AU - Li, Jian

AU - Song, Jiangning

PY - 2019/4/24

Y1 - 2019/4/24

N2 - With the explosive growth of biological sequences generated in the post-genomic era, one of the most challenging problems in bioinformatics and computational biology is to computationally characterize sequences, structures and functions in an efficient, accurate and high-throughput manner. A number of online web servers and stand-alone tools have been developed to address this to date; however, all these tools have their limitations and drawbacks in terms of their effectiveness, user-friendliness and capacity. Here, we present iLearn, a comprehensive and versatile Python-based toolkit, integrating the functionality of feature extraction, clustering, normalization, selection, dimensionality reduction, predictor construction, best descriptor/model selection, ensemble learning and results visualization for DNA, RNA and protein sequences. iLearn was designed for users that only want to upload their data set and select the functions they need calculated from it, while all necessary procedures and optimal settings are completed automatically by the software. iLearn includes a variety of descriptors for DNA, RNA and proteins, and four feature output formats are supported so as to facilitate direct output usage or communication with other computational tools. In total, iLearn encompasses 16 different types of feature clustering, selection, normalization and dimensionality reduction algorithms, and five commonly used machine-learning algorithms, thereby greatly facilitating feature analysis and predictor construction. iLearn is made freely available via an online web server and a stand-alone toolkit.

AB - With the explosive growth of biological sequences generated in the post-genomic era, one of the most challenging problems in bioinformatics and computational biology is to computationally characterize sequences, structures and functions in an efficient, accurate and high-throughput manner. A number of online web servers and stand-alone tools have been developed to address this to date; however, all these tools have their limitations and drawbacks in terms of their effectiveness, user-friendliness and capacity. Here, we present iLearn, a comprehensive and versatile Python-based toolkit, integrating the functionality of feature extraction, clustering, normalization, selection, dimensionality reduction, predictor construction, best descriptor/model selection, ensemble learning and results visualization for DNA, RNA and protein sequences. iLearn was designed for users that only want to upload their data set and select the functions they need calculated from it, while all necessary procedures and optimal settings are completed automatically by the software. iLearn includes a variety of descriptors for DNA, RNA and proteins, and four feature output formats are supported so as to facilitate direct output usage or communication with other computational tools. In total, iLearn encompasses 16 different types of feature clustering, selection, normalization and dimensionality reduction algorithms, and five commonly used machine-learning algorithms, thereby greatly facilitating feature analysis and predictor construction. iLearn is made freely available via an online web server and a stand-alone toolkit.

KW - bioinformatics

KW - integrated platform

KW - sequence analysis

KW - machine learning

KW - automated modeling

KW - data clustering

KW - feature selection

KW - biomedical data mining

U2 - 10.1093/bib/bbz041

DO - 10.1093/bib/bbz041

M3 - Article

JO - Briefings in Bioinformatics

JF - Briefings in Bioinformatics

SN - 1467-5463

M1 - bbz041

ER -