Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications

Pengyi Yang, Paul D Yoo, Juanita Isabelle Esther Fernando, Bing B Zhou, Zili Zhang, Albert Y Zomaya

Research output: Contribution to journalArticleResearchpeer-review

Abstract

Data sampling is a widely used technique in a broad range of machine learning problems. Traditional sampling approaches generally rely on random resampling from a given dataset. However, these approaches do not take into consideration additional information, such as sample quality and usefulness. We recently proposed a data sampling technique, called sample subset optimization (SSO). The SSO technique relies on a cross-validation procedure for identifying and selecting the most useful samples as subsets. In this paper, we describe the application of SSO techniques to imbalanced and ensemble learning problems, respectively. For imbalanced learning, the SSO technique is employed as an under-sampling technique for identifying a subset of highly discriminative samples in the majority class. In ensemble learning, the SSO technique is utilized as a generic ensemble technique where multiple optimized subsets of samples from each class are selected for building an ensemble classifier. We demonstrate the utilities and advantages of the proposed techniques on a variety of bioinformatics applications where class imbalance, small sample size, and noisy data are prevalent.
Original languageEnglish
Pages (from-to)445 - 455
Number of pages11
JournalIEEE Transactions on Cybernetics
Volume44
Issue number3
DOIs
Publication statusPublished - 2014

Cite this

Yang, Pengyi ; Yoo, Paul D ; Fernando, Juanita Isabelle Esther ; Zhou, Bing B ; Zhang, Zili ; Zomaya, Albert Y. / Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications. In: IEEE Transactions on Cybernetics. 2014 ; Vol. 44, No. 3. pp. 445 - 455.
@article{d0be3268b7604f5c82b5985c0f48146c,
title = "Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications",
abstract = "Data sampling is a widely used technique in a broad range of machine learning problems. Traditional sampling approaches generally rely on random resampling from a given dataset. However, these approaches do not take into consideration additional information, such as sample quality and usefulness. We recently proposed a data sampling technique, called sample subset optimization (SSO). The SSO technique relies on a cross-validation procedure for identifying and selecting the most useful samples as subsets. In this paper, we describe the application of SSO techniques to imbalanced and ensemble learning problems, respectively. For imbalanced learning, the SSO technique is employed as an under-sampling technique for identifying a subset of highly discriminative samples in the majority class. In ensemble learning, the SSO technique is utilized as a generic ensemble technique where multiple optimized subsets of samples from each class are selected for building an ensemble classifier. We demonstrate the utilities and advantages of the proposed techniques on a variety of bioinformatics applications where class imbalance, small sample size, and noisy data are prevalent.",
author = "Pengyi Yang and Yoo, {Paul D} and Fernando, {Juanita Isabelle Esther} and Zhou, {Bing B} and Zili Zhang and Zomaya, {Albert Y}",
year = "2014",
doi = "10.1109/TCYB.2013.2257480",
language = "English",
volume = "44",
pages = "445 -- 455",
journal = "IEEE Transactions on Cybernetics",
issn = "2168-2267",
number = "3",

}

Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications. / Yang, Pengyi; Yoo, Paul D; Fernando, Juanita Isabelle Esther; Zhou, Bing B; Zhang, Zili; Zomaya, Albert Y.

In: IEEE Transactions on Cybernetics, Vol. 44, No. 3, 2014, p. 445 - 455.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications

AU - Yang, Pengyi

AU - Yoo, Paul D

AU - Fernando, Juanita Isabelle Esther

AU - Zhou, Bing B

AU - Zhang, Zili

AU - Zomaya, Albert Y

PY - 2014

Y1 - 2014

N2 - Data sampling is a widely used technique in a broad range of machine learning problems. Traditional sampling approaches generally rely on random resampling from a given dataset. However, these approaches do not take into consideration additional information, such as sample quality and usefulness. We recently proposed a data sampling technique, called sample subset optimization (SSO). The SSO technique relies on a cross-validation procedure for identifying and selecting the most useful samples as subsets. In this paper, we describe the application of SSO techniques to imbalanced and ensemble learning problems, respectively. For imbalanced learning, the SSO technique is employed as an under-sampling technique for identifying a subset of highly discriminative samples in the majority class. In ensemble learning, the SSO technique is utilized as a generic ensemble technique where multiple optimized subsets of samples from each class are selected for building an ensemble classifier. We demonstrate the utilities and advantages of the proposed techniques on a variety of bioinformatics applications where class imbalance, small sample size, and noisy data are prevalent.

AB - Data sampling is a widely used technique in a broad range of machine learning problems. Traditional sampling approaches generally rely on random resampling from a given dataset. However, these approaches do not take into consideration additional information, such as sample quality and usefulness. We recently proposed a data sampling technique, called sample subset optimization (SSO). The SSO technique relies on a cross-validation procedure for identifying and selecting the most useful samples as subsets. In this paper, we describe the application of SSO techniques to imbalanced and ensemble learning problems, respectively. For imbalanced learning, the SSO technique is employed as an under-sampling technique for identifying a subset of highly discriminative samples in the majority class. In ensemble learning, the SSO technique is utilized as a generic ensemble technique where multiple optimized subsets of samples from each class are selected for building an ensemble classifier. We demonstrate the utilities and advantages of the proposed techniques on a variety of bioinformatics applications where class imbalance, small sample size, and noisy data are prevalent.

UR - http://goo.gl/wGXgbM

U2 - 10.1109/TCYB.2013.2257480

DO - 10.1109/TCYB.2013.2257480

M3 - Article

VL - 44

SP - 445

EP - 455

JO - IEEE Transactions on Cybernetics

JF - IEEE Transactions on Cybernetics

SN - 2168-2267

IS - 3

ER -