TY - JOUR
T1 - Estimating change-points in biological sequences via the cross-entropy method
AU - Evans, Gareth
AU - Sofronov, Georgy
AU - Keith, Jonathan
AU - Kroese, Dirk
PY - 2011
Y1 - 2011
N2 - The genomes of complex organisms, including the human genome, are known to vary in GC content along their length. That is, they vary in the local proportion of the nucleotides G and C, as opposed to the nucleotides A and T. Changes in GC content are often abrupt, producing well-defined regions.
We model DNA sequences as a multiple change-point process in which the sequence is separated into segments by an unknown number of change-points, with each segment supposed to have been generated by a different process. Multiple change-point problems are important in many biological applications, particularly in the analysis of DNA sequences. Multiple change-point problems also arise in segmentation of protein sequences according to hydrophobicity.
We use the Cross-Entropy method to estimate the positions of the change-points. Parameters of the process for each segment are approximated with maximum likelihood estimates. Numerical experiments illustrate the effectiveness of the approach. We obtain estimates of the locations of change-points in artificially generated sequences and compare the accuracy of these estimates with those obtained via other methods such as IsoFinder (Oliver et al. in Nucl. Acids Res. 32:W283a??W292, 2004) and Markov Chain Monte Carlo. Lastly, we provide examples with real data sets to illustrate the usefulness of our method.
AB - The genomes of complex organisms, including the human genome, are known to vary in GC content along their length. That is, they vary in the local proportion of the nucleotides G and C, as opposed to the nucleotides A and T. Changes in GC content are often abrupt, producing well-defined regions.
We model DNA sequences as a multiple change-point process in which the sequence is separated into segments by an unknown number of change-points, with each segment supposed to have been generated by a different process. Multiple change-point problems are important in many biological applications, particularly in the analysis of DNA sequences. Multiple change-point problems also arise in segmentation of protein sequences according to hydrophobicity.
We use the Cross-Entropy method to estimate the positions of the change-points. Parameters of the process for each segment are approximated with maximum likelihood estimates. Numerical experiments illustrate the effectiveness of the approach. We obtain estimates of the locations of change-points in artificially generated sequences and compare the accuracy of these estimates with those obtained via other methods such as IsoFinder (Oliver et al. in Nucl. Acids Res. 32:W283a??W292, 2004) and Markov Chain Monte Carlo. Lastly, we provide examples with real data sets to illustrate the usefulness of our method.
UR - http://www.springerlink.com/content/h308768j870hn57l/fulltext.pdf
UR - https://www.scopus.com/pages/publications/80052018747
U2 - 10.1007/s10479-010-0687-0
DO - 10.1007/s10479-010-0687-0
M3 - Article
SN - 0254-5330
VL - 189
SP - 155
EP - 165
JO - Annals of Operations Research
JF - Annals of Operations Research
IS - 1
ER -