TY - JOUR
T1 - Detecting and correcting systematic variation in large-scale RNA sequencing data
AU - Li, Sheng
AU - Labaj, Pawel P.
AU - Zumbo, Paul
AU - Sykacek, Peter
AU - Shi, Wei
AU - Shi, Leming
AU - Phan, John
AU - Wu, Po Yen
AU - Wang, May
AU - Wang, Charles
AU - Thierry-Mieg, Danielle
AU - Thierry-Mieg, Jean
AU - Kreil, David P.
AU - Mason, Christopher E.
N1 - Publisher Copyright:
© 2014 Nature America, Inc.
PY - 2014/9
Y1 - 2014/9
N2 - High-throughput RNA sequencing (RNA-seq) enables comprehensive scans of entire transcriptomes, but best practices for analyzing RNA-seq data have not been fully defined, particularly for data collected with multiple sequencing platforms or at multiple sites. Here we used standardized RNA samples with built-in controls to examine sources of error in large-scale RNA-seq studies and their impact on the detection of differentially expressed genes (DEGs). Analysis of variations in guanine-cytosine content, gene coverage, sequencing error rate and insert size allowed identification of decreased reproducibility across sites. Moreover, commonly used methods for normalization (cqn, EDASeq, RUV2, sva, PEER) varied in their ability to remove these systematic biases, depending on sample complexity and initial data quality. Normalization methods that combine data from genes across sites are strongly recommended to identify and remove site-specific effects and can substantially improve RNA-seq studies.
AB - High-throughput RNA sequencing (RNA-seq) enables comprehensive scans of entire transcriptomes, but best practices for analyzing RNA-seq data have not been fully defined, particularly for data collected with multiple sequencing platforms or at multiple sites. Here we used standardized RNA samples with built-in controls to examine sources of error in large-scale RNA-seq studies and their impact on the detection of differentially expressed genes (DEGs). Analysis of variations in guanine-cytosine content, gene coverage, sequencing error rate and insert size allowed identification of decreased reproducibility across sites. Moreover, commonly used methods for normalization (cqn, EDASeq, RUV2, sva, PEER) varied in their ability to remove these systematic biases, depending on sample complexity and initial data quality. Normalization methods that combine data from genes across sites are strongly recommended to identify and remove site-specific effects and can substantially improve RNA-seq studies.
UR - http://www.scopus.com/inward/record.url?scp=84909587930&partnerID=8YFLogxK
U2 - 10.1038/nbt.3000
DO - 10.1038/nbt.3000
M3 - Article
C2 - 25150837
AN - SCOPUS:84909587930
SN - 1087-0156
VL - 32
SP - 888
EP - 895
JO - Nature Biotechnology
JF - Nature Biotechnology
IS - 9
ER -