New study finds RNA-seq data greatly affected by bias and systematic errors

When examining large datasets processed from four different studies, researchers at Johns Hopkins and Brown Universities found RNA-seq data to be affected by similar errors that originally plagued the development of microarrays as a tool for analyzing gene expression. Microarrays, the technology that first permitted measurement of gene expression, had problems due to unwanted sources of variability. While these problems are now mitigated after years of statistical methodology development, this study found that RNA-seq data demonstrates unwanted and obscuring variability similar to what was first observed in microarrays.

They report on commonly observed data distortions that demonstrate the need for data normalization. In particular, they found GC-content has a strong sample specific effect on gene expression measurements that, if left uncorrected, leads to false positives in downstream results.  To remove these unwanted sources of variation they have developed a normalization procedure for RNA-seq data that greatly improves precision without affecting accuracy. Their conditional quantile normalization (CQN) algorithm combines robust generalized regression to remove systematic bias introduced by deterministic features such as GC-content, and quantile normalization to correct for global distortions.

Hansen KD, Irizarry RA , and Wu Z,  (2011) Removing Technical Variability In Rna-Seq Data Using Conditional Quantile Normalization. Collection of Biostatistics Research Archive [Epub ahead of print]. [article]