Major technology-related artifacts and biases affect RNA-Seq expression data. Normalization is therefore essential to ensure accurate inference of expression levels and subsequent analyses thereof. Researchers at UC Berkeley focus on biases related to GC-content and demonstrate the existence of strong sample-specific GC-content effects on RNA-Seq read counts, which can substantially bias differential expression analysis.
They propose three simple within-lane gene-level GC-content normalization approaches and assess their performance on two different RNA-Seq datasets, involving different species and experimental designs. Their methods are compared to state-of-the-art normalization procedures in terms of bias and mean squared error for expression fold-change estimation and in terms of Type I error and p-value distributions for tests of differential expression.
The normalization methods proposed in this article are implemented in the open-source Bioconductor R package EDASeq. The resulting normalized counts (or raw counts and associated normalization offsets) can then be supplied seamlessly to other R packages for differential expression analysis, such as DESeq or edgeR
- Risso D, Schwartz K, Sherlock G, Dudoit S. (2011) GC-Content Normalization for RNA-Seq Data. U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 291. [abstract]