RNA-Seq is a widely used method for studying the behavior of genes under different biological conditions. An essential step in an RNA-Seq study is normalization, in which raw data are adjusted to account for factors that prevent direct comparison of expression measures. Errors in normalization can have a significant impact on downstream analysis, such as inflated false positives in differential expression analysis. An underemphasized feature of normalization is the assumptions on which the methods rely and how the validity of these assumptions can have a substantial impact on the performance of the methods.
Here, researchers from Carnegie Mellon University explain how assumptions provide the link between raw RNA-Seq read counts and meaningful measures of gene expression. They examine normalization methods from the perspective of their assumptions, as an understanding of methodological assumptions is necessary for choosing methods appropriate for the data at hand. Furthermore, they discuss why normalization methods perform poorly when their assumptions are violated and how this causes problems in subsequent analysis. To analyze a biological experiment, researchers must select a normalization method with assumptions that are met and that produces a meaningful measure of expression for the given experiment.
Use of negative controls with shift in expression
Two genes are investigated for differential expression between condition A and condition B. A negative control is used for normalization (could be a known non-DE gene or spike-in control). (A) Both non-control genes are up-regulated under condition B versus condition A, having twice the expression under condition B. As a negative control, the control has the same expression under both conditions. (B) In the RNA-Seq experiment, the same number of molecules is sequenced from each sample. As the control has a smaller share of the mRNA in condition B, there are fewer control molecules in the sample for condition B. (C) Variability leads to differences in the total read count for the two samples. The share of the reads aligned to the control is the share of mRNA from the control. (D) The control should have the same expression in both conditions, so normalization is performed to equalize the normalized read count for the control, resulting in normalized read counts that reflect the correct mRNA/cell levels. (E) Because normalized counts correctly reflect mRNA/cell, the observed fold change agrees with the truth.