Accurate identification and effective removal of unwanted variation is essential to derive meaningful biological results from RNA sequencing (RNA-seq) data, especially when the data come from large and complex studies. Using RNA-seq data from The Cancer Genome Atlas (TCGA), researchers from the Walter and Eliza Hall Institute of Medical Research examined several sources of unwanted variation and demonstrate here how these can significantly compromise various downstream analyses, including cancer subtype identification, association between gene expression and survival outcomes and gene co-expression analysis. They propose a strategy, called pseudo-replicates of pseudo-samples (PRPS), for deploying their recently developed normalization method, called removing unwanted variation III (RUV-III), to remove the variation caused by library size, tumor purity and batch effects in TCGA RNA-seq data. The researchers illustrate the value of their approach by comparing it to the standard TCGA normalizations on several TCGA RNA-seq datasets. RUV-III with PRPS can be used to integrate and normalize other large transcriptomic datasets coming from multiple laboratories or platforms.
Unwanted variation in individual TCGA RNA-seq datasets
a, Illustrative examples showing data with and without unwanted variation. Data with unwanted variation exhibit high correlation between the first five PCs and this variation (top left). Data without unwanted variation have low correlation with unwanted variation (bottom left). The histograms show Spearman correlations and log2 F-statistics between individual genes and different sources of unwanted variation. Data with large library size and tumor purity variation show high Spearman correlations between individual gene expression and this variation. Data with plate effects exhibit high F-statistics obtained from ANOVA between individual gene expression and plates as factor. In contrast, data without such unwanted variation show low Spearman correlations and F-statistics. b, Distribution of (log2) library size colored by years for the individual TCGA cancer types. The year information was not available for the LAML RNA-seq study. The library sizes are calculated after removing lowly expressed genes for each cancer type. c, R2 obtained from linear regression between the first, first and second, and so on, cumulatively to the fifth PC and library size (first panel), tumor purity (second panel) and RLE medians (third panel) in the raw count, FPKM and FPKM.UQ normalized datasets. The fourth panel shows the vector correlation between the first five PCs cumulatively and plates in the datasets. Ideally, we should see no significant associations between PCs and sources of unwanted variation. Gray color indicates that samples were profiled across a single plate. d, Spearman correlation coefficients between individual gene expression levels and library size (first panel), tumor purity (second panel) and the RLE medians (third panel) in the datasets. The fourth panel shows log2 F-statistics obtained from ANOVA of gene expression levels by the factor: plate variable. Plates with fewer than three samples were excluded from the analyses. ANOVA was not possible for cancer types whose samples were profiled using a single plate.