Differential Expression Analysis of RNA-Seq Data

from BioCompare by Josh P. Roberts

Knowledge of gene expression is crucial to understanding the molecular underpinnings of biology and medicine. As our ability to query the transcriptome grows—as new instrumentation comes online and becomes refined, along with the techniques necessary to use it—what was once the dominion of a few is becoming a workhorse in diverse labs.

biocompare“The era of RNA-Seq is definitely here,” says Christopher Mason Assistant Professor at Weill Cornel Medical College. “With RNA-Seq—especially for certain medium and high expressers—you get all of the same specificity of an array, plus greater sensitivity. You also get, not just expression by gene, by exon, by junction, but … also … SNP [single nucleotide polymorphism] information. Since it’s actual sequence data, you can look at the genetic variation present. You can look for things like gene fusion events or allele-specific expression. You can look for other rearrangements or new transcribed regions that are, by definition, novel, so they wouldn’t be on an array.” Mason further comments that the remainder of the RNA-Seq reads are sometimes from another species, what some might consider a “contaminant,” but which might also provide interesting and valuable information. In short, “You get a real wealth of information from the RNA-Seq data.”

RNA-Seq for differential expression analysis

RNA-Seq builds on next-generation DNA sequencing (NGS), taking as its input RNA that has been reverse-transcribed into cDNA and (usually) PCR-amplified. It’s most often used for a more global profile of RNA, because studies in which the researcher is looking for specific targets are perhaps better accomplished with arrays, or quantitative real-time PCR. Alternatively, some will limit the targets (and therefore the expense) by selecting the RNA to be queried before doing the RNA-Seq.

Although virtually any NGS instrument can be used, “Illumina is the clear leader in this, and they’ve got a really well-established infrastructure for RNA sequencing,” observes Mason. For applications requiring quantification, such as differential gene expression (DE) analysis, “RNA-Seq is a gene-counting platform. So you want as many reads as possible for as cheap as possible—that’s unequivocally Illumina right now.”

Counting the number of times a given sequence has been read in a given sample is only the first step in DE. The data then need to be normalized, fed through a statistical model and tested to see whether they meet the criteria for DE.


Many sources of variability in RNA-Seq are inherent in NGS itself, such as the fact that coverage across the genome may not be uniform, and that more reads map to longer genes. These types of bias usually can be ignored, because they tend to affect all samples equally.

A more worrisome source of variability relates to the fact that sequencing depth—the number of mapped reads—typically differs sample to sample. Yet because RNA-Seq documents the relative abundance of transcripts, even samples sequenced to equal depth may show skewed results if the proportion of highly expressed genes differs among the samples. It is therefore imperative to normalize the counts to a stable measure that does not change (or changes very little). Various software algorithms offer ways to do that—for example, using the average of many genes’ reads as a metric.

Low count levels, resulting from low expression levels, are inherently noisy—the difference between 2, 4 and 6 reads may be the same as the difference between 200, 202 and 204 reads, but the variance of the former is comparatively very high. “It’s a simple question of statistics,” Mason says.

Even for highly expressed genes, though, he cannot overstress the importance of running replicate samples. For these, it’s not so much technical variance, “because we know that RNA-Seq is technically tight, especially for high and medium expressers,” but biological variance. That is, you want to make sure you didn’t grab the outlier, the one sample that was that was exposed to some unknown chemical.

To avoid getting into the weeds of statistical algorithms, Type I (false positive) and Type II (false negative) error and the like, the simplest thing to do is “just look at things which are expressed at high levels and high fold change—which is often what people just did with microarrays anyway,” says Mason.

56,000 genes and counting

Although this may yield only the low-hanging fruit, there’s plenty of it still to grab when using the GENCODE (www.gencodegenes.org ) dataset, with 56,000 genes, including long noncoding RNAs and pseudogenes. Good biomarkers—clinically relevant genes change in their expression in statistically significant ways—have been identified in neuroblastoma and toxicogenomics and pharmacogenomic studies from among non-protein-coding genes in GENCODE, Mason points out. “There’s an opportunity for the next few years to have completely low-hanging fruit on the other half of the transcriptome that no one has looked at before.”

The two most important factors in using RNA-Seq to find DE are: having a sufficient number of replicates (generally at least three) and depth of sequencing. Between 40 and 50 million reads gives “a really good look and a pretty thorough look at the transcriptome,” says Mason, adding that “there’s always a cost-benefit trade-off at some point, where you [have] detected the vast majority of the genes you’ll see and you [are getting] a marginal gain for the remainder of your depth.”

There are myriad free and commercial software products for the various stages of DE analysis with RNA-Seq. For the novice, choosing among them is largely a question of ease of use, Mason says. “Do you have time to play around with the command prompts? Or do you just want to click on something with a nice [interface] and then get your data processed right away?”

(read more…)