Transcriptomics: From Microarrays to RNA-Seq

by Josh P. Roberts at  Biocompare

TranscriptomicsWhen next-gen sequencing exploded onto the scene, it brought in its wake a host of innovations. Among these is the deep-sequencing of RNA (RNA-Seq), which is giving unprecedented breadth and depth to our understanding of the way cells develop, regulate themselves and each other, and respond to their environment. Although the study of cellular RNA is not new, the scale on which researchers are now undertaking transcriptomic investigations and many of the questions they are now able to ask, would not have been possible with earlier technologies.

The transcriptome is commonly defined as “the complete set of transcripts in a cell, and their quantity, for a specific developmental stage or physiological condition.”[1]

Transcriptomics can address all or a segment of the transcriptome, from normal or diseased single cells or tissues. Depending on the techniques employed, it can be used, for example, to catalog and annotate a cell’s RNA complement, including coding as well as non-coding transcripts; query the structure of the genes that gave rise to them, including exon/intron boundaries, transcription start sites, splicing patterns and even gene fusion events; and aid in mapping interactive networks. Perhaps most familiarly, transcriptomics can be used to determine how expression patterns of these transcripts change under differing conditions (such as disease or drug treatment) and to search for, and validate, biomarkers.


Microarrays have been the stalwarts of high-throughput, genome-wide expression investigations since the mid-1990s. The process starts with extracting RNA; reverse-transcribing it into cDNA; and amplifying, fluorescently labeling and fragmenting the cDNA clones. These clones (“targets”) are then exposed to the microarray (called a “chip”), on which are found ordered sets of DNA probes. In some platforms, two sets of differently labeled samples compete with each other for binding to the probes, with the color indicating which one (or both) was bound to each spot. Other platforms query a single set of labeled targets at a time, with the intensity of the spot indicating how much target is bound.

Such chips may feature probes that represent the entire genome of an organism or a smaller subset, such as exons, miRNA or single nucleotide polymorphisms (SNPs). “Everything revolves around hybridization—the same as all the way back to Southern blots—which is a highly specific, highly sensitive type of approach. They’re easy to call—no errors,” says Eric Hoffman, director of the Center for Genetic Medicine Research at the Children’s National Medical Center in the District of Columbia.

For more defined questions, like searching for genetic modifiers that modulate the severity of Duchene’s Muscular Dystrophy, Hoffman designed a custom array. “Then you can focus on polymorphisms that change proteins. You have a narrower window, but you’re asking a narrower question, and you know what you’re doing to try to interpret it,” he explains.

One of the biggest limitations is that an existing knowledge of the genome is necessary to fabricate a microarray: Transcripts such as unknown non-coding RNA or RNA from an unsequenced organism that is not in a database will not be detected. In this sense, transcriptomics is not a tool for discovery. Microarrays suffer from a limited dynamic range, meaning that a choice generally must be made to focus on either highly abundant or rare transcripts; the latter are notoriously difficult, because the analog nature of the signal presents difficulties in quantifying low-expressed species. And, unless a specialized array designed for the purpose is being used, microarrays cannot distinguish splice- or allele-specific variants or the like. Similarly, it may be problematic to distinguish among highly related RNA species because of cross-reactivity.


RNA-Seq has largely solved these problems. By sequencing all the transcripts in a sample multiple times, the entirety of the transcriptome can be queried, down to the individual base, whether or not a reference genome is available. As costs of obtaining the data continue to plummet, RNA-Seq is fast becoming the preferred route to genome-wide expression studies, able to see five logs of expression level differences, exon and allele usage variants and non-coding RNA equally well.

Depending on the platform, transcripts are read in increments of about 36 to 400 bases. These reads then need to be collated and mapped back into the full-length transcripts they came from, either with or without reference to a genome (called “site-aware” or “de novo,” respectively)—the exception being for classes of small RNA, such as miRNA. Along the way, a host of decisions are made by the software as to how to normalize and extrapolate quantities, how to handle ambiguous calls and the like.

The amount of data coming out of an RNA-Seq experiment can be staggering. Although that can be a boon, it is also a curse. Much more expansive, complex and time-consuming bioinformatics capabilities are required to process the data and extract meaningful results. This can be buffered somewhat by pre-selecting what is fed into the experiment (for example, by choosing or deleting poly-A RNA). Still, orders of magnitude more data are generated than with microarrays.

Sage words

“A lot of RNA profiling experiments run into problems with experimental design and interpretation,” Hoffman cautions. “Just because you can do RNA-Seq and get a lot more data, at a lot greater sensitivity, over a greater dynamic range, doesn’t solve that problem.”


[1] Wang, Z, Gerstein, M, Snyder, M, “RNA-Seq: a revolutionary tool for transcriptomics,” Nature Reviews Genetics, 10(1), 57-63, 2009. [article]

(read more…)