Genome Annotation with RNA-Seq


While RNA-Seq’s capability of high-resolution and accuracy in transcript abundance estimation has been thoroughly demonstrated, (so much so that it is being heralded as a possible replacement for microarray based gene expression technology) there is another important application for RNA-Seq; the improvement of existing genome annotations and even the possibility of complete de novo genome annotation.

Improvements to current genome annotation is a topic that has been discussed before on the RNA-Seq Blog. See post from earlier this year:

Jan 13 – RNA-Seq Datasets Improving Genome Annotation in Plants, Animals, Bacteria

Jan 7 – Improvements to Ensembl include a de novo RNA-seq gene annotation pipeline

Now, researchers at UC Berkley and the Broad Institute have developed a novel approach termed “reference annotation based transcript (RABT) assembly”.  They claim that it is a “pure” assembler and that it does not utilize information about the structure and content of coding genes, or other external input (e.g. ESTs) during the assembly.

However, a problem exists with using RNA-Seq for annotation. Genes that are expressed at a low level will be represented by few reads and may be only partially covered. This means that naive assembly methods will fail to reconstruct the majority of full-length transcripts.

(Read how their method overcomes this problem… )

Availability: The methods described in this paper are implemented in the Cufflinks suite of software for RNA-Seq, freely available from

  • Roberts A, Pimentel H, Trapnell C, Pachter L. (2011) Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics [Epub ahead of print]. [abstract]