Genome-guided Trinity for Gene Structure Annotation (Beta)

from the Broad Institute and the Hebrew University of Jerusalem

A primary use of RNA-Seq is to identify transcribed regions of a genome, and to reconstruct the structures of transcripts including alternatively spliced variants. Current state-of-the-art methods for genome-based transcript reconstruction involve aligning RNA-Seq reads to the genome using spliced (intron-aware) aligners, and then assembling the alignments to reconstuct transcript structures (eg. cufflinks, scripture). We refer to this as the align-reads then assemble-alignments approach. Trinity supports an alternative, hybrid approach to genome-based transcript reconstruction that uses a combination of RNA-Seq alignments to a genome coupled with RNA-seq read de novo assembly and transcript alignment assembly. This alternative approach involves four major steps: align-reads, assemble-reads, align-transcripts, then assemble-transcript_alignments. Specifically, the process involves:

  • align-reads: GSNAP is used to align reads to the genome sequence. Reads are then partitioned into read-covered regions of the genome.
  • assemble-reads: Trinity is used to assemble the RNA-Seq reads in each partition. This can be done in a massiviely parallel manner, typically requiring little RAM as compared to whole de novo RNA-Seq assemblies, and can be executed using standard hardware.
  • align-transcripts: The Trinity-assembled transcripts are aligned back to the genome using GMAP, as part of the PASA software pipeline.
  • assemble-transcript_alignments: The transcript alignments are assembled by PASA into complete transcript structures, resolving alternatively spliced transcript structures.

We’ve found this system to be highly effective for annotation of diverse eukaryotic genomes, from the compact genomes of microbial eukaryotes to the more expanse genomes of plants and vertebrates. The resulting transcript structures are provided in popular file formats for downstream analysis, including visualization (ex. bed for IGV), expression analysis (gtf for Tuxedo), or coding gene identification (gff3 for EVidenceModeler, gtf for TransDecoder).

(read more…)