The identification of full length transcripts entirely from short-read RNA sequencing data (RNA-seq) remains a challenge in the annotation of genomes. Here researchers from the University of California at Berkeley describe an automated pipeline for genome annotation that integrates RNA-seq and gene-boundary data sets, which they call Generalized RNA Integration Tool, or GRIT.
Applying GRIT to Drosophila melanogaster short-read RNA-seq, cap analysis of gene expression (CAGE) and poly(A)-site-seq data collected for the modENCODE project, they recovered the vast majority of previously annotated transcripts and doubled the total number of transcripts cataloged. They found that 20% of protein coding genes encode multiple protein-localization signals and that, in 20-d-old adult fly heads, genes with multiple polyadenylation sites are more common than genes with alternative splicing or alternative promoters. GRIT demonstrates 30% higher precision and recall than the most widely used transcript assembly tools. GRIT will facilitate the automated generation of high-quality genome annotations without the need for extensive manual annotation.
Availability – All software associated with this project and the pipelines run to generate these annotations are available for download at http://grit-bio.org/