Improvements to Ensembl include a de novo RNA-seq gene annotation pipeline

The scientists at the Ensembl project have focused on continued improvements over the past year and a major development effort has been the continued optimization of our new annotation pipeline that uses only RNA-seq data as input to create transcript models. The refined RNA-seq annotation pipeline was used in the annotation of the zebrafish zv9 assembly and earlier versions of this pipeline were used to annotate human, worm and fly data for the RNA-seq Genome Annotation Assessment Project (RGASP) 1.2. The zv8 assembly provided a platform for much of the development of the pipeline and the Ensembl website now displays a number of informative DAS tracks, including transcript models built from a range of tissues and also expression information in the form of intron alignments.

An article recently published in the journal, Nucleic Acids Research, provides an overview of some of the new data and features that have been added to Ensembl since the previous report  and provides details of changes to the integrated analysis procedures that are designed to maximize the value of new and emerging technologies such as RNA-seq and ChIP-seq.

About the Ensembl Project

The Ensembl project seeks to enable genomic science by providing high quality, integrated annotation on chordate and selected eukaryotic genomes within a consistent and accessible infrastructure. All supported species include comprehensive, evidence-based gene annotations and a selected set of genomes includes additional data focused on variation, comparative, evolutionary, functional and regulatory annotation. The most advanced resources are provided for key species including human, mouse, rat and zebrafish reflecting the popularity and importance of these species in biomedical research. As of Ensembl release 59 (August 2010), 56 species are supported of which 5 have been added in the past year. Since our previous report, we have substantially improved the presentation and integration of both data of disease relevance and the regulatory state of different cell types.

Ensembl data and source code are provided freely to all users:

Flicek P et al. (2011) Ensembl 2011. Nucl Acids Res 39(suppl 1), D800-06.  [article]