This lecture is by Simon White of Ensembl and is 17 minutes long.

In it, he explains how we can use RNA-Seq data to annotate genomes, including building complete gene sets from scratch, adding novel genes to existing annotations, improving and validating existing annotations and analysing the UTRs of genes. He explains the methods involved in RNA-Seq genome annotation.

He discusses the impact that this has had on the annotation of already annotated genomes, using the example of zebrafish, and on creating annotation for new genomes, in this case the Tasmanian devil.

imageClick here to watch the video

NotesClick here to view the slides

Incoming search terms:

  • zebrafish RNAseq
  • zebrafish rna seq
  • zebrafish transcriptome database
  • zebrafish RNA sequencing
  • ZebraFish Rnaseq tophat
  • ensembl rna seq
  • ensembl rna-seq zebrafish
  • top hat zebrafish genome optimization parameters
  • rnaseq zebrafish
  • zebrafish chip-seq database

What are the RNA-Seq models in Ensembl, and how were they determined? How does RNA-Seq data contribute to Ensembl gene sets? Can I upload my own RNA-Seq data to Ensembl? Answers to these questions and more…

Ensembl gene annotation provides a comprehensive catalogue of transcripts aligned to the reference sequence. It relies on publicly available species specific and orthologous transcripts plus their inferred protein sequence. The accuracy of gene models is improved by increasing the species specific component which can be cost-effectively achieved using RNA-Seq. Two zebrafish gene annotations are presented in Ensembl version 62 built on the Zv9 reference sequence.

Firstly, RNA-Seq data from five tissues and seven developmental stages were assembled into 25,748 gene models. A 3′ end capture and sequencing protocol was developed to predict the 3′ ends of transcripts and 46.1% of the original models were subsequently refined. Read more

Incoming search terms:

  • helicos
  • zebrafish rna-seq
  • rna seq zebrafish
  • Ensembl RNA-Seq gene model

The latest release of  Ensembl (release 62) includes  RNASeq data from Illumina’s Human BodyMap 2.0 project. It consists of 16 human tissue types, including adrenal, adipose, brain, breast, colon, heart, kidney, liver, lung, lymph, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells. For each tissue, they have aligned the raw reads to the genome and then linked exons into tissue-specific transcript models using the reads that span an exon-exon boundary.

You can view these data in the Region in Detail view. Click on ‘Configure this page’ and choose ‘RNA-Seq’ at the left of the main panel. Enable any or all of the 32 tracks and then close the configuration panel. Out of 32 possible tracks you can draw, 16 are tissue ‘gene model’ tracks, and 16 are ‘intron’ tracks.

Ensembl Blog Post

 

Incoming search terms:

  • illumina body map
  • human body map rna-seq
  • rnaseq (illumina body map)
  • illumina bodymap
  • illumina body map gtf files
  • illumina body map project
  • ilumina rna-seq human b cells
  • human bodymap
  • human body map 2 0 data from illumina
  • illumina body map rna-seq

The scientists at the Ensembl project have focused on continued improvements over the past year and a major development effort has been the continued optimization of our new annotation pipeline that uses only RNA-seq data as input to create transcript models. The refined RNA-seq annotation pipeline was used in the annotation of the zebrafish zv9 assembly and earlier versions of this pipeline were used to annotate human, worm and fly data for the RNA-seq Genome Annotation Assessment Project (RGASP) 1.2. The zv8 assembly provided a platform for much of the development of the pipeline and the Ensembl website now displays a number of informative DAS tracks, including transcript models built from a range of tissues and also expression information in the form of intron alignments. Read more

Incoming search terms:

  • annotating de novo RNAseq
  • ensembl rna-seq pipeline
  • ensemble rnaseq

Zebrafish RNASeq models (Zebrafish)
A new set of gene models made using RNA-Seq data from 9 tissues. The gene set has exon and transcript level RPKM values for each tissue stored as supporting features. The models are an additional gene set to the Ensembl build and are stored in the otherfeatures database.

Gorilla patched gene set (Gorilla)
Patch gorilla gene set using Ensembl (e57) human translation to include new genes, extend existing genes and merge previously split adjacent genes.
Patch gorilla otherfeatures database with RNA-Seq gene models from Illumina transcriptome analysis of a Western lowland gorilla.

WormBase WS210 (C.elegans)
An ensembl build based on the WS210 WormBase C.elegans freeze using the r58 code. Main feature are quite a few new transcripts thanks to RNA-seq data. The functional genomics database will include imported mappings for Agilent and Affymetrix expression arrays, as well as a remapped set using the functional genomics pipeline. Read more

Incoming search terms:

  • rnaseq database
  • tophat c elegans rna-seq
  • rna-seq c elegans
  • c elegans rnaseq
  • human rna-seq database
  • gorilla rnaseq
  • gorilla rna gene
  • gorilla rna expression levels
  • c elegans rna-seq
  • c elegans rna seq

  • Social Networking Pages

    Linkedin Group

  • Follow Me on Pinterest
  • RSS SEQanswers – RNA Sequencing

    • RNAseq (SOLiD) from 18 - 200 nt June 18, 2013
      We are interested in small non-coding RNAs. Whomever you ask about the size range of small RNAs, you get a different answer. ;) Lets assume, small... […]
      GenomicIBK
    • Unmapped ratio very high on mouse genome June 17, 2013
      Hi, My problem regards RNA-Seq data. I've downloaded public data (SAGE libs w/ 6 different samples from mouse liver ) to analyse using ArrayStudio.... […]
      le.nono
    • RNASeq: Read length different from expected June 17, 2013
      Hello all, I have received paired-end reads for 40 samples. The reads are supposed to be 100bp per end. Instead, 20 of my samples are 101bp per... […]
      gogodidi
    • How to install xgawk June 16, 2013
      Hi, This is Shrujan, i have a problem while running RNA Sequencing QC. It shows an error that xgawk is not found. So please help me installing... […]
      shrujan
    • RNA Sequencing QC Error while using with Sequence_QC.sh file June 15, 2013
      Hi, This is Shrujan kumar Madadha, I had an error while running QC for Drosophila Yukuba fastq RNA file using Sequence_QC.sh file of FASTX... […]
      shrujan
    • Cuffmerge related query June 12, 2013
      I have a query regarding what samples should be merged using cuffmerge, when you have multiple phenotypes (each with replicates). Lets say my mouse... […]
      ParthavJailwala
  • RSS Biostar – RNA-Seq

    • edgeR: very low p-value and very high variance within the group of replicates. What's my problem??
      I'm using edgeR in order to perform differential expression analysis from RNA-seq experiment. I have 6 samples of tumor cell, same tumor and same treatment: 3 patient with good prognosis and 3 patient with bad prognosis. I want to compare the gene expression among the two groups. I ran the edgeR pakage like follow: x […]
    • Normalising tag count to RPKM
      Hi! I was wondering if their is a way to normalise the number of reads in a region and the RPKM of the nearest gene to that region, so that a correlation could be computed. Like the following data shows number of tags in first column and RPKM in second column Tags RPKM 15 0.14619 11 0 203 0.2259 129 10.701 300 7.0772 122 2.3234 346 10.666 77 3.117 201 16.749 […]
    • a simple question on RNA-Seq terminology
      This question may be very simple and basic, but I just need to confirm that I understand the differences among those terminologies in the RNA-Seq context. Suppose I have a sample called SLR, and it is sequenced on 5 lanes, so I have (among other output files) BAM files like L1_SLR, L2_SLR, L3_SLR, L5_SLR and L7_SLR.bam. Here, the letter "L" denotes […]
    • FInding regions of interest with minimum coverage
      Hi, I have a bam file of all my accepted hits (tophat output) and an gtf file with my genes of interest for which I am trying to find potential antisense transcripts. I would like to create a list - preferably one that can be visualized in a genome browser - that shows all genes that have antisense reads in the accepted hits.bam file provided that there are […]
    • How to remove the intronic reads before counting
      I got RNASeq data in several samples. I checked the FastQC, seems the read quality are good (Hiseq 2000). But the problem is many reads are mapped to intronic region, and the regions have no any reference exons there (Refseq, ensembl, gencode). We don't know what they are. We guess the problem happend in library preparation, the concentration was low. N […]
    • Which strand of the mRNA molecule does the sequencer output as a "read"?
      In Illumina Stranded RNA-Seq (using the dUTP method), do the final reads in the fastq files correspond to the initial molecule (that was transcribed), or to the reverse complement of the molecule? C […]