Researchers at the Wellcome Trust Sanger Institute present a purely quantitative digital gene expression sample processing and analysis package called differential expression transcript counting technique (DeTCT) that begins with tissue samples and produces a text table or HTML table, comprising genomic coordinates representing the 3′ ends of genes, raw and normalised counts, and a fold change in transcript abundance between two conditions with an associated p-value. Their simplified library preparation and analysis protocol incorporates a sample indexing system and allows processing and sequencing of large numbers of samples and replicates. The genomic coordinates can be compared to existing gene annotation, but they also identify unannotated genomic regions showing an alteration in polyA+ transcript number.
DeTCT pipeline workflow. Between nine and 11 pairs of mutant and normal zebrafish embryos were collected from one clutch and RNA extracted. a Following DNaseI treatment and chemical fragmentation, molecules representing the 3′ end of transcripts were enriched by pulldown using an anchored biotinylated oligo dT primer attached to streptavidin magnetic beads (orange line). An RNA oligo matching part of the Illumina read 2 adapter (purple line) was ligated onto the 5′ end, the RNA eluted and annealed to an oligo comprising partial read 1 Illumina adapter (dark blue line) followed by 12 random bases (beige line), then an eight base indexing sequence specific to each sample (light blue line) and finally a 14 base anchored polyT sequence (grey line). After reverse transcription the Illumina adapter sequences were completed during library amplification. Libraries were quantified, pooled in equimolar amounts and sequenced by Illumina HiSeq 2500. b After decoding the indexing sequence, the trimmed zebrafish sequences (read 1 in green and read 2 in red) were mapped to the reference genome and duplicate reads were flagged. c The coordinate representing the transcript counting 3′ end (TC 3′ end) was predicted using the base immediately 3′ of the polyT sequence in read 1 (green dashed arrow and green curved line). After calling peaks using all mapped read 2s the resulting counts were associated with their respective sample (red curved line). The count data were used to identify differential transcript abundance between conditions using DESeq2 and reported as a fold change with an adjusted p-value. The TC 3′ ends were matched to the closest Ensembl transcript 3′ ends on the same strand (black line). Gene list tables were produced and ordered by the lowest adjusted p-value. These gene lists were filtered for genes showing differential transcript abundance using the adjusted p-value and the proximity of the TC 3′ end and Ensembl gene end (typically adjusted p-value <= 0.05 and within -100 and +5000 bases)
This method can be implemented on polyadenylated RNA from any organism with an annotated reference genome and in any laboratory with access to Illumina sequencing.
Availability – The source code for the DeTCT pipeline is available from DeTCT github: https://github.com/iansealy/DETCT