Researchers at the University of Colorado, Boulder have developed a fast and simple algorithm to detect nascent RNA transcription in global nuclear run-on sequencing (GRO-seq). GRO-seq is a relatively new protocol that captures nascent transcripts from actively engaged polymerase, providing a direct read-out on bona fide transcription. Most traditional assays, such as RNA-seq, measure steady state RNA levels which are affected by transcription, post-transcriptional processing, and RNA stability. GRO-seq data, however, presents unique analysis challenges that are only beginning to be addressed.
A schematic showing how contig length and coverage statistics discriminate active from inactive nascent transcription.
Regions of active transcription contain many long contigs (positive length, not drawn to scale) with significant read coverage (labeled in blue) interspersed with short regions of no coverage. Coverage statistics define mean, median, mode and variance of reads (black bars) across a contig, see Table S1. In segments with no reads, a gap (labeled in green) is defined by a negative length value and all coverage statistics are set to zero. For our algorithm, reads (grey bars) are represented by only their 50 position (black points). Therefore a contig is also a continuous region where every base has at least one read’s 50 end at that position. Consequently, small gaps between contigs have a high probability of being in an active call.
Here the researchers also describe a new algorithm, Fast Read Stitcher (FStitch), that takes advantage of two popular machine-learning techniques, hidden Markov models (HMMs) and logistic regression, to classify which regions of the genome are transcribed. Given a small user-defined training set, our algorithm is accurate, robust to varying read depth, annotation agnostic, and fast. Analysis of GRO-seq data without a priori need for annotation uncovers surprising new insights into several aspects of the transcription process.
FStitch requires little training data and is robust to low levels of GRO-seq read coverage.
(A) Classification accuracy utilizing successively decreasing amounts of training data to learn feature vector weights, for the polynomial (d = 2 and c = 0; blue and teal) and linear (d = 1 and c = 0; green and red) kernel. (B) Classification accuracy with successively less sequencing depth (dataset size). In this case, we trained on 5% of all available chromosome 1 labels and tested on 50 different subsamples of the curated dataset. TP = true positive rate and FN = false negative rate.
Availability – The open-source software and a comprehensive manual is freely downloadable at http://dowell.colorado.edu.