Experimental procedures for preparing RNA-seq and single-cell (sc) RNA-seq libraries are based on assumptions regarding their underlying enzymatic reactions. Here, researchers at the University of Warwick show that the fairness of these assumptions varies within libraries: coverage by sequencing reads along and between transcripts exhibits characteristic, protocol-dependent biases. To understand the mechanistic basis of this bias, they present an integrated modeling framework that infers the relationship between enzyme reactions during library preparation and the characteristic coverage patterns observed for different protocols. Analysis of new and existing (sc)RNA-seq data from six different library preparation protocols reveals that polymerase processivity is the mechanistic origin of coverage biases. The researchers apply their framework to demonstrate that lowering incubation temperature increases processivity, yield, and (sc)RNA-seq sensitivity in all protocols. They also provide correction factors based on our model for increasing accuracy of transcript quantification in existing samples prepared at standard temperatures. In total, these findings improve our ability to accurately reflect in vivo transcript abundances in (sc)RNA-seq libraries.
cDNA Conversion Yields Biases of RNA-Seq Coverage
(A) Library preparation for next-generation sequencing involves reverse transcription and second-strand synthesis, followed by fragmentation. Depending on the protocol, reverse-transcription starts and ends at certain points for first-strand synthesis (s1 and e1, respectively) and second-strand synthesis (s2 and e2).
(B) The original mRNA (olive) is thus often non-uniformly represented by double-stranded cDNA (orange), which biases detection by RNA-seq (blue).
(C) RNA-seq coverage along transcripts for different datasets. Sequencing reads were mapped to murine, non-overlapping RefSeq transcripts without isoforms. All detected transcripts (∼10,000) were ordered from shortest (top) to longest (bottom), were adjusted to have identical length, and were divided into 20 bins each. The percentage of reads in each bin is color coded for each transcript (see legend). The distribution of transcript lengths is shown on log scale on the left. This distribution corresponds to the Wold dataset but is representative of the others, subject to minor variations due to different numbers of detected transcripts.
(D) Simplified models/scenarios of RNA-seq library preparation outcomes based on priming strategy and synthesis success.