A simple fragmentation model to compute the expected coverage profile across a transcript that is independent of bias

RNA sequencing (RNA-seq) has emerged as the method of choice for measuring the expression of RNAs in a given cell population. In most RNA-seq technologies, sequencing the full length of RNA molecules requires fragmentation into smaller pieces. Unfortunately, the issue of nonuniform sequencing coverage across a genomic feature has been a concern in RNA-seq and is attributed to biases for certain fragments in RNA-seq library preparation and sequencing.

To investigate the expected coverage obtained from fragmentation, University of Vienna researchers develop a simple fragmentation model that is independent of bias from the experimental method and is not specific to the transcript sequence. Essentially, they enumerate all configurations for maximal placement of a given fragment length, F, on transcript length, T, to represent every possible fragmentation pattern, from which they compute the expected coverage profile across a transcript. The researchers extend this model to incorporate general empirical attributes such as read length, fragment length distribution, and number of molecules of the transcript. They further introduce the fragment starting-point, fragment coverage, and read coverage profiles. They found that the expected profiles are not uniform and that factors such as fragment length to transcript length ratio, read length to fragment length ratio, fragment length distribution, and number of molecules influence the variability of coverage across a transcript. Finally, the researchers explore a potential application of the model where, with simulations, they show that it is possible to correctly estimate the transcript copy number for any transcript in the RNA-seq experiment.

The fragment placement pattern space for various transcript lengths


(A) Fragments of 3 and 4 bases long placed on transcripts of lengths 3–5 bases. (B) Fragments of 3 and 4 bases long placed on a transcript of 10 bases long. Each row represents a unique fragmentation pattern, where fragments are placed till remaining positions on the transcript permit no further placement of a fragment. The computed SPP and FCP are, respectively, the sum of fragment starting-points (boxes shaded red) and the sum of fragments covering each position, and are shown under each pattern space. Sections that have been bordered green show pattern spaces of shorter transcripts found in longer transcripts. Dashed green borders show the pattern spaces for transcripts shorter than the fragment length.

Prakash C, Von Haeseler A. (2016) An Enumerative Combinatorics Model for Fragmentation Patterns in RNA Sequencing Provides Insights into Nonuniformity of the Expected Fragment Starting-Point and Coverage Profile. J Comput Biol [Epub ahead of print]. [article]

Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.