RNA-seq has been widely used for genome-wide expression profiling. RNA-seq data typically consists of tens of millions of short sequenced reads from different transcripts. However, due to sequence similarity among genes and among isoforms, the source of a given read is often ambiguous. Existing approaches for estimating expression levels from RNA-seq reads tend to compromise between accuracy and computational cost.
Researchers at Harvard Medical School introduce a new approach for quantifying transcript abundance from RNA-seq data. EMSAR (Estimation by Mappability-based Segmentation And Reclustering) groups reads according to the set of transcripts to which they are mapped and finds maximum likelihood estimates using a joint Poisson model for each optimal set of segments of transcripts. The method uses nearly all mapped reads, including those mapped to multiple genes. With an efficient transcriptome indexing based on modified suffix arrays, EMSAR minimizes the use of CPU time and memory while achieving accuracy comparable to the best existing methods.
Principles and overall flow of EMSAR. An illustration of the key elements of EMSAR for single-end RNA-seq. Gene 1 has three splice isoforms and gene 2 has one. The two genes share some sequences, indicated by the yellow reads that are mapped to two locations. The RNA-seq reads are colored according to which transcripts share the read sequence. The read count (X 1 , …, X 6 ) is the number of RNA-seq reads in the same ‘segments’ or that are shared by the same combination of transcripts. The length of the segments (L 1 , …, L 6 ) correspond to the number of possible distinct virtual reads in each group. The read counts depend on the total expression level of the isoforms associated with each segment times the length of the segment. Transcripts are grouped into a sequence-sharing such that transcripts in the same segment belong to a set. A sequence-sharing set may contain transcripts from one or more genes. In this illustration, the top set (set 1) represents the transcripts shown on the left side. The four expression quantities (e A , …, e D ) are estimated by considering the six segments simultaneously within this set.
Availability – EMSAR is available at https://github.com/parklab/emsar