Mix2 (rd. “mixquare”) yields highly accurate concentration estimates for gene isoforms by adapting to the positional coverage bias in RNA-Seq data. This leads to higher accuracy in the detection of differential expression of genes and gene isoforms. Mix2 enables repeatable concentration estimates across multiple library preparations and sequencing facilities and can be used as an explorative tool to investigate the positional biases present in RNA-Seq data. Mix2 is highly efficient and runs significantly faster than current state-of-the-art RNA-Seq data analysis tools.
Introduction
Fragment bias in RNA-Seq poses a serious challenge to the accurate quantification of gene isoforms. Mix2 makes no assumptions about coverage bias but fits for each gene isoform a mixture model to the data (Fig. 1). Mix2 can therefore, for instance, accurately represent the 5’ bias, as shown in Fig. 1 (a and b), whereas Cufflinks is restricted to the uniform distribution (Fig. 1c).
Figure 1 | Exemplary representation for positional fragment bias over a 2000 bps transcript modeled with a mixture of 8 normal distributions. (a) the green curve shows the combined probability density function over the whole transcript, while the blue curves show the individual mixture distributions. (b) and (c) panels display fragment distributions in a locus with two transcripts sharing one junction, as modeled by Mix2 or Cufflinks. Long and short transcripts start at 5000 and 5500 bp from the beginning of the locus, and are 2000 and 1000 bp long, respectively. The junction spans the 6000 – 6499 bp region.
Experiments and results
Mix2 was tested on the publicly available MicroArray Quality Control (MAQC) [1] and Association of Biomolecular Resource Facilities (ABRF) [2] datasets, containing RNA-Seq data from multiple sequencing facilities and library preparations which started with differently degraded RNA. For the Universal Human Reference (UHR) RNA and the Human Brain Reference (HBR) RNA samples in MAQC and ABRF, qPCR concentration measurements of around 1000 gene isoforms are available, which are regarded as the ground truth. In our experiments we compare the Mix2 software to the widely used Cufflinks [3] and to PennSeq [4], which has been shown to outperform a large number of methods for the estimation of isoform concentrations.
Figure 2 | Correlation between qPCR and FPKM values on MAQC data for Mix2 , Cufflinks, and PennSeq. (a) and (b) Correlation for one lane of UHR RNA data. The ellipses at the bottom highlight transcripts which were not detected. (c) Average R2 value over all 7 lanes of UHR and HBR RNA.
Accurate concentration estimates of gene isoforms
Our experiments show considerably better correlation between the qPCR and FPKM (expected Fragments Per Kilobase of transcript per Million fragments mapped [3]) values for the Mix2 model than for Cufflinks with bias correction [3, 5] and PennSeq [4]. Cufflinks fails to detect 117 transcripts or 16 % of the complete test set in one lane of UHR RNA data (Fig. 2b). Similarly, the average qPCR value of -1.86, for undetected transcripts, is relatively high in comparison to the average for the complete test set, which is -1.02. Hence, Cufflinks fails to detect a large number of highly abundant transcripts. The number of undetected transcripts for Mix2 is considerably smaller and equals 24, which is 3 % of the complete test set (Fig. 2a). In addition, the average qPCR value of -2.55 for Mix2 is noticeably smaller than that for Cufflinks, hence Mix2 fails to detect only transcripts of low abundance. Finally, the correlation between qPCR and FPKM values as measured by R2 is considerably higher for Mix2 than for Cufflinks and PennSeq (Fig. 2c).
2. Li, S. et al (2014). Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study. Nature Biotechnology 32, 915-925. [abstract]
3. Trapnell, C. et al (2010). Transcript assembly and quantification by RNA-Seq reveals unnatotated transcripts and isoform switching during cell differentiation. Nature Biotechnology 28, 511-515. [abstract]
4. Hu, Y. et al (2014). PennSeq: accurate isoform-specific gene expression quantification in RNA-Seq by modeling non-uniform read distribution. Nucleic Acids Research 42:e20. [article]
5. Roberts, A. et al (2011). Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology 12, R22. [abstract]