Current computational methods for estimating transcript abundance from RNA-seq data can lead to hundreds of false-positive results. Researchers from the Dana-Farber Cancer Institute show that these systematic errors stem largely from a failure to model fragment GC content bias. Sample-specific biases associated with fragment sequence features lead to misidentification of transcript isoforms. The researchers introduce alpine, a method for estimating sample-specific bias-corrected transcript abundance. By incorporating fragment sequence features, alpine greatly increases the accuracy of transcript abundance estimates, enabling a fourfold reduction in the number of false positives for reported changes in expression compared with Cufflinks. Using simulated data, they also show that alpine retains the ability to discover true positives, similar to other approaches.
Modeling and correcting fragment sequence bias
(a) Boxplot comparing reduction in mean squared error (MSE) in predicting coverage for different bias models for all 8 samples and 64 transcripts (n = 512). The models are (i) read start: the Cufflinks VLMM for read starts; (ii) the mseq model for read starts; (iii) GC: a model using fragment GC content; (iv) GC+str: as in GC plus additional terms for stretches of high GC; (v) all: the VLMM for read starts in addition to the terms in GC+str. The models that included fragment GC doubled the reduction in MSE, compared to the read start models. (b) Predicted coverage plots for bias models on GEO BC011380 (raw coverage in blue, test-set-predicted coverage in black). (c,d) Volcano plots of differential transcript expression across centers for genes with two isoforms (orange: Benjamini–Hochberg-adjusted P values less than 1%; red: Bonferroni FWER rate less than 1%). (e,f) FPKM estimates for two isoforms of BASP1 across centers by Cufflinks (e) and alpine (f). (g) ROC curves for a simulation of a confounded design, with fragment sequence bias drawn from 30 GEUVADIS samples. The “gold” ROC curve indicates the sensitivity and specificity using the true underlying fragment counts, without fragment sequence bias and with known transcript assignment for fragments. (h) The partial area under the curve (AUC) for panel g, considering false-positive rate in [0, 0.2], scaled to take values between 0 and 1.
Availability – The method is available as an R/Bioconductor package that includes data visualization tools useful for bias discovery: http://bioconductor.org/packages/alpine