A sensitive approach to quantitative analysis of transcriptional regulation in diploid organisms is analysis of allelic imbalance (AI) in RNA sequencing (RNA-seq) data. A near-universal practice in such studies is to prepare and sequence only one library per RNA sample. Harvard Medical School researchers present theoretical and experimental evidence that data from a single RNA-seq library is insufficient for reliable quantification of the contribution of technical noise to the observed AI signal; consequently, reliance on one-replicate experimental design can lead to unaccounted-for variation in error rates in allele-specific analysis. The researchers have developed a computational approach, Qllelic, that accurately accounts for technical noise by making use of replicate RNA-seq libraries. Testing on new and existing datasets shows that application of Qllelic greatly decreases false positive rate in allele-specific analysis while conserving appropriate signal, and thus greatly improves reproducibility of AI estimates. The researchers explore sources of technical overdispersion in observed AI signal and conclude by discussing design of RNA-seq studies addressing two biologically important questions: quantification of transcriptome-wide AI in one sample, and differential analysis of allele-specific expression between samples.
Different combinations of signal and noise parameters result in indistinguishable observed distributions of AI values
a Two simulated parametrizations (left) of true AI signal (AItrue; solid line) and noise (dashed line) that combine to produce overlapping observed AI values (right; red and blue, respectively). These and similar observations are indistinguishable by Mann–Whitney–Wilcoxon and Kolmogorov–Smirnov tests; see Supplementary Note S1. AI distributions are shown for allelic coverage 100. Noise distribution shown at AItrue = 0.5. Both signal and noise are modeled using beta-binomial distributions; the following parameters are shown: [ρSignal = 0.001, ρNoise = 0.1] and [ρSignal = 0.1, ρNoise = 0.001]; simulation sample size 500,000. b Quantile–Quantile (QQ)-plot for distributions set by parametrizations 1 and 2 from panel (a). Quantiles were taken from 0 to 1, with step 0.01.
Availability – AI estimation tools described here are implemented in two parts. Data processing steps from read alignment to allelic counts were reimplemented as ASEReadCounter* (github.com/gimelbrantlab/asereadcounter_star). Calculation of QCC, estimation of confidence intervals and differential AI analysis are implemented in Qllelic tool set (github.com/gimelbrantlab/Qllelic50).