There are many instances in genomics data analyses where measurements are made on a multivariate response. For example, alternative splicing can lead to multiple expressed isoforms from the same primary transcript. There are situations where the total abundance of gene expression does not change (e.g. between normal and disease state), but differences in the relative ratio of expressed isoforms may have significant phenotypic consequences or lead to prognostic capabilities. Similarly, knowledge of single nucleotide polymorphisms (SNPs) that affect splicing, so-called splicing quantitative trait loci (sQTL), will help to characterize the effects of genetic variation on gene expression. RNA sequencing (RNA-seq) has provided an attractive toolbox to carefully unravel alternative splicing outcomes and recently, fast and accurate methods for transcript quantification have become available.
Researchers at the Swiss Institute of Bioinformatics have developed a statistical framework based on the Dirichlet-multinomial distribution that can discover changes in isoform usage between conditions and SNPs that affect splicing outcome using these quantifications. The Dirichlet-multinomial model naturally accounts for the differential gene expression without losing information about overall gene abundance and by joint modeling of isoform expression, it has the capability to account for their correlated nature. The main challenge in this approach is to get robust estimates of model parameters with limited numbers of replicates. The researchers approach this by sharing information and show that this method improves on existing approaches in terms of standard statistical performance metrics. The framework is applicable to other multivariate scenarios, such as Poly-A-seq or where beta-binomial models have been applied (e.g., differential DNA methylation).
In the first scenario, all genes have the same (common) dispersion, and in the second one, each gene has a different (genewise) dispersion. All genes have expression equal to 1000 and 3 or 10 features with the same proportions estimated from kallisto counts from Kim et al. data set. For each of the scenarios, common, genewise, with and without moderation to common dispersion is estimated with maximum likelihood using the dirmult R package, the raw profile likelihood and the Cox-Reid adjusted profile likelihood. A: Median absolute error of concentration γ+ estimates. B: Median raw error of γ+ estimates. C: False positive (FP) rate for the p-value threshold of 0.05 of the null two-group comparisons based on the likelihood ratio statistics. Dashed line indicates the 0.05 level. Additionally, the FP rates when true concentration estimates are used in the inference (gray boxplot).
Availability – This method is available as a Bioconductor R package called DRIMSeq: https://www.bioconductor.org/packages/release/bioc/html/DRIMSeq.html