Large-scale sequencing of cDNA (RNA-seq) has been a boon to the quantitative analysis of transcriptomes. A notable application is the detection of changes in transcript usage between experimental conditions. For example, discovery of pathological alternative splicing may allow the development of new treatments or better management of patients. From an analysis perspective, there are several ways to approach RNA-seq data to unravel differential transcript usage, such as annotation-based exon-level counting, differential analysis of the `percent spliced in’ measure or quantitative analysis of assembled transcripts.
A team led by researchers at the University of Zurich set out to compare and contrast current state-of-the-art methods, as well as to suggest improvements to commonly used workflows. They assess the performance of representative workflows using synthetic data and explore the effect of using non-standard counting bin definitions as input to a state-of-the-art inference engine (DEXSeq).
They found that, although the canonical counting provided the best results overall, several non-canonical approaches were as good or better in specific aspects and most counting approaches outperformed the evaluated event- and assembly-based methods. The researchers show that an incomplete annotation catalog can have a detrimental effect on the ability to detect differential transcript usage in transcriptomes with few isoforms per gene and that isoform-level pre-filtering can considerably improve false discovery rate (FDR) control. Count-based methods generally perform well in detection of differential transcript usage. Controlling the FDR at the imposed threshold is difficult, mainly in complex organisms, but can be improved by pre-filtering of the annotation catalog.
Overall performance of the evaluated methods. The three circles for each method indicate the observed false discovery rate (FDR) and true positive rate (TPR) when the gene-wise q-values are thresholded at three commonly used thresholds: 0.01, 0.05 and 0.1. Ideally each circle should fall to the left of the corresponding vertical line, since this would indicate that the FDR is controlled at the imposed level. A circle is lled if the FDR is controlled and open otherwise. The total number of genes (n) as well as the number aected by dierential isoform usage (n.ds) are given in the panel headers. Overall, only cudi manages to control the FDR, but at the cost of a reduction in power (TPR). The FDR control is worse in the human simulation than in the fruit y simulation, potentially due to the larger number of isoforms in the human transcriptome.