The recently introduced Kallisto pseudoaligner has radically simplified the quantification of transcripts in RNA-sequencing experiments. However, as with all computational advances, reproducibility across experiments requires attention to detail. The elegant approach of Kallisto reduces dependencies, but researchers at the Keck School of Medicine of USC noted differences in quantification between versions of Kallisto, and both upstream preparation and downstream interpretation benefit from an environment that enforces a requirement for equivalent processing when comparing groups of samples.
Therefore, they created the Artemis and TxDbLite R packages to meet these needs and to ease cloud-scale deployment of the above. TxDbLite extracts structured information directly from source FASTA files with per-contig metadata, while Artemis enforces versioning of the derived indices and annotations, to ensure tight coupling of inputs and outputs while minimizing external dependencies. The two packages are combined in Illumina’s BaseSpace cloud computing environment to offer a massively parallel and distributed quantification step for power users, loosely coupled to biologically informative downstream analyses via gene set analysis (with special focus on Reactome annotations for ENSEMBL transcriptomes). Previous work has revealed that filtering transcriptomes to exclude lowly-expressed isoforms can improve statistical power, while more-complete transcriptome assemblies improve sensitivity in detecting differential transcript usage. Based on earlier work by Bourgon et. al, 2010, the researchers included this type of filtering for both gene- and transcript-level analyses within Artemis. For reproducible and versioned downstream analysis of results, they focused their efforts on ENSEMBL and Reactome integration within the qusage framework, adapted to take advantage of the parallel and distributed environment in Illumina’s BaseSpace cloud platform. The researchers show that quantification and interpretation of repetitive sequence element transcription is eased in both basic and clinical studies by just-in-time annotation and visualization. The option to retain pseudoBAM output for structural variant detection and annotation, while not insignificant in its demand for computation and storage, nonetheless provides a middle ground between de novo transcriptome assembly and routine quantification while consuming a fraction of the resources.
Customization of the Qusage algorithm accelerated its computational speed improving its performance on average by 1.34X.
Finally, they describe common use cases where investigators are better served by cloud-based computing platforms such as BaseSpace due to inherent efficiencies of scale and enlightened common self-interest. Their experiences suggest a common reference point for methods development, evaluation, and experimental interpretation.