The usual analysis of RNA sequencing (RNA-seq) reads is based on an existing reference genome and annotated gene models. However, when a reference for the sequenced species is not available, alternatives include using a reference genome from a related species or reconstructing transcript sequences with de novo assembly. In addition, researchers are faced with many options for RNA-seq data processing and limited information on how their decisions will impact the final outcome. Using both a diploid and polyploid species with a distant reference genome, University of Tennessee researchers have tested the influence of different tools at various steps of a typical RNA-seq analysis workflow on the recovery of useful processed data available for downstream analysis.
At the preprocessing step, the researchers found error correction has a strong influence on de novo assembly but not on mapping results. After trimming, a greater percentage of reads could be used in downstream analysis by selecting gentle quality trimming performed with Skewer instead of strict quality trimming with Trimmomatic. This availability of reads correlated with size, quality, and completeness of de novo assemblies and with number of mapped reads. When selecting a reference genome from a related species to map reads, outcome was significantly improved when using mapping software tolerant of greater sequence divergence, such as Stampy or GSNAP.
Schematic view of the RNA-seq pipeline followed on this work
(A) Samples were obtained from roots of the diploid Vaccinium arboreum (VA) and tetraploid Vaccinium corymbosum (VC) grown at either pH 4.5 or 6.5 and sequenced. (B) Paired-end (PE) Illumina reads were either error corrected (cor; black lines) or not (Uc) and trimmed for removal of adapters and either low-quality bases (trimm; red crosses) or not (skewer). (C) Each set of reads was subjected to two de novo transcriptome assembly methods (two individual samples and merge results, or four combined samples) with three assemblers, followed by redundancy reduction by CD-HIT and RapClust clustering methods. Metrics were conducted on all steps. Trinity transcriptomes were further annotated, and their CD-HIT clusters used for mapping (underlined). (D) Transcripts were mapped to a diploid VC genome with gmap for mapping metrics, while short reads were mapped to either the genome or a transcriptome using multiple read aligners to obtain read counts.
The selection of bioinformatic software tools for RNA-seq data analysis can maximize quality parameters on de novo assemblies and availability of reads in downstream analysis.