With the decreased cost of RNA-Seq, an increasing number of non-model organisms have been sequenced. Due to the lack of reference genomes, de novo transcriptome assembly is required. However, there is limited systematic research evaluating the quality of de novo transcriptome assemblies and how the assembly quality influences downstream analysis.
Purdue University researchers used two authentic RNA-Seq datasets from Arabidopsis thaliana, and produced transcriptome assemblies using eight programs with a series of k-mer sizes (from 25 to 71), including BinPacker, Bridger, IDBA-tran, Oases-Velvet, SOAPdenovo-Trans, SSP, Trans-ABySS and Trinity. They measured the assembly quality in terms of reference genome base and gene coverage, transcriptome assembly base coverage, number of chimeras, and number of recovered full-length transcripts. SOAPdenovo-Trans performed best in base coverage, while Trans-ABySS performed best in gene coverage and number of recovered full-length transcripts In terms of chimeric sequences, BinPacker and Oases-Velvet were the worst, while IDBA-tran, SOAPdenovo-Trans, Trans-ABySS and Trinity produced fewer chimeras across all single k-mer assemblies. In differential gene expression analysis, about 70% of the significantly differentially expressed genes (DEG) were the same using reference genome and de novo assemblies. The researchers further identify four reasons for the differences in significant DEG between reference genome and de novo transcriptome assemblies: incomplete annotation, exon level differences, transcript fragmentation, and incorrect gene annotation, which they suggest that de novo assembly is beneficial even when a reference genome is available.
Differences between reference genome and de novo assembly based analyses
The upper panel in each figure shows the depth of aligned bases for a region of the reference genome. The lower panel shows gene model from reference genome annotation and predicted transcripts from de novo assemblies. Predicted transcripts are shown only for selected, relevant, de novo assemblies. (a) Incomplete annotation: AT4G14365. UTR regions are clearly present in the de novo assembled transcripts, thus the predicted transcripts are longer than the gene model from reference genome (b) Exon level differences: AT1G64790. The second to last exon and neighboring regions are highly expressed. De novo assembly programs assembled this region (highlighted in grey) as a distinct transcript. (c) Fragmentation: AT3G45860. The first four exons and the last three exons, consistently assembled as separate transcripts. (d) Incorrect gene model: AT3G27330. The left region with low expression contains a glycosyl transferase domain, and the more highly expressed region on the right matches a zinc finger motif.