RNA-seq enables gene expression profiling in selected spatiotemporal windows and yields massive sequence information with relatively low cost and time investment, even for non-model species. However, there remains a large room for optimizing its workflow, in order to take full advantage of continuously developing sequencing capacity.
Researchers at the RIKEN Center for Life Science Technologies performed transcriptome sequencing of the Madagascar ground gecko (Paroedura picta) with the Illumina platform. To take advantage of increased read length of >150 nt, they demonstrated shortened RNA fragmentation time, which resulted in a dramatic shift of insert size distribution.
The output reads were assembled de novo for reconstructing transcript sequences. To evaluate products of multiple de novo assembly runs incorporating reads with different RNA sources, read lengths, and insert sizes, the researchers introduce a new reference gene set, core vertebrate genes (CVG), consisting of 233 genes that are shared as one-to-one orthologs by all vertebrate genomes examined (29 species)., The completeness assessment performed by the computational pipelines CEGMA and BUSCO referring to CVG, demonstrated higher accuracy and resolution than with the gene set previously established for this purpose. As a result of the assessment with CVG, we have derived the most comprehensive transcript sequence set of the Madagascar ground gecko by means of assembling individual libraries followed by clustering the assembled sequences based on their overall similarities.
Core Vertebrate Genes (CVG). a Flowchart showing selection procedure of the CVG from the chordate ortholog groups of eggNOG v4.0 (ChorNOGs). The 26 core species were specified by the eggNOG. Components of the CVG were shown in Additional file 5. b Taxonomic ranges of CEG (on a light blue background) and CVG (on a magenta background). The CEG consists of the six the species with asterisks, and the CVG set for CEGMA consists of the eight species in magenta. Tunicate orthologs were used as outgroup in order to distinguish one-to-one orthologs conserved in vertebrates from those with additional paralogs duplicated in the vertebrate lineage. Those with no additional vertebrate paralog were included in CVG. c Completeness scores of the transcriptome assemblies assessed by CEGMA referring to the 248 CEGs and 233 CVGs. The scores indicate proportions of the genes recognized as ‘complete’ in individual assemblies by CEGMA out of 248 CEGs and 233 CVGs. See Additional file 8 for the results of an equivalent assessment with BUSCO
These results provide several insights into optimizing de novo RNA-seq workflow, including the coordination between library insert size and read length, which manifested in improved connectivity of assemblies. The approach and assembly assessment with CVG demonstrated here would be applicable to transcriptome analysis of other species as well as whole genome analyses.