For plant species with unsequenced genomes, cDNA contigs created by de novo assembly of RNA-Seq reads are used as reference sequences for comparative analysis of RNA-Seq datasets and the detection of differentially expressed genes (DEGs). Redundancies in such contigs are evident in previous RNA-Seq studies, and such redundancies can lead to difficulties in subsequent analysis. Nevertheless, the effects of removing redundancy from contig assemblies on comparative RNA-Seq analysis have not been evaluated.
Here researchers from the Tokyo University of Agriculture and Technology describe a method for removing redundancy from raw contigs that were primarily created by de novo assembly of Arabidopsis thaliana RNA-Seq reads. Specifically, the contigs with the highest bit scores were selected from raw contigs by a homology search against the gene dataset in the TAIR10 database. The two existing methods for removal of redundancy based on contig length or clustering analysis used to eliminate redundancies from raw contigs. Contig number was reduced most effectively with the method based on homology search. In a comparative analysis of RNA-Seq datasets, DEGs detected in contigs that underwent redundancy removal via the homology search method showed the highest identity to the DEGs detected when the TAIR10 gene dataset was used as an exact reference. Redundancy in raw contigs could also be removed by a homology search against integrated protein datasets from several plant species other than A. thaliana. DEGs detected using contigs that underwent such redundancy-removed also showed high homology to DEGs detected using the TAIR10 gene dataset.
Accuracy of detection of differentially expressed genes (DEGs) in reference sequences. Shown are the number of DEGs or differentially expressed contigs (DECs). The gene dataset indicates DEGs in all panels. a Raw contigs: DECs in raw contigs, (b) Longest contigs: DECs in longest contigs, (c) Clustered contigs: DECs in clustered contigs, (d) Annotated contigs: DECs in annotated contigs, (e) Plant DB contigs: DECs in Plant DB contigs, (f) PlantClust50 contigs: DECs detected in PlantClust50.