Next generation sequencing techiques revolutionized the study of RNA expression by permitting whole transcriptome analysis. However, sequencing reads generated from nested and multi-copy genes are often either misassigned or discarded, which greatly reduces both quantification accuracy and gene coverage.
Here researchers at the Université de Sherbrooke present CoCo, a read assignment pipeline that takes into account the multitude of overlapping and repetitive genes in the transcriptome of higher eukaryotes. CoCo uses a modified annotation file that highlights nested genes and proportionally distributes multimapped reads between repeated sequences. CoCo salvages over 15% of discarded aligned RNA-seq reads and significantly changes the abundance estimates for both coding and non-coding RNA as validated by PCR and bedgraph comparisons.
CoCo read correction scheme for nested and multimapped genes
(A) Representation of a standard gene annotation used for depicting a genetic locus containing one host gene and three nested genes. The dashed lines indicate introns while the dark blue boxes indicate exons. (B) Representation of the gene annotation produced using the correct annotation module of CoCo showing a gap in the retained intron over the first nested gene. (C) Examples of potential read pairs overlapping the different features and multimapped read pairs. (D) Comparison of the read pair assignment using standard and CoCo pipelines, for each of the read pairs illustrated in C. The reads that are differentially assigned by CoCo are highlighted in red. (E) Comparison of the read count estimates by the standard and the CoCo pipelines, based on the assignments listed in D. (F) Flow chart of the CoCo pipeline. Pre-processing and alignment steps are shown before the correct count module application. The correct count module then assigns reads with Subread’s feature Counts using the gapped CoCo annotation (built with the correct annotation module). Read pairs resulting in multiple alignments are considered separately and distributed proportionally to the uniquely assigned read pairs.
Availability: The CoCo software is an open source package written in Python and available from http://gitlabscottgroup.med.usherbrooke.ca/scott-group/coco.