The RNA-Seq technology has revolutionized transcriptome characterization not only by accurately quantifying gene expression, but also by the identification of novel transcripts like chimeric fusion transcripts. The ‘fusion’ or ‘chimeric’ transcripts have improved the diagnosis and prognosis of several tumors, and have led to the development of novel therapeutic regimen. The fusion transcript detection is currently accomplished by several software packages, primarily relying on sequence alignment algorithms. The alignment of sequencing reads from fusion transcript loci in cancer genomes can be highly challenging due to the incorrect mapping induced by genomic alterations, thereby limiting the performance of alignment-based fusion transcript detection methods. Here, researchers at University of Nebraska Medical Center developed a novel alignment-free method, ChimeRScope that accurately predicts fusion transcripts based on the gene fingerprint (as k-mers) profiles of the RNA-Seq paired-end reads. Results on published datasets and in-house cancer cell line datasets followed by experimental validations demonstrate that ChimeRScope consistently outperforms other popular methods irrespective of the read lengths and sequencing depth. More importantly, results on the researchers in-house datasets show that ChimeRScope is a better tool that is capable of identifying novel fusion transcripts with potential oncogenic functions.
ChimeRScope strategy for identifying FESRs: an example
A k-mer library is created by first, (A) generating all the k-mer profiles for all the genes. Next, we compare all k-mer profiles so that for each possible k-mer, the list of genes that contains that k-mer can be quickly identified. (B) A snapshot of a hypothetical example of the k-mer library. (C) The Circos map on the right illustrates an example of how ChimeRScope determine an FESR. A discordant paired-end read (100 bp × 2) that fails the stringent alignment against the reference genome is plotted in a circular layout with each nucleobase type represented by a unique color. (D) Four different variations of each k-mer in the read (e.g. highlighted region shows the 11th 17-mer for read 1 from base 11 to 27) will be created and searched against the k-mer library in order to obtain (E) a list of gene IDs that uses the corresponding k-mer as fingerprint. Each block represents a k-mer and each color here represents a unique gene ID. For example, four genes (G1: red, G2: green, G3: yellow and G4: orange) are related to the 11th 17-mer (from the 11th nucleotide to 27th nucleotide, as highlighted in gray region) and two genes (G1 and G4) are associated with the 29th 17-mer (highlighted in light yellow). (F) A complete graph is drawn for all eight matched genes. Each vertex in the complete graph represents a unique gene with the size of the vertex proportional to the overall fingerprint score for that gene.