Chimeric transcripts are commonly defined as transcripts linking two or more different genes in the genome, and can be explained by various biological mechanisms such as genomic rearrangement, read-through or trans-splicing, but also by technical or biological artefacts. Several studies have shown their importance in cancer, cell pluripotency and motility. Many programs have recently been developed to identify chimeras from Illumina RNA-seq data (mostly fusion genes in cancer). However outputs of different programs on the same dataset can be widely inconsistent, and tend to include many false positives. Other issues relate to simulated datasets restricted to fusion genes, real datasets with limited numbers of validated cases, result inconsistencies between simulated and real datasets, and gene rather than junction level assessment.
Here researchers from the Centre for Genomic Regulation (CRG), Barcelona, present ChimPipe, a modular and easy-to-use method to reliably identify chimeras from paired-end Illumina RNA-seq data. They have also produced realistic simulated datasets for three different read lengths, and enhanced two gold-standard cancer datasets by associating exact junction points to validated gene fusions. Benchmarking ChimPipe together with four other state-of-the-art tools on this data showed ChimPipe to be the top program at identifying exact junction coordinates for both kinds of datasets, and the one showing the best trade-off between sensitivity and precision. Applied to 106 ENCODE human RNA-seq datasets, ChimPipe identified 137 high confidence chimeras connecting the protein coding sequence of their parent genes. In subsequent experiments, three out of four predicted chimeras, two of which recurrently expressed in a large majority of the samples, could be validated. Cloning and sequencing of the three cases revealed several new chimeric transcript structures, 3 of which with the potential to encode a chimeric protein for which the researchers hypothesized a new role.
The ChimPipe method
(A) RNA-seq reads are first mapped to the genome and transcriptome using the GEMtools RNA-seq pipeline, and the reads that do not map this way are passed to the GEM RNA-mapper to get reads that split map to different chromosomes or strands. (B) The split-reads from these two mapping steps are then gathered and passed on to the ChimSplice module which derives consensus junctions associated to their expression calculated as the number of staggered split-reads supporting them. The ChimPE module can then associate each chimeric junction found by ChimSplice to their discordant PE reads, splitting them into the ones consistent and the ones inconsistent with the junction. (C) The ChimFilter module then applies a series of filters to the chimeric junctions obtained until this point in order to discard false positives, leading to (D) a set of reliable chimeric junctions to which it associates several pieces of information such as a category (readthrough, intrachromosomal, inverted, interstand, or interchromosomal), and the supporting evidence in terms of number of staggered split-reads and number of consistent PE reads, among others.
ChimPipe combines spanning and paired end RNA-seq reads to detect any kind of chimeras, including read-throughs, and shows an excellent trade-off between sensitivity and precision. The chimeras found by ChimPipe can be validated in-vitro with high accuracy.
Availability – The ChimPipe program is available at: https://github.com/Chimera-tools/ChimPipe