Transcripts are frequently modified by structural variations, which leads to either a fused transcript of two genes (known as a fusion gene) or an insertion of intergenic sequence into a transcript. These modifications, called transcriptomic structural variants (TSV), can lead to drastic changes in a downstream product. Detecting TSVs, especially in cancer tumor sequencing where they are known to frequently occur, is an important and challenging computational problem. This problem is made even more challenging in that often only RNA-seq measurements are available.
Carnegie Mellon University researchers introduce SQUID, a novel algorithm and its implementation, to accurately predict both fusion-gene and non-fusion-gene TSVs from RNA-seq alignments. SQUID takes the unique approach of attempting to reconstruct an underlying genome sequence that best explains the observed RNA-seq reads. By unifying both concordant alignments and discordant read alignments into one model, SQUID achieves high sensitivity with many fewer false positives than other approaches. The researchers detect TSVs on TCGA tumor samples using SQUID, and observe that that non-fusion-gene TSVs are more likely to be intra-chromosomal than fusion-gene TSVs. They also quantify the propensity for breakpoint partners to be reused. They identify several novel TSVs involving tumor suppressor genes, which may lead to loss-of-function in the corresponding genes and play a role in tumorgenesis.
Overview of the SQUID algorithm
Based on the alignments of RNA-seq reads to the reference genome, SQUID partitions the genome into segments, connects the endpoints of the segments to indicate the actual adjacency in transcript, and finally reorders the endpoints along the most reliable path. Each edge in the final path that comes from discordant read alignments represents a TSV.