RNA-Seq has become the method of choice to quantify genes and exons, discover novel transcripts, and detect fusion genes. However, reliable variant identification from RNA-Seq data remains challenging due to the complexities of the transcriptome, the challenges of accurately mapping exon boundary spanning reads, and the bias introduced during the sequencing library preparation.
Researchers at the Mayo Clinic developed RVboost, a novel method specific for RNA variant prioritization. RVboost utilizes several attributes unique in the process of RNA library preparation, sequencing, and RNA-Seq data analyses. It employs a boosting method to train a model of “good quality” variants using common variants from HapMap, and prioritizes and calls the RNA variants based on the trained model. The researchers packaged RVboost in a comprehensive workflow which integrates tools of variant calling, annotation, and filtering.
RVboost consistently outperforms Variant Quality Score Recalibration (VQSR) from the Genome Analysis Tool Kit (GATK) and the RNA-Seq variant calling pipeline SNPiR in 12 RNA-Seq samples using ground-truth variants from paired exome sequencing data. Several RNA-Seq specific attributes were identified as critical to differentiate true and false variants, including the distance of the variant positions to exon boundaries, and the percent of the reads supporting the variant in the first 6 base pairs. The latter identifies false variants introduced by the random hexamer priming during the library construction.
Availability and Implementation: The RVboost package is implemented to readily run in Mac or Linux environments.
Software and user manual are available at: http://bioinformaticstools.mayo.edu/research/rvboost/