Finding a suitable library size to call variants in RNA-Seq

RNA sequencing allows the study of both gene expression changes and transcribed mutations, providing a highly effective way to gain insight into cancer biology. When planning the sequencing of a large cohort of samples, library size is a fundamental factor affecting both the overall cost and the quality of the results. Researchers from the Walter and Eliza Hall Institute of Medical Research specifically address how overall library size influences the detection of somatic mutations in RNA-seq data in two acute myeloid leukaemia datasets.

The researchers simulated shallower sequencing depths by downsampling 45 acute myeloid leukaemia samples (100 bp PE) that are part of the Leucegene project, which were originally sequenced at high depth. They compared the sensitivity of six methods of recovering validated mutations on the same samples. The methods compared are a combination of three popular callers (MuTect, VarScan, and VarDict) and two filtering strategies. They observed an incremental loss in sensitivity when simulating libraries of 80M, 50M, 40M, 30M and 20M fragments, with the largest loss detected with less than 30M fragments (below 90%, average loss of 7%). The sensitivity in recovering insertions and deletions varied markedly between callers, with VarDict showing the highest sensitivity (60%). Single nucleotide variant sensitivity is relatively consistent across methods, apart from MuTect, whose default filters need adjustment when using RNA-Seq. The researchers also analysed 136 RNA-Seq samples from the TCGA-LAML cohort (50 bp PE) and assessed the change in sensitivity between the initial libraries (average 59M fragments) and after downsampling to 40M fragments. When considering single nucleotide variants in recurrently mutated myeloid genes they found a comparable performance, with a 6% average loss in sensitivity using 40M fragments.

Sensitivity by total depth at a variant site

Fig. 4

a Sensitivity as a function of the total depth at a variant site for the TCGA-LAML SNVs in Set1 (all SNVs after removing intronic and intergenic variants), combining the initial and 40M libraries and adopting callers with default-filters. b Sensitivity as a function of the total depth at a variant site using the Leucegene samples. The sensitivity is computed using the variants in the truth set, combining the calls from all downsampling runs, and using both types of filters. c Median with maximum and minimum sensitivity in recovering the SNVs in the truth set using the Leucegene samples. Only SNVs with total depth d≥d are considered as called. The sensitivity by depth is computed for each starting library size (colours) and using annotation-filters. Each estimated median sensitivity (and minimum and maximum) is the median across random downsampling runs at the same library size. The red dotted lines represent the 80%, 90% and 95% sensitivity thresholds

Between 30M and 40M 100 bp PE reads are needed to recover 90-95% of the initial variants on recurrently mutated myeloid genes. To extend this result to another cancer type, an exploration of the characteristics of its mutations and gene expression patterns is suggested.

Quaglieri A, Flensburg C, Speed TP, Majewski IJ. (2020) Finding a suitable library size to call variants in RNA-Seq. BMC Bioinformatics 21(1):553. [article]

One comment

Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.