Transcriptome sequencing (RNA-seq) is widely used to detect gene rearrangements and quantitate gene expression in acute lymphoblastic leukemia (ALL), but its utility and accuracy in identifying copy number variations (CNVs) has not been well described. CNV information inferred from RNA-seq can be highly informative to guide disease classification and risk stratification in ALL due to the high incidence of aneuploid subtypes within this disease.
A team led by researchers at Charles University, Czech Republic has developed RNAseqCNV, a method to detect large scale CNVs from RNA-seq data. The research team used models based on normalized gene expression and minor allele frequency to classify arm level CNVs with high accuracy in ALL (99.1% overall and 98.3% for non-diploid chromosome arms, respectively), and the models were further validated with excellent performance in acute myeloid leukemia (accuracy 99.8% overall and 99.4% for non-diploid chromosome arms). RNAseqCNV outperforms alternative RNA-seq based algorithms in calling CNVs in the ALL dataset, especially in samples with a high proportion of CNVs. The CNV calls were highly concordant with DNA-based CNV results and more reliable than conventional cytogenetic-based karyotypes. RNAseqCNV provides a method to robustly identify copy number alterations in the absence of DNA-based analyses, further enhancing the utility of RNA-seq to classify ALL subtype.
Workflow of RNAseqCNV
Per-gene read counts and SNVs information are input for the analysis. Each sample is analyzed against standard samples (in-built or user provided). The DESeq2 variance stabilizing transformation (VST) is used for gene expression level normalization. Then log2 fold change per gene is calculated against the median expression of the gene across the cohort. Each gene is assigned a weight according to the (in-built or user provided) weight matrix. Subsequently, weighted quantiles per-chromosome are used to generate boxplots. High quality SNVs are kept based on sequencing depth and record in the dbSNP database. Since only heterozygous SNVs are informative for calling CNVs, only SNVs with MAF between 0.1 and 0.9 are kept as default. Weighted boxplots and MAF density curves are visualized on chromosomal and arm level, and serve as predictors for CNV calling. The diploid arms are estimated in the first model and used to adjust the expression data for correct prediction of CNVs in the second model. Sex of the sample is estimated based on gene expression on chromosome Y.
Availability – RNAseqCNV and the tutorial are freely available from https://github.com/honzee/RNAseqCNV.
Bařinka J, Hu Z, Wang L, Wheeler DA, Rahbarinia D, McLeod C, Gu Z, Mullighan CG. (2022) RNAseqCNV: analysis of large-scale copy number variations from RNA-seq data. Leukemia [Epub ahead of print]. [abstract]