Virus expression detection reveals RNA-sequencing contamination in TCGA

Contamination of reagents and cross contamination across samples is a long-recognized issue in molecular biology laboratories. While often innocuous, contamination can lead to inaccurate results. Cantalupo et al., for example, found HeLa-derived human papillomavirus 18 (H-HPV18) in several of The Cancer Genome Atlas (TCGA) RNA-sequencing samples. This work motivated researchers at the University of North Carolina at Chapel Hill to assess a greater number of samples and determine the origin of possible contaminations using viral sequences. To detect viruses with high specificity, they developed the publicly available workflow, VirDetect, that detects virus and laboratory vector sequences in RNA-seq samples. The researchers applied VirDetect to 9143 RNA-seq samples sequenced at one TCGA sequencing center (28/33 cancer types) over 5 years.

The researchers confirmed that H-HPV18 was present in many samples and determined that viral transcripts from H-HPV18 significantly co-occurred with those from xenotropic mouse leukemia virus-related virus (XMRV). Using laboratory metadata and viral transcription, they determined that the likely contaminant was a pool of cell lines known as the “common reference”, which was sequenced alongside TCGA RNA-seq samples as a control to monitor quality across technology transitions (i.e. microarray to GAII to HiSeq), and to link RNA-seq to previous generation microarrays that standardly used the “common reference”. One of the cell lines in the pool was a laboratory isolate of MCF-7, which they discovered was infected with XMRV; another constituent of the pool was likely HeLa cells.

VirDetect workflow and performance


a & b VirDetect workflow diagram a VirDetect alignment steps, b virus genome preparation steps. c Number of reads mapping to the viral genome for both human (left) and low complexity (right) simulated reads (100 simulated samples, with 1000,000 human reads and 1000 low complexity reads each). d & e Viral simulated reads (100 simulated samples with 1000 reads each) with 0–10 mutations in the first read pair (d) Sensitivity, measured by the percent of reads that mapped to the viral genomes. e Positive predictive value (PPV) measured by number of true positives (simulated viral reads that mapped to the correct viral genomes) divided by the number of true positives and false positives

Altogether, this indicates a multi-step contamination process. First, MCF-7 was infected with an XMRV. Second, this infected cell line was added to a pool of cell lines, which contained HeLa. Finally, RNA from this pool of cell lines contaminated several TCGA tumor samples most-likely during library construction. Thus, these human tumors with H-HPV or XMRV reads were likely not infected with H-HPV 18 or XMRV.


Selitsky SR, Marron D, Hollern D, Mose LE, Hoadley KA, Jones C, Parker JS, Dittmer DP, Perou CM. (2020) Virus expression detection reveals RNA-sequencing contamination in TCGA. BMC Gen 21(1):79. [article]

Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.