Researchers find viral contamination of cancer RNA-Seq samples in The Cancer Genome Atlas database

High-throughput RNA-Sequencing followed by computational analysis has vastly accelerated the identification of viral and other pathogenic sequences in clinical samples, but cross-contamination during the processing of the samples remain a major problem that can lead to erroneous conclusions.

Researchers from the National Heart, Lung, and Blood Institute at the NIH found HPV38 sequences specifically present in RNA-Seq samples of endometrial cancer patients from TCGA, a virus not previously associated with this type of cancer. However, multiple lines of evidence suggest possible cross-contamination in these samples, which were processed together in the same batch.

The researchers found HPV38, a cutaneous form of HPV associated with skin cancer, in 32 of 168 samples with endometrial cancer. In 12 of the HPV38+ samples, they observed at least one paired read that mapped to both human and HPV38 genomes, indicative of viral integration into host DNA, something not previously demonstrated for HPV38. The expression levels of HPV38 transcripts were relatively low, and all 32 HPV38+ samples belonged to the same experimental batch of 40 samples, whereas none of the other 128 endometrial carcinoma samples were HPV38+, raising doubts about the significance of the HPV38 association. Moreover, the HPV38+ samples contained the same 10 novel single nucleotide variations (SNVs), leading them to hypothesize that one patient was infected with this new isolate of HPV38, which was integrated into his/her genome and may have cross-contaminated other TCGA samples within batch #228.


Computational pipeline for identifying viral sequences in NGS data.

Despite this potential cross-contamination, the researchers data indicate that they have detected a new isolate of HPV38 that appears to be integrated into the human genome.

Based on this analysis, the research team proposes guidelines to examine batch effect, virus expression level, and SNVs as part of NGS data analysis for evaluating the significance of viral/pathogen sequences in clinical samples.

Kazemian M, Ren M, Lin JX, Liao W, Spolski R, Leonard WJ. (2015) Possible HPV38 contamination of endometrial cancer RNA-Seq samples in The Cancer Genome Atlas database. J Virol [Epub ahead of print]. [abstract]

Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.