With the advent of the age of big data in bioinformatics, large volumes of data and high performance computing power enable researchers to perform re-analyses of publicly available datasets at an unprecedented scale. Ever more studies imply the microbiome in both normal human physiology and a wide range of diseases. RNA sequencing technology (RNA-seq) is commonly used to infer global eukaryotic gene expression patterns under defined conditions, including human disease-related contexts, but its generic nature also enables the detection of microbial and viral transcripts.
Researchers from Helmholtz Zentrum München have developed a bioinformatic pipeline to screen existing human RNA-seq datasets for the presence of microbial and viral reads by re-inspecting the non-human-mapping read fraction. They validated this approach by recapitulating outcomes from 6 independent controlled infection experiments of cell line models and comparison with an alternative metatranscriptomic mapping strategy. The researchers then applied the pipeline to close to 150 terabytes of publicly available raw RNA-seq data from >17,000 samples from >400 studies relevant to human disease using state-of-the-art high performance computing systems. The resulting data of this large-scale re-analysis are made available in the presented MetaMap resource.
Schematic illustrates the MetaMap pipeline
Over 400 projects from studies relevant to human disease were identified in the SRA database. Over 500 billion RNA-seq reads were downloaded and first filtered by mapping them onto the human genome and subsequently the remaining reads underwent metafeature classification. 90.7% of all reads mapped to the human genome. 0.03%, 0.20% and 0.39% of all reads were assigned to archaeal, bacterial or viral metafeatures, respectively. 8.6% of all reads remain non-discriminative at the species level (“unclassified‟).
These results demonstrate that common human RNA-seq data, including those archived in public repositories, might contain valuable information to correlate microbial and viral detection patterns with diverse diseases.
Availability – Codes to process new datasets and perform statistical analyses are made available at https://github.com/theislab/MetaMap.