Although metatranscriptomics-the study of diverse microbial population activity based on RNA-seq data-is rapidly growing in popularity, there are limited options for biologists to analyze this type of data. Current approaches for processing metatranscriptomes rely on restricted databases and a dedicated computing cluster, or metagenome-based approaches that have not been fully evaluated for processing metatranscriptomic datasets. Researchers at the University of California – Davis created a new bioinformatics pipeline, designed specifically for metatranscriptome dataset analysis, which runs in conjunction with Metagenome-RAST (MG-RAST) servers. Designed for use by researchers with relatively little bioinformatics experience, SAMSA offers a breakdown of metatranscriptome transcription activity levels by organism or transcript function, and is fully open source. The researchers used this new tool to evaluate best practices for sequencing stool metatranscriptomes.
Working with the MG-RAST annotation server, we constructed the Simple Annotation of Metatranscriptomes by Sequence Analysis (SAMSA) software package, a complete pipeline for the analysis of gut microbiome data. SAMSA can summarize and evaluate raw annotation results, identifying abundant species and significant functional differences between metatranscriptomes. Using pilot data and simulated subsets, we determined experimental requirements for fecal gut metatranscriptomes. Sequences need to be either long reads (longer than 100 bp) or joined paired-end reads. Each sample needs 40-50 million raw sequences, which can be expected to yield the 5-10 million annotated reads necessary for accurate abundance measures. The researchers also demonstrated that ribosomal RNA depletion does not equally deplete ribosomes from all species within a sample, and remaining rRNA sequences should be discarded. Using publicly available metatranscriptome data in which rRNA was not depleted, they were able to demonstrate that overall organism transcriptional activity can be measured using mRNA counts. They were also able to detect significant differences between control and experimental groups in both organism transcriptional activity and specific cellular functions.
The SAMSA pipeline
This organizational chart shows the flow of data through the pipeline, beginning with raw reads at the top of the chart and ending with the graphical output of the results at the bottom. Note that blue boxes denote intermediate generated output files, red boxes denote Python scripts, orange boxes denote R scripts, and green boxes denote external reference databases.
By making this new pipeline publicly available, the developers have created a powerful new tool for metatranscriptomics research, offering a new method for greater insight into the activity of diverse microbial communities. They further recommend that stool metatranscriptomes be ribodepleted and sequenced in a 100 bp paired end format with a minimum of 40 million reads per sample.
Availability – All components and tools used in the SAMSA pipeline, as well as documentation files, are freely available from GitHub at http://github.com/transcript/SAMSA