The rapid growth of single-cell RNA-seq studies (scRNA-seq) demands efficient data storage, processing, and analysis. Big-data technology provides a framework that facilitates the comprehensive discovery of biological signals from inter-institutional scRNA-seq datasets. There are several strategies to solve the stochastic and heterogeneous single-cell transcriptome signal. Researchers at the Baylor Institute for Immunology Research now propose a workflow that accounts for the unique characteristics of scRNA-seq data and primary objectives of single-cell studies.
Inter-institutional single-cell RNA-seq datasets are aligned against their genomes at the Hadoop layer. Read counts are resolved into gene “on” or “off” status at the normalization layer. Differential expression, co-expression, and other applications are developed based on gene “on” or “off” status instead of gene expression. Biology in the resulting gene list is verified by GSEA, GO-term enrichment analysis, DAVID functional analysis or other tools. GSEA, gene set enrichment analysis; GO, gene ontology; DAVID, database for annotation, visualization and integrated discovery.