Single-cell transcriptomics provides us unprecedented opportunity to understand the transcriptional stochasticity and cellular heterogeneity in great detail, which are crucial for maintaining cell functions and for facilitating disease progression or treatment response. Such stochasticity and heterogeneity are always masked in bulk-cell studies. Recent single-cell applications have utilized a broad range of tissues, stem cell lines and cell populations with clinical backgrounds. scRNA-seq is one of the most promising technologies for single-cell transcriptomics. Nevertheless, it also poses big challenges, largely stemming from the aforementioned big data characteristics with regard to the data management, query, and analysis.
Number of papers/datasets addressing single-cell data and big data.
There are five ‘V’s to consider for scRNA-seq data:
(1) Volume. NGS data has become one of the largest big-data domains in terms of data acquisition, storage, and distribution . Just like bulk-cell RNA-seq and other NGS-based studies, scRNA-seq generates a high volume of raw sequencing data and high-dimensional transformed expression data. Moreover, due to the heterogeneity of cell populations, a typical scRNA-seq study usually incorporates hundreds or even thousands of cells and thus adds a few more orders of magnitude to the data volume.
(2) Velocity. As aforementioned, the data volume of scRNA-seq is higher than that of bulk-cell RNA-seq. Consequently, high data-transfer bandwidth, parallel algorithms, and high-performance computers are required to generate and process data.
(3) Variety. An scRNA-seq study may combine data from different single-cell isolation chips, protocols, and research environments. How to normalize the datasets and make them comparable becomes a major issue.
(4) Variability. The transcriptional activity of a living cell is dynamic rather than static. Thus, scRNA-seq captures a snapshot of single cells in seemingly homogeneous populations that as a matter of fact, vary significantly from one to another. Substantial variability of the scRNA-seq signal comes from a variety of biological aspects, including transcriptional stochasticity and cellular heterogeneity, which cannot be investigated in bulk-cell studies. Therefore, scRNA-seq data exhibit significantly larger variance than bulk-cell RNA-seq data. Solving the biological variability is the main goal of single-cell transcriptomics research.
(5) Veracity. scRNA-seq is composed of sequential steps of target cell isolation, RNA extraction, fragmentation, reverse transcription, cDNA amplification, sequencing, alignment, and read counting. Every step introduces biases and artifacts that may significantly affect the coverage, accuracy, and timeliness of transcript expression and thus interfere with both the proper characterization and quantification of transcripts. It is therefore critical to control the data quality prior to including the datasets in a meaningful global study.