Droplet-based 3’ single-cell RNA-sequencing (scRNA-seq) methods have proved transformational in characterizing cellular diversity and generating valuable hypotheses throughout biology. Researchers from the University of Texas Southwestern Medical Center outline a common problem with 3’ scRNA-seq datasets where genes that have been documented to be expressed with other methods, are either completely missing or are dramatically under-represented thereby compromising the discovery of cell types, states, and genetic mechanisms. The researchers show that this problem stems from three main sources of sequencing read loss: (1) reads mapping immediately 3’ to known gene boundaries due to poor 3’ UTR annotation; (2) intronic reads stemming from unannotated exons or pre-mRNA; (3) discarded reads due to gene overlaps. Each of these issues impacts the detection of thousands of genes even in well-characterized mouse and human genomes rendering downstream analysis either partially or fully blind to their expression. They outline a simple three-step solution to recover the missing gene expression data that entails compiling a hybrid pre-mRNA reference to retrieve intronic reads, resolving gene collision derived read loss through removal of readthrough and premature start transcripts, and redefining 3’ gene boundaries to capture false intergenic reads. They demonstrate with mouse brain and human peripheral blood datasets that this approach dramatically increases the amount of sequencing data included in downstream analysis revealing 20 – 50% more genes per cell and incorporates 15-20% more sequencing reads than with standard solutions. These improvements reveal previously missing biologically relevant cell types, states, and marker genes in the mouse brain and human blood profiling data. Finally, the researchers provide scRNA-seq optimized transcriptomic references for human and mouse data as well as simple algorithmic implementation of these solutions that can be deployed to both thoroughly as well as poorly annotated genomes. These results demonstrate that optimizing the sequencing read mapping step can significantly improve the analysis resolution as well as biological insight from scRNA-seq. Moreover, this approach warrants a fresh look at preceding analyses of this popular and scalable cellular profiling technology.
a. Sc-RNA-seq based profiling of the mouse physiology regulating brain center – Median Preoptic Nucleus (MnPO). 10x Genomics 3’ transcriptomic analysis of MnPO neurons (n=906) mapped to an exonic transcriptomic reference reveals 13 neuron types. Data shown in a tSNE embedding. b. Sample scRNA-seq detected gene (Nxph4) with sequencing read mapping at its genomic locus. The majority of sequencing reads map to known exons of Nxph4 gene and are therefore registered (blue) and included in downstream analysis. Discarded reads (red) map to non-exonic regions or are antisense to the gene and are therefore excluded. Inset violin plot: scRNA-seq analysis detects Nxph4 expression in several MnPO neuron types (cell-type specific log-transformed expression of Nxph4 in MnPO neuron types with cell-type identity color-coded as in Fig1a). Micrograph inset: in situ hybridization of Nxph4 expression in the MnPO (scale bar: 150 µm, posterior MnPO outlined with white dashed line, data from Allen Brain Atlas Mouse ISH dataset). c. Sample gene (B4galnt2) not detected by scRNA-seq due to intronic read mapping. Inset violin plot: gene expression is not detected in any of the MnPO neuron types. Inset micrograph: in situ hybridization of B4galnt2 expression in the MnPO. d. Sample gene (Gpr165) not detected by scRNA-seq due to intergenic read mapping 3’ of known end of the gene. Inset violin plot: gene expression is not detected in any of the MnPO neuron types with scRNA-seq. Inset micrograph: in situ hybridization of Gpr165 expression in the MnPO. e. Proportion of uniquely mapped sequencing reads according to mapping site (exonic, intronic or intergenic) for mouse brain (MnPO, left) and human peripheral blood mononuclear cells (right) datasets. f. Intronic and intergenic reads constitute a promising source to recover missing gene expression data in scRNA-seq analysis. Number of detected genes in mouse brain (MnPO, left) and human PBMC (right) datasets, if reads mapping to exons, exons and introns, exons and intergenic reads within 10kb of known 3’ ends of genes, or all three sources are included in downstream analysis. g. Human and mouse genes according to the dominant source of sequencing read data. Genes are classified as ‘exonic dominant’, ‘intronic dominant’ or ‘3’ intergenic dominant’ if more than 50% of sequencing reads map to their exons, introns or within 10kb of their 3’ end, respectively. Mixed genes have less than 50% of reads stemming from any of the three regions.