Conventional scRNA-seq expression analyses rely on the availability of a high quality genome annotation. Yet, as researchers show here with scRNA-seq experiments and analyses spanning human, mouse, chicken, mole rat, lemur and sea urchin, genome annotations are often incomplete, in particular for organisms that are not routinely studied. To overcome this hurdle, researchers at Cornell University created a scRNA-seq analysis routine that recovers biologically relevant transcriptional activity beyond the scope of the best available genome annotation by performing scRNA-seq analysis on any region in the genome for which transcriptional products are detected. Their tool generates a single-cell expression matrix for all transcriptionally active regions (TARs), performs single-cell TAR expression analysis to identify biologically significant TARs, and then annotates TARs using gene homology analysis. This procedure uses single-cell expression analyses as a filter to direct annotation efforts to biologically significant transcripts and thereby uncovers biology to which scRNA-seq would otherwise be in the dark.
Generating de novo features based on genome coverage
a Workflow to generate TARs and to identify biologically meaningful uTARs. b Total genome assembly sequence length for human (hg38 and hg16), mouse (mm10), chicken (GRCg6a), gray mouse lemur (Mmur_3.0), naked mole rat (HetGla_1.0), and sea urchin (Spur_4.2). c Total number of annotated transcripts in existing annotations normalized to the assembly sequence length for humans (hg38 GENCODE v30, hg16 RefSeq), mouse (GENCODE vM21), chicken (GRCg6a Ensembl v96), gray mouse lemur (Mmur_3.0 RefSeq), naked mole rat (HetGla_1.0 RefSeq), and sea urchin (Spur_4.2 RefSeq). d Relative number of unique scRNA-seq reads outside of gene annotations contained in uTARs for each cell shown as violin plots (3849 cells in hg38 and hg19, 6113 in mouse, 14008 in chicken, 6321 in lemur, 2657 in naked mole rat, 2658 in sea urchin). Mean values (black dots) and 2 standard deviations above and below the mean (black bars) are shown. e Relative number of unique scRNA-seq reads outside of gene annotations for different human genome assemblies and annotations at different times (3849 cells). f Example of groHMM defined aTAR (red) and uTAR (maroon) features along hg16 chr22 with RefSeq hg16 gene annotations shown in blue. Sense strand coverage plotted in black while antisense strand coverage plotted in gray (log-e scale).
Availability – The snakemake pipeline along with the bash and R scripts used in the TAR-scRNA-seq tool is available at https://github.com/fw262/TAR-scRNA-seq with the identifier (https://doi.org/10.5281/zenodo.4567436).