The differences in transcription start sites (TSS) and transcription end sites (TES) among gene isoforms can affect the stability, localization, and translation efficiency of mRNA. Gene isoforms allow a single gene diverse functions across different cell types, and isoform dynamics allow different functions over time. However, methods to efficiently identify and quantify RNA isoforms genome-wide in single cells are still lacking.
Researchers from Sun Yat-Sen University have developed single cell RNA Cap And Tail sequencing (scRCAT-seq), a method to demarcate the boundaries of isoforms based on short-read sequencing, with higher efficiency and lower cost than existing long-read sequencing methods. In conjunction with machine learning algorithms, scRCAT-seq demarcates RNA transcripts with unprecedented accuracy. The researchers identified hundreds of previously uncharacterized transcripts and thousands of alternative transcripts for known genes, revealed cell-type specific isoforms for various cell types across different species, and generated a cell atlas of isoform dynamics during the development of retinal cones.
Overview of scRCAT-seq
a Schematic of the scRCAT-seq method. Full-length cDNA was synthesized by template-switching reverse transcription, amplified by PCR, and tagmented with Tn5 transposases. The TAG added to both ends contains the UMI (unique molecular identifier) and CI (cell identifier). Both 5′ and 3′ ends of the cDNA were captured and amplified by PCR, producing indexed libraries for pooled sequencing. Sequencing data were processed and transcription start sites (TSSs) and transcription end sites (TESs) were identified using machine learning models. CS1: common sequence 1; CS2: common sequence 2; TSO: Template-switching oligo; T30: 30 repeating T bases. b Schematic of the machine learning models. Features were collected based on characteristics related to the peaks, including the read distribution, motifs related to real TSSs/TESs, and sequence features related to internal false-positive signals, and used to train RF, LR, SVM, and KNN models. c Gene body coverage of scRCAT-seq reads derived from DRG (n = 18). Shown is the mean coverage of reads shaded by 95% confidence intervals. d Accuracy in identifying authentic TSSs and TESs with different machine learning models. Error bars represent standard deviation of the mean (n = 3). e Distance of the identified TSSs/TESs to those annotated in hg38. TSSs/TESs were identified from the scRCAT-seq peaks derived from hESC with the RF model. f Pie chart illustrating the distribution of the identified TSSs in hESC relative to the TSSs in the FANTOM5 database. The total number of TSS peaks identified after optimization by the machine learning models is indicated under the pie chart. g Pie chart illustrating the distribution of the identified TSSs in hESC relative to the TESs in PolyA_DB3. Source data are provided as a Source data file.
Availability – All custom computer code used in this study is freely available at https://github.com/huyoujinlab/scRCAT-seq.