The majority of mammalian genes encode multiple transcript isoforms that result from differential promoter use, changes in exonic splicing, and alternative 3’ end choice. Detecting and quantifying transcript isoforms across tissues, cell types, and species has been extremely challenging because transcripts are much longer than the short reads normally used for RNA-seq. By contrast, long-read RNA-seq (LR-RNA-seq) gives the complete structure of most transcripts. A team led by researchers at the University of California, Irvine sequenced 264 LR-RNA-seq PacBio libraries totaling over 1 billion circular consensus reads (CCS) for 81 unique human and mouse samples. The researchers detect at least one full-length transcript from 87.7% of annotated human protein coding genes and a total of 200,000 full-length transcripts, 40% of which have novel exon junction chains.
To capture and compute on the three sources of transcript structure diversity, the researchers have developed a gene and transcript annotation framework that uses triplets representing the transcript start site, exon junction chain, and transcript end site of each transcript. Using triplets in a simplex representation demonstrates how promoter selection, splice pattern, and 3’ processing are deployed across human tissues, with nearly half of multi-transcript protein coding genes showing a clear bias toward one of the three diversity mechanisms. Evaluated across samples, the predominantly expressed transcript changes for 74% of protein coding genes. In evolution, the human and mouse transcriptomes are globally similar in types of transcript structure diversity, yet among individual orthologous gene pairs, more than half (57.8%) show substantial differences in mechanism of diversification in matching tissues. This initial large-scale survey of human and mouse long-read transcriptomes provides a foundation for further analyses of alternative transcript usage, and is complemented by short-read and microRNA data on the same samples and by epigenome data elsewhere in the ENCODE4 collection.
Overview of the ENCODE4 RNA datasets
a, Overview of the sampled tissues and number of libraries from each tissue in the ENCODE human LR-RNA-seq dataset. b, Percentage of GENCODE v40 polyA genes by gene biotype detected in at least one ENCODE short-read RNA-seq library from samples that match the LR-RNA-seq at > 0 TPM, >=1 TPM, and >= 100 TPM. c, Number of samples in which each GENCODE v40 gene is detected >= 1 TPM in the ENCODE short-read RNA-seq dataset from samples that match the LR-RNA-seq. d, Data processing pipeline for the LR-RNA-seq data. e, Percentage of GENCODE v40 polyA genes by gene biotype detected in at least one ENCODE human LR-RNA-seq library at > 0 TPM, >= 1 TPM, and >= 100 TPM. f, Number of samples in which each GENCODE v40 gene is detected >= 1 TPM in the ENCODE human LR-RNA-seq dataset. g, Boxplot of TPM of polyA genes at the indicated rank in each human LR-RNA-seq library. Not significant (no stars) P > 0.05; *P <= 0.05, **P <= 0.01, ***P <= 0.001, ****P <= 0.0001; Wilcoxon rank-sum test.