The advent of high-throughput RNA sequencing (RNA-seq) has led to the discovery of unprecedentedly immense transcriptomes encoded by eukaryotic genomes. However, the transcriptome maps are still incomplete partly because they were mostly reconstructed based on RNA-seq reads that lack their orientations (known as unstranded reads) and certain boundary information. Methods to expand the usability of unstranded RNA-seq data by predetermining the orientation of the reads and precisely determining the boundaries of assembled transcripts could significantly benefit the quality of the resulting transcriptome maps.
Researchers from Hanyang University have developed a high-performing transcriptome assembly pipeline, called CAFE, that significantly improves the original assemblies, respectively assembled with stranded and/or unstranded RNA-seq data, by orienting unstranded reads using the maximum likelihood estimation and by integrating information about transcription start sites and cleavage and polyadenylation sites. Applying large-scale transcriptomic data comprising 230 billion RNA-seq reads from the ENCODE, Human BodyMap 2.0 Projects, The Cancer Genome Atlas, and GTEx, CAFE enabled the researchers to predict the directions of about 220 billion unstranded reads, which led to the construction of more accurate transcriptome maps, comparable to the manually curated map, and a comprehensive lncRNA catalogue that includes thousands of novel lncRNAs. This pipeline should not only help to build comprehensive, precise transcriptome maps from complex genomes but also to expand the universe of non-coding genomes.
Comprehensive human transcriptome map
(A) A schematic flow for the reconstruction of the BIG Transcriptome map using large-scale RNA-Seq samples from human cell lines, ENCODE, and Human BodyMap 2.0 Projects. (B) Accuracies of unstranded (blue) and RPD assemblies (mint) from the ENCODE and Human BodyMap Projects 2.0.(C) Sensitivities (red) and specificities (blue) of unstranded assemblies (solid line box) and RPD assemblies (dotted line box) are shown in box plots. The unstranded RNA-Seq data are from GTEx (14 tissues) and TCGA Project (5 tumour types). The numbers (n) indicate the sample numbers in each group. CRBL: brain cerebellum, CTX: brain cortex, FCTX: brain frontal cortex, HPC: brain hippocampus, HTH: brain hypothalamus, ESO: esophagus – mucosa, PAN: pancreas, PRO: prostate, ESCA: esophageal carcinoma, HNSC: head and neck squamous cell carcinoma, LIHC: liver hepatocellular carcinoma, LUAD: lung adenocarcinoma, and LUSC: lung squamous cell carcinoma.(D) Shown are the accuracies of BIG Transcriptome and MITranscriptome at the base and intron levels based on four different sets of annotations (RefSeq, manual and automatic GENCODE, PacBio, and EST), and a combined set of annotations. SN: sensitivity and SP: specificity. (E-F) Maximum entropy scores of the putative splice donor sites (E) and of putative splice acceptor sites (F). Blue lines are from BIG Transcriptome, green lines are from PacBio assembly, and orange lines are from MiTranscriptome. (G) The fraction of TFBSs upstream of the 5’ end of BIG Transcriptome transcripts(blue) was compared to those of MiTranscriptome (orange), GENCODE (automatic) (black), and PacBio assembly (green). (H) The fraction of the closest poly(A) signals, AAUAAA, in the region just upstream of the 3’ end of BIGTranscriptome annotations (blue) compared to those of MiTranscriptome (orange), GENCODE (automatic) (black), and PacBio assembly (green).
Availability – All source codes and a detailed manual for using CAFE can be found at: http://big.hanyang.ac.kr/CAFE