Expression of different forms of genes in human tissues has been mapped at unprecedented depth using emerging sequencing technology
Research on RNA diversity in human tissues, led by scientists from the New York Genome Center and the Broad Institute, is described in a recent study published in Nature. When the genetic code is transcribed to RNA, one gene typically produces several different forms of RNA molecules, or transcripts, with different functions. While this phenomenon has been known for decades, the catalog of human transcripts has remained incomplete.
“Equipped with the latest sequencing technology, we were able to read segments of over one thousand nucleotides, compared to less than one hundred with standard approaches”, describes Dr. Beryl Cummings, one of the leaders of the project and formerly a postdoctoral fellow at the Broad Institute. “Importantly, we were able to do this at scale of over 80 samples from many tissues, which led to discovery of tens of thousands of novel transcripts,” she adds.
The researchers used their data to characterize how genetic and environmental differences can manifest in differences in the transcriptome.
“Genetic differences between individuals can affect how genes are regulated. We were able to describe with a finer resolution than before how transcript structures are affected. This helps to understand molecular underpinnings of variants that contribute to disease risk,” explains Dr. Dafni Glinos from the New York Genome Center and co-first author of the study.
“We believe the discoveries, data, and tools we present pave the way for a new era of transcriptome research. About a decade ago, high-through analysis of small DNA or RNA segments revolutionized genomics. I think we’re at the cusp of a new revolution with long read sequencing,” says Professor Tuuli Lappalainen from the New York Genome Center and one of the leaders of the study.
LORALS pipeline development and aligning statistics
A) Pipeline for allele-specific analysis. Raw long-reads are first aligned to the genome using minimap2. This alignment is used to correct the phase of some of the heterozygous variants on the whole genome sequencing vcf. This new file is then used to generate personalized genome reference files against which the raw reads are again aligned using minimap2. The raw reads are also aligned to the transcriptome using minimap2. The VCF file along with the genome aligned reads and the transcriptome aligned reads are then fed into LORALS for allelic analysis. B) Percentage of switched haplotypes per donor informed by the long-read data. For this all samples from the same donor were merged to harmonize the files. C) Percentage of haplotype specific reads calculated as reads having a higher mapping score when using a personalized genome reference. D) Delta calculated as the difference in the start position of the aligned read between the genome aligned and the personalized genome aligned reads. Not shown are the reads that had Delta = 0. E) Reference ratio for the samples present in this study sequenced using Illumina technology and ONT technology aligned with two different approaches.
Source – New York Genome Center
Availability – All original code is released as part of a software package, https://github.com/LappalainenLab/lorals. General scripts are available at https://github.com/LappalainenLab/lorals_paper_code