A multi-institution research group jointly led by Professor Jan Prins and UNC CS alumna Professor Jinze Liu at the University of Kentucky is developing algorithms and software to analyze RNA using high-throughput sequencers. The software is in use in multiple research efforts, including The Cancer Genome Atlas (TCGA) program.
The TCGA is a national NIH-sponsored project to characterize the genetic basis of different cancers and to classify the myriad ways in which changes in a cell’s DNA can lead to cancer’s defining characteristic of unregulated cell growth. While most participants in this program analyze the DNA sequence of thousands of different tumor samples, at UNC the focus is on analyzing the RNA.
To understand the UNC focus, a CS analogy might be helpful. The genome (DNA) is often described as the “program” controlling the cell. In “execution” of the program, short sequences of DNA are transcribed, creating RNA molecules that have regulatory effects in the cell, or are translated to proteins making up the basic machinery of the cell. In turn, small molecules produced in response to conditions throughout the cell regulate the locus and frequency of DNA transcription of the genome, controlling execution. Most TCGA efforts analyze tumor genomes to identify mutations not found in healthy cells. In other words, they identify errors in the program that appear to be associated with uncontrolled cell growth. In contrast, UNC’s project examines the transcriptome (RNA) harvested from tumor samples to understand how execution of the genomic program is changed in tumor cells.
RNA molecules generally degrade quickly within the cell, so RNA transcripts extracted from a cell provide an execution snapshot. RNA transcripts are easily converted back to DNA molecules that are broken into short fragments, which are then sequenced using high-throughput DNA sequence analyzers, a process referred to as RNA-seq. For the TCGA RNA-seq protocol, fragments are around 200 nucleotides in length and 50 nucleotides are simultaneously sequenced from both ends of a fragment. For a given tumor sample, perhaps 100-150 million fragments might be sequenced, producing several hundred million sequences, or reads, of length 50. To obtain insight into transcriptome changes in cancer cells, we need to determine the genomic origin of all the reads and compare the locus and abundance of transcription, either between healthy and tumor cells or between different types of tumors.
Transcriptome analysis has seen intense activity over the past few years as RNA-seq enables unprecedented visibility into the transcriptome. The joint research group, funded by grants from NSF and NIH, has developed MapSplice, FDM, DiffSplice and other methods in daily use by the TCGA project and by other researchers around the world.
The MapSplice method determines the genomic origin of reads in an RNA-seq dataset. In principle, locating a sequence of length 50 in the 3.2 billion nucleotide reference genome can be accomplished very efficiently. However, the human reference genome will differ from any specific individual’s genome in many ways, and the reads themselves may contain sequencing errors, so approximate matching techniques are needed. Second, the human genome is hardly a random sequence – it contains many similar sequences over and over, the result of ancient duplication events and evolution, giving rise to ambiguous matches. The final, and most difficult, challenge is that RNA transcripts may reflect “splicing” in which sections of the transcript are excised to yield the final observed RNA sequence. To identify the genomic origin of such sequences we require an alignment to the reference genome that can include gaps, reflecting splices. Taken together, these complications make it possible to align most any read in multiple ways. Biological cues and constraints can help, but the biggest help comes from the abundance of data itself. We can observe candidate alignments of all reads in the data simultaneously to discover splices and variations from the reference genome that are consistently supported, and use this information to determine the correct genomic origin.
Within the UNC TCGA project MapSplice has identified “gene fusions” that are the result of broken sections of DNA being re-incorporated in the wrong location. This may happen relatively often genome-wide, but the transcriptome view can detect whether such alterations may affect transcription in genes that control cell proliferation. MapSplice also predicted the presence of “circular RNA” in which splicing aberrations cause a transcript to link back to itself. Circular RNA does not degrade easily, and consequently affects regulation; its association with certain tumor suppressor genes suggests a role in cancer. The FDM and DiffSplice methods compare transcriptomes between samples using algorithms that observe differences in read alignments between RNA-seq datasets. Using FDM and DiffSplice to compare tumor and normal RNA-seq datasets, splicing and abundance changes in the transcription of key tumor suppressor genes were observed as a result of certain mutations in this area of the genome.
Uncovering the method by which specific DNA mutations can lead to uncontrolled cell growth suggests possible ways in which specific tumor cells may be identified and neutralized, as well as possible therapies to offset transcriptional changes. Meanwhile, sequencing technologies are advancing rapidly, promising more detailed insight. However, more data also presents additional challenges. Even with 99.5% alignment accuracy, given 100 million reads, half a million might be aligned incorrectly and could give rise to many incorrect conclusions. So boon or curse – these are interesting times for computer scientists, indeed.
Source – UNC News & Notes