The automated transcript discovery and quantification of high-throughput RNA sequencing (RNA-seq) data are important tasks of next-generation sequencing (NGS) research. However, these tasks are challenging due to the uncertainties that arise in the inference of complete splicing isoform variants from partially observed short reads. Here, researchers from Tsinghua University address this problem by explicitly reducing the inherent uncertainties in a biological system caused by missing information. In their approach, the RNA-seq procedure for transforming transcripts into short reads is considered an information transmission process. Consequently, the data uncertainties are substantially reduced by exploiting the information transduction capacity of information theory. The experimental results obtained from the analyses of simulated datasets and RNA-seq datasets from cell lines and tissues demonstrate the advantages of our method over state-of-the-art competitors.
Overview of MaxInfo
(A) Dissect RNA-seq procedures from the perspective of information transduction. The RNA-seq procedures construct a coding channel that transmits the information from the source to the receiver. On both terminals of the channel, isoforms are the signal source and short reads are the encoded codes. (B) Algorithmic gene prediction and candidate isoform reconstruction. For illustration purposes, two genes (A and B) are located on the genome and determined by the read distribution. Within gene A, eight subexons are identified and used as nodes to construct the directed graph. A pair of source () and sink () nodes are added to the graph to identify the start/end exon of a putative isoform. (C) Information transduction capacity model. and represent the entropies of transcripts and reads, respectively. is the mutual information and used to measure the information content shared by the transcripts and the reads. A probabilistic graphical model (in the rectangle) is incorporated to depict the read generation procedures from transcripts (T) to RNA-seq data (R). , indicate a pair of reads (paired-end). In the graphical model, S and L represent the starting position along the transcript and the length of the fragment, respectively, and Q describes the match quality of the read alignment.
Availability – The MaxInfo software package is available at http://maxinfo.sourceforge.net for public usage, and the source code is included