Improved transcript discovery from partially observed short reads caused by missing information

The automated transcript discovery and quantification of high-throughput RNA sequencing (RNA-seq) data are important tasks of next-generation sequencing (NGS) research. However, these tasks are challenging due to the uncertainties that arise in the inference of complete splicing isoform variants from partially observed short reads. Here, researchers from Tsinghua University address this problem by explicitly reducing the inherent uncertainties in a biological system caused by missing information. In their approach, the RNA-seq procedure for transforming transcripts into short reads is considered an information transmission process. Consequently, the data uncertainties are substantially reduced by exploiting the information transduction capacity of information theory. The experimental results obtained from the analyses of simulated datasets and RNA-seq datasets from cell lines and tissues demonstrate the advantages of our method over state-of-the-art competitors.

Overview of MaxInfo

rna-seq

(A) Dissect RNA-seq procedures from the perspective of information transduction. The RNA-seq procedures construct a coding channel that transmits the information from the source to the receiver. On both terminals of the channel, isoforms are the signal source and short reads are the encoded codes. (B) Algorithmic gene prediction and candidate isoform reconstruction. For illustration purposes, two genes (A and B) are located on the genome and determined by the read distribution. Within gene A, eight subexons are identified and used as nodes to construct the directed graph. A pair of source (equation M7) and sink (equation M8) nodes are added to the graph to identify the start/end exon of a putative isoform. (C) Information transduction capacity model. equation M9 and equation M10 represent the entropies of transcripts and reads, respectively.equation M11 is the mutual information and used to measure the information content shared by the transcripts and the reads. A probabilistic graphical model (in the rectangle) is incorporated to depict the read generation procedures from transcripts (T) to RNA-seq data (R). equation M12, equation M13 indicate a pair of reads (paired-end). In the graphical model, S and L represent the starting position along the transcript and the length of the fragment, respectively, and Q describes the match quality of the read alignment.

Availability – The MaxInfo software package is available at http://maxinfo.sourceforge.net for public usage, and the source code is included

Deng Y, Bao F, Yang Y, Ji X, Du M, Zhang Z, Wang M, Dai Q. (2017) Information transduction capacity reduces the uncertainties in annotation-free isoform discovery and quantification. Nucleic Acids Res 45(15):e143. [article]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Time limit is exhausted. Please reload CAPTCHA.