The recent success of deep learning techniques in machine learning and artificial intelligence has stimulated a great deal of interest among bioinformaticians, who now wish to bring the power of deep learning to bare on a host of bioinformatical problems. Deep learning is ideally suited for biological problems that require automatic or hierarchical feature representation for biological data when prior knowledge is limited. In this work, University of Tokyo researchers address the sequence-specific bias correction problem for RNA-seq data redusing Recurrent Neural Networks (RNNs) to model nucleotide sequences without pre-determining sequence structures. The sequence-specific bias of a read is then calculated based on the sequence probabilities estimated by RNNs, and used in the estimation of gene abundance.
The researchers explore the application of two popular RNN recurrent units for this task and demonstrate that RNN-based approaches provide a flexible way to model nucleotide sequences without knowledge of predetermined sequence structures. These experiments show that training a RNN-based nucleotide sequence model is efficient and RNN-based bias correction methods compare well with the-state-of-the-art sequence-specific bias correction method on the commonly used MAQC-III data set.
RNN-based sequence-specific bias correction pipeline
(a). The pipeline of RNN-based bias correction method for gene expression estimation. (b). An example of foreground and background sequences. Foreground sequences are extracted surrounding the read start-end positions and background sequences are extracted by randomly o↵setting the selected read start-end positions. (c). Training RNN sequence models on foreground and background sequences. The probability of the sequence is calculated with the RNN prediction scores in the red rectangle except the initial position.