RNA-Sequencing has been the leading technology to quantify expression of thousands genes simultaneously. The data analysis of an RNA-Seq experiment starts from aligning short reads to the reference genome/transcriptome or reconstructed transcriptome. However, current aligners lack the sensitivity to distinguish reads that come from homologous regions of an genome. One group of these homologies is the paralog pseudogenes. Pseudogenes arise from duplication of a set of protein coding genes, and have been considered as degraded paralogs in the genome due to their lost of functionality. Recent studies have provided evidence to support their novel regulatory roles in biological processes. With the growing interests in quantifying the expression level of pseudogenes at different tissues or cell lines, it is critical to have a sensitive method that can correctly align ambiguous reads and accurately estimate the expression level among homologous genes.
Previously in PseudoLasso, UCLA computer scientists proposed a linear regression approach to learn read alignment behaviors, and to leverage this knowledge for abundance estimation and alignment correction. With this work, they extend the development of PseudoLasso by grouping the homologous genomic regions into different communities using a community detection algorithm, followed by building a linear regression model separately for each community. The results show that this approach is able to retain the same accuracy as PseudoLasso. By breaking the genome into smaller homologous communities, the running time is improved from quadratic growth to linear with respect to the number of genes.
The enhanced framework of PseudoLasso
In the training stage, gene features are represented by all substrings of gene sequences. These features are used to identify homologous loci through a community detection approach. Mining on the same list of genes, paired-end short reads are simulated with different coverages and are aligned back to the reference genome, providing the potential models to describe read distribution of experimental data. These distribution matrices are normalized, and split based on the homologous community partition. The read counts of genes for validated data or experimental data are estimated through these models. The two tasks in the training stage are indicated by different colors: steps with blue arrows describe the first task of feature generation and community detection; steps with purple arrows describe the second task of read distribution computation within each community. In the validation stage, reads from other experiments are first aligned to the reference genome. The alignment profiles are matched to the best normalized matrices from the training stage using k-nearest neighbor classification. The predicted read counts are optimized by solving the non-negative least squares equations.