Long intergenic non-coding RNAs (lincRNAs) are non-coding transcripts >200 nucleotides long that do not overlap protein-coding sequences. Importantly, such elements are known to be tissue-specifically expressed and to play a widespread role in gene regulation across thousands of genomic loci. However, very little is known of the mechanisms for the evolutionary biogenesis of these RNA elements, especially given their poor conservation across species. It has been proposed that lincRNAs might arise from pseudogenes. To test this systematically, researchers at the Johannes Gutenberg University of Mainz developed a novel method that searches for remnants of protein-coding sequences within lincRNA transcripts; the hypothesis is that we can trace back their biogenesis from protein-coding genes or posterior transposon/retrotransposon insertions. Applying this method, they found 203 human lincRNA genes with regions significantly similar to protein-coding sequences. This method provides a visualization tool to trace the evolutionary biogenesis of lincRNAs with respect to protein-coding genes by sequence divergence. Subsequently, the researchers show the expression correlation between lincRNAs and their identified parental protein-coding genes using public RNA-seq repositories, hinting at novel gene regulatory relationships. In summary, they have developed a novel computational methodology to study non-coding gene sequences, which can be applied to identify the evolutionary biogenesis and function of lincRNAs.
Methodology for finding remnants of protein-coding sequences within lincRNA
Flow chart for the alignment of protein-coding genes (amino acid sequences) with non-coding genes (DNA sequence) with the following steps: determination of the longest common sub-string, followed by extension in both directions until a zero-probability event is found. The process iterates on unaligned regions. Then, a single block is selected with stretches of indels and/or mismatches shorter than 10 amino acids. Finally, the boundaries between different frames are optimized, maximizing the alignment score is defined according to our scoring matrix
Availability – The open software for obtaining the lincRNA set predictions (Class I and II, 413 lincRNAs) and for aligning each lincRNA against their corresponding protein-coding gene is available in the next GitHub repository (https://github.com/swttalyan/protsInLincRNA)