Comprehensive assembly of “unmappable” reads from human RNA-Seq data

Crucial parts of the genome including genes encoding microRNAs and noncoding RNAs went unnoticed for years, and even now, despite extensive annotation and assembly of the human genome, RNA-sequencing continues to yield millions of unmappable and thus uncharacterized reads.

Here, researchers from the NHLBI examined:

> 300 billion reads from
536 normal donors and
1,873 patients encompassing
21 cancer types, identified
~300 million such uncharacterized reads, and using a distinctive approach de novo assembled
2,550 novel human transcripts, which mainly represent long noncoding RNAs.

Of these, 230 exhibited relatively specific expression or non-expression in certain cancer types, making them potential markers for those cancers, whereas 183 exhibited tissue specificity. Moreover, they used lentiviral-mediated expression of three selected transcripts that had higher expression in normal than in cancer patients and found that each inhibited the growth of HepG2 cells. This analysis provides a comprehensive and unbiased resource of unmapped human transcripts and reveals their associations with specific cancers, providing potentially important new genes for therapeutic targeting.

rna-seq

Characterizing unmapped sequences

A. Data processing strategy for identifying missed transcripts. Sequencing reads from cancer and normal samples were mapped to the human genome and transcriptome. Abundant reads (e.g., polyA, polyC, ribosomal RNAs, phage) and low‐quality reads were discarded, and reads that mapped to known viral and bacterial sequences were removed. The remaining unmapped reads were pooled and de novo assembled to obtain previously missed transcripts. The newly assembled transcripts were annotated by their over‐ or under‐representation in each cancer and the presence of histone marks in their genomic loci. For the illustrated 2 transcripts, one was expressed only in cancer and one in both cancer and normal tissues.

B. Cancer types (inner donut) and matching normal tissue (outer donut). The numbers of samples are in parentheses. The abbreviations for the different cancer types are in Table EV1A.

C, D Distribution of high‐quality unmapped sequencing reads across all cancer (C) and normal (D) samples after screening as described in (A).

Kazemian M, Ren M, Lin JX, Liao W, Spolski R, Leonard WJ. (2015) Comprehensive assembly of novel transcripts from unmapped human RNA-Seq data and their association with cancer. Mol Syst Biol 11(8):826. [article]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Time limit is exhausted. Please reload CAPTCHA.