Owing greatly to the advancement of next-generation sequencing (NGS), the amount of NGS data is increasing rapidly. Although there are many NGS applications, one of the most commonly used techniques ‘RNA sequencing (RNA-seq)’ is rapidly replacing microarray-based techniques in laboratories around the world. As more and more of such techniques are standardized, allowing technicians to perform these experiments with minimal hands-on time and reduced experimental/operator-dependent biases, the bottleneck of such techniques is clearly visible; that is, data analysis. Further complicating the matter, increasing evidence suggests most of the genome is transcribed into RNA; however, the majority of these RNAs are not translated into proteins. These RNAs that do not become proteins are called ‘noncoding RNAs (ncRNAs)’. Although some time has passed since the discovery of ncRNAs, their annotations remain poor, making analysis of RNA-seq data challenging.
Here, researchers from Goethe University examine the current limitations of RNA-seq analysis using case studies focused on the detection of novel transcripts and examination of their characteristics. Finally, they validate the presence of novel transcripts using biological experiments, showing novel transcripts can be accurately identified when a series of filters is applied. The authors conclude that novel transcripts that are identified from RNA-seq must be examined carefully before proceeding to biological experiments.
Analysis of novel transcripts. (A) Numbers of transcripts listed in the Ensembl 77 annotation, identified (‘annotated’), novel isoforms and intergenic transcripts. (B) Numbers of transcripts identified above each threshold for FPKM values. (C) Percent occupancies of repetitive elements derived per length of a transcript. The box-and-whisker plot shows the distribution computed against each transcript category under the specified threshold for FPKM values. (D) Percent distributions of seven major classes of repetitive elements: ‘Simple_repeat’, including microsatellites; ‘LINE’, long interspersed nuclear elements; ‘Low_complexity’, low complexity repeats; ‘DNA’, DNA repeat elements; ‘Satellite’, satellite repeats; ‘LTR’, long terminal repeat elements, including retroposons; and ‘SINE’, short interspersed nuclear elements, including ALU elements.