Annotation of complex genomes has evolved significantly but many remain far from complete. Several published transcript assembly programmes were tested on RNA-sequencing (RNA-seq) data to determine their effectiveness in identifying novel genes to improve the rice genome annotation. The assembly software tested did not identify all transcripts suggested by the RNA-seq data, was CPU intensive, lacked documentation, or lacked software updates.
Flowchart of the processes for gene identification by Tiling Assembly.
To overcome these shortcomings, researchers at University of Nevada Las Vegas developed a heuristic ab initio transcript assembly algorithm, Tiling Assembly, to identify genes based on short read and junction alignment. Tiling Assembly was compared with Cufflinks to evaluate its gene-finding capabilities. Additionally, a pipeline was developed to eliminate false-positive gene identification due to noise or repetitive regions in the genome. By combining Tiling Assembly and Cufflinks, 767 unannotated genes were identified in the rice genome, demonstrating that combining both programmes proved highly efficient for novel gene identification. The researchers also demonstrated that Tiling Assembly can accurately determine transcription start sites by comparing the Tiling Assembly genes with their corresponding full-length cDNA. They applied our pipeline to additional organisms and identified numerous unannotated genes, demonstrating that Tiling Assembly is an organism-independent tool for genome annotation.Comparison of Tiling Assembly and Cufflinks identified potential novel genes with those published in our previous study. (A) There were 92 genes identified in our previous study that were not classified as novel genes by Tiling Assembly or Cufflinks. While all of them were identified by Tiling Assembly or Cufflinks, slight changes in their length disqualified them from fitting into the category of potential novel genes. Of the 767 unannotated genes identified in this study, 306 genes were not identified in our previous study, demonstrating that our new pipeline is superior to that previously reported. (B) While all of the Clustering Algorithm genes were also found by Tiling Assembly, slight changes in their identification disqualified five from fitting into the category of potential novel genes.
Availability – Project home page: http://shenlab.sols.unlv.edu/shenlab/
Operating systems: Platform independent
Programming language: PERL
Other requirements: MySQL or SQLite