Rice (Oryza sativa) is one of the most important worldwide crops. The genome has been available for over 10 years and has undergone several rounds of annotation. University of Liverpool researchers created a comprehensive database of transcripts from 29 public RNA sequencing datasets, officially predicted genes from Ensembl plants, and common contaminants in which to search for protein-level evidence. They re-analysed nine publicly accessible rice proteomics datasets. In total, they identified 420K peptide spectrum matches from 47K peptides and 8,187 protein groups. 4168 peptides were initially classed as putative novel peptides (not matching official genes). Following a strict filtration scheme to rule out other possible explanations, the researchers discovered 1,584 high confidence novel peptides. The novel peptides were clustered into 692 genomic loci where our results suggest annotation improvements. 80% of the novel peptides had an ortholog match in the curated protein sequence set from at least one other plant species. For the peptides clustering in intergenic regions (and thus potentially new genes), 101 loci were identified, for which 43 had a high-confidence hit for a protein domain. These results can be displayed as tracks on the Ensembl genome or other browsers supporting Track Hubs, to support re-annotation of the rice genome.
Identification of novel peptides in annotations from other plants
The heatmap represents hierarchical analysis of the final novel peptides mapped against the proteins encoded by the 44 plant genomes from Ensembl (red = positive match, white = no match from BLASTp, allowing no gaps and one mismatch). The novel peptides are divided into two groups intergenic (upper panel) and intragenic (lower panel), and are ranked by peptide length for hierarchical analysis.