from Science Network WA by Teresa Belcher
SOFTWARE developed by a WA PhD student is quickly and accurately locating genes and correcting gene sequences in disease-causing fungi.
Alison Testa from Curtin University’s Centre for Crop and Disease Management (CCDM) created CodingQuarry, a gene-prediction software making finding fungal genes a lot quicker and more reliable.
Ms Testa says the CCDM is interested in finding important genes in fungi that allow fungal pathogens to infect their hosts, in this case crop plants.
“Ultimately this will lead to reducing the economic impact of major crop diseases for the Australian agricultural industry and is also broadly relevant across other areas of fungal genomics research, not just in crop-protection,” she says.
“A lot of the work we do involves sequencing fungal genomes in order to find the location of genes important to pathogenicity.
“It’s really important that we get that step right otherwise we can waste time in the lab or come to incorrect conclusions in our downstream work.”
Ms Testa says they wanted to reliably incorporate existing large experimental datasets on gene expression (RNA-seq) into their gene predictions and build a program specifically-designed for fungi, which have in the past been difficult to work with.
Net Blotch considered in software tests
CodingQuarry uses a novel method to do this by combining two techniques; hidden-Markov-model prediction and alignment of RNA-seq transcriptome sequences.
CodingQuarry flow diagram. Examples are shown of correct annotations of coding sequences, (A) and a typical CodingQuarry input; assembled transcripts aligned to the genome (B). The stages used within CodingQuarry to predict coding sequences are shown (C-G). Firstly, coding sequences are predicted from transcript sequences (introns are removed) using a GHMM (C). Possible prediction errors after this step are coloured red, and notes show how these are identified (D). These error prone predicted genes are discarded (E), and regions are selected for prediction from genome sequence (F). The resulting prediction is output by CodingQuarry (G), which merges the retained predictions from transcript sequences (E) with the predictions from selected areas of the genome sequence (F). Sections of the example genome sequence and annotations have been labelled i-x in each part of the diagram (A-G), and marked with vertical dotted lines. These sections are labelled to facilitate in-text references to the diagram in the Implementation section of this manuscript. Labels i-x correspond to the same genome sections through A-G.
One disease the team have been working with is Pyrenophora Net Blotch, an important barley disease caused by the pathogen Pyrenophora teres f. teres.
“It has around 13,000 genes so you can get a feeling for why its important to have software to automate this normally time-consuming process,” Ms Testa says.
“Using RNA-seq data with CodingQuarry, we have located one thousand new genes and made corrections to a few thousand of the known genes in Pyrenophora Net Blotch.
“We can now pass on that information with a lot more confidence to experimentalists at the CCDM and find out how this and other pathogens are causing disease at a molecular-level,” she says.
Ms Testa says after benchmarking on the model genomes of bakers yeast and fission yeast, the software was able to predict gene locations with more than 90 per cent accuracy.
“That’s five per cent better than the competing software,” she says.
“In terms of time saved, if you were going to manually correct genes based with RNA-seq, it takes months and months and is very labour intensive, whereas using CodingQuarry, the same outcome can be achieved in about 10 minutes.”
Source – Science Network WA
Availability – CodingQuarry is freely available (https://sourceforge.net/projects/codingquarry/), and suitable for incorporation into genome annotation pipelines.