Operon prediction in prokaryotes is critical not only for understanding the regulation of endogenous gene expression, but also for exogenous targeting of genes using newly developed tools such as CRISPR-based gene modulation. A number of methods have used transcriptomics data to predict operons, based on the premise that contiguous genes in an operon will be expressed at similar levels. While promising results have been observed using these methods, most of them do not address uncertainty caused by technical variability between experiments, which is especially relevant when the amount of data available is small. In addition, many existing methods do not provide the flexibility to determine the stringency with which genes should be evaluated for being in an operon pair.
Researchers at the Sandia National Laboratories have developed OperonSEQer, a set of machine learning algorithms that uses the statistic and p-value from a non-parametric analysis of variance test (Kruskal-Wallis) to determine the likelihood that two adjacent genes are expressed from the same RNA molecule. The researchers implement a voting system to allow users to choose the stringency of operon calls depending on whether your priority is high recall or high specificity. In addition, they provide the code so that users can retrain the algorithm and re-establish hyperparameters based on any data they choose, allowing for this method to be expanded as additional data is generated. The researchers show that their approach detects operon pairs that are missed by current methods by comparing our predictions to publicly available long-read sequencing data. OperonSEQer therefore improves on existing methods in terms of accuracy, flexibility, and adaptability.
Schematic of the method for determining similarity of RNA-seq signal between two adjacent genes
(A) Identification of an operon pair requires at least one of the two genes to be detectably expressed, and significant signal in the intergenic space. Idealized data on the left, hypothetical real-world data in the middle, and actual data on the right. (B) Usage of the Kruskal-Wallis statistic and p-value for pairwise comparisons of genes A, B and the intergenic (I) region, as well as the 3-way comparison. A50 and B50 represent the 50bp from genes A and B that are 50bp away from the intergenic region. These were used for comparison to minimize incorporation of technical variability seen across the gene body. These values, along with the intergenic distance, serve as features for training our operon prediction model.