The main scope of the study is ambiguous genes, i.e. genes whose expression is difficult to estimate from the data produced by next-generation sequencing technologies. Researchers at the Poznan University of Life Sciences focused on the RNA sequencing (RNA-Seq) type of experiment performed on the Illumina platform. It is crucial to identify such genes and understand the cause of their difficulty, as these genes may be involved in some diseases. By giving misleading results, they could contribute to a misunderstanding of the cause of certain diseases, which could lead to inappropriate treatment. The researchers thought that the ambiguous genes would be difficult to map because of their complex structure. So they looked at RNA-seq analysis using different mappers to find genes that would have different measurements from the aligners. They were able to identify such genes using a generalized linear model with two factors: mappers and groups introduced by the experiment. A large proportion of ambiguous genes are pseudogenes. High sequence similarity of pseudogenes to functional genes may indicate problems in alignment procedures. In addition, predictive analysis verified the performance of difficult genes in classification. The effectiveness of classifying samples into specific groups was compared, including the expression of difficult and not difficult genes as covariates. In almost all cases considered, ambiguous genes have less predictive power.
Percentage of misclassified samples for each dataset
For each dataset and number of predictors equal to 1/3, 1/2 and 2/3 of the number of samples, violin plots are drawn for 10 simulations of the joint classifier “ensemble”. The basic classifiers used in the “ensemble” classifier were: support vector machine, random forest, neural networks and rpart. The color represents difficulty cases: green color means “no” case—considering DEG that are not difficult; red color means “yes” case—considering DEG that are difficult.