Sequencing costs currently prohibit the application of single-cell mRNA-seq to many biological and clinical analyses. Targeted single-cell mRNA-sequencing reduces sequencing costs by profiling reduced gene sets that capture biological information with a minimal number of genes. Caltech researchers have developed an active learning method that identifies minimal but highly informative gene sets that enable the identification of cell types, physiological states and genetic perturbations in single-cell data using a small number of genes. The active feature selection procedure generates minimal gene sets from single-cell data by employing an active support vector machine (ActiveSVM) classifier. The researchers demonstrate that ActiveSVM feature selection identifies gene sets that enable ~90% cell-type classification accuracy across, for example, cell atlas and disease-characterization datasets. The discovery of small but highly informative gene sets should enable reductions in the number of measurements necessary for application of single-cell mRNA-seq to clinical tests, therapeutic discovery and genetic screens.
Description of ActiveSVM feature selection
At the nth step, an n-D SVM using only already-selected genes is trained to select a certain number of misclassified cells, which is the cell selection step. In the gene selection step, the least classifiable cells are taken as the training set. Based on this training set, N – n (n + 1)-D SVMs are trained, where n dimensions are the genes already selected and the last dimension is one of the previously unselected candidate genes. We would then obtain N – n weights corresponding to N – n unselected genes as well as N – n margin rotation angles θ between every and the original weight w of the n-D SVM. The gene with the maximum rotation of margin is selected for the next round.