The individual sample heterogeneity is one of the biggest obstacles in biomarker identification for complex diseases such as cancers. Current statistical models to identify differentially expressed genes between disease and control groups often overlook the substantial human sample heterogeneity. Meanwhile, traditional nonparametric tests lose detailed data information and sacrifice the analysis power, although they are distribution free and robust to heterogeneity.
Here, researchers from the University of Southern California propose an empirical likelihood ratio test with a mean-variance relationship constraint (ELTSeq) for the differential expression analysis of RNA sequencing (RNA-seq). As a distribution-free nonparametric model, ELTSeq handles individual heterogeneity by estimating an empirical probability for each observation without making any assumption about read-count distribution. It also incorporates a constraint for the read-count overdispersion, which is widely observed in RNA-seq data. ELTSeq demonstrates a significant improvement over existing methods such as edgeR, DESeq, t-tests, Wilcoxon tests and the classic empirical likelihood-ratio test when handling heterogeneous groups. It will significantly advance the transcriptomics studies of cancers and other complex disease.
Performance of different methods for tumor classification
(A) Silhouette plots based on the results of the K-means clustering (K = 8 for eight tumor types). The union of top 20 DE genes across all 28 pairwise comparisons was used as features to cluster the samples in the testing set. (B) Scatter plots of eight different types of tumor samples in the testing set using the first two principal components calculated with the union of top 20 DE genes. (C) Boxplot of pairwise classification accuracy rate of the K-means (K=2) clustering using top 20 DE genes identified by each method.
Availability – ELTSeq is available at http://www-rcf.usc.edu/~liangche/software.html.