clusterExperiment and RSEC – A Bioconductor package and framework for clustering of single-cell and other large gene expression datasets

Clustering of genes and/or samples is a common task in gene expression analysis. The goals in clustering can vary, but an important scenario is that of finding biologically meaningful subtypes within the samples. This is an application that is particularly appropriate when there are large numbers of samples, as in many human disease studies. With the increasing popularity of single-cell transcriptome sequencing (RNA-Seq), many more controlled experiments on model organisms are similarly creating large gene expression datasets with the goal of detecting previously unknown heterogeneity within cells. It is common in the detection of novel subtypes to run many clustering algorithms, as well as rely on subsampling and ensemble methods to improve robustness.

Researchers from Weill Cornell Medicine introduce a Bioconductor R package, clusterExperiment, that implements a general and flexible strategy they entitle Resampling-based Sequential Ensemble Clustering (RSEC). RSEC enables the user to easily create multiple, competing clusterings of the data based on different techniques and associated tuning parameters, including easy integration of resampling and sequential clustering, and then provides methods for consolidating the multiple clusterings into a final consensus clustering. The package is modular and allows the user to separately apply the individual components of the RSEC procedure, i.e., apply multiple clustering algorithms, create a consensus clustering or choose tuning parameters, and merge clusters. Additionally, clusterExperiment provides a variety of visualization tools for the clustering process, as well as methods for the identification of possible cluster signatures or biomarkers.

Main steps of RSEC workflow

rna-seq

(a) shows a diagram of the steps to the workflow while (b)-(d) demonstrate these steps on the olfactory epithelium dataset. (b) The clusterMany step produces many clusterings from the different combinations of algorithms and tuning parameters. These clusterings are displayed using the plotClusters function. Each column of the plot corresponds to a sample and each row to a clustering from the clusterMany step. The samples in each row are color-coded by their cluster assignment in that clustering; samples that are not assigned to a cluster are left white. The colors across different clusterings (rows) are assigned so as to have similar colors for clusters with similar samples across clusterings. The consensus clustering obtained from the makeConsensus step is also shown below the individual clusterings. (c) The makeConsensus step finds a consensus clustering across the clusterMany clusterings based on the co-occurrence of samples in these clusterings. The heatmap of the matrix of co-occurrence proportions is plotted using the plotCoClustering function. The resulting cluster assignments from makeConsensus are color-coded above the matrix, as are the assignments from the next step, mergeClusters. (d) The makeDendrogram step creates a hierarchy between the consensus clusters and then similar clusters in sister nodes are merged with mergeClusters. Plotted here with the function plotDendrogram is the hierarchy of the clusters from makeDendrogram, with merged nodes indicated with dashed lines. The makeConsensus clusters and resulting mergeClusters clusters are indicated as color-coded blocks below the dendrogram, sized according to the number of samples in each cluster.

Availability – The R package clusterExperiment is publicly available through the Bioconductor Project, with a detailed manual (vignette) as well as well documented help pages for each function. https://www.bioconductor.org/packages/release/bioc/vignettes/clusterExperiment/inst/doc/clusterExperimentTutorial.html

Risso D, Purvis L, Fletcher RB, Das D, Ngai J, Dudoit S, et al. (2018) clusterExperiment and RSEC: A Bioconductor package and framework for clustering of single-cell and other large gene expression datasets. PLoS Comput Biol 14(9): e1006378. [article]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Time limit is exhausted. Please reload CAPTCHA.