Single-cell RNA sequencing (scRNA-seq) allows researchers to collect large catalogues detailing the transcriptomes of individual cells. Unsupervised clustering is of central importance for the analysis of these data, as it is used to identify putative cell types. However, there are many challenges involved. Wellcome Sanger Institute researchers discuss why clustering is a challenging problem from a computational point of view and what aspects of the data make it challenging. They also consider the difficulties related to the biological interpretation and annotation of the identified clusters.
Clustering methods for scRNA-seq
Representation of different clustering approaches for single-cell RNA sequencing (scRNA-seq) using the Deng data set of early mouse embryo development. a | True clusters, as defined by the authors, are based on the developmental stage. b | k-means separates cells into k = 5 groups. Because k-means assumes equal-sized clusters, the larger group of blastocysts is split from the other cell groups before the 8-cell and 16-cell stages are separated from each other. c | Complete-linkage hierarchical clustering creates a hierarchy of cells that can be cut at different levels (the result for k = 5 is indicated by the coloured bars at the bottom). Cutting farther down the tree would reveal finer substructures within the clusters. d,e | Louvain community detection is applied to a shared-nearest-neighbour graph connecting the cells and finds tightly connected communities in the graph (number of nearest neighbours used to construct the graph is five for part d and ten for part e). Increasing the number of neighbours when constructing the cell–cell graph indirectly decreases the resolution of graph-based clustering. Each clustering algorithm was implemented in R (igraph for parts d and e) and applied to the first two principal components (PCs) of the data.