Assessing Dissimilarity Measures for Sample-Based Hierarchical Clustering of RNA Sequencing Data Using Plasmode Datasets

Sample- and gene- based hierarchical cluster analyses have been widely adopted as tools for exploring gene expression data in high-throughput experiments. Gene expression values (read counts) generated by RNA sequencing technology (RNA-seq) are discrete variables with special statistical properties, such as over-dispersion and right-skewness. Additionally, read counts are subject to technology artifacts as differences in sequencing depth. This possesses a challenge to finding distance measures suitable for hierarchical clustering. Normalization and transformation procedures have been proposed to favor the use of Euclidean and correlation based distances. Additionally, novel model-based dissimilarities that account for RNA-seq data characteristics have also been proposed. Adequacy of dissimilarity measures has been assessed using parametric simulations or exemplar datasets that may limit the scope of the conclusions.

Here, researchers at Michigan State University propose the simulation of realistic conditions through creation of plasmode datasets, to assess the adequacy of dissimilarity measures for sample-based hierarchical clustering of RNA-seq data.


Algorithm used to generate plasmodes from Bottomly dataset.

Consistent results were obtained using plasmode datasets based on RNA-seq experiments conducted under widely different conditions. Dissimilarity measures based on Euclidean distance that only considered data normalization or data standardization were not reliable to represent the expected hierarchical structure. Conversely, using either a Poisson-based dissimilarity or a rank correlation based dissimilarity or an appropriate data transformation, resulted in dendrograms that resemble the expected hierarchical structure. Plasmode datasets can be generated for a wide range of scenarios upon which dissimilarity measures can be evaluated for sample-based hierarchical clustering analysis. The researchers show different ways of generating such plasmodes and apply them to the problem of selecting a suitable dissimilarity measure.


Typical dendrograms obtained for plasmode datasets from Bottomly experimental data with two dissimilarity measures under three scenarios. Dendrograms obtained using complete linkage hierarchical clustering based on Poisson dissimilarity (poi) are presented in the left column (a, c and e), and dendrograms based on Euclidean distance calculated from raw normalized data (rnr) are presented in right column (b, d, f). The rows correspond to three scenarios with different percentage of differentially expressed (DE) transcripts: 1) DE[100%] (a and b), 2) DE[10%]+nonDE[90%] (c and d), and 3) DE[20%]+nonDE[80%] (e and f). Sample labels correspond to main treatment (A or B) and flowcell number (4, 6 or 7). Dendrograms based on poi separates samples according to the expected sources of variation; in (a), only DE transcripts, samples are arranged in two separate groups following treatment labels; in (c), with a predominant number of non DE transcripts, the structure of groups is dominated by flowcell characteristics in addition to main treatment: and in (e) an in-between scenario, the dendrogram presents an intermediate group structure. Dendrograms based on rnr do not resemble any expected configuration.
Reeb PD, Bramardi SJ, Steibel JP (2015) Assessing Dissimilarity Measures for Sample-Based Hierarchical Clustering of RNA Sequencing Data Using Plasmode Datasets. PLoS ONE 10(7): e0132310. [article]

Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.