RNA extraction method, read length and sequencing layout (single-end versus paired-end) contribute strongly to variation between RNA-Seq samples

Sequencing-based gene expression methods like RNA-sequencing (RNA-seq) have become increasingly common, but it is often claimed that results obtained in different studies are not comparable owing to the influence of laboratory batch effects, differences in RNA extraction and sequencing library preparation methods and bioinformatics processing pipelines. It would be unfortunate if different experiments were in fact incomparable, as there is great promise in data fusion and meta-analysis applied to sequencing data sets.

Now, a team led by researchers at the Royal Institute of Technology has compared reported gene expression measurements for ostensibly similar samples (specifically, human brain, heart and kidney samples) in several different RNA-seq studies to assess their overall consistency and to examine the factors contributing most to systematic differences. The same comparisons were also performed after preprocessing all data in a consistent way, eliminating potential bias from bioinformatics pipelines. The researchers conclude that published human tissue RNA-seq expression measurements appear relatively consistent in the sense that samples cluster by tissue rather than laboratory of origin given simple preprocessing transformations. The article is supplemented by a detailed walkthrough with embedded R code and figures.


Analyses of the 11 data sets with published precomputed FPKM/RPKM values (n = 13,078) for brain, heart and kidney samples from four different studies.

Key points

  • Publicly available data sets with precomputed RNA expression levels are not comparable in their untransformed state in the sense that samples from the same tissues obtained in different experiments do not cluster by tissue.

  • Logarithmic transformation improves clustering of samples in principal components 2 and 3, while principal component 1 still seems to be dominated by study-specific factors.

  • RNA extraction method, read length and sequencing layout (single-end versus paired-end) contribute strongly to variation between samples.

  • Removal of known batch effects is essential for clustering based on tissue type.

  • Reprocessing raw data avoids loss of expression information because of gene identifier matching issues but does not serve to improve clustering.

Danielsson F, James T, Gomez-Cabrero D, Huss M. (2015) Assessing the consistency of public human tissue RNA-seq data sets. Brief Bioinform [Epub ahead of print]. [article]



Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.