YARN – Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data

Although ultrahigh-throughput RNA-Sequencing has become the dominant technology for genome-wide transcriptional profiling, the vast majority of RNA-Seq studies typically profile only tens of samples, and most analytical pipelines are optimized for these smaller studies. However, projects are generating ever-larger data sets comprising RNA-Seq data from hundreds or thousands of samples, often collected at multiple centers and from diverse tissues. These complex data sets present significant analytical challenges due to batch and tissue effects, but provide the opportunity to revisit the assumptions and methods that we use to preprocess, normalize, and filter RNA-Seq data – critical first steps for any subsequent analysis. Researchers from the Dana-Farber Cancer Institute found that analysis of large RNA-Seq data sets requires both careful quality control and that one account for sparsity due to the heterogeneity intrinsic in multi-group studies.

Preprocessing workflow for large, heterogeneous RNA-Seq
data sets, as applied to the GTEx data

rna-seqThe boxes on the right show the number of samples, genes, and tissue types at each step. First, samples were filtered using PCoA with Y-chromosome genes to test for correct annotation of the sex of each sample. PCoA was used to group or separate samples derived from related tissue regions. Genes were filtered to select a normalization gene set to preserve robust, tissue-dependent expression. Finally, the data were normalized using a global count distribution method to support cross-tissue comparison while minimizing within-group variability.

Availability – An R package instantiating our method for large-scale RNA-Seq normalization and preprocessing, YARN, is available at bioconductor.org/packages/yarn.

Paulson J, Chen CY, Lopes-Ramos CM, Kuijjer ML, Platig J, Sonawane AR, Fagny M, Glass K, Quackenbush J. (2016) Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data. bioRXiv [Epub ahead of print]. [abstract]

Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.