High-throughput sequencing technologies, such as the Illumina Hi-seq, are powerful new tools for investigating a wide range of biological and medical problems. Massive and complex data sets produced by the sequencers create a need for development of statistical and computational methods that can tackle the analysis and management of data. The data normalization is one of the most crucial steps of data processing and this process must be carefully considered as it has a profound effect on the results of the analysis.
In this work, researchers from the Poznan University of Life Sciences focus on a comprehensive comparison of five normalization methods related to sequencing depth, widely used for transcriptome sequencing (RNA-seq) data, and their impact on the results of gene expression analysis.
Trimmed Mean of M-values (TMM) and Upper Quartile (UQ), both implemented in the edgeR Bioconductor package, Median (DES) implemented in the DESeq Bioconductor package, Quantile (EBS) implemented in the EBSeq Bioconductor package, and PoissonSeq (PS) normalization implemented in the PoissonSeq package. All packages are available from CRAN (http://cran.r-project.org/web/packages) and Bioconductor (http://www.bioconductor.org/packages/release/bioc).
Based on this study, they suggest a universal workflow that can be applied for the selection of the optimal normalization procedure for any particular data set. The described workflow includes calculation of the bias and variance values for the control genes, sensitivity and specificity of the methods, and classification errors as well as generation of the diagnostic plots. Combining the above information facilitates the selection of the most appropriate normalization method for the studied data sets and determines which methods can be used interchangeably.
Bar plots of the DEGs with specified levels of count abundance in all studied data sets. On the x-axis the methods of normalization are featured, whereas the y-axis represents the number of DEGs determined after each normalization procedure. The bar colours represent the groups of genes of particular level of expression.