High-throughput sequencing technologies, such as the Illumina Hi-seq, are powerful new tools for investigating a wide range of biological and medical problems. Massive and complex data sets produced by the sequencers create a need for development of statistical and computational methods that can tackle the analysis and management of data. The data normalization is one of the most crucial steps of data processing and this process must be carefully considered as it has a profound effect on the results of the analysis.
Researchers at the Poznan University of Life Sciences conducted a comprehensive comparison of five normalization methods related to sequencing depth, widely used for transcriptome sequencing (RNA-seq) data, and their impact on the results of gene expression analysis.
Bar plots of the DEGs with specified levels of count abundance in all studied data sets. On the -axis the methods of normalization are featured, whereas the -axis represents the number of DEGs determined after each normalization procedure. The bar colours represent the groups of genes of particular level of expression.
Based on this study, they suggest a universal workflow that can be applied for the selection of the optimal normalization procedure for any particular data set. The described workflow includes calculation of the bias and variance values for the control genes, sensitivity and specificity of the methods, and classification errors as well as generation of the diagnostic plots.
- normalize the data using considered methods,
- calculate the “bias” and “variance” and rank the methods based on these values,
- after each normalization perform differential analysis and determine DEG lists found by each normalization method,
- select a subset of genes that can serve as positive and negative controls to investigate the sensitivity and specificity of normalization methods and rank the methods based on these criteria,
- calculate the percentage of the mean of the prediction errors obtained using chosen classifiers for DEGs found by each normalization method and rank them,
- draw Venn diagrams or balloon plots based on the number of differentially expressed genes and rank the methods based on the number of common DEG values, and
- based on the summary of ranks choose the most appropriate normalization method of the investigated data set.
Combining the above information facilitates the selection of the most appropriate normalization method for the studied data sets and determines which methods can be used interchangeably.