With the surge of next generation high-throughput technologies, RNA-seq data is playing an increasingly important role in disease diagnosis, in which normalization is assumed as an essential procedure to produce comparable samples. Recent studies have seen different normalization methods proposed to remove various technical biases in RNA sequencing. However, there are no previous studies evaluating the impacts of normalization on RNA-seq disease diagnosis.
In this study, Fordham University researchers investigate this problem by analyzing structured big data: RNA-seq data acquired from the TCGA portal for its popularity in RNA-seq disease diagnosis. They propose a novel normalization effect test algorithm, diagnostic index (d-index), and data entropy to analyze and evaluate the impacts of normalization on RNA-seq disease diagnosis by using state-of-the-art machine learning models. Furthermore, the researchers present an original visualization analysis to compare the performance of normalized data versus raw data. They have found that normalized data yields generally an equivalent or even lower level diagnosis than its raw data. Moreover, some normalization approaches (e.g. RPKM) even bring negative effects in disease diagnosis. On the other hand, raw data seems to have the potential to decipher pathological status better or at least comparable than when the data is normalized. Their visualization analysis also shows that some normalization methods even bring ‘outliers’, which unavoidably decreases sample detectability in diagnosis. More importantly, their data entropy analysis shows that normalized data usually demonstrates equivalent or lower entropy values than raw data. Those data with high entropy values tend to achieve better diagnosis than those with low entropy values. In addition, the researchers found that high-dimensional imbalance (HDI) data is unaffected by any normalization procedures in diagnosis, and fails almost all machine learning models by only recognizing majority types in spite of raw or normalized data. These results suggest that normalized data may not demonstrate statistically significant advantages in disease diagnosis than its raw form. It further implies that normalization may not be an indispensable procedure in RNA-seq disease diagnosis or at least some normalization processes may not be. Instead, raw data may perform better for capturing more original transcriptome patterns in different pathological conditions.