Tissue heterogeneity as a widespread phenomenon in publicly available gene expression datasets

Lack of reproducibility in gene expression studies is a serious issue being actively addressed by the biomedical research community. Besides established factors such as batch effects and incorrect sample annotations, Researchers at the Medical University of Innsbruck and the Technical University of Munich recently reported tissue heterogeneity, a consequence of unintended profiling of cells of other origins than the tissue of interest, as a source of variance. Although tissue heterogeneity exacerbates irreproducibility, its prevalence in gene expression data remains unknown. Here, the researchers systematically analyse 2 667 publicly available gene expression datasets covering 76 576 samples. Using two independent data compendia and a reproducible, open-source software pipeline, they find a prevalence of tissue heterogeneity in gene expression data that affects between 1 and 40% of the samples, depending on the tissue type. They discover both cases of severe heterogeneity, which may be caused by mistakes in annotation or sample handling, and cases of moderate heterogeneity, which are likely caused by tissue infiltration or sample contamination. Their analysis establishes tissue heterogeneity as a widespread phenomenon in publicly available gene expression datasets, which constitutes an important source of variance that should not be ignored. Consequently, the researchers advocate the application of quality-control methods such as BioQC to detect tissue heterogeneity prior to mining or analysing gene expression data.

Tissue heterogeneity in gene expression studies from GEO and ARCHS4

Tissue heterogeneity in gene expression studies from GEO and ARCHS4. (A) Fraction of heterogeneous samples per tissue. Error bars show 95% confidence intervals derived by bootstrapping (n=1 000). (B) Confusion matrix of tissues with absolute counts. Reference tissue refers to the annotated tissue, detected signature to other tissue signatures that were detected in these samples by BioQC. For tiles boxed with dashed lines, one or more query signatures have been removed due to correlation with the reference signature.

(A) Fraction of heterogeneous samples per tissue. Error bars show 95% confidence intervals derived by bootstrapping (n=1 000). (B) Confusion matrix of tissues with absolute counts. Reference tissue refers to the annotated tissue, detected signature to other tissue signatures that were detected in these samples by BioQC. For tiles boxed with dashed lines, one or more query signatures have been removed due to correlation with the reference signature.

Sturm G, List M, Zhang JD. (2021) Tissue heterogeneity is prevalent in gene expression studies. NAR Genomics and Bioinformatics 3(3), lqab077. [article]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Time limit is exhausted. Please reload CAPTCHA.