More than 110,000 publications have used microarrays to decipher phenotype-associated genes, clinical biomarkers and gene functions. Microarrays rely on digital assaying the fluorescence signals of arrays. In this study, researchers at the University of Michigan Medical School retrospectively constructed raw images for 37,724 published microarray data, and developed deep learning algorithms to automatically detect systematic defects. They report that an alarming amount of 26.73% of the microarray-based studies are affected by serious imaging defects. By literature mining, the researchers found that publications associated with these affected microarrays have reported disproportionately more biological discoveries on the genes in the contaminated areas compared to other genes. 28.82% of the gene-level conclusions reported in these publications were based on measurements falling into the contaminated area, indicating severe, systematic problems caused by such contaminations. The researchers have provided the identified published, problematic datasets, affected genes and the imputed arrays as well as software tools for scanning such contamination that will become essential to future studies to scrutinize and critically analyze microarray data.
Overview of the workflow of the algorithm
(A) U-Net model structure. (B) Partition of images into training, validation and test sets. (C) Image preprocessing, U-Net model training and result evaluation. (D) Examples of model output, compared with human labels: our model can detect the defective areas that were initially identified and missed by human labeling. White regions in the output/label images indicate the defected areas. (E) Dice coefficients for each fold of cross-validation are 0.6054, 0.6356, 0.6493, 0.5722 and 0.6244.