RNA sequencing (RNA-seq) and microarrays are two transcriptomics techniques aimed at the quantification of transcribed genes and their isoforms. Here researchers from the Luxembourg Institute of Health compare the latest Affymetrix HTA 2.0 microarray with Illumina 2000 RNA-seq for the analysis of patient samples – normal lung epithelium tissue and squamous cell carcinoma lung tumours. Protein coding mRNAs and long non-coding RNAs (lncRNAs) were included in the study.
Both platforms performed equally well for protein-coding RNAs, however the stochastic variability was higher for the sequencing data than for microarrays. This reduced the number of differentially expressed genes and genes with predictive potential for RNA-seq compared to microarray data. Analysis of this variability revealed a lack of reads for short and low abundant genes; lncRNAs, being shorter and less abundant RNAs, were found especially susceptible to this issue. A major difference between the two platforms was uncovered by analysis of alternatively spliced genes. Investigation of differential exon abundance showed insufficient reads for many exons and exon junctions in RNA-seq while the detection on the array platform was more stable. Nevertheless, the researchers identified 207 genes which undergo alternative splicing and were consistently detected by both techniques.
Differentially expressed genes identified by the platforms
Evolution of the number of significant genes identified with variable FDR thresholds (a), using edgeR and limma with voom correction for analysis of RNA-seq data, and using limma for HTA data. Solid lines show unpaired analyses, while dotted lines show analyses paired by patient. Differentially expressed protein coding mRNAs (b) and lncRNAs (c) were obtained by unpaired differential expression analysis using edgeR for RNA-seq and limma for HTA (FDR < 0.01) and represented as proportional Euler-Venn diagrams. The lists of differentially expressed genes were confirmed by the top 25% significant genes detected in the LUSC-TCGA dataset: 4569 protein coding genes (FDR < 10−18) and 111 lncRNAs (FDR < 10−8) were used. Evolution of Jaccard index for coding mRNAs with variable FDR thresholds (d) between the two platforms (violet) shows a monotonic behaviour. Similarity between the TCGA validation gene list and each of the platforms– RNA-seq (red) and HTA (blue) showed a slight outperformance of HTA (marked by an arrow)
Despite the fact that the results of gene expression analysis were highly consistent between Human Transcriptome Arrays and RNA-seq platforms, the analysis of alternative splicing produced discordant results. The researchers concluded that modern microarrays can still outperform sequencing for standard analysis of gene expression in terms of reproducibility and cost.