In recent years, RNA-seq is emerging as a powerful technology in estimation of gene and/or transcript expression, and RPKM (Reads Per Kilobase per Million reads) is widely used to represent the relative abundance of mRNAs for a gene. In general, the methods for gene quantification can be largely divided into two categories: transcript-based approach and ‘union exon’-based approach. Transcript-based approach is intrinsically more difficult because different isoforms of the gene typically have a high proportion of genomic overlap. On the other hand, ‘union exon’-based approach method is much simpler and thus widely used in RNA-seq gene quantification. Biologically, a gene is expressed in one or more transcript isoforms. Therefore, transcript-based approach is logistically more meaningful than ‘union exon’-based approach. Despite the fact that gene quantification is a fundamental task in most RNA-seq studies, however, it remains unclear whether ‘union exon’-based approach for RNA-seq gene quantification is a good practice or not.
Researchers at Pfizer Worldwide Research & Development carried out a side-by-side comparison of ‘union exon’-based approach and transcript-based method in RNA-seq gene quantification. It was found that the gene expression levels are significantly underestimated by ‘union exon’-based approach, and the average of RPKM from ‘union exons’-based method is less than 50% of the mean expression obtained from transcript-based approach. The difference between the two approaches is primarily affected by the number of transcripts in a gene. The researchers performed differential analysis at both gene and transcript levels, respectively, and found more insights, such as isoform switches, are gained from isoform differential analysis. The accuracy of isoform quantification would improve if the read coverage pattern and exon-exon spanning reads are taken into account and incorporated into EM (Expectation Maximization) algorithm. This investigation discourages the use of ‘union exons’-based approach in gene quantification despite its simplicity.
Analysis workflow and illustration of different methods for gene quantification. A) The overall mapping and counting workflow. B) A hypothetical gene, its two isoforms and read coverage profile. Assuming that the sum of mapped reads from all genes is 1 million, and each small and large exon is 1kb and 2kb long, respectively. C) featureCounts results, denoted as fc_rpkm. After exon flattening, the ‘union exons’ are 2 kb, 1 kb and 2 kb long, respectively. The calculated RPKM is 6.4. D) RSEM results, denoted as rsem_txSum_rpkm. Mapped reads are first distributed to individual isoforms, and the expressions for the two isoforms are 2 RPKM and 6 RPKM, respectively. Accordingly, the entire gene expression is 8 RPKM. Note in Fig 1D, the reads distributed to individual isoforms can be added up first, and then the sum of reads is divided by the length of union exons to give rise to rsem_rpkm.