Biologists often use RNA-Sequencing (RNA-Seq) to identify a limited number of genes for subsequent validation, and one important factor for candidate gene selection is the fold-change in expression between two groups. However, RNA-Seq produces a wide range of read counts per gene, and genes with a low coverage of reads can produce artificially high fold-change values.
In this paper, researchers from the City of Hope National Medical Center present a solution to this problem: adding a factor between 0.01 and 1 to normalized expression values. This conclusion is based upon analysis of a large patient cohort of paired tumor and normal samples from patients with lung adenocarcinomas as well as a small, two-group cell line dataset. The optimal factor to add to normalized expression values is chosen based upon testing arrange of factors on: the number of genes or transcripts whose expression is effectively censored (using three different alignment algorithms) and 2) the potential level of bias introduced by the factor (defined by comparing unadjusted gene lists). The robustness of these trends is also tested by comparing multiple mRNA quantification and differential expression algorithms. The relationship between RPKM cutoff and concordance between gene lists produced using different statistical methods can be complicated, but this study emphasizes that simple statistical analysis (amendable to the use of rounded RPKM values) at least provides equal quality results as popular algorithms for RNA-Seq differential expression.
Availability – The strategies discussed in this paper have been implemented as part of the Simplified RNA-Seq Analysis Pipeline (sRAP), which is an R package that is available at: http://www.bioconductor.org/packages/release/bioc/html/sRAP.html