RNA-Seq is a powerful transcriptomics tool for mammalian cell culture process development. Successful RNA-Seq data analysis requires a high quality reference for read mapping and gene expression quantification. Currently, there are two public genome references for Chinese hamster ovary (CHO) cells, the predominant mammalian cell line in the biopharmaceutical industry.
In this study, Amgen scientists compared these two references by analyzing 60 RNA-Seq samples from a variety of CHO cell culture conditions. Among the 20,891 common genes in both references, they observed that 31.5% have more than 7.1% quantification differences, implying gene definition differences in the two references. The scientists propose a framework to quantify this difference using two metrics, Consistency and Stringency, which account for the average quantification difference between the two references over all samples, and the sample-specific effect on the quantification result, respectively. These two metrics can be used to identify potential genes for future gene model improvement and to understand the reliability of differentially expressed genes identified by RNA-Seq data analysis. Before a more comprehensive genome reference for CHO cells emerges, the strategy proposed in this study can enable a more robust transcriptome analysis for CHO cell RNA-Seq data.
Comparison of two genome references
(A) Comparison of gene content of the two genome references. Reference 1: CriGri_1.0 assembly and its 2014 annotation; Reference 2: C_griseus_v1.0 assembly and its 2014 annotation. First track (from center): Composition of all genes in reference 1 (left) and reference 2 (right). Each reference has both genes with specific gene symbol (blue and black for reference 1, blue and green for reference 2) and genes without specific gene symbol (red and grey for reference 1, red and cyan for reference 2). Red and blue indicate shared genes between two references. Second track: Same as previous track, except that genes without a complete parent-child relationship are removed when building the genome references (as indicated by white blocks in the sectors). Third track: Same as the previous track, except that genes not in the KEGG pathway database are removed. (B) Strategy to compare two genome references by analyzing RNA-Seq data from CHO cells: RNA-Seq reads from each sample are mapped to both references. For each gene, log2 ratio of read count using two references can be calculated. By using multiple samples under a variety of conditions, four different distribution patterns of the log2 ratio for a specific gene can be characterized. (C) Quantification of the distribution pattern by using two metrics, Consistency and Stringency. μ and σ are the average and the standard deviation of the log2 ratio for all samples, respectively.