Cancer is a complex and heterogeneous disease that poses a major challenge for clinical management and research. One of the key goals of cancer research is to identify biomarkers that can help diagnose, classify, prognosticate, and treat cancer patients. Biomarkers are molecules or characteristics that can be measured or detected in biological samples, such as blood, tissue, or urine, and can provide information about the presence, stage, or progression of cancer, as well as the response or resistance to therapy.
However, finding reliable and specific biomarkers for cancer is not an easy task, as cancer is caused by multiple genetic and epigenetic alterations that affect the expression and function of thousands of genes and proteins. Moreover, cancer is a dynamic and evolving process that can vary across different regions of the same tumor, as well as between different patients with the same type of cancer. Therefore, there is a need for more comprehensive and sensitive methods to capture the molecular diversity and complexity of cancer.
One of the most promising methods to achieve this goal is RNA sequencing (RNA-seq), a technique that uses high-throughput sequencing to measure the quantity and quality of RNA molecules in a given sample. RNA-seq can provide a global and unbiased view of the transcriptome, the complete set of RNA transcripts produced by the genome, including coding and non-coding RNAs, such as messenger RNAs (mRNAs), microRNAs (miRNAs), long non-coding RNAs (lncRNAs), and circular RNAs (circRNAs). RNA-seq can also detect novel transcripts, splice variants, fusion genes, and mutations that may be associated with cancer.
In this article, we will discuss how to use RNA-seq data to identify novel biomarkers for cancer diagnosis and treatment, and provide some examples of successful applications of this approach in different types of cancer.
Steps to Identify Novel Biomarkers from RNA-seq Data
The general workflow to identify novel biomarkers from RNA-seq data consists of the following steps:
- Sample preparation and sequencing: The first step is to collect and process the biological samples of interest, such as tumor tissue, blood, or other fluids, from cancer patients and healthy controls. The samples are then subjected to RNA extraction, quality control, library preparation, and sequencing using an appropriate platform and protocol. The sequencing output is a set of short reads that represent the RNA molecules in the sample.
- Read alignment and quantification: The next step is to map the reads to a reference genome or transcriptome using a suitable alignment tool, such as STAR, HISAT2, or Bowtie2. The alignment results in a set of genomic coordinates and counts for each read, which can be used to estimate the expression level of each gene or transcript in the sample. The expression level can be measured by different metrics, such as counts per million (CPM), fragments per kilobase of transcript per million mapped reads (FPKM), or transcripts per million (TPM).
- Differential expression analysis: The third step is to compare the expression levels of genes or transcripts between different groups of samples, such as cancer versus normal, or treated versus untreated, using a statistical method, such as DESeq2, edgeR, or limma. The differential expression analysis results in a list of genes or transcripts that are significantly upregulated or downregulated in one group compared to another, along with their fold changes, p-values, and false discovery rates (FDRs).
- Functional annotation and enrichment analysis: The fourth step is to annotate and classify the differentially expressed genes or transcripts according to their biological functions, pathways, and interactions, using various databases and tools, such as Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, or STRING. The functional annotation and enrichment analysis can help to understand the biological processes and mechanisms that are involved in cancer development and progression, and to identify potential targets for therapy.
- Biomarker selection and validation: The final step is to select and validate the most promising candidates for biomarkers from the differentially expressed genes or transcripts, based on their biological relevance, clinical significance, and technical feasibility. The biomarker selection can be guided by several criteria, such as the magnitude and direction of expression change, the consistency and specificity across different samples and cancer types, the correlation with clinical outcomes, and the availability of detection methods. The biomarker validation can be performed by using independent cohorts of samples, different platforms or techniques, such as quantitative real-time PCR (qRT-PCR), immunohistochemistry (IHC), or enzyme-linked immunosorbent assay (ELISA), and functional assays, such as knockdown, overexpression, or inhibition experiments.
Examples of Novel Biomarkers Identified from RNA-seq Data
RNA-seq has been widely used to identify novel biomarkers for cancer diagnosis and treatment in various types of cancer, such as breast cancer, lung cancer, colorectal cancer, and hepatocellular carcinoma. Here are some examples of novel biomarkers identified from RNA-seq data in these cancers:
- Breast cancer: Breast cancer is the most common and the second most deadly cancer among women worldwide, with an estimated 2.3 million new cases and 685,000 deaths in 2020. Breast cancer is a heterogeneous disease that can be classified into different subtypes based on the expression of hormone receptors (estrogen receptor, ER; progesterone receptor, PR) and human epidermal growth factor receptor 2 (HER2), as well as other molecular features, such as gene expression profiles and mutations. These subtypes have different clinical behaviors and responses to therapy, and therefore require different biomarkers for diagnosis and treatment. RNA-seq has been used to identify novel biomarkers for breast cancer subtypes, such as:
- GREB1: GREB1 is a gene that encodes a protein that is involved in the regulation of ER signaling and cell proliferation. GREB1 was found to be highly expressed in ER-positive breast cancer, and to be associated with poor prognosis and resistance to endocrine therapy. GREB1 was also shown to be a potential therapeutic target for ER-positive breast cancer, as its inhibition reduced tumor growth and enhanced the sensitivity to tamoxifen, a drug that blocks ER activity.
- LINC00963: LINC00963 is a lncRNA that is involved in the regulation of cell cycle and apoptosis. LINC00963 was found to be overexpressed in triple-negative breast cancer (TNBC), a subtype of breast cancer that lacks ER, PR, and HER2 expression, and has a poor prognosis and limited treatment options. LINC00963 was also shown to be a potential therapeutic target for TNBC, as its knockdown induced cell cycle arrest and apoptosis, and suppressed tumor growth and metastasis.
- Lung cancer: Lung cancer is the most common and the most deadly cancer among both men and women worldwide, with an estimated 2.2 million new cases and 1.8 million deaths in 2020. Lung cancer is mainly divided into two types: non-small cell lung cancer (NSCLC), which accounts for about 85% of all cases, and small cell lung cancer (SCLC), which accounts for about 15% of all cases. NSCLC can be further classified into different subtypes based on the histology and molecular features, such as adenocarcinoma, squamous cell carcinoma, and large cell carcinoma. These subtypes have different clinical outcomes and responses to therapy, and therefore require different biomarkers for diagnosis and treatment. RNA-seq has been used to identify novel biomarkers for lung cancer subtypes, such as:
- SOX17: SOX17 is a gene that encodes a transcription factor that is involved in the regulation of cell differentiation and development. SOX17 was found to be downregulated in lung adenocarcinoma, and to be associated with better survival and response to chemotherapy . SOX17 was also shown to be a potential therapeutic target for lung adenocarcinoma, as its overexpression inhibited cell proliferation, migration, and invasion, and induced apoptosis and senescence .
- LINC01133: LINC01133 is a lncRNA that is involved in the regulation of cell growth and metabolism. LINC01133 was found to be upregulated in lung squamous cell carcinoma, and to be associated with poor prognosis and resistance to cisplatin, a drug that induces DNA damage and cell death. LINC01133 was also shown to be a potential therapeutic target for lung squamous cell carcinoma, as its knockdown reduced cell viability, glycolysis, and oxidative phosphorylation, and increased cisplatin sensitivity .
- Colorectal cancer: Colorectal cancer is the third most common and the second most deadly cancer among both men and women worldwide, with an estimated 1.9 million new cases and 0.9 million deaths in 2020. Colorectal cancer is a heterogeneous disease that can be classified into different subtypes based on the molecular features, such as microsatellite instability (MSI), chromosomal instability (CIN), CpG island methylator phenotype (CIMP), and mutations in genes such as KRAS, BRAF, and PIK3CA. These subtypes have different clinical behaviors and responses to therapy, and therefore require different biomarkers for diagnosis and treatment. RNA-seq has been used to identify novel biomarkers for colorectal cancer subtypes, such as:
- MSI-H: MSI-H is a subtype of colorectal cancer that is characterized by a high level of microsatellite instability, which is a condition where the DNA repeats in certain regions of the genome are prone to errors during replication. MSI-H accounts for about 15% of all colorectal cancers, and is associated with better prognosis and response to immunotherapy, a type of treatment that stimulates the immune system to fight cancer cells. RNA-seq has been used to identify novel biomarkers for MSI-H colorectal cancer, such as:
- MLH1: MLH1 is a gene that encodes a protein that is involved in the repair of DNA mismatches. MLH1 is frequently silenced by promoter methylation in MSI-H colorectal cancer, and its loss leads to increased microsatellite instability and tumorigenesis. MLH1 is a potential diagnostic and prognostic biomarker for MSI-H colorectal cancer, as its expression level can distinguish MSI-H from other subtypes, and is correlated with survival and recurrence .
- LINC01234: LINC01234 is a lncRNA that is involved in the regulation of immune response and inflammation. LINC01234 is upregulated in MSI-H colorectal cancer, and is associated with higher infiltration of immune cells, such as T cells, B cells, and natural killer cells, in the tumor microenvironment. LINC01234 is a potential predictive and therapeutic biomarker for MSI-H colorectal cancer, as its expression level can predict the response to immunotherapy, and its knockdown can reduce the immune activation and tumor growth .
- CIN: CIN is a subtype of colorectal cancer that is characterized by a high level of chromosomal instability, which is a condition where the number or structure of chromosomes in the cells are abnormal. CIN accounts for about 65% of all colorectal cancers, and is associated with worse prognosis and resistance to chemotherapy, a type of treatment that uses drugs to kill cancer cells. RNA-seq has been used to identify novel biomarkers for CIN colorectal cancer, such as:
- CCAT1: CCAT1 is a gene that encodes a lncRNA that is involved in the regulation of cell proliferation and invasion. CCAT1 is overexpressed in CIN colorectal cancer, and is associated with poor survival and metastasis. CCAT1 is also a potential therapeutic target for CIN colorectal cancer, as its inhibition can suppress the growth and migration of cancer cells, and enhance the sensitivity to chemotherapy .
- SOX2: SOX2 is a gene that encodes a transcription factor that is involved in the regulation of stem cell maintenance and differentiation. SOX2 is amplified in CIN colorectal cancer, and is associated with stemness and aggressiveness. SOX2 is also a potential therapeutic target for CIN colorectal cancer, as its knockdown can induce differentiation and apoptosis of cancer stem cells, and reduce the tumorigenicity and chemoresistance.
In this article, we have discussed how to use RNA-seq data to identify novel biomarkers for cancer diagnosis and treatment, and provided some examples of successful applications of this approach in different types of cancer, such as breast cancer, lung cancer, and colorectal cancer. RNA-seq is a powerful and versatile technique that can reveal the molecular diversity and complexity of cancer, and help to discover new biomarkers that can improve the clinical management and research of cancer. However, RNA-seq also faces some challenges and limitations, such as the high cost, the large amount of data, the variability and noise, and the need for validation and integration with other data sources. Therefore, there is still room for improvement and innovation in the field of RNA-seq and cancer biomarker discovery.