In an effort to address a major challenge when analyzing large single-cell RNA-sequencing datasets, researchers from The University of Texas MD Anderson Cancer Center have developed a new computational technique to accurately differentiate between data from cancer cells and the variety of normal cells found within tumor samples. The work was published today in Nature Biotechnology.
The new tool, dubbed CopyKAT (copy number karyotyping of aneuploid tumors), allows researchers to more easily examine the complex data obtained from large single-cell RNA-sequencing experiments, which deliver gene expression data from many thousands of individual cells.
CopyKAT uses that gene expression data to look for aneuploidy, or the presence of abnormal chromosome numbers, which is common in most cancers, said study senior author Nicholas Navin, Ph.D., associate professor of Genetics and Bioinformatics & Computational Biology. The tool also helps to identify distinct subpopulations, or clones, within the cancer cells.
“We developed CopyKAT as a tool to infer genetic information from the transcriptome data. By applying this tool to several datasets, we showed that we could unambiguously identify, with about 99% accuracy, tumor cells versus the other immune or stromal cells present in a mixed tumor sample,” Navin said. “We could then go one step further to discover the subclones present and understand their genetic differences.”
Historically, tumors have been studied as a mixture of all cells present, many of which are not cancerous. The advent of single-cell RNA sequencing in recent years has enabled researchers to analyze tumors in much greater resolution, examining the gene expression of each individual cell to develop a picture of the tumor landscape, including the surrounding microenvironment.
However, it’s not easy to distinguish between cancer cells and normal cells without a reliable computational approach, Navin explained. Former postdoctoral fellow Ruli Gao, Ph.D., now assistant professor of Cardiovascular Sciences at Houston Methodist Research Institute, developed the CopyKAT algorithms, which improve upon older techniques by increasing accuracy and adjusting for the newest generation of single-cell RNA-sequencing data.
Overview of the CopyKAT analysis workflow
a, The CopyKAT workflow begins with a UMI count matrix to order genes by their genomic positions and uses the raw count matrix to perform log-Freeman–Turkey transformation to stabilize variance and smooth outliers using a polynomial DLM. b, A subset of normal cells is defined using integrative clustering and the GMM method to infer the copy number (CN) baseline. c, Relative gene expression values in single cells are used for MCMC segmentation, and segments are merged by KS testing. d, Aneuploid tumor and normal cell clusters are classified using normal cell enrichment and GMM distribution tests. e, The clonal substructure of tumor cells is delineated by clustering, and subclones are used for differential expression analysis.
The team first benchmarked its tool by comparing results to whole-genome sequencing data, which showed high accuracy in predicting copy number changes. In three additional datasets from pancreatic cancer, triple-negative breast cancer and anaplastic thyroid cancer, the researchers showed that CopyKAT was accurate in distinguishing between tumor cells and normal cells in mixed samples.
These analyses were made possible through collaborations with Stephen Y. Lai, M.D., Ph.D., professor of Head and Neck Surgery, as well as Stacy Moulder, M.D., professor of Breast Medical Oncology, and the Breast Cancer Moon Shot®, part of MD Anderson’s Moon Shots Program®, a collaborative effort to rapidly develop scientific discoveries into meaningful clinical advances that save patients’ lives.In analyzing these samples, the researchers also showed the tool is effective in identifying subpopulations of cancer cells within the tumor based on copy number differences, as confirmed by experiments in triple-negative breast cancers.
“By using CopyKAT, we were able to identify rare subpopulations within triple-negative breast cancers that have unique genetic alterations not widely reported, including those with potential therapeutic implications,” Gao said. “We hope this tool will be useful to the research community to make the most of their single-cell RNA-sequencing data and to drive new discoveries in cancer.”
The authors note that the tool is not applicable to the study of all cancer types. Aneuploidy, for example, is relatively rare in pediatric and hematologic cancers.
Availability – The tool is freely available to researchers at: https://github.com/navinlabcode/copykat.