A novel technique for advanced data processing and analysis

Statisticians at the National University of Singapore (NUS) have introduced a new technique that marks a significant step forward in capturing and analysing complex data patterns more effectively and accurately. This could pave the way for advancements in various fields of research such as single-cell RNA sequencing.

The team, led by Associate Professor Yao Zhigang together with Research Fellow Su Jiaji from the Department of Statistics and Data Science under the NUS Faculty of Science, pioneered a novel method for effectively estimating low-dimensional manifolds, a generalisation and abstraction of the notion of a curved surface, hidden within high-dimensional data. This approach not only achieves cutting-edge estimation accuracy and convergence rates but also enhances computational efficiency through the utilisation of deep Generative Adversarial Networks (GANs).

This research was done in collaboration with Professor Yau Shing-Tung at Tsinghua University. Their findings have been published as a methodology paper in PNAS on 24 January 2024.

Overcoming the challenges of data analysis

The introduction of manifold fitting represents a significant advancement in the field of data processing and analysis, in addressing the shortcomings of previous approaches.

Conventional approaches in data processing and analysis tend to oversimply the representation of data, and hence often fail to capture the intricate, complex patterns present in high-dimensional data spaces, such as image databases, genomics, social media data, financial data, and data gathered via Internet of Things (IoT) sensor networks.

Manifold learning techniques have been developed to overcome these challenges by focusing on the intrinsic geometric structures of the data. However, existing manifold learning methods lack robustness and often give rise to inaccuracies and inefficiencies in data analysis. The NUS team came up with a novel technique to address this gap.

“By accurately fitting manifolds, we can reduce data dimensionality while preserving crucial information, including the underlying geometric structure. This represents a major leap in data analysis, enhancing both accuracy and efficiency. By providing a solution that overcomes the limitations of previous methods, our research paves the way for enhanced data analysis and offers valuable insights for diverse applications in the scientific community,” said Assoc Prof Yao.

Illustration of fitting the latent manifold using the Cycle Generative Adversarial Network (CycleGAN)

CycleGAN is a deep learning technique for unsupervised image-to-image translation. In the real world, data, such as the images shown in panel (a), are often high-dimensional vectors. These vectors typically reside around a low-dimensional latent manifold, depicted by the black dotted curve in panel (b). The CycleGAN framework, detailed in panel (c), effectively learns to estimate this latent manifold (illustrated as the red curve in panel (b)). This advancement facilitates nonlinear interpolation and denoising within the high-dimensional ambient space (panel (d)), offering significant improvements in data processing and analysis.

Applications in RNA sequencing and biodata analysis

The novel manifold fitting method has potential applications in areas such as RNA sequencing and the processing of biological data. Single-cell RNA sequencing data is inherently noisy, with disruptions stemming from biological variability and technical inaccuracies that can skew gene expression analysis and complicate assessments of cell similarity, especially within diverse populations. Traditional methods, including advanced deep learning techniques, often falter in precisely delineating cell relationships amidst this pervasive noise. In response, the NUS researchers introduced an innovative pipeline framework aimed at refining clustering accuracy and enhancing data visualisation in scRNA-seq research.

The manifold fitting can also be integrated with deep learning to create a unified, low-dimensional representation of multi-modal biological data. This integration is expected to enhance the precision and effectiveness of disease prediction models, particularly for complex neurological conditions. By reducing data dimensionality while maintaining its essential features, a more holistic view of disease mechanisms can be offered which would advance the field of personalised medicine.

Looking ahead, Yao’s research team is developing a new framework to process even more complex data, such as single-cell RNA sequence data, while continuing to collaborate with the YMSC team. This ongoing work promises to revolutionise the approach for the reduction and processing of complex datasets, potentially offering new insights into a range of scientific fields.

SourceNational University of Singapore

Yao Z, Su J, Yau ST. (2024)  Manifold fitting with CycleGAN. PNAS 121(5):e2311436121. [abstract]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Time limit is exhausted. Please reload CAPTCHA.