scSiameseClu Achieves SOTA Performance in Unsupervised Single - Cell Clustering Tasks with Validation on 7 Datasets

A clustering tool that can truly preserve cell differences

A research team from the Chinese Academy of Sciences, Northeast Agricultural University, the University of Macau, and Jilin University jointly proposed a novel Siamese clustering framework, scSiameseClu, for interpreting single-cell RNA-seq data. This framework can effectively alleviate the problem of representation collapse, achieve clearer classification of cell populations, and provide a powerful tool for the analysis of scRNA-seq data.

In the progress of life sciences, the focus in the past was often on the "population" level. Through traditional bulk RNA sequencing (Bulk RNA-Seq), we can obtain the average gene expression of a group of cells. However, this means that the characteristics of some rare cells may be masked. Nowadays, researchers increasingly hope to hear the "single" cell's voice.

Single-cell RNA sequencing (scRNA-seq) is such a revolutionary technology. It can capture the comprehensive genetic information of a single cell amidst the noise of a cell population, thereby revealing hidden complex characteristics. To understand this complex information, a key step - cell clustering - is required. Cells are classified according to the similarity of gene expression, and this process is full of challenges.

scRNA-seq data is characterized by high noise, high sparsity, and high dimensionality. Even the most effective current graph neural network (GNNs) methods have the problems of "insufficient graph construction" and "representation collapse." As shown in the figure below, whether it is scNAME based on deep learning or scGNN based on graph neural networks, their gradually converging representation results mean that there is representation collapse to varying degrees. In other words, there is still a lack of a clustering tool that can truly preserve cell differences.

Similarity distribution of cell embeddings of scNAME and scGNN on the same dataset

To solve this dilemma, a research team from the Chinese Academy of Sciences, Northeast Agricultural University, the University of Macau, and Jilin University jointly proposed a novel Siamese clustering framework, scSiameseClu, for interpreting single-cell RNA-seq data. It aims to capture and refine complex intercellular information while learning discriminative and robust representations at the gene and cell feature levels. The framework integrates three key modules: dual augmentation, Siamese fusion, and optimal transport clustering. Through this design, scSiameseClu can effectively alleviate the problem of representation collapse, achieve clearer classification of cell populations, and provide a powerful tool for the analysis of scRNA-seq data.

The related research titled "scSiameseClu: A Siamese Clustering Framework for Interpreting single-cell RNA Sequencing Data" was selected for IJCAI 2025, and the preprint has been published on arXiv.

Research highlights:

* scSiameseClu can capture complex information from gene expression and cell graphs to learn discriminative and robust cell embeddings, improving clustering results and downstream tasks;

* Key modules are introduced to construct a complete framework of "augmentation - fusion - clustering";

* scSiameseClu outperforms the state-of-the-art (SOTA) methods in clustering and other biological tasks.

Paper address: https://go.hyper.ai/00BhP

Seven real datasets covering multiple tissues and species

To comprehensively evaluate the performance of scSiameseClu, the research team conducted experiments on seven real scRNA-seq datasets. Genes expressed in fewer than three cells were filtered out. The data was normalized, logarithmically transformed (logTPM), and highly variable genes were selected according to predefined mean and dispersion thresholds. These preprocessed datasets consist of three mouse samples and four human samples, covering a variety of cell types (e.g., retina, lung, liver, kidney, pancreas, etc.), with different numbers of genes, cell types, and sparsity rates. The following picture shows an overview of the datasets used.

Overview of seven scRNA-seq datasets

Three modules of the Siamese clustering framework

The scSiameseClu proposed by the research team is a Siamese clustering framework based on an augmented graph autoencoder. This framework includes three modules: (i) Dual Augmentation Module; (ii) Siamese Fusion Module; (iii) Optimal Transport Clustering strategy for self-supervised learning.

Overview of the scSiameseClu architecture

Dual Augmentation Module

In this study, the dual augmentation module is "gene expression augmentation + cell graph augmentation." To improve the model's robustness to noise and its generalization ability on different datasets, the research team added Gaussian noise to simulate the natural fluctuations of gene expression, achieving robustness enhancement at the gene level. By adopting edge perturbation and graph diffusion strategies, augmented adjacency matrices were generated respectively to process the cell graph from different but complementary perspectives, enabling the model to capture diverse interactions between cells.

Siamese Fusion Module

The Siamese Fusion Module (SFM) is the most core innovative design of scSiameseClu. It adopts a strategy of integrating "cross-correlation refinement" and "adaptive information fusion." Specifically, the former processes the augmented gene expression matrix and cell graph matrix respectively by constructing an autoencoder, and aligns and fuses them in the latent space. The latter integrates cell relationships through embedding aggregation, autocorrelation learning, and dynamic recombination, effectively filtering out redundant information and retaining discriminative features in the latent space, enabling it to learn robust and meaningful representations, thereby improving clustering performance and avoiding representation collapse.

In addition, the framework introduces a propagation regularization term to constrain the consistency between the original embedding and the embedding after graph propagation with the Jensen-Shannon divergence, alleviating the over-smoothing problem of graph neural networks while maintaining information flow.

Optimal Transport Clustering

The research team first uses the Student's t-distribution to calculate the similarity between cells and clustering centers, and then aligns and corrects the predicted distribution through the Sinkhorn algorithm, thus ensuring the balance of the clustering distribution and avoiding the collapse problem.

Multiple verifications of the excellent performance of the scRNA-seq framework

The excellent performance of the scRNA-seq framework in clustering is the result of a large number of experimental verifications. First, there is a comprehensive comparison with mainstream methods. The research team selected a total of nine current state-of-the-art benchmark models for comparison, including traditional clustering methods, methods based on deep neural networks, and clustering methods based on graph neural networks. Using the seven real datasets mentioned above, three widely recognized clustering metrics were adopted for evaluation: ACC (accuracy), NMI (normalized mutual information), and ARI (adjusted Rand index).

The results show that scSiameseClu has obvious advantages in these three metrics. It not only has higher overall scores but also performs stably across different datasets. For example, in the visual comparison of the human hepatocyte dataset, it can be clearly seen that compared with other benchmark models, scSiameseClu can generate clusters with clear boundaries and good separation, and can effectively distinguish different cell types.

Visualization results of scSiameseClu and four typical benchmark methods on human hepatocytes

Secondly, in the downstream task experiment, the research team conducted cell type annotation. In the human pancreas dataset, they used the Seurat tool to identify differentially expressed genes and marker genes, and compared the top 50 marker genes identified by scSiameseClu and other methods with the gold standard. The results showed that the similarity of most clusters exceeded 90%, and they could accurately correspond to known cell types. At the same time, the model could also identify the marker genes of each cluster.

Further cell classification experiments also showed that scSiameseClu outperformed the baseline models in multiple metrics such as accuracy and F1 score, verifying its advantages in revealing cell heterogeneity and type discrimination.

Overlap between differentially expressed genes and gold standard cell types

Comparison of classification performance

Finally, in the ablation experiment, the research team removed the key components of scSiameseClu (including SFM loss, ZINB loss, and OTC loss, etc.) on the Shekhar mouse retinal cell dataset and compared them with the complete model to evaluate the effectiveness of each module of the framework. The results showed that each part could significantly improve performance, and the absence of any one would lead to a decline in the effect. Further disassembling the SFM module, the performance of the model degenerated when cell-related refinement, latent-related refinement, propagation regularization, and reconstruction loss were removed respectively. However, scSiameseClu including all components showed a significant improvement in performance, indicating that it effectively integrates gene and cell information.

Ablation experiment on the Shekhar mouse retinal cell dataset

Stepping into a new era of prosperous development of computational biology

From the perspective of computational biology, scSiameseClu effectively solves the long - standing problem of cell heterogeneity analysis in biology by using methods such as dual augmentation, Siamese fusion, and optimal transport clustering in computer science. It can be said that it is not only a new type of clustering tool but also one of many emerging attempts in the field of deep integration of computational methods and life sciences. In addition, with the rapid development of artificial intelligence algorithms and biology, new achievements are constantly emerging.

The research team led by Professor Yang Zhang from the National University of Singapore proposed a high - precision RNA structure prediction framework based on deep learning - DRfold2. It integrates a pre - trained RNA composite language model (RCLM) and a denoising structure module for end - to - end RNA structure prediction. The related results titled "Ab initio RNA structure prediction with composite language model and denoised end - to - end learning" have been published on the preprint platform bioRxiv.

Paper address: https://www.biorxiv.org/content/10.1101/2025.03.05.641632v1

The research team from Baylor College of Medicine in the United States proposed a deep - learning - based protein post - translational modification prediction framework - DeepMVP. It integrates the high - quality PTMAtlas dataset for accurate prediction of PTM sites and changes caused by missense variants. The related results titled "DeepMVP: deep learning models trained on high - quality data accurately predict PTM sites and variant - induced alterations" were published in Nature Methods.

Paper address: https://www.nature.com/articles/s41592-025-02797-x

This article is from the WeChat public account "HyperAI Super Neuro". The author is Paida Xinghang. It is published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。