MIT's APOLLO Framework: Breaking Traditional Multimodal Integration Limits for Clear Separation of Cell - Shared and Cell

Unlocking the precise decoupling and mechanistic interpretation of single-cell multimodal data

A joint research team from the Massachusetts Institute of Technology (MIT) and the Swiss Federal Institute of Technology in Zurich (ETH Zurich) has proposed a computational framework called APOLLO, which is an autoencoder that learns a partially overlapping latent space through latent variable optimization. By explicitly modeling shared information and modality-specific information, it provides a feasible technical path for more comprehensive and accurate analysis of cell states and their regulatory logic.

In the field of single-cell biology research, the rapid development of measurement technologies is continuously expanding the boundaries of scientific exploration. Breakthroughs in areas such as multiplex imaging, single-cell RNA sequencing (scRNA-seq), single-cell assay for transposase-accessible chromatin sequencing (scATAC-seq), and protein abundance detection have enabled researchers to conduct panoramic observations of individual cells from multiple dimensions, including transcriptional regulation, chromatin state, protein expression, and morphological structure. These multi-modal data interpret the genetic code of life from different levels, and their complementary integration provides unprecedented opportunities for revealing cell heterogeneity and exploring disease mechanisms.

However, current analysis methods still have significant limitations when dealing with these high-throughput data. The mainstream strategy often involves analyzing each modality separately and then making comparisons, which is not only inefficient but also difficult to capture the deep associations between modalities. Another type of method integrates multi-modal data into the same latent space through representation learning, but often confuses shared information with modality-specific information, blurring the unique contributions of each dimension to cell function.

This problem is particularly prominent in the integrated analysis of paired scATAC-seq and scRNA-seq data. Traditional methods often coarsen chromatin accessibility to the gene level for comparison with gene expression. Although this approach simplifies the problem, it may discard fine structural information at the chromatin level and is only applicable to data types with relatively consistent features. More complex integration methods such as linear models and generative adversarial networks either have difficulty adapting to unstructured data such as imaging or perform poorly in separating shared and specific information, making it difficult to meet the growing demand for multi-modal data analysis of large biological sample libraries.

Therefore, with the continuous evolution of single-cell technologies and the rapid growth of data scale, how to efficiently and automatically integrate multi-modal data while clearly decoupling shared information and modality-specific information has become a core challenge in current single-cell biology.

In response to this challenge, the joint research team from MIT and ETH Zurich proposed a general deep learning computational framework called APOLLO (Autoencoder with a Partially Overlapping Latent space learned through Latent Optimization). This framework provides a feasible technical path for more comprehensive and accurate analysis of cell states and their regulatory logic by explicitly modeling shared information and modality-specific information.

The relevant research results, titled "Partially shared multi-modal embedding learns holistic representation of cell state", have been published in Nature Computational Science.

Research Highlights:

* This study proposes a general deep learning framework, APOLLO, that can automatically and explicitly decouple "shared information" and "modality-specific information" in multi-modal data.

* APOLLO learns a partially overlapping latent space by equipping each modality with an autoencoder and adopting a two-step training strategy, thereby effectively identifying and distinguishing the biological signals captured jointly by multiple modalities.

* APOLLO can reveal the associations between differences in protein subcellular localization and the morphology of different cell compartments, thus extending the analysis from pure omics data to the field of spatial morphology.

Paper Link: https://www.nature.com/articles/s43588-025-00948-w

Dataset: Comprehensive Validation Covering Sequencing and Imaging

To comprehensively evaluate the performance of the APOLLO framework, the research used multiple publicly available multi-modal single-cell datasets covering two types of technologies: sequencing and imaging.

In terms of sequencing data, the researchers first used paired single-cell transcriptome (scRNA-seq) and chromatin accessibility (scATAC-seq) data measured by the SHARE-seq technology to verify whether APOLLO can automatically identify and distinguish the gene activities captured jointly by the transcriptome and chromatin accessibility, as well as those captured by only one of the modalities.

Secondly, the researchers used paired scRNA-seq and cell surface protein abundance data obtained by CITE-seq to further test the applicability of the model to sequencing data. This CITE-seq dataset is derived from mouse spleens and lymph nodes and contains two sets of wild-type mouse samples processed through independent experiments. It can not only be used to evaluate the ability to distinguish cell types but also reveals the experimental batch effects caused by different mouse individual sources.

In terms of imaging data, the researchers introduced a multiplex imaging dataset of human peripheral blood mononuclear cells (PBMCs), covering a total of 32,345 cells from 40 patients. The diagnostic results are divided into four categories: healthy, meningioma, glioma, and head and neck tumors. For each patient, two sets of imaging data based on different antibody combinations were collected: one set uses DAPI to label chromatin and combines with CD4, CD8, and CD16 antibody staining; the other set also uses DAPI staining but combines with lamin, CD3, and γH2AX antibody staining.

Through testing with this dataset, it was found that APOLLO can identify the cell state information shared by the two modalities in chromatin structure and protein localization, as well as the morphological features captured by only one modality. In addition, by combining additional cell staining markers such as microtubules and endoplasmic reticulum, the research also used the multiplex imaging data from the Human Protein Atlas (HPA) to prove that APOLLO can be used to reveal the associations between differences in protein subcellular localization and the morphology of different cell compartments.

The APOLLO Model: An Autoencoder with a Latent Optimization Strategy

In response to the problem that existing multi-modal integration methods often confuse shared information with modality-specific information, the APOLLO framework proposed in this study is an autoencoder that learns a partially overlapping latent space through latent optimization, aiming to automatically learn and effectively decouple the shared and specific information in multiple modalities. Different from conventional autoencoders that uniformly align all latent dimensions, APOLLO only performs cross-modal alignment on some latent dimensions and retains the remaining dimensions to represent the information specific to each modality, thus achieving a clear separation of shared and specific information in the model design.

In terms of the model architecture, APOLLO equips each data modality with an autoencoder and can introduce additional decoders according to task requirements. The encoders and decoders use neural network structures adapted to specific modalities. For example, convolutional networks are used for imaging data, and fully connected networks are used for gene expression data to fully capture the data characteristics of each modality. The latent space is clearly divided into two parts: shared latent features and modality-specific latent features. The dimension of the shared latent space is usually set to be much larger than that of the modality-specific space to ensure sufficient representation of cross-modal shared information.

As shown in the figure below, the training process of APOLLO is divided into two steps: The first step focuses on the training of each modality's decoder while synchronously updating the latent space. The core goal is to enable the decoder to accurately reconstruct the input data from the latent space. If the task requires enhancing the representation of shared information and achieving cross-modal prediction, two additional decoders will be introduced to map the shared latent space to each modality respectively, and the training will be completed by minimizing the reconstruction loss.

The second step is to train the modality-specific encoders. Each data modality is mapped to the corresponding latent space, and by minimizing the mean squared error, the latent space embedding inference of samples not involved in training is achieved, thus ensuring the good generalization ability of the model.

The two-step training process adopted by APOLLO

In terms of model validation, the research first tested the decoupling performance of APOLLO on five simulated datasets with known true latent structures. The results show that the model can maintain stable performance regardless of the dependencies between shared and specific latent features. Further validation on real data shows that the explicit learning of partial information sharing by APOLLO can not only decouple multi-modal information but also achieve accurate cross-modal prediction, such as predicting undetected proteins from chromatin imaging.

Overall, APOLLO effectively decouples and interprets the shared and modality-specific information in multi-modal datasets by learning a partially shared latent space, providing a general framework for exploring biological mechanisms.

Surpassing Traditional Multi-modal Integration Frameworks for a More Comprehensive Understanding of Cell States

To comprehensively evaluate the generality and core advantages of the APOLLO model, the research designed a series of experiments around five directions: paired sequencing data integration, chromatin and protein imaging integration, cross-modal prediction, morphological feature recognition, and exploration of protein subcellular localization.

In the integration of paired sequencing data, SHARE-seq experiments show that adding the modality-specific space to the shared space can significantly improve the accuracy of cell type classification, proving that the specific space can capture biological information not included in the shared space.

The interpretation of the latent space shows that the RNA-specific space is enriched with genes related to the cell cycle, the ATAC-specific space is enriched with chromatin open regions related to transcriptional regulation, and the shared space is enriched with known transcription factors and regulatory pathways, verifying the biological significance of the decoupling results. In the CITE-seq experiment, APOLLO successfully separates cell types and batch effects into the shared space and the RNA-specific space, while existing integration methods cannot achieve such decoupling, highlighting the unique advantages of the model in the integration of sequencing data.

Differentially expressed genes in each latent space

Applying APOLLO to the CITE-seq dataset

For imaging data, APOLLO can accurately reconstruct the cell imaging of patients not involved in training. In the cross-modal task of predicting undetected proteins from chromatin, APOLLO significantly outperforms traditional image inpainting methods. Downstream phenotype classification shows that the classification accuracy based on predicted protein imaging is similar to that of real imaging, with the prediction effect of the CD3 protein being the best, confirming that the prediction results can be effectively used for biological discovery.

Comparison of the model's prediction performance with different input types

In the morphological feature recognition task, the shared space mainly captures chromatin morphological features (such as nuclear area and heterochromatin volume), while protein-specific features such as γH2AX focus count only exist in the corresponding specific space. Feature ablation experiments show that removing this feature will significantly reduce the accuracy of phenotype classification, further verifying the accuracy of the decoupling.

Heatmaps showing the proportion of each representative morphological time

In the exploration of protein subcellular localization, applying APOLLO to the U2OS cell imaging data, it was found that the differences in protein nuclear localization can be captured by the characteristics of different cell compartments. For example, the nuclear localization of DDB1 is related to the morphology of the endoplasmic reticulum and microtubules, while CLNS1A is only related to the nuclear morphology. This result indicates that the model can be extended to various imaging combinations, providing a new perspective for analyzing the association between protein localization and cell morphology.

APOLLO simulates different cell components of protein subcellular localization

The Implementation of Single-cell Multi-modal Data Integration

Single-cell multi-modal data integration is becoming a core technological direction for analyzing cell heterogeneity, revealing disease mechanisms, and promoting the development of precision medicine, and it has currently attracted wide attention from the global academic community.

For example, the scMTR-seq technology developed by the Peter Rugg-Gunn team at the Babraham Institute of the University of Cambridge has achieved the simultaneous capture of six histone modifications and the entire transcriptome at the single-cell level for the first time, overcoming a decade-long technical bottleneck in the field of epigenetics research.

Paper Title: Combinatorial profiling of multiple histone modifications and transcriptome in single cells using scMTR-seq
Paper Link: https://www.science.org/doi/10.1126/sciadv.adu3308

The CellFuse framework proposed by the research team at Stanford University constructs a shared embedding space based on supervised contrastive learning and is specifically designed for multi-modal integration scenarios with limited feature overlap. It can achieve accurate prediction of cell types and seamless integration across modalities and experimental conditions. Tests on multiple datasets such as healthy PBMCs, bone marrow, CAR-T treated lymphoma, and tumor tissues show that this framework outperforms existing methods in terms of integration quality and operating efficiency.

Paper Title: CellFuse Enables Multi-modal Integration of Single-cell and Spatial Proteomics data
Paper Link: https://doi.org/10.1101/2025.07.23.665976

At the same time, leading global biotechnology and healthcare companies are accelerating the deployment of single-cell multi-modal data integration technologies, focusing on core scenarios such as clinical translation, drug development, and precision diagnosis and treatment, and promoting the transformation of cutting-edge research results into practical applications. The German company BioNTech has applied this technology to tumor immunotherapy and personalized vaccine development. By integrating single-cell

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Breaking through the limitations of traditional multimodal integration, MIT proposed the APOLLO framework to achieve clear separation of cell - shared and cell - specific information.

Dataset: Comprehensive Validation Covering Sequencing and Imaging

The APOLLO Model: An Autoencoder with a Latent Optimization Strategy

Surpassing Traditional Multi-modal Integration Frameworks for a More Comprehensive Understanding of Cell States

The Implementation of Single-cell Multi-modal Data Integration