HomeArticle

When AI reads the entire "book of genes" for the first time, what can a single-cell large model with one billion parameters do?

新智元2026-03-18 15:41
scLong: A single-cell model with 1 billion parameters that incorporates whole-genome and GO knowledge to enhance multi-task performance.

[Introduction] The single-cell foundation model scLong with one billion parameters no longer focuses only on a few highly expressed genes. Instead, it incorporates nearly 28,000 genes in a cell into the modeling and combines the biological knowledge of Gene Ontology (GO) to understand a more complete gene context.

In the field of single-cell transcriptomics, researchers hope to read the cell state and regulatory relationships from the gene expression of each cell, and even predict what will happen to the cell when a certain gene is knocked out or a certain drug is added.

In the past few years, foundation models have begun to enter this field, showing strong transferability. However, for a long time, existing methods often only focus on a small number of highly expressed genes to save computation, ignoring a large number of lowly expressed or even non-expressed genes. At the same time, there is a lack of systematic integration of external gene function knowledge. This not only leads to the loss of important regulatory signals but also makes the model "see only the trees but not the forest" in complex biological processes.

Recently, a joint team from institutions such as MBZUAI and the University of California, San Diego (UC San Diego) published a research result, scLong, in Nature Communications.

Paper link: https://www.nature.com/articles/s41467-026-69102-y

This is a single-cell foundation model with one billion parameters. It is pre-trained on approximately 48 million cells, capable of modeling approximately 27,874 genes across the entire human transcriptome and integrating the structured biological knowledge provided by GO (Gene Ontology) into the model.

The paper report shows that scLong outperforms existing single-cell foundation models and various task-specific models in multiple tasks such as genetic perturbation prediction, chemical perturbation prediction, cancer drug response prediction, and gene regulatory network inference.

Research Background

Why does the single-cell field need a "longer" model?

Because a cell is not determined by only a few "star genes". Many existing models only perform self-attention on approximately 1,500 to 2,000 highly expressed genes. This indeed saves more computing power, but the cost is that a large number of lowly expressed genes are excluded.

Although these lowly expressed genes are "not very loud", they often act as regulatory switches, signal fine-tuners, and even play a key role in rare cell types, stress responses, and disease progression.

In other words, many past models are more like reading "abstracts" rather than "full texts".

Another problem is that relying solely on the expression matrix itself, the model may not truly understand "what this gene does".

Gene Ontology precisely provides structured knowledge about genes in biological processes, molecular functions, and cellular components. Many past models mainly "figure it out by themselves" from the data but do not explicitly utilize this mature biological prior knowledge. Therefore, they are still limited in understanding functional associations, regulatory relationships, and generalization across conditions.

So, what scLong wants to do is straightforward: not only look at all the genes but also "understand" them.

Read a cell as a whole sentence

If we use natural language as an analogy, the core idea of scLong is very vivid: read the entire gene expression profile of a cell as a very long and complex sentence.

In this "sentence", each "word" is not an ordinary word but a combination of a "gene ID + expression value". The model first uses an expression encoder to map the numerical expression level into a vector, and then uses a gene encoder to generate a biologically meaningful representation for each gene. After adding the two together, the initial representation of this "word" is obtained.

Subsequently, the model allows these genes to "see each other" through a context encoder, thereby learning the context relationship between genes in the current cell.

The most interesting thing here is that scLong does not simply discard the lowly expressed genes. It adopts a dual-encoder design: uses a larger Performer encoder for highly expressed genes and a smaller Performer encoder for lowly expressed genes, and finally integrates all genes through a full-length Performer. This not only preserves the context information across the entire genome as much as possible but also strikes a balance between computational cost and modeling ability.

Furthermore, scLong also incorporates the GO knowledge graph. The research team first constructs a gene graph based on the shared GO annotations of genes:

If two genes are similar enough in biological processes, molecular functions, or cellular localization, they will be connected;

Then, a graph convolutional network (GCN) is used to learn the gene representation.

In this way, the model not only knows "how much this gene is expressed in this cell" but also knows "what functions and which genes this gene is usually related to". This is equivalent to adding a layer of background knowledge to each "word".

In terms of pre-training, scLong uses an approach similar to BERT: randomly masks a part of the expression values and lets the model reconstruct them.

The research team pre-trains on approximately 48 million human cells from 1,618 single-cell datasets covering more than 50 types of tissues, covering 27,874 genes, including both protein-coding genes and non-coding genes. For the single-cell field, this is equivalent to letting the model "read through the corpus" in a large number of real cells first and then perform various downstream tasks.

There is also a very noteworthy design: scLong even models zero expression as information. Because zero does not necessarily mean "meaningless"; it may represent "the expression is too low to be detected" or "this gene is indeed turned off in this cell".

The former may correspond to a weak but real biological signal, while the latter may precisely reveal a certain cell identity or regulatory state. For single-cell data, this idea of "treating absence as information" is very important.

From Gene Perturbation to Drug Response

Genetic Perturbation Prediction: Better at Guessing Unseen Perturbations

In the genetic perturbation task, the model needs to predict the expression changes after perturbation based on the expression before cell perturbation and the perturbation conditions.

The paper uses the Norman dataset for evaluation and particularly focuses on the model's generalization ability to unseen perturbation combinations. The results show that scLong outperforms Geneformer, scGPT, scFoundation, UCE, and task-specific models such as GEARS, ALM, and the simple baseline No-Change in most scenarios. Especially in the more difficult Seen 0/1 and Seen 0/2 scenarios, the advantage of scLong is more obvious. For example, in the Seen 0/1 scenario, the Pearson correlation coefficient of scLong reaches 0.625, higher than 0.561 of GEARS; in the Seen 0/2 scenario, the MSE of scLong is 0.170, also better than most baselines.

Moreover, scLong is also better than GEARS in identifying two types of genetic interactions, synergy and suppressor, in double-gene perturbations.

This means that it can not only predict "how much it will change" but also come closer to understanding "how these genes work together".

Chemical Perturbation Prediction: Let the Model "Test" New Drugs First

In the chemical perturbation task, the model takes the drug molecular graph, dosage, and cell line information as input and outputs the gene expression after perturbation. The paper evaluates scLong on the L1000 subset, and the results show that scLong significantly outperforms Geneformer, scGPT, scFoundation, UCE, and the task-specific model DeepCE in terms of RMSE, Spearman/Pearson correlation, and Top-100 accuracy indicators.

In other words, when facing a new compound, scLong is better at predicting what state it will push the cells into.

Cancer Drug Response Prediction: Better Understanding of Cancer Cells and Combination Therapy

In the cancer drug response prediction task, the model needs to predict the drug efficacy based on the drug structure and the cancer cell expression profile. The paper reports on the DeepCDR dataset that the Pearson correlation coefficient of scLong reaches 0.878, higher than 0.852 of Geneformer, 0.867 of scFoundation, 0.837 of DeepCDR, and 0.746 of the linear model.

More interestingly, the research team also upgrades the problem to drug combination prediction: Will the same cancer cell line have a better response when facing the combination of two drugs?

On the out-of-distribution test set, the AUROC of scLong reaches 0.652, also exceeding that of various foundation models and task models. This shows that it can not only evaluate single drugs but also provide clues in more complex combination therapy scenarios.

Gene Regulatory Network and Batch Integration: Not Only Prediction but also "Organizing Knowledge"

In the gene regulatory network (GRN) inference task, scLong starts from the similarity between gene representations to reconstruct who regulates whom.

The results show that its AUPR reaches 1.35, significantly better than Geneformer, scGPT, scFoundation, UCE, DeepSEM, GENIE3, and the baseline using the GO graph directly.

That is to say, what scLong learns is not a "memorized" GO network but a relationship graph closer to the real biological system after combining specific cell data.

In the zero-shot batch integration task, scLong achieves a batch ASW of 0.96 on the pancreas dataset, exceeding Raw, HVG, scVI, and other foundation models.

It is worth noting that scLong neither pre-trained nor fine-tuned on this dataset but still outperforms scVI, which was specifically trained on this dataset, showing strong transferability.

Finally, the ablation experiments also provide strong support: the performance will decline after removing the modeling of lowly expressed genes or the GO graph. This shows that the improvement of scLong is not accidental but precisely comes from "looking at all genes" and "introducing biological knowledge".

Summary of Core Highlights

From "looking at a few genes" to "looking at the whole genome": It incorporates approximately 28,000 genes into the context modeling instead of only focusing on highly expressed genes.

Truly embed biological knowledge into the model: GO is no longer just an annotation table but participates in the core structure of gene representation learning.

Strong transferability brought by large-scale pre-training: Pre-training on 48 million cells enables the model to perform stably on multiple downstream tasks.

Not just "bigger" but "more understanding of biology": The most important inspiration of the paper is not the number of parameters itself but the proof that lowly expressed/non-expressed genes and structured prior knowledge are very crucial for single-cell foundation models.

Actual Application Prospects

From an application perspective, the potential shown by scLong is quite clear.

First, in gene perturbation and function research, it can help researchers predict the transcriptome changes that may be brought about by knockout, overexpression, and combination perturbations more quickly, thereby reducing the cost of a large number of wet experiments.

Second, in drug discovery and precision medicine, it can predict chemical perturbations and cancer drug responses, providing computational support for candidate drug screening, combination drug design, and personalized treatment.

Third, at the level of systems biology, it can also assist in reconstructing gene regulatory networks, understanding cell state transitions, and providing more stable cell representations in multi-batch data integration. The paper authors also point out that such a model is expected to further promote precision medicine, drug development, and cell biology research.

In the long run, scLong represents a very noteworthy direction: Single-cell foundation models should not just apply Transformer to biological data but should embrace both "global context" and "domain knowledge".

When the model can both "read the whole book of genes" and understand the position of each gene in biology, it is more likely to truly become a general intelligent tool in life science.

Reference: https://www.nature.com/articles/s41467-026-69102-y

This article is from the WeChat official account "New Intelligence Yuan", edited by LRST, and published by 36Kr with authorization.