HomeArticle

Apple upsets the table, discards the core module of AlphaFold, and ushers in the era of "generative AI" for protein folding

新智元2025-09-28 07:56
The first protein folding model based on the Transformer module.

Protein folding has always been a core challenge in computational biology and has a profound impact on fields such as drug development.

If protein folding is analogized to a generative model in the visual field, the amino acid sequence is equivalent to the "prompt", and the model output is the three-dimensional coordinates of atoms.

Inspired by this thinking, researchers constructed a general and powerful architecture based on standard Transformer modules and adaptive layers - SimpleFold.

Paper link: https://arxiv.org/abs/2509.18480

What are the differences between SimpleFold and classic protein folding models such as AlphaFold2?

AlphaFold2 and RoseTTAFold2 integrate complex and highly specialized architectures, such as triangular updates, pairwise representations, and multiple sequence alignment (MSA).

These designs often "hard-code" our existing understanding of the structure generation mechanism into the model, rather than letting the model learn the generation method from the data by itself.

SimpleFold proposes a brand - new idea:

Without triangular updates, pairwise representations, and no need for MSA, it is entirely based on general Transformer and flow - matching, and can directly map protein sequences to complete three - dimensional atomic structures (see Figure 1).

SimpleFold

The first protein folding model based on Transformer modules

Flow - matching regards generation as a journey that progresses over time, and uses ordinary differential equations (ODE) for trajectory integration. It's like developing a photo, and the noise is gradually "developed" into a clear structure.

SimpleFold replicates this journey in protein folding:

The input is the "prompt" of the amino acid sequence, and the output is a three - dimensional "photo" of all atoms, which is very similar to the "text - to - image" or "text - to - 3D" tasks in vision.

Since AlphaFold2, components such as triangular updates and interactions between monomer and pairwise representations have been widely used in protein folding models, but it is still inconclusive whether these designs are necessary.

SimpleFold makes bold innovations in design, using only general Transformer modules to build the architecture (see Figure 5 for comparison).

The SimpleFold architecture consists of three parts: a lightweight atomic encoder, a heavy - duty residue backbone, and a lightweight atomic decoder (see Figure 2).

This "fine - coarse - fine" hierarchical approach first examines the micro - level, then grasps the overall situation, and finally supplements the details, finding a good balance between speed and accuracy.

Different from previous methods, SimpleFold does not use pairwise representations and does not rely on the attention initialization of MSA or PLM.

Compared with works that rely on equivariant architectures, SimpleFold is entirely built on non - equivariant Transformer.

To deal with the rotational symmetry in protein structures, researchers introduced SO(3) data augmentation during training, that is, randomly rotating the target structure and relying on the model to learn this symmetry.

Experimental evaluation

To study the scalability of the SimpleFold framework in protein folding tasks, researchers trained a series of SimpleFold models of different scales (including 100M, 360M, 700M, 1.1B, 1.6B, and 3B).

Making the model larger is not just about adding parameters. As the model scale increases, researchers also upgraded the entire link of the atomic encoder, decoder, and residue backbone network (see Table 5 for details).

During the training process, researchers borrowed the strategy of AlphaFold2. Each protein is replicated Bc times on each GPU, different time steps t are sampled respectively, and then the gradients are accumulated from Bp proteins (see Table 6 for specific settings).

Experiments show that this strategy can bring more stable gradients and better model performance compared with directly randomly selecting proteins to form a batch.

Researchers evaluated the performance of SimpleFold on two widely used protein structure prediction benchmarks, CAMEO22 and CASP14.

These two benchmark tests have high requirements in terms of generalization ability, robustness, and atomic - level accuracy.

Table 1 summarizes the evaluation results on CASP14 and CAMEO22.

Researchers divided the models into two categories according to the way of extracting protein sequence information: methods based on MSA retrieval (such as RoseTTAFold, RoseTTAFold2, and AlphaFold2) and methods based on protein language models (PLM) (such as ESMFold and OmegaFold).

In addition, they also marked the baseline models according to whether the training target is a generative target (such as diffusion, flow - matching, or autoregression) to distinguish whether they directly perform structural regression.

Interestingly, when fine - tuning AlphaFold2 and ESMFold into flow - matching AlphaFlow and ESMFlow, the overall indicators are worse than their original regression models.

Researchers believe that this is because protein folding benchmarks such as CAMEO22 and CASP14 usually only provide one "true" structure target, which is more beneficial for regression models that perform deterministic point - by - point prediction.

Despite its simple architecture, SimpleFold still performs very well.

In the two benchmark tests, SimpleFold consistently outperforms ESMFlow, which is also a flow - matching method and is built based on ESM embeddings.

On CAMEO22, SimpleFold performs as well as the current state - of - the - art models (such as ESMFold, RoseTTAFold2, and AlphaFold2).

More importantly, without using triangular attention and MSA, SimpleFold can achieve more than 95% of the performance of RF2/AF2 in most indicators.

In the more challenging CASP14, SimpleFold even surpasses ESMFold.

SimpleFold has a smaller score drop across benchmarks, indicating that it can generalize robustly without MSA and can handle more complex structure prediction tasks.

Researchers also reported the performance of SimpleFold models of different scales.

Even the smallest SimpleFold - 100M can achieve more than 90% of the performance of ESMFold on CAMEO22, further demonstrating the feasibility of building protein folding models based on general structural modules.

As the model scale increases, the performance of SimpleFold continues to improve in various indicators, indicating that the general and scalable architecture design has significant advantages in folding tasks.

Especially on the more challenging CASP14, the performance gain brought by model expansion is more obvious.

Figure 3(a) shows an example of a structure containing pLDDT prediction values, where red and orange indicate low prediction confidence, and blue indicates high prediction confidence.

It can be seen that SimpleFold is more confident in predicting most secondary structures, but shows some uncertainty in flexible loop regions.

Figure 3(b) and (c) show a comparative analysis of pLDDT and actual LDDT - Cα.

The structure ensemble generation ability of SimpleFold

The advantage of using a generative target is that SimpleFold can directly model the structure distribution instead of only outputting a single "final draft".

Therefore, for the same amino acid sequence, it can not only generate a deterministic structure but also generate a structure ensemble composed of multiple different conformations.

To verify this ability of SimpleFold, researchers tested it on the ATLAS dataset.

This dataset is used to evaluate the generation of molecular dynamics (MD) structure ensembles and contains the all - atom MD simulation structures of 1390 proteins.

Table 2 shows the comparison results of SimpleFold and multiple baseline models on ATLAS (see Table 9 for SimpleFold models of different scales).

The used indicators comprehensively evaluate the quality of the generated structure ensemble, including flexible prediction, distribution accuracy, and ensemble observability.

As shown in Table 2, SimpleFold consistently outperforms ESMFlow - MD, which also relies on ESM representations, in multiple evaluation indicators.

Meanwhile, in key observability aspects such as exposed residues and mutual information matrices, SimpleFold also outperforms AlphaFlow - MD, which helps to discover the "cryptic pockets" common in drug discovery.

Researchers also evaluated the ability of SimpleFold to model the structures of proteins that naturally have multiple conformational states.

As shown in Table 3, on the Apo/holo dataset, SimpleFold achieves the current best performance, significantly surpassing powerful MSA methods such as AlphaFlow.

On the Fold - switch dataset, SimpleFold performs as well as or even better than ESMFlow.

Overall, the performance of SimpleFold improves as the model scale increases, further demonstrating the great potential of this framework in protein structure ensemble generation.

The scaling effect in protein folding

To study the scaling effect of SimpleFold in protein folding tasks, researchers trained multiple model versions with parameters ranging from 100 million to 3 billion.

All models use the complete pre - trained data, including SwissProt in PDB and AFDB, as well as the filtered AFESM.

Figure 4(a) - (d) show the impact of model scale on the performance of folding tasks (see also Figure 1(d)).

The results show that larger - scale models perform better when there are more training resources (such as more FLOPs and iterations).

This proves that the scalability of SimpleFold is qualified and points out a feasible path for the large - scale implementation of general generative models in the biological field.

Researchers also studied the impact of the expansion of training data scale on model performance: using the SimpleFold - 700M model, they trained on datasets of different scales.

As shown in Figure 4(e) - (f), as the number of unique structures in the training data increases, the model performance continues to improve after 400,000 iterations.

These results prove that a simple and scalable folding model can continuously benefit from the increasingly rich experimental and model data.

Author introduction

Yuyang Wang