Nvidia's Breakthrough in Atomic-Level Protein Design: High-Precision Generation of Proteins up to 800 Residues

Achieved SOTA performance in unconditional protein generation.

The research team at NVIDIA, in collaboration with Mila, the Quebec Artificial Intelligence Institute in Canada, has proposed La-Proteina, an atomistic protein design method based on partially latent flow matching. It can effectively combine explicit backbone modeling with fixed-size latent representations for each residue to capture sequence and atomic side-chain information, addressing the key challenge of dimensional variability in explicit side-chain representations during protein generation.

As is well known, designing novel proteins with specific structures and functions holds great application potential in numerous fields such as drug development and bioengineering. However, achieving this goal is no easy feat. Particularly, capturing the relationship between protein sequences and structures has long been a major challenge in de novo protein design.

Most previous methods often separated the design of protein sequences and structures. For example, they would first generate sequences and then perform folding, or design the backbone first and then determine the sequence. However, precisely modeling the joint distribution of protein sequences and full-atom structures to achieve fine control of functional sites and complete crucial protein design tasks, such as atomic motif scaffold design, remains a highly challenging problem. This not only requires handling discrete sequences and continuous coordinates but also dealing with the issue of side-chain dimensions varying with the sequence.

Against this backdrop, the research team at NVIDIA, in collaboration with Mila, the Quebec Artificial Intelligence Institute in Canada, has proposed La-Proteina, an atomistic protein design method based on partially latent flow matching. It can effectively combine explicit backbone modeling with fixed-size latent representations for each residue to capture sequence and atomic side-chain information, addressing the key challenge of dimensional variability in explicit side-chain representations during protein generation and bringing new breakthroughs to the field of protein design.

The relevant research findings were published on arXiv under the title "La-Proteina: Atomistic Protein Generation via Partially Latent Flow Matching."

Research Highlights:

* A partially latent flow matching framework, La-Proteina, was proposed, specifically designed for the joint generation of protein sequences and fully atomistic structures. It effectively combines explicit backbone modeling with fixed-size latent representations for each residue to capture sequences and atomic-level side chains.

* In extensive benchmark experiments, La-Proteina achieved state-of-the-art (SOTA) performance in unconditional protein generation, capable of generating diverse, co-designable, and structurally valid fully atomistic proteins of up to 800 residues.

* The research successfully applied La-Proteina to indexed and non-indexed atomic motif scaffold design, two important conditional protein design tasks, demonstrating that the model outperforms previous full-atom generators.

Paper Link:

https://go.hyper.ai/3csT5

Datasets: For Training Unconditional Models and the Features and Functions of Protein Data

This research used two datasets for training unconditional models:

One is the AFDB dataset clustered by Foldseek, which is derived from the screening and clustering of the AlphaFold Database (AFDB). Sequence and structural information were combined during clustering. Initially, there were approximately 3 million unique samples. After optimization using multiple criteria - an average pLDDT score of no less than 80, protein lengths ranging from 32 to 512 residues, a coil ratio of less than 50%, and no more than 20 consecutive coil residues, and specifically requiring the presence of β-sheets to correct the low β-sheet content in the proteins generated by the model, approximately 550,000 protein samples were finally obtained. This carefully screened dataset enables the model to generate proteins with more balanced structural features, especially increasing the β-sheet content.

The other is a customized subset of AFDB for long-sequence training. The researchers screened samples from AFDB with an average pLDDT of at least 70 and lengths between 384 and 896. After clustering, more than 4 million clusters were obtained for training. Focusing on longer protein samples, it meets the requirements for long-sequence training.

In addition, protein data itself contains sequence (20 residue types) and 3D structural information, which is uniformly stored using the Atom37 representation. The Atom37 representation defines a standardized superset of 37 potential atoms for each residue, allowing the structure of a protein with L residues to be stored as a tensor of shape [L, 37, 3], and the relevant coordinate subset is selected according to the type of each residue.

The characteristic of this standardized method is that it provides a unified storage and representation method for the structural information of different residues. Its function is to lay the foundation for the model to uniformly process the structural information of different residues. The large-scale data characteristic of AFDB provides the model with abundant samples, which helps it learn a wider range of protein sequence and structural features, improving its performance and generalization ability. Through the training and experiments with these data, the relevant models can better capture the relationship between protein sequences and structures, achieving more accurate design.

La-Proteina: The Innovative Architecture and Training Mechanism of the Atomistic Protein Design Model

La-Proteina is an innovative model for atomistic protein design. Its core design revolves around "partially latent representation" to address the complex challenges in the generation of fully atomistic structures.

At the design level, considering the challenges in full-atom structure generation, which need to simultaneously balance large-scale backbones, amino acid types, and side chains (the side-chain dimensions vary with amino acids), La-Proteina proposes to encode the atomic-level details and residue types of each residue into a fixed-length continuous latent space while maintaining explicit backbone modeling through α-carbon coordinates.

This design brings multiple advantages. It avoids the problem of mixed continuous-categorical modeling in the main generation component of the model, allowing the fully continuous flow matching method to efficiently generate latent variables and expand based on the progress of high-performance backbone modeling. At the same time, explicit backbone modeling allows different generation schedules to be set for the global α-carbon backbone and residue atomic-level details, which is the key to high performance and also improves scalability, enabling the model to be extended to the generation of large proteins of up to 800 residues. This hybrid method is the core reason why it outperforms fully latent modeling frameworks.

In terms of its composition, as shown in the figure below, the core of La-Proteina consists of three neural networks: an encoder, a decoder, and a denoiser. They share a Transformer core architecture based on a biased attention mechanism.

Among them, the encoder is responsible for mapping the input protein (including sequence and structural information) to latent variables. Its initial sequence representation includes the original atomic coordinates, side-chain and backbone torsion angles, and residue types. The initial pairwise representation includes the relative sequence separation, pairwise distances, and relative orientations between residues. The decoder reconstructs the complete protein from the latent variables and α-carbon atom coordinates, processing the 8-dimensional latent variables for each residue and α-carbon atom coordinates. The denoiser network predicts the velocity field for transporting samples from the standard Gaussian reference distribution to the target data distribution and directly conditions on the interpolation time in its Transformer blocks.

Composition of La-Proteina

In terms of training, La-Proteina adopts a two-stage training strategy.

In the first stage, a conditional variational autoencoder (VAE) is trained. The encoder maps the input protein to latent variables, and the decoder reconstructs the protein based on the latent variables and α-carbon atom coordinates. The entire VAE is optimized by maximizing the β-weighted evidence lower bound (ELBO). For the above modeling choices, the reconstruction term can be simplified to the cross-entropy loss of the sequence and the squared L2 loss of the structure.

In the second stage, the flow matching model is optimized to approximate the target distribution. The denoiser network is trained by minimizing the conditional flow matching (CFM) objective. The use of two separate interpolation times, tx and tz, is a key design in this stage. This setting allows different integration schedules to be used for the α-carbon atom coordinates and latent variables during the inference process, effectively enhancing the model's performance.

Through such design and training, La-Proteina can efficiently learn the joint distribution of protein sequences and full-atom structures, providing strong technical support for atomistic protein design.

Experimental Results: La-Proteina Significantly Leads in Four Types of Experiments

To verify the performance of La-Proteina, the research team conducted a series of experiments in two major directions: unconditional atomistic protein generation and atomic motif scaffold design, comprehensively evaluating the model's performance in different scenarios.

In the unconditional atomistic protein generation experiment, as shown in the figure below, the research team compared two variants of La-Proteina (with and without a triangular multiplication layer) with several publicly available full-atom generation baseline methods such as P (all-atom), APM, and PLAID. Evaluation metrics included full-atom co-designability, diversity, novelty, and standard design ability.

The results showed that the two variants of La-Proteina outperformed all baseline methods in terms of full-atom co-designability, design ability, and diversity, and were also highly competitive in terms of novelty.

Performance of La-Proteina in Generating Unconditional Long Chains

Notably, La-Proteina without the triangular multiplication layer achieves state-of-the-art performance while also having high scalability. In contrast, P (all-atom), which has the second-best performance, can only handle short proteins due to its reliance on computationally expensive triangular update layers.

In addition, the research team also demonstrated the scalability of La-Proteina in generating large full-atom structures. By training on the AFDB dataset containing approximately 46 million samples, the model performed best in the task of generating proteins with more than 500 residues, while other full-atom baseline methods often struggled to generate valid samples in this length range.

In the biophysical analysis, the structural validity was evaluated using the MolProbity tool. The results showed that the structures generated by La-Proteina had higher quality. The scores were significantly better than all baseline methods, and the generated structures were more realistic at the physical level and more similar to real proteins. At the same time, by visualizing the side-chain dihedral angle distribution and comparing it with the references from PDB and AFDB, it was found that La-Proteina can accurately simulate the conformational space of amino acid rotamers, while the baseline methods often deviated from the references, either missing patterns or filling unrealistic angle regions.

La-Proteina Has Higher Structural Validity Than Existing Full-Atom Generation Baselines

In the atomic motif scaffold design experiment, the research team evaluated the performance of the model in the atomic motif scaffold design task, which requires the model to generate a protein structure that can accurately support a predefined atomic motif based on its atomic structure. The experiment was conducted under four evaluation settings, including full-atom and tip-atom scaffold design, as well as indexed and non-indexed versions.

The results showed that under all four settings, La-Proteina significantly outperformed the only comparable full-atom baseline method, Protpardelle, and was able to successfully solve most benchmark tasks. Especially for motifs composed of three or more different residue segments, the non-indexed version of La-Proteina performed better than the indexed version, possibly because fixing the positions of multiple segments limits the model's flexibility in exploring different structural solutions.

Research Breakthroughs and Innovative Practices in the Field of Atomistic Protein Design

In the field of protein design, the research direction of atomistic protein design represented by La-Proteina has attracted wide attention from academia and industry. Many universities and enterprises have achieved important research breakthroughs and innovative practices in this field.

In academia, some research teams are committed to improving the performance and scalability of protein generation models. For example, the research teams from NVIDIA in collaboration with Mila, the Quebec Artificial Intelligence Institute, the University of Montreal, and the Massachusetts Institute of Technology developed Proteina, which demonstrated the scalability of a flow-based protein structure generation model through training on the large-scale AlphaFold Database (AFDB).

Some research has utilized diffusion models in protein design. Early diffusion-based protein generators such as RFDiffusion and Chroma focused on backbone generation. Subsequent research further expanded the application scope of diffusion models in protein design, such as diffusion on the SO(3) manifold and Euclidean flow matching.

Some research teams also focus on the joint modeling of protein sequences and structures. For example, ProtComposer, jointly launched by NVIDIA and the Massachusetts Institute of Technology, uses auxiliary statistical models and 3D primitives to generate protein structures. Some works handle full-atom structures by jointly modeling the protein backbone and sequence or using latent variable models. In addition, language models have also been applied to protein design. Some methods focus on protein sequences, while others tokenize structural information and jointly model sequences and structures.

In industry, Cradle, a Dutch biotech company, focuses on using artificial intelligence to simplify the protein design process. By establishing a wet lab to accumulate billions of protein sequences and data to train a proprietary generative artificial intelligence model, it makes protein design and optimization more convenient. Xaira Therapeutics, a US AI pharmaceutical service provider, leveraging its advantages in advanced machine learning research, large-scale data generation, and therapy development, is committed to creating suitable molecules for specific indications. Some enterprises are also committed to combining protein design technology with artificial intelligence and machine learning to improve the efficiency and accuracy of protein design.

The research breakthroughs of these universities and the innovative practices of these enterprises provide rich experience and technical support for the development of the protein design field, driving the continuous progress of this field. With the continuous advancement of technology, it is believed that protein design will play an important role in more fields in the future.

Reference Articles: 1.https://mp.weixin.qq.com/s/7r69S3XpNMjemo3EiXzNeQ 2.https://mp.weixin.qq.com/s/DrZEdsb1SqSSkv_hbrp3TA

This article is from the WeChat official account "HyperAI Super Neural", author: Tian Xiaoyao. Republished by 36Kr with permission.