AI-driven quantum refinement: Carnegie Mellon University and others propose AQuaRef, for the first time refining the all-atom model of proteins with quantum mechanics constraints
A joint research team from universities such as Carnegie Mellon University, the University of Wrocław in Poland, and the University of Florida has proposed an AI-driven quantum refinement method called AQuaRef. This method is based on the AIMNet2 machine learning atomic potential function and has been custom-trained for the refinement task. While approaching the computational efficiency of classical force fields, it can well approximate the results of quantum mechanics calculations, providing a new technical path for the all-atom quantum refinement of biological macromolecules.
To understand the molecular mechanisms of life processes, one must first visualize the three-dimensional structures of biological macromolecules. Resolving atomic-level structures is the core task of structural biology and an important foundation for understanding protein functions, revealing genetic regulatory mechanisms, and conducting targeted drug research and development. Whether it is protein-catalyzed reactions, nucleic acids transmitting genetic information, or antibodies recognizing antigens, these key biological processes rely on accurate structural models for explanation.
Currently, cryo-electron microscopy and X-ray crystallography are the main experimental techniques for resolving the structures of biological macromolecules, and a large amount of high-resolution structural data has been accumulated. In recent years, computational prediction methods represented by AlphaFold and RoseTTAFold have also made significant progress, providing efficient tools for structural modeling. However, in terms of discovering unknown structure types and resolving complex interactions, experimental resolution still has an irreplaceable position. In the experimental structure resolution process, atomic model refinement is a key step near the final stage, with the goal of constructing a molecular model that conforms to stereochemical rules and fits the experimental data as closely as possible. Current mainstream refinement software, such as CCP4 and Phenix, mainly relies on stereochemical constraints in standard databases to maintain reasonable bond lengths and bond angles and reduce inter-atomic conflicts.
However, this type of constraint system still has obvious limitations. It mainly targets covalent structures and lacks a systematic description of important non-covalent interactions such as hydrogen bonds and π-stacking. Under low-resolution conditions, it may cause the model to deviate from the real chemical state. When new ligands or special connections appear in the structure, manual parameter definition is required to complete the refinement. In addition, reasonable geometric deviations caused by the local chemical environment may also be misjudged as anomalies by the constraint system and forcibly corrected. Theoretically, quantum mechanics methods can more accurately describe intermolecular interactions, but biological macromolecules usually contain thousands or even tens of thousands of atoms, and the cost of full quantum calculations is extremely high. Therefore, most existing studies are limited to local regions such as ligand binding sites.
To solve this problem, a joint research team from universities such as Carnegie Mellon University, the University of Wrocław in Poland, and the University of Florida has proposed an AI-driven quantum refinement method called AQuaRef. This method is based on the AIMNet2 machine learning atomic potential function and has been custom-trained for the refinement task. While approaching the computational efficiency of classical force fields, it can well approximate the results of quantum mechanics calculations, providing a new technical path for the all-atom quantum refinement of biological macromolecules.
The relevant research results, titled "AQuaRef: machine learning accelerated quantum refinement of protein structures", have been published in Nature Communications.
Research Highlights:
* AQuaRef is based on the AIMNet2 machine learning potential function and achieves quantum refinement of the whole protein atomic model for the first time.
* In the tests of 61 low-resolution X-ray and cryo-electron microscopy models, AQuaRef performed better in 57 models.
* In the case of short hydrogen bonds in DJ-1 and YajL proteins, AQuaRef can determine the proton positions consistent with experimental evidence without manual intervention.
Paper Link: https://www.nature.com/articles/s41467-025-64313-1
A 1-million-sample dataset for the training of machine learning potential functions for polypeptides
This research aims to construct a parameterized model of machine learning potential functions for polypeptide systems. Therefore, in the dataset design, it is necessary to systematically cover three dimensions: chemical composition, conformational space, and intermolecular interactions.
In the chemical dimension, the researchers constructed a small peptide database in the form of SMILES strings, covering 20 standard amino acids, 11 protonation states, 3 N-terminal modifications, and 4 C-terminal modifications. On this basis, all single peptides and dipeptides were enumerated, and some tripeptides and tetrapeptides were randomly selected. At the same time, polypeptides containing disulfide bonds and their selenium analogs were additionally generated. To fully cover the conformational space, the researchers used the OpenEye Omega software for intensive torsional angle sampling without imposing restrictions on chiral centers, so that the model can be applied to polypeptide systems with D-type, L-type, and mixed stereochemistry.
At the same time, complexes composed of 2 - 4 peptide segments were constructed, and their spatial orientations were randomly adjusted to simulate intermolecular interactions. The entire data generation process did not refer to natural sequences or experimental structures to avoid potential data leakage. To control the computational scale, the total number of atoms (including hydrogen) in all peptide segments and their complexes was limited to within 120.
After obtaining the initial conformation, the researchers first used the GFN-FF force field to conduct molecular dynamics simulations to sample non-equilibrium structures and maintained the overall configuration close to the initial input through Cartesian coordinate constraints while releasing the torsional angle and intermolecular degrees of freedom.
Subsequently, an active learning strategy of query-by-committee was introduced: first, 500,000 initial samples were randomly selected to train an ensemble system composed of 4 models; then, four rounds of iteration were carried out. In each round, samples were screened according to the uncertainty of the model's prediction of energy and atomic force, and DFT calculations were performed on these high-uncertainty structures before adding them to the training set. In the last round, uncertainty-guided optimization was further introduced, and boundary structures with high prediction uncertainty but low energy were preferentially selected. Through this process, a training set composed of approximately 1 million samples with an average number of atoms of about 42 was finally obtained.
In addition to the theoretically generated data, the researchers also screened experimental structures from the RCSB and EMDB databases for model validation. The screening criteria included: single-conformation models containing only proteins, the number of non-hydrogen atoms between 1000 - 10000, a resolution of 2.5 - 4 Å, a MolProbity conflict score of less than 50, and bond length and bond angle deviations not exceeding 4 times the standard values.
AQuaRef: An AI-driven quantum refinement method for macromolecular systems
AQuaRef first checks the integrity of the input atomic model. For the missing atoms in the structure, the program will try to supplement them automatically. However, this process may sometimes introduce new steric hindrance conflicts, especially when the original model does not contain hydrogen atoms. If the missing atoms are key structures such as backbone atoms, the model cannot continue with quantum refinement; if obvious steric conflicts or serious geometric anomalies are detected, rapid geometric regularization will be carried out through standard stereochemical constraints first to eliminate the problems while minimizing the adjustment of atomic positions.
For crystallographic data, the refinement also needs to consider the unit cell symmetry and periodic interactions. Specifically, the program will expand the model into a supercell according to the space group symmetry operators and truncate it, only retaining the symmetric copies within the set distance from the main copy atoms. This processing is usually not required for cryo-electron microscopy structures.
After completing the atomic supplementation and model expansion, the system enters the standard refinement process of the Q|R software package. The core architecture of AQuaRef is basically the same as that of the basic AIMNet2 model, but several key adjustments have been made for the structure refinement task.
First, the model does not explicitly calculate the long-range Coulomb and dispersion interactions but directly trains to reproduce the total energy of DFT-D4. This is because under the CPCM implicit solvent model, the Coulomb interaction is difficult to accurately estimate through partial atomic charges, and the long-range interaction has been significantly shielded by the polarizable continuum medium. In addition, the long-range dispersion term beyond the 5 Å cutoff radius contributes very little to the key atomic forces during the refinement process, so it can be ignored without affecting the accuracy.
Second, the model introduces an explicit short-range exponential repulsion term from GFN1-XTB, thus showing better stability when dealing with structures with steric hindrance conflicts. The model is trained with the energy, atomic force, and Hirshfeld partial atomic charges calculated by the B97M-D4/def2-QZVP method as the targets. Starting from random weight initialization, the batch size is 256, and the total number of training steps is 1.5 million. The remaining hyperparameters follow the settings of the original AIMNet2.
In terms of computational efficiency, as shown in the figure below, the calculation time of energy and atomic force in the AIMNet2 framework and the peak GPU memory usage both increase linearly (O(N)) with the number of atoms in the system. For a protein system containing approximately 100,000 atoms, the single-point energy and force calculation only takes about 0.5 seconds; on a single NVIDIA H100 GPU with 80GB of memory, a model with up to approximately 180,000 atoms can be processed.
The computational scaling law of the AIMNet2 machine learning interatomic potential function in AQuaRef
Validation of 41 cryo-electron microscopy and 20 X-ray models shows that AQuaRef can optimize local structures up to 2 Å
To evaluate the performance of AQuaRef, the researchers constructed a test set containing 41 cryo-electron microscopy models, 20 low-resolution, and 10 ultra-high-resolution X-ray models. Among them, 61 low-resolution models are all equipped with corresponding high-resolution homologous reference structures. Three constraint conditions were set for comparison during the refinement process: AIMNet2 quantum constraints (i.e., AQuaRef), standard geometric constraints, and additional constraints such as hydrogen bonds and secondary structures on the basis of the standard constraints.
The results are shown in the figure below. The low-resolution models after quantum refinement are significantly better than traditional constraint methods in geometric indicators such as the MolProbity score and Ramachandran plot Z-score. At the same time, the fitting degree of the model to the experimental data remains basically the same. For X-ray structures, the degree of overfitting is slightly reduced (the difference between Rwork and Rfree is smaller); for cryo-electron microscopy structures, the CCmask slightly decreases while the EMRinger score remains basically unchanged. Combined with the overall improvement in geometric quality, this result suggests that the overfitting of the model may be reduced.
Although adding additional geometric constraints to the standard constraints can also improve the model quality, AQuaRef can still obtain a more reasonable geometric structure and be closer to the high-resolution reference model. In some cases, the local difference between the structures obtained by standard constraints and quantum refinement can reach 2 Å.
Optimization results of 41 cryo-electron microscopy models and 20 X-ray models
The research also compared AQuaRef with various mainstream refinement methods. The results are shown in the figure below. For X-ray data, AMBER, Rosetta, and REFMAC5 were selected; for cryo-electron microscopy data, Servalcat was used. Overall, AQuaRef has a slightly better Rfree and the lowest degree of overfitting. Compared with Servalcat, their EMRinger scores are comparable, but Servalcat has a slightly higher CCmask.
In terms of geometric quality, AQuaRef performs similarly to Rosetta and is significantly better than REFMAC5 and Servalcat; Rosetta has a slightly higher overall fit with the reference model, which may be related to the larger convergence radius brought by its non-gradient optimization strategy. In addition, both AQuaRef and Rosetta can generate reasonable hydrogen bond geometric structures, followed by AMBER, while REFMAC5 and Servalcat can basically not accurately restore these details.
Optimization results of AQuaRef, Servalcat (SE), REFMAC5 (RE), AMBER (AM), and Rosetta for 61 low-resolution models
In the test of short hydrogen bond systems, the researchers took the Parkinson's disease-related protein DJ-1 and its homologous protein YajL as examples to test the ability of AQuaRef to handle protonation states. Traditional refinement methods are affected by the stereochemical constraints in the database and often cause the bond length to deviate from the real value. When using the symmetric double-protonated structure as the initial model for AQuaRef refinement, the obtained proton positions and bond geometry are consistent with the results of unconstrained refinement; while after adding traditional constraints, the bond length is pulled towards the non-protonated standard value in the database. When the experimental data is truncated to a resolution of 2 Å and the atomic details are significantly reduced, AQuaRef can still restore a structure almost consistent with the original 1.15 Å data, while traditional constrained refinement further deviates from the real configuration. AQuaRef locates the proton at the Oδ2 atom of the D24 residue in DJ-1, and this result is supported by both energy calculations and difference electron density maps.
Bond distance analysis in wild-type DJ-1
In the YajL protein, the AQuaRef refinement results of the two short hydrogen bonds at E14/D23 are also consistent with the unconstrained refinement, indicating that the proton is shared by D23 and E14, showing typical low-barrier hydrogen bond characteristics. This situation is different from the case in DJ-1 where the proton is mainly located at a single oxygen atom. The energy distribution given by AIMNet2 shows a relatively flat potential energy surface, meaning that the proton position can be freely adjusted under the constraints of the experimental data. At the same time, the difference electron density map shows significant peaks higher than 3σ near the hydrogen atoms, providing further evidence for this structural interpretation.
Energy distribution diagram along the hydrogen bond