With an accuracy rate of 97%, researchers from Princeton University and others proposed MOFSeq-LMM to efficiently predict whether MOFs can be synthesized.
A joint research team from Princeton University and the Colorado School of Mines has proposed an efficient prediction method based on machine learning. This method uses large language models to directly predict the free energy from the structural sequences of MOFs, thereby significantly reducing computational costs and enabling high - throughput, scalable thermodynamic evaluation of MOFs.
Metal–Organic Frameworks (MOFs) have shown great potential in applications such as gas storage, separation, catalysis, and drug delivery due to their highly tunable pore structures and rich chemical functionalities. However, the vast design space of MOFs encompasses trillions of possible combinations of building blocks, and relying solely on experimental exploration is extremely inefficient.
To accelerate the discovery of MOFs, computational workflows have emerged, aiming to generate new MOFs, predict their properties, and ultimately achieve synthesis. In this process, the main challenge lies in the low conversion rate from "screening to synthesis," which largely stems from the uncertainty in the synthetic feasibility of computer - generated MOFs. For example, among the thousands of computational MOF screenings published so far, only about a dozen have been accompanied by MOF synthesis.
Free energy is an important indicator for evaluating the thermodynamic stability and synthetic feasibility of MOFs. However, traditional computational methods are costly for large - scale MOF datasets and difficult to support rapid screening. To address this challenge, a joint research team from Princeton University and the Colorado School of Mines proposed an efficient prediction method based on machine learning. Using a large language model (LLM) to directly predict the free energy from the structural sequences of MOFs, this method significantly reduces computational costs and enables high - throughput, scalable thermodynamic evaluation of MOFs. Without retraining, the model shows extremely high generality: when judging whether the free energy of MOFs is higher or lower than the empirical synthetic feasibility threshold, the F1 score is as high as 97%.
The relevant research results, titled "Highly Accurate and Fast Prediction of MOF Free Energy via Machine Learning," have been published in ACS Publications.
Research Highlights:
* Based on this model for free energy prediction, researchers can accurately simulate the results of complete molecular simulations without retraining, thereby judging the synthetic feasibility of MOFs.
* Work that previously took a lot of time in the laboratory or through molecular simulations now takes negligible time.
* This method provides a feasible approach for using machine - learning free energy prediction as an early or late screening tool in performance - based computational MOF screening.
- Paper Link:https://pubs.acs.org/doi/10.1021/jacs.5c13960
MOFMinE: Covering 1 million MOF Prototypes
To support model training, the research team constructed a large - scale MOF dataset called MOFMinE, which covers approximately 1 million MOF prototypes. It contains information on the entire process from building block selection, topological template mapping to functional modification, as shown in the following figure:
Overview of the construction and characterization of the MOFMinE dataset, containing approximately 1 million structures
Construction Method
The dataset is generated based on the ToBaCCo - 3.0 platform. Each MOF is generated by mapping the constituent building units to appropriately scaled (to match the size of the building units) topological templates, which guide the spatial arrangement and connection mode of the building units in the MOF unit cell. The ToBaCCo building units are divided into node - type (NBBs) or edge - type (EBBs) according to their mapping positions: node - type building units are mapped to the vertices of the template, and edge - type building units are mapped to the edges of the template. NBBs can be divided into inorganic or organic types. Inorganic NBBs correspond to the so - called secondary building units (SBUs) of MOFs, and organic NBBs combine with EBBs to form MOF linkers.
Data Scale and Diversity
MOFMinE contains 1,393 topological templates, 27 inorganic NBBs, 14 organic NBBs, and 19 basic EBBs, and covers 13 functional modifications, ensuring chemical and topological diversity. The void fraction of the database ranges from 0.01 to 0.99, the geometric surface area (GSA) ranges from 26 to 8382 m²/g, and the largest pore diameter (LPD) ranges from 2.6 to 127.7 Å, fully covering the structural space of MOFs.
Free Energy Sub - set
Among these 1 million MOF prototypes, a sub - set of 65,574 structures has collected free energy data. This sub - set contains 379 topological templates, 6 inorganic NBBs, 11 organic NBBs, and 12 basic EBBs, with 13 functional modifications. The pore properties of the sub - set are as follows: the void fraction (Vf) is between 0.01 and 0.97, the GSA is between 38 and 7304 m²/g, and the LPD is between 2.6 and 87.8 Å. This dataset is used for fine - tuning and testing the LLM's free energy prediction.
MOFSeq - LMM Model for Efficient Prediction of MOF Free Energy
Supported by the MOFMinE dataset, the research team constructed the MOFSeq - LMM model framework for efficient prediction of MOF free energy and to achieve a full - process data - driven design from structure to properties. The core idea of this framework is to convert the structural information of MOFs into a computer - understandable sequence representation (MOFSeq) and combine it with a large language model for learning and prediction, thereby significantly reducing computational costs while retaining physicochemical information.
MOFSeq Representation
To overcome the limitations of existing representation strategies and fully utilize large language models for extensive MOF property prediction, researchers developed MOFSeq. This new string - based sequence representation method is both compact and highly informative, encoding the local and global structural features of MOFs in an optimized way, enabling the language model to process them efficiently and scalably.
In MOFSeq, local information mainly includes the atomic composition of the building units and their internal connection information; global information mainly includes high - level descriptions of MOF building units and the connection patterns between building units. Local information is obtained through the MOFid tool, while global information depends on ToBaCCo - 3.0, as shown in the following figure:
Schematic diagram of MOFSeq
MOF Database Construction and Data Processing
After constructing the MOFMinE dataset using the method described above, all MOF prototypes generated by ToBaCCo are optimized using the UFF4MOF force field in LAMMPS (version October 29, 2020) to obtain the final MOF structures.
The dataset generated by ToBaCCo - 3.0 only contains the MOFname and its corresponding CIF file as the representation of each MOF. However, MOFSeq requires both MOFname and MOFid. To obtain MOFid, researchers used the MOFid generator developed by Bucior et al., which can generate both MOFid and MOFkey based on the CIF structure of the MOF.
Finally, 793,079 MOFSeq pre - training samples are divided into a training set of 634,463, a validation set of 79,308, and a test set of 79,308. 54,443 MOFSeq fine - tuning data points are divided into a training set of 43,554, a validation set of 5,444, and a test set of 5,445.
LLM - Prop Model Design
Based on the MOFSeq representation, the research team adopted LLM - Prop, a large language model specifically designed for material property prediction. The LLM - Prop model has a relatively moderate scale of about 35 million parameters, which ensures learning ability while considering computational efficiency. The input length of the model is set to 2,000 tokens, which can accommodate the structural sequence information of most MOFs. Through the attention mechanism, the model can adaptively capture the influence of different building blocks and topological structures on the free energy in the sequence, forming an interactive representation of global and local features.
Pre - training and Fine - tuning
* Pre - training stage:
Researchers trained LLM - Prop to predict the strain energy of MOFs through the MOFSeq representation. Strain energy was chosen because it has low computational cost and is highly correlated with free energy. During the pre - training process, dropout rates of 0.2 and 0.5 were used, and the results showed that a dropout rate of 0.2 performed better in pre - training and downstream tasks. The input length of MOFSeq was set to 2000 tokens.
* Fine - tuning stage:
The settings are the same as in pre - training, but the model's goal is changed to predict free energy, and the number of training epochs is increased to 200. The LLM - Prop is designed as a lightweight model, with a scale about 1/2000 of Llama 2, giving priority to computational efficiency. This design brings a trade - off: compared with fine - tuning large LLMs (such as Llama 2 or GPT - 2), LLM - Prop requires more training epochs to achieve high performance, but its small scale makes training feasible and efficient.
Prediction Accuracy of MOF Synthesis Reaches 97%
After training the MOFSeq - LMM model, the research team systematically evaluated the model's performance in free energy prediction, synthetic feasibility judgment, and screening of polymorphic MOFs. The experimental results not only verify the high accuracy of the model but also highlight its application potential in high - throughput MOF design and screening.
Free Energy Prediction Performance
First, the team evaluated the free energy prediction performance of LLM - Prop on unknown MOF samples. The results show that the model can accurately predict the free energy with a mean absolute error (MAE) of 0.789 kJ/molMOFatom and achieve a high correlation of R² = 0.990, as shown in Figure b below. This means that the model can give prediction results close to the real values for most MOF samples.
During the pre - training stage, the model was trained with strain energy data, achieving an MAE of 0.623 kJ/molMOFatom and an R² of 0.965, as shown in Figure a below. The high correlation in this stage indicates that strain energy data can provide effective preliminary information for free energy prediction, verifying the rationality of the research team's pre - training strategy. Further analysis shows that the pre - trained strain energy is highly correlated with the fine - tuned free energy, proving the value of strain energy as a low - cost proxy indicator in model training.
Performance of the research method in MOF free energy prediction
Ablation Experiment Results
To understand the source of the model's performance in depth, the team conducted a systematic ablation experiment. The experiment investigated the effects of local features, global features, and pre - training on free energy prediction respectively. The results are as follows:
Results of the ablation experiment
With only local features: Through pre - training, the MAE decreased from 1.242 to 1.168 kJ/molMOFatom, and the R² increased from 0.971 to 0.974, indicating that pre - training can improve the model's generalization ability when local features are limited.
* With only global features:
The performance is significantly better than using only local features. The MAE decreased to below 1.0 kJ/molMOFatom, and the R² increased to about 0.980. The influence of pre - training is small in this case (the MAE decreased from 0.994 to 0.989 kJ/molMOFatom, and the R² increased from 0.979 to 0.980), indicating that global features themselves contain more information for the task and require less pre - training for effective learning.
* Combining local and global features:
With the support of pre - training, the model achieved the best performance, with an MAE of 0.789 kJ/molMOFatom and an R² of 0.990, proving that the synergistic effect of the two types of features is crucial for improving prediction accuracy.
The results of this ablation experiment clearly show that the design of global and local features of MOFSeq and the pre - training strategy are the core elements for improving the model's prediction ability.
Synthetic Feasibility Judgment
In industrial applications, the more critical task is to judge the synthetic feasibility of MOFs rather than simply focus on the absolute value of free energy. The research team set ΔL_MFFL (an indicator based on free energy correction) as a threshold of 4.4 kJ/molMOFatom to perform binary classification prediction on the synthetic feasibility of MOFs. The experimental results are shown in the following figure: