HomeArticle

Autonomously generate new materials. Scientists have achieved the reverse design of gallium-containing materials based on the Bayesian optimization framework, and the optimization results have 100% uniqueness and novelty.

超神经HyperAI2026-05-27 18:10
Accelerate the design of next-generation computer chips and electronic materials

A research team led by Flinders University in collaboration with Khalifa University in the United Arab Emirates has proposed a machine learning-guided Bayesian Optimization (BO) framework. This framework can achieve the reverse design of gallium-based components with preset electronic properties while maintaining chemical rationality. The optimized analysis results show that the generated materials have 100% uniqueness and novelty compared to the training data, and the effectiveness of SMACT is significantly improved in the band gap range of 1.5–2.5 eV.

In the modern semiconductor industry, the boundaries of material performance are constantly being pushed to higher dimensions. From high-efficiency photovoltaic devices to high-brightness light-emitting diodes (LEDs), and then to high-frequency communication and quantum information systems, almost all key technologies rely on a core ability at the underlying level - the precise regulation of the electronic structure of materials, especially the precise design of the band gap. However, this goal has long been difficult to achieve in the traditional material science system.

The reason is that the electronic properties of materials are not simply determined by a single element but are jointly affected by complex chemical bonding, crystal structure, electronic orbital hybridization, and the synergistic effect of multiple elements. Among many material systems, gallium-based semiconductors occupy a unique position. The excellent chemical diversity and multivalent characteristics of gallium enable it to exhibit a series of adjustable electronic properties from wide band gap to narrow band gap.

Gallium-containing compounds have become an important foundation for key optoelectronic and energy conversion technologies such as high-efficiency solar cells, high-brightness LEDs, and high-frequency communication devices. They are also becoming potential candidate materials for flexible, biocompatible, and implantable electronic systems. However, despite decades of research, the discovery process of new gallium-containing materials for specific target electronic properties still largely relies on empirical exploration - this is mainly limited by the vast component design space and the high computational cost brought by first-principles calculations.

In this context, a research team led by Flinders University in collaboration with Khalifa University in the United Arab Emirates has proposed a machine learning-guided Bayesian Optimization (BO) framework. This framework can achieve the reverse design of gallium-based components with preset electronic properties while maintaining chemical rationality.

With this unified framework, the system can autonomously generate new, chemically valid gallium-containing materials and achieve an adjustable band gap of 0.5–3.5 eV - this energy range is of great significance for applications such as solar energy, photonics, and power electronics. The Bayesian optimization process can adaptively guide the search to the region with the highest "expected improvement". The optimized analysis results show that the generated materials have 100% uniqueness and novelty compared to the training data, and the effectiveness of SMACT is significantly improved in the band gap range of 1.5–2.5 eV.

The relevant research results, titled "Bayesian Optimization-Guided Discovery of Gallium-Containing Semiconductors with Targeted Band Gaps", have been published in ACS Publications.

Research Highlights:

* The new framework can accelerate reverse material design under real chemical constraints, providing an alternative to the traditional screening method centered on DFT (Density Functional Theory).

* The new framework can not only efficiently cover the chemically reasonable region but also maintain high novelty and component diversity compared to existing databases.

* The research breaks through the limitations of traditional static property prediction and promotes semiconductor discovery towards a data-driven generative research paradigm.

Paper Address: https://pubs.acs.org/doi/10.1021/acsmaterialslett.5c01482

Dataset: Constructing a Chemical Learning Space from Real Material Databases

This study used the NOMAD and Materials Project databases to train the model. The data content includes the chemical composition of materials and their corresponding experimental band gap values. For example, Ga₄P₄, GaAs, GaN, Ga₂O₃, etc. The initial dataset contains 2,530 records of material compositions and their band gaps.

To ensure data quality, the study deleted samples with missing values in the "composition" or "band_gap" columns. It also removed non-physical or negative band gap data and duplicate records. Finally, 1,578 valid components were retained for modeling. In addition, the study used the pymatgen software package to standardize the chemical formula strings to merge chemically equivalent terms. The band gap unit was uniformly converted from joules to electron volts (eV). In the preprocessed dataset, the band gap ranges from 0.0–5.92 eV, with an average of about 1.8 eV and a standard deviation of 1.6 eV.

The study further screened the material components, retaining only compounds containing elements from the predefined set of atomic numbers to ensure that the research focused on the gallium-based material system. Several additional features were also constructed, including:

* The number of elements in each chemical formula

* The length of the chemical formula string

* A binary indicator of the presence or absence of gallium

The dataset was then randomly divided into a training set and a test set at a ratio of 8:2, and the division was completed at the "component level" to avoid compounds with similar chemistry appearing in different datasets. Five-fold cross-validation was also used to evaluate the robustness of the model under different data division conditions.

Framework: Collaborative Design of Machine Learning and Bayesian Optimization

This study proposed a Bayesian Optimization (BO) framework with chemical constraints, as shown in the following figure - it first uses a gradient boosting regression model trained on a gallium-based composite material dataset to predict the material band gap; then, Bayesian optimization iteratively explores in the constrained component space; finally, the generated candidate materials are screened for chemical validity, novelty, and uniqueness through the SMACT and pymatgen tools to identify the gallium-based composite materials with the best performance and that have not been explored before.

Machine learning-guided workflow for the discovery of gallium-based composite materials

Prediction Model Layer

The study systematically evaluated 8 machine learning regression algorithms, including linear models, support vector regression, random forests, gradient boosting, and K-Nearest Neighbors (KNN). The results show that the nonlinear models are significantly better than the linear models overall, indicating a strong nonlinear relationship between material composition and band gap. Among them, the KNN model performs the best, with an R² of 0.812 and better error indicators than other models.

Among all the candidate models, KNN was finally selected as the surrogate model in Bayesian optimization. The reason is that it has excellent local interpolation ability and maintains stable performance under different random division conditions. Different from tree-based ensemble models, KNN can preserve the neighborhood relationship in the component feature space, which is crucial for identifying the similarity between materials with similar element ratios.

In the context of Bayesian optimization, this "local preservation ability" is particularly important because the optimization search often focuses on potential areas near known high-quality candidates. Therefore, the non-parametric and locally adaptive characteristics of KNN can provide a smoother and more reliable search guide for the optimizer and still maintain high computational efficiency in the sparsely sampled material space.

Bayesian Optimization (BO) Module

This BO workflow uses the KNN surrogate model to guide the search for gallium-containing components with the target band gap and achieves a balance between "exploration" and "exploitation" through the "Expected Improvement" acquisition function, thereby generating candidate stoichiometries in the gallium-centered component space.

The system sets multiple constraints, including: each component contains at most 4 elements and must meet the minimum gallium content requirement to ensure the relevance of the candidate materials to the gallium-based research topic.

Chemical Constraint Filtering Layer

All generated candidate materials must be verified by the SMACT tool, including constraints such as charge balance, oxidation state rationality, and electronegativity consistency, to ensure that the generated materials are not only valid in the mathematical space but also chemically realizable.

In addition, the framework also combines Explainable Artificial Intelligence (XAI) methods and uses SHAP to analyze the model decision logic, making material prediction shift from a "black box" to an "explainable system".

Accelerating Reverse Material Design under Real Chemical Constraints

The researchers designed a series of experiments to evaluate and analyze the performance, structural characteristics, interpretability, and chemical validity of the model:

Model Performance Evaluation

In terms of model performance evaluation, the KNN model performs stably in cross-validation, with an R² of about 0.60 ± 0.07 and an RMSE of about 1.02 eV, indicating that the model has good generalization ability in the sparse chemical space.

The feature importance analysis in the following figure shows that melting point, electronegativity range, and electronegativity deviation are the key factors affecting band gap prediction, which are closely related to the bond strength and charge transfer behavior in materials. As the electronegativity difference increases, the band gap shows a downward trend, while an increase in melting point and cohesive energy corresponds to a larger band gap. This law is highly consistent with the traditional understanding of semiconductor physics.

The most important features in the final KNN model. The bar chart represents the relative contribution of each feature to the model split gain. The higher the value, the more significant the impact

Ability to Learn Real Chemical Rules from Data

During the generation stage, Bayesian optimization proposed a total of 1,025 candidate gallium-containing components, of which only 38 passed the SMACT screening, indicating that the chemical feasibility constraints are extremely strict. These effective materials are mainly concentrated in the 2.0–2.5 eV range, which means that this region is more likely to form medium band gap semiconductors with both ionic and covalent bond characteristics. These results are highly consistent with known systems, for example: Ga₂O₃ (≈4.8 eV), Ga₂S₃ (≈2.5 eV).

The BO search process also shows a trend of clustering towards known gallium-containing chemical families (such as Ga–O, Ga–N, Ga–As/Sb) and proposes new intermediate component stoichiometries in these regions, for example: Ga₀.₅₁As₀.₁₆N₀.₂₄Sb₀.₁₀, Ga₀.₁₇₁Sb₀.₁₇₅O₀.₃₆₇F₀.₂₈₆.

For wide band gap materials (>3.0 eV), the algorithm tends to favor oxygen-rich compounds because the strong Ga–O bond helps to widen the band gap; while a lower band gap (about 1.5–2.0 eV) is usually achieved by replacing oxygen with sulfur, selenium, or phosphorus, introducing stronger p–p interactions. These laws are highly consistent with existing experimental observations, indicating that the model has been able to "implicitly learn" real chemical rules from the data.

Ability to Capture Real "Structure-Property Relationships"

To confirm that the generated gallium-containing components correspond to "physically realizable" materials, the research team used the Chemeleon-dng model developed by Park et al. to predict their crystal prototypes, as shown in the following figure:

Representative crystal structures of the generated gallium (Ga) composite compounds

The candidate components verified by SMACT show a chemically reasonable coordination environment, mainly centered on gallium with tetrahedral and octahedral coordination, which is highly consistent with known crystal prototypes such as Ga₂O₃, GaN, and GaSe. The surrogate model successfully reproduces the electronic structure hierarchical relationship in line with empirical laws - oxides: 3.5–4.8 eV, chalcogenides: 1.8–2.6 eV, pnictides: about 1.2–2.0 eV, that is: oxide band gap > chalcogenide band gap > pnictide band gap.

This result indicates that this Bayesian optimization workflow has been able to effectively capture the real "structure-property relationship".

It is worth noting that none of the 38 valid components that passed the verification are repeated with existing known materials, which further proves that the generated results have both "novelty" and "chemical self-consistency".

DFT Verification

The researchers further conducted DFT verification. The following table summarizes the comparison results between the "model-predicted band gap" and the "DFT-calculated band gap" for 10 components that passed the SMACT verification, as well as the corresponding band gap types.

Comparison results between the model-predicted band gap and the DFT-calculated band gap for ten components that passed the SMACT verification

Overall: the mean absolute error (MAE) = 0.890 eV, the root mean square error (RMSE) = 1.158 eV, and the median absolute error = 0.784 eV. Although there is a certain deviation, it has high practical value in the early screening stage of material discovery. More importantly, none of the materials that passed the verification appear in the known database, indicating high novelty.

Conclusion

Overall, this study demonstrates a new material design paradigm for gallium-containing semiconductors: through the synergistic effect of machine learning modeling, Bayesian optimization search, and chemical constraint screening, an automated generation path from "data" to "new materials" is achieved.

From an industrial perspective, this method has potential value for photovoltaic material design, light-emitting device development, and wide band gap semiconductor research. Especially in the context of the rapid development of new-generation power electronics and optoelectronic devices, the demand for materials with controllable band gaps is rapidly increasing, and AI-driven material design methods are expected to become a key tool to accelerate this process.

Furthermore, the significance of this framework is not limited to the gallium system. Its methodology can also be extended to indium, tin, and even lead-free semiconductor systems, providing a general path for the rational design of complex multi-component compounds. This marks that material science is moving from "empirical trial and error" to a new stage of "algorithm generation", and artificial intelligence is becoming the core bridge between chemical rules and material discovery.

References: https://techxplore.com/news/2026-05-ai-discovery-gen-chips-electronic.html https://pubs.acs.org/doi/10.1021/acsmaterialslett.5c01482

This article is from the WeChat official account