AI4S Bottleneck Overcome: Two AIs "Arguing" Boosts Scientific Research Code Deployment Success Rate to Over 95%

Deploying 50,000 scientific computing tools in one day might be the real starting point of AI4S.

In the past few decades, an unprecedented number of open - source software tools have been accumulated in the field of scientific computing.

From bioinformatics and chemical simulation to materials computing, physical simulation, and engineering design, almost every academic discipline has developed its own tool ecosystem.

On platforms like GitHub, thousands of code repositories claim to be usable in scientific research practices.

However, a long - standing fact that has never been systematically resolved is: The vast majority of scientific software remains in the state of "having been published" rather than "being directly executable".

The emergence of current high - performance intelligent agents is expected to improve this situation and assist users in quickly running open - source projects.

In real scientific research practices, teams often need to spend days or even weeks repeatedly solving problems such as compilation failures, dependency conflicts, and system incompatibilities before they can "barely run" a tool locally.

Such a running environment highly depends on personal experience, is often temporary and non - portable, and is difficult to be reproduced or reused by others. Each researcher and each laboratory manually maintains its own running environment instead of working on a shared and reproducible execution infrastructure.

The problems brought about by this model are not just inefficiency.

More crucially, it structurally limits three aspects of scientific software: Reproducibility, large - scale evaluation, and systematic integration.

Even though containerization, cloud computing, and HPC platforms have significantly lowered the threshold of computing power, this "deployment bottleneck" still exists and has long restricted the usability of scientific software.

With the rise of AI for Science (AI4S), this problem has been further magnified.

In the new scientific research paradigm, AI systems no longer just output prediction results but need to closely interact with real scientific tools: invoking solvers, executing simulation programs, running analysis pipelines, and processing real data.

In this context, whether a tool "can really run" is no longer an engineering detail but a fundamental question.

This problem is more acute in the Agentic Science scenario.

If a tool depends on an implicit environment and its execution is highly fragile, the planning of the intelligent agent cannot be truly implemented, and execution failures cannot be structurally analyzed, let alone be converted into learnable execution trajectories.

From this perspective, whether a tool is ready for deployment has become a structural bottleneck restricting the large - scale development of AI4S and Agentic Science.

Based on these observations, the research team gradually formed a judgment: The problem with scientific software does not lie in the lack of tools but in the absence of a shared infrastructure that can systematically convert tools into executable facts.

Deploy - Master was proposed in this context.

In the real world, deployment is not an isolated step but a continuous chain: whether a tool can be discovered, whether it is correctly understood, whether an environment can be built, and whether it can really be executed. Deploy - Master is designed as a one - stop automated workflow centered around execution around this chain.

Search Agent: Searching through millions of repositories

In large - scale scenarios, the first challenge in deployment is not building but discovery. If there is a systematic bias in the candidate tool set itself, all subsequent automation will magnify this bias.

To address this, the team started from 91 scientific and engineering fields to construct a disciplinary space covering the actual application scenarios of AI4S. They used language models to expand search keywords and conducted large - scale searches on GitHub and the public network.

The repositories obtained from the initial recall will serve as "anchors" and be iteratively expanded through signals such as dependency relationships, citation relationships, shared contributors, and document links, thus avoiding the blind spots caused by keyword - only searches.

Subsequently, the team eliminated obviously non - executable repositories through structural heuristic rules and let the Agent make semantic judgments to confirm whether they constitute an executable scientific tool.

Through this multi - stage funnel process, the team narrowed down the initial approximately 500,000 repositories to 52,550 scientific tool candidates entering the automatic deployment process. The significance of this step lies not only in tool screening but also in the first structured description of the scale and boundaries of the real scientific tool world.

Build Agent: Dual - model debate

In the building stage, the team is not facing a world with a "clear instruction manual". The building information of a large number of scientific software repositories is fragmented, incomplete, and even contradictory.

The README files may be outdated, the existing Dockerfiles may not reflect the current state of the code, and the key dependencies often only exist in the author's local environment.

The Build Agent will systematically traverse the building clues in the repository and conduct supplementary information retrieval when necessary to generate an initial building plan.

Early experiments showed that relying on a single model to generate building specifications only achieved a success rate of 50%–60%. The failures were mainly due to a large number of implicit and unexpressed assumptions in the building information.

To address this, Deploy - Master introduced a dual - model review and debate mechanism: One model proposes building specifications, and the other model independently reviews and actively looks for potential inconsistencies, missing dependencies, or environmental assumptions and proposes correction suggestions.

Through multiple rounds of interaction, the two models continuously revise the plan until a stable and executable building specification is formed. This mechanism has increased the overall success rate to over 95%.

Each tool will finally be verified through a minimum executable command.

Only tools that pass the execution verification will be considered successfully deployed, further structured, registered, and published on Bohr and SciencePedia, so that they can be directly used or invoked by other Agents (such as SciMaster).

From the distribution of building time, large - scale deployment is not a "uniform" process.

Although most tools can be built in about 7 minutes, the overall distribution shows an obvious long - tail feature. Some tools only contain lightweight scripts or interpreted code, and the building process is relatively simple; while others involve complex compilation processes, deep dependencies, and system - level library configurations, and their building time is significantly longer.

This difference does not prevent the progress of the overall process, but it determines the cost structure of deployment under large - scale conditions.

Among the 50,112 successfully deployed tools, the team observed a highly heterogeneous language distribution.

The tools cover more than 170 programming languages, with Python accounting for the largest proportion, followed by C/C++, tools in Notebook form, R, Java, etc.

The deployment success rate of the vast majority of languages remains stable at a relatively high level. The few languages with relatively low success rates are mainly concentrated in scenarios that rely on complex compilation chains or system - level libraries, such as C/C++, Fortran, and some R tools.

This does not mean that these languages are "inherently more difficult to deploy", but reflects that their toolchains have a higher degree of coupling with the underlying environment, thus magnifying the uncertainties in the building specifications.

From a deployment perspective, the language itself is not the decisive factor; the strength of environmental coupling is.

In 2,438 failed building attempts, the team systematically counted the failure reasons. The results showed that the failures were not evenly distributed but highly concentrated in a few types of problems.

The main source of failure is building process errors, including inconsistencies between building steps and the current state of the repository, missing key dependencies, and mismatches between compilers or system libraries.

This type of failure far exceeds resource shortages, network exceptions, or permission issues. Meanwhile, resource - related errors did occur during the high - concurrency stage, which directly promoted the team's subsequent improvement of the scheduling strategy and isolation mechanism.

This further shows that in large - scale deployment, failures should not be regarded as exceptions but as signals for the system to expose problems and then self - correct.

Through a unified execution infrastructure, the team was able to systematically observe the deployment behavior of scientific software in a real environment: which links are most likely to fail, which implicit assumptions are most frequently triggered, and which toolchains are most likely to magnify uncertainties.

This observability itself is one of the foundations that Deploy - Master hopes to establish. It transforms the perception that "scientific software is difficult to deploy" from an empirical judgment into an engineering object that can be quantified, analyzed, and continuously improved.

From runnable tools to the execution foundation of Agentic Science

The direct output of Deploy - Master is a collection of tens of thousands of execution - verified tools. More importantly, it provides a long - missing basic premise for community Agents and various Master Agents.

For an Agent, tool invocation is not an abstract action but an execution process that must be successfully implemented in the real environment.

Only when tools are uniformly built, verified, and registered as executable capabilities can an Agent truly have a stable action space, and the closed - loop between planning, execution, and learning can be established. This also enables community Agents from different sources to share the same batch of execution - verified tool capabilities instead of each maintaining a fragile and non - reproducible running environment.

The significance of this methodology is not limited to scientific computing. Scientific tools are often considered the most difficult type in automated deployment: complex dependencies, strong system coupling, incomplete documentation, and high sensitivity to the environment.

If in such a "most difficult scenario", it is still possible to stably produce runnable tools at the ten - thousand - level scale through an execution - centered design, then the conclusion is very clear: The problem does not lie in the type of tools but in whether an execution - centered infrastructure has been established.

This judgment also applies to a broader software tool ecosystem: engineering tools, data processing systems, professional software, and even various Agent Tooling. As long as a tool needs to be finally executed, its deployment problem cannot bypass the reality of "imperfect information".

Deploy - Master has not solved all problems. Heterogeneous hardware, distributed computing, semantic - level I/O interfaces, and closed - loop integration with physical experiment systems still remain challenges to be faced in the future.

But one thing is clear: In the era of Agentic Science, execution is not an ancillary step after reasoning but the premise for all capabilities to be established.

When "whether a tool can run" is no longer a default assumption but a systematically verified fact, scientific intelligent agents will truly start to have a basis for interacting with the real world. And Deploy - Master is an attempt to move towards this execution reality.

This article is from the WeChat official account “QbitAI”, author: DP Technology. It is published by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Another bottleneck of AI4S has been overcome: Two AIs "arguing" boosts the deployment success rate of scientific research codes to over 95%

Search Agent: Searching through millions of repositories

Build Agent: Dual - model debate

From runnable tools to the execution foundation of Agentic Science