DeepAgent and DeepSearch have both topped the charts, and the answer points to the emerging open-source project openJiuwen.
Since the beginning of 2026, the hottest thing in the artificial intelligence circle has been a crayfish named Clawdbot.
From Clawdbot to OpenClaw, two name - changes couldn't stop people's enthusiasm for it. A global collective desire is emerging - people are eager to have a more advanced, more general, and more reliable super - intelligent agent.
In the past year, Agents have emerged in an endless stream. The year 2025 was even called the "Year of AI Agents". To measure the real strength of an intelligent agent, we need to look at both its comprehensive problem - solving ability in general scenarios and its core specialized capabilities in vertical fields. The GAIA General Intelligence Benchmark List and the BrowseComp - Plus In - depth Research Benchmark List are more straightforward than any conceptual discussions.
Last year, the intelligent agent of the startup Manus became extremely popular, which also brought the GAIA List into the spotlight. Since then, it seems that every intelligent agent tries to make it onto the GAIA list. The BrowseComp - Plus benchmark test, which focuses on in - depth research and web - browsing capabilities, has also become the core competition arena for intelligent agents' search capabilities, thanks to its strict evaluation criteria.
Recently, when looking through the two major lists, we found that there have been new breakthroughs at the top of both lists: DeepAgent and DeepSearch, built on the emerging open - source project openJiuwen, have both topped the GAIA and BrowseComp - Plus lists.
DeepAgent Tops the GAIA List
DeepAgent, built on openJiuwen, topped the GAIA list with a score of 91.69%. It has overtaken NVIDIA Nemotron and a host of leading intelligent agents at home and abroad.
List link: https://gaia - benchmark - leaderboard.hf.space/
- Competing on the GAIA List: Facing the Biggest Challenge for Agents
The GAIA list is not one that favors large models.
GAIA is an evaluation benchmark jointly developed by Meta and Hugging Face specifically for general - purpose Agent capabilities. It covers 12 core capabilities such as long - term task planning, multi - modal understanding, tool invocation, complex reasoning, and execution robustness. It sets three levels of difficulty, Level 1 - 3. The difficulty of Level 3 tasks is close to that of human - level tasks. The evaluation uses a closed test set and an automated scoring mechanism to comprehensively and strictly assess the comprehensive ability level of intelligent agents.
According to the brief introduction of the GAIA evaluation on Hugging Face, the average success rate of human participants in this benchmark test is about 92%, while GPT - 4 can only achieve about 15% performance even with the help of plugins.
The evaluation design of GAIA has several distinct characteristics. Its difference from traditional AI benchmarks is very obvious, and it can keep out a large number of intelligent agents that "seem very smart".
1. Real - world difficulty: The tasks not only involve language understanding but also require reasoning, planning, multi - modal processing, tool invocation, and execution behaviors, approaching the work that intelligent agents need to complete in real scenarios.
2. Human interpretability: Although the tasks are difficult for AI to understand, they are conceptually clear and verifiable for humans. This makes the evaluation results more reliable and helps to compare the gap between humans and machines.
3. Non - gameability: GAIA emphasizes the quality of the entire task - execution process. The correct answer requires the complete execution of the task, and the "brute - force cracking" method is invalid.
openJiuwen - deepagent's topping the list with a score of 91.69% is almost infinitely close to the about 92% score of human participants in the GAIA test.
This result means that it has formed system - level advantages in dimensions such as planning, execution stability, tool coordination, multi - modal understanding, and task closure, indicating that general - purpose intelligent agents have been able to achieve task - execution capabilities close to those of humans.
Actual performance of DeepAgent. Task: Automatically analyze and purchase ingredients based on a YouTube cooking video.
Taking a typical browser - use task as an example, we can intuitively see the "ceiling of execution power" of DeepAgent.
Users only need to issue an instruction, and DeepAgent can parse the YouTube food video and automatically identify the ingredient list. Then, it searches item by item on the e - commerce website according to the list, adds them to the cart, and conducts real - time price comparison and verification. After all the ingredients are ready, the Agent hands over the operation right to the user for payment confirmation. The whole process is seamless, demonstrating stable execution ability in real and complex scenarios.
- Behind DeepAgent: Unlocking the Ability to Dominate the List
DeepAgent's ability to top the GAIA list is not accidental. It is because it hit the "weak point" of the list from the very beginning of its design. In the GAIA evaluation, a high score means meeting several strict conditions simultaneously:
It can understand natural - language tasks that are vague, long - chained, and multi - constrained.
It can perform multi - step planning instead of linear execution.
It can stably invoke tools, access web pages, process files, and execute code.
It can self - correct when it fails or there is a lack of information, avoiding crashes or hallucinations.
Three core concepts reveal the secret of DeepAgent "dominating the GAIA list".
1. Agent Dynamic Self - evolution Engine: From "Linear Execution" to "Closed - loop Autonomy"
In actual tasks, the Agent faces natural - language instructions and needs to structure these instructions and break down vague requirements into actionable steps. When executing tasks, the Agent must be able to dynamically adjust the plan based on real - time feedback to ensure the smooth completion of tasks in a changing environment.
To this end, DeepAgent simultaneously runs two closed - loops: "Planning - Execution" and "Observation - Reflection". It not only structures and breaks down natural - language instructions but is more like a commander with a "monitoring room": it continuously reviews the execution results during operation. Once it senses an environmental anomaly or a logical deviation, the system will immediately trigger local rollback and self - repair, preventing the intelligent agent from falling into the typical failure mode of "not turning back until hitting the south wall".
At the same time, based on the Agent self - evolution ability of openJiuwen, DeepAgent has installed an evolvable external memory module as its "digital brain" for its core engine. This is not just simple data storage but a cognitive center with self - healing ability. It can accurately diagnose the root cause of task - execution errors, achieve closed - loop correction of logic relying on the feedback mechanism of the external memory, and independently generate optimization strategies to drive the continuous iterative improvement of subsequent execution ability.
2. Multi - level Context Engine: Ensuring Agent Cognitive Consistency
In high - difficulty tasks like GAIA, the real challenge often lies in "whether to conduct continuous reasoning based on reliable information". To this end, DeepAgent has designed a context system that is hierarchically integrated, fully traceable, and consistent in the long run. It stores and dynamically associates conversation records, project knowledge, domain rules, and entity relationships in different layers to form a structured system. Each reasoning step is attached with a source evidence chain to ensure the interpretability of the output results.
At the same time, based on the context compression ability of openJiuwen, in long - term tasks, it compresses and unloads irrelevant context in a timely manner, enabling the Agent to maintain internal consistency and credibility in long - term tasks instead of becoming more and more "distorted".
3. Asynchronous Tool Orchestration Bus: Realizing Unified Scheduling and Reliable Execution of Heterogeneous Tools
Facing a complex toolchain, messy API calls are often the cause of system crashes. The Agent must be able to call different expert modules like a scheduling team, with each module performing its own duties. At the same time, it must also be able to use external tools and systems at the right time to ensure efficient and reliable execution.
GAIA's tasks involve a large number of real - environment operations. Instead of "outsourcing" these capabilities to various independent tools in a scattered way, DeepAgent abstracts external APIs, systems, and databases into standardized capability nodes through a unified tool gateway and orchestration mechanism. It not only supports high - concurrency asynchronous scheduling but also realizes controllable, traceable, and replayable tool invocation, supporting the review of the execution process and reliability auditing.
In the real - environment operations of GAIA, DeepAgent distributes tasks as precisely as a scheduling expert team, ensuring that each tool output can be converted into stable - scoring productivity.
Throughout the entire task process, these capabilities are like a series of unlocked skill trees, jointly supporting the intelligent agent to score steadily in GAIA's complex tasks. At the inflection point when Agents enter the productivity era, what really determines the upper limit is not the model but the depth of the intelligent agent's ability design.
DeepSearch Tops the BrowseComp - Plus List
DeepSearch, built on openJiuwen, topped the BrowseComp - Plus list with an accuracy rate of 80%.
List link: https://huggingface.co/spaces/Tevatron/BrowseComp - Plus
- The BrowseComp - Plus List: Tackling the Core Test of In - depth Search
BrowseComp - Plus is the core and authoritative benchmark in the industry for measuring the in - depth search, research, and web - browsing capabilities of intelligent agents. As an upgraded version of the OpenAI BrowseComp benchmark, it covers core capabilities such as multi - hop retrieval, cross - source information integration, retrieval reasoning planning, and web - content understanding. It tests the practical ability of intelligent agents to efficiently mine effective information from a large amount of corpus, eliminate interference, and form accurate answers.
The scoring mechanism of BrowseComp - Plus is highly scientific:
1. It uses a fixed manually - verified corpus to build the test environment. Each test question is accompanied by human - verified supporting documents and high - difficulty interference documents, completely avoiding the evaluation deviation caused by the dynamism of the real - time network.
2. It takes strict accuracy as the core scoring dimension, supplemented by retrieval invocation efficiency indicators for comprehensive judgment. The standardized and automated scoring system has no manual intervention throughout the process.
3. The results are verifiable. Relying on the fixed manually - verified corpus, all answers have clear traceability bases, completely avoiding the evaluation deviation caused by the dynamism of the real - time network, making the evaluation results reproducible and auditable, and ensuring the fairness of the evaluation to the greatest extent.
Thanks to its professional evaluation design, the BrowseComp - Plus list has become an important basis for global top - tier institutions to test the real strength of in - depth search intelligent agents. openJiuwen - deepsearch's topping the list with an accuracy rate of 80% means that it has formed core technological advantages in dimensions such as multi - hop in - depth search, cross - source information integration, interference information screening, and web - content understanding, marking a breakthrough improvement in the practical ability of intelligent agents in the fields of in - depth search and web interaction.
- Behind DeepSearch: A Benchmark Engine for In - depth Research
Searching in the real world often means:
Multiple rounds of questioning and repeated verification
Cross - source information comparison and traceability
A large amount of noise and misleading information interference
Long - chain reasoning and evidence - closed - loop construction
DeepSearch models complex query problems as a state space by constructing three core engines. Through dynamic expansion and exploration, it thinks from multiple perspectives like a human expert, and each search action is generated based on the real - time state.
1. Entity Cognition Engine: Automatically identify key entities and establish a traceable state - evolution history
The first step in in - depth research is to understand the problem structure. Identify key entities and establish associations through clue reference relationships. The system extracts core variables such as people, institutions, and events, establishes reference relationships between clues, and continuously tracks the state - evolution trajectory of each entity.
Based on the context engine of openJiuwen, the system models the problem state as a continuously updatable structured context. Each search action triggers an incremental update of the state, keeping the entity relationships and reasoning progress consistent and traceable.
2. Parallel Reasoning Path Management: Decompose complex problems into multi - branch reasoning paths and dynamically maintain an action pool
Facing complex multi - hop problems, DeepSearch does not move forward along a single path like traditional retrieval. Instead, it constructs a multi - angle reasoning tree. Under the multi - workflow control mechanism of openJiuwen, it concurrently explores multiple possible solution paths, maintains a dynamically expanding action pool, and concentrates resources on exploring high - potential paths, greatly improving retrieval efficiency.
The system can simultaneously retain multiple candidate paths, concurrently explore different information sources, continuously evaluate the value of paths, and through the probability sampling mechanism, the system preferentially executes high - value paths, and low - value paths are naturally marginalized. In this way, the system can maintain stable exploration ability in a complex environment.