New breakthrough in self-evolving agents: Meta unveils Dr. Zero with spontaneously emerging complex reasoning and search capabilities
There has been new progress in self-evolving intelligent agents (Agents).
Recently, Meta's Super Intelligence Laboratory and the University of Illinois at Urbana-Champaign (UIUC) jointly proposed the Dr. Zero framework, enabling Agents to achieve efficient self-evolution under the condition of zero training data.
It is reported that this framework solves the problems faced by multi-round search Agents in data-free self-evolution, such as "limited problem diversity" and "requiring a large amount of computing resources for multi-step reasoning and tool use".
The research team innovatively proposed the "Hop-Relative Policy Optimization (HRPO)" method. By clustering problems with similar structures, a robust group-level benchmark is constructed. While ensuring the effectiveness of training, it avoids the expensive nested sampling requirement in the self-evolution process.
Experiments show that in complex question-answering tasks, without manually annotated data, the performance of this framework exceeds the fully supervised baseline by up to 14.1%, proving the strong potential of search-enhanced models in advanced reasoning tasks.
Meanwhile, without any human-annotated data, through reasonable architecture design and reward mechanism, intelligent agents can completely spontaneously develop complex reasoning and search abilities. This provides new ideas for solving the problem of model training in data-scarce environments in the future.
The Problem of Data Scarcity in AI Self-Evolution
Training a powerful model usually requires a large amount of high-quality manually annotated data. Especially in tasks involving complex reasoning and multi-step search, obtaining accurate annotated data is not only time-consuming but also extremely costly. Although the concept of "adaptive language agents" has been proposed for a long time, aiming to improve the performance of models through iterative learning, the existing mainstream methods still have difficulty achieving true self-evolution. They still rely heavily on a large number of carefully written human questions or labels as prompts to drive exploration. This dependence on human intervention limits the ability of AI to explore unknown boundaries.
To break through this limitation, the academic community has begun to explore data-free self-evolution, that is, allowing models to generate and solve problems autonomously, thereby constructing synthetic training data. However, there are still huge challenges in moving from the laboratory to real-world applications.
An ideal self-evolution framework enables AI to achieve a spiral improvement in performance through proposer-solver co-evolution without any annotated data sets.
Figure | Adaptive training framework (Huang et al., 2025a), training the proposer and solver through iterative supervised minimization.
Most current self-evolution research focuses on specific fields with clear definitions and closed rules, such as mathematics and programming. In these fields, even with limited data diversity, models can make good progress.
However, once entering the open domain, the situation becomes completely different. Models tend to generate simple single-hop questions, lacking challenges. Multi-step reasoning and using search tools require a huge amount of computing resources. If models are optimized through a large number of blind trials and errors, the computing cost will be unbearable.
Therefore, how to enable AI to efficiently perform high-quality self-evolution in a complex open world without relying on manual data is the core problem that Dr. Zero attempts to solve.
Dr. Zero: A "Zero-Data" Self-Evolving Learning System
Dr. Zero is not just a model but a self-improving learning system. Its core design mainly includes three aspects.
1. Proposer-Solver Co-Evolution
The framework contains two core roles - the proposer and the solver. Both are played by large language models and co-evolve during the training process.
Figure | Dr. Zero self-evolution feedback loop. Guided by the solver's feedback, the proposer synthesizes verifiable and challenging queries, continuously enhancing the solver's search and reasoning abilities.
The proposer's task is not only to generate questions but also to actively explore information in the open domain using external search engines, generating diverse and structurally complex questions. More importantly, as the training progresses, the proposer optimizes its own strategy based on rewards, generating new, more complex, challenging but verifiable questions.
The solver's task is to try to obtain information using external search engines and answer these questions. It is trained based on the synthetic questions generated by the proposer, continuously optimizing its reasoning logic and search tool usage ability. As the solver's level improves, it will in turn force the proposer to find more tricky angles to generate new questions.
Figure | The evolution process of the iterative reward dynamics between the proposer and the solver in Dr. Zero. The baseline reward value decreases continuously with iterations, which reflects the co-evolution between the models: when the performance of one model improves, it will naturally lower the initial reward threshold of the other model, thereby promoting its continuous self-optimization through the reinforcement learning mechanism.
2. Hop-Relative Policy Optimization
When enabling AI to self-evolve, the biggest obstacle is often computing power. Traditional reinforcement learning methods (such as GRPO) need to perform "nested sampling" to accurately evaluate the quality of a question - that is, generating multiple questions for the same prompt. HRPO cleverly solves this problem.
Traditional methods have a large computational volume, and when facing open questions with diverse structures, the global benchmark evaluation is unstable. HRPO clusters problems with similar structures (for example, according to the "hop" complexity of reasoning steps) to construct group-level benchmarks. This means that the model no longer needs to generate many repeated questions for each prompt to test. It only needs to generate a single question for each prompt, and by comparing with the performance of other questions in the same group, it can obtain a robust evaluation result. This directly avoids expensive nested sampling, significantly reducing the computational cost while ensuring the training effect.
3. Difficulty-Guided Reward Mechanism
How to make the proposer generate high-quality difficult questions? Dr. Zero adopts a sophisticated difficulty-guided reward mechanism.
The reward mechanism is designed to encourage the proposer to generate complex, multi-hop, difficult but search-engine-verifiable queries, rather than just simple single-hop questions. It not only encourages the questions to be challenging but also ensures that the answers to the questions can be objectively verified through the information returned by the search engine, avoiding generating open or subjective questions that cannot be evaluated.
As a scalable and efficient framework, Dr. Zero iteratively improves the proposer and the solver through data-free self-evolution. In each iteration, the proposer generates a batch of question-answer pairs with heterogeneous hop structures. Using the solver's feedback, the proposer optimizes to generate verifiable, diverse, and challenging queries through HRPO. Meanwhile, the solver uses the generated data to improve search and reasoning abilities through GRPO. This alternating optimization cycle forms a symbiotic feedback mechanism: as the solver's ability improves, the rewards for simple queries gradually decrease, forcing the proposer to explore more complex reasoning paths to maximize benefits.
Data-Free Evolution Beats Data-Supervised
To comprehensively evaluate the search and reasoning abilities of Dr. Zero, the experiments cover various scenarios in open-domain question-answering, constructing a comprehensive benchmark test system.
It includes single-hop tasks, such as NQ (Natural Questions) and TriviaQA, which mainly test the model's ability to accurately retrieve and answer based on a single fact; and multi-hop complex tasks: such as HotpotQA, MuSiQue, 2WikiMQA, etc., which require the model to conduct multi-round searches, information synthesis, and coherent reasoning, posing extremely high challenges to the agent's interaction and in-depth understanding abilities.
Figure | The performance of Dr. Zero trained with different generated question distributions.
Based on the above evaluation, the research team drew the following conclusions:
1. Performance comparable to or better than the supervised baseline
After multiple rounds of self-evolution, Dr. Zero performs as well as or better than the fully supervised search agent baseline (such as Search-R1) trained with manually annotated data on multiple open-domain question-answering benchmarks. For example, it achieves a performance improvement of up to 14.1% on some tasks. The experimental results prove that the performance level achieved by data-free evolution is reliable and robust.
2. Far exceeds other data-free baselines
Compared with existing data-free methods (such as the self-questioning language model SQLM and the self-evolving reasoning model R-Zero), Dr. Zero performs best in all tasks, with an average performance improvement of 39.9% and 27.3% over SQLM and R-Zero respectively. This is especially evident in complex multi-hop tasks. The questions generated by Dr. Zero through its difficulty-guided reward mechanism improve the performance by an average of 83.3% compared with the optimized R-Zero*, highlighting its unique advantage in promoting complex reasoning abilities.
3. Significant scale effect, verifying the scalability of the framework
The research team also observed a clear model scale effect. The 7B parameter-scale model performs particularly well on complex multi-hop reasoning data sets such as 2WikiMQA, achieving a significant relative improvement (7.67%). This indicates that the Dr. Zero framework has good scalability, and larger-scale models can more effectively utilize this self-evolution mechanism to handle more complex and intertwined search and reasoning tasks.
This article is from the WeChat official account "Academic Headlines" (ID: SciTouTiao), author: Wang Yueran, published by 36Kr with authorization.