Three papers clearly outlined the dilemmas of the first year of Agent.
In 2025, the capital market defined it as the "Year of Agents".
Multi - Agent applications such as Manus, Lovart, and Fellou have attracted a great deal of attention. They feature high automation and strong generalization ability. Xiao Hong's quote, "More Intelligence, Less Structure", has deeply resonated with people.
Most of these star companies adopt a multi - Agent joint architecture. Completing tasks involves multiple tool calls, often resulting in long waiting times. Guided by them, two iron laws seem to have formed in the current Agent industry: First, the capabilities of a single Agent are limited, and multi - Agent collaboration can solve complex problems; Second, if the budget is insufficient, increase the number of Tokens and tool calls, and the performance will naturally improve.
However, a research report titled "Measuring Agents in Production" published by UC Berkeley in December presents a parallel universe that is completely contrary to the narrative of star companies.
The Berkeley team conducted in - depth research on 306 front - line practitioners and 20 in - depth cases (including large banks like Intesa Sanpaolo). To prevent biases, the paper specifically filtered out projects that were still in the pie - in - the - sky stage or in the Demo phase, and only studied systems that had been deployed and were generating real value.
The results show that the real - world data in the production environment is much more conservative than in the laboratory. It can even be said that they are all "cowards".
68% of production - level Agents have their execution steps strictly limited to within 10 steps. Only 16.7% allow dozens of steps, and only 6.7% have no restrictions.
To simplify tool usage and reduce risks, enterprises dare not let Agents directly call the underlying production - environment APIs. Teams usually build an abstraction layer (Wrapper APIs) between Agents and the real environment. For example, if the underlying system requires calling 3 interfaces to query a user, engineers will encapsulate them into a single large interface for the Agent. One step replaces three steps.
80% of the in - depth interview cases use "structured control flow". This means that the task flow chart is drawn by humans, and the AI only fills in the blanks within the established framework.
The paper's data shows that 12% of the deployed systems have a Prompt length exceeding 10,000 Tokens. All Agents operate in a Pipeline with a very rigid System Prompt that often contains tens of thousands of words.
Current successful cases are essentially a "tireless intern with reading comprehension ability" stuffed into a strict SOP process. Compared with hard - coded SaaS, it can understand vague intentions and has a certain degree of flexibility, but that's all.
Why is the reality so harsh?
In November and December, DeepMind published two papers in succession, providing a perfect pathological analysis for the dismal situation in the Berkeley report. Because they directly disproved two core assumptions in the Agent community.
They proved through experiments and data that the magical era of expecting models to self - emerge has not arrived yet. We still remain in the engineering era that relies on hard - coding, strong control, and pipeline operations.
01
The Collapse of the Tower of Babel, More Agents ≠ Better Performance
DeepMind's first paper shattered the myth of "more Agents must be stronger" with 180 controlled experimental configurations.
In the past year, architects fantasized that since one model is not smart enough, they could use a bunch of models. Let GPT - 5 play the role of a product manager, a team of Claude models play the role of programmers, and a team of Gemini models be responsible for testing. Just like running a company, form a virtual team with more than a dozen AI with doctor - level capabilities to serve them. What problems can't be solved?
However, DeepMind's paper, "Towards a Science of Scaling Agent Systems", proved that this was just a fantasy. They constructed what might be the largest - scale experiment in Agent history.
The experiment tested five mainstream Agent architectures, including:
● Single - Agent System (SAS), where one Agent completes all tasks (such as the ReAct architecture)
● Independent multi - Agent architecture (a group of Agents work in parallel on the same task, and the results are aggregated later without communication in the middle, usually to eliminate hallucinations)
● Decentralized multi - Agent architecture (Agents use protocols like A2A to have peer - to - peer debates and discussions and finally summarize a result)
● Centralized Agent architecture (there is a commander Agent responsible for task distribution and result verification)
● Hybrid Agent architecture (usually a combination of centralized and decentralized architectures, where the Agents executing tasks at the bottom communicate, and there is also a supervisor assigning tasks)
The tested models were the popular products of three top - tier companies: OpenAI, Google, and Anthropic. Finally, four common benchmark tests for Agents were used to test the effects of different combinations, including financial analysis (Finance - Agent), web browsing (BrowseComp - Plus), game planning (PlanCraft), and workflow (Workbench).
These different factors formed more than 180 combinations. Through this scientific large - scale comparison, they discovered some basic laws of Agent design.
1. Tool - Collaboration Trade - off
In open and complex tasks, simply increasing the number of Agents will only make the system "dumber".
In an environment like PlanCraft, which is similar to Minecraft, introducing multi - Agent collaboration not only did not improve performance but also led to a significant decline. For example, the performance of Anthropic's model dropped by 35.0% after introducing collaboration. The reason lies in the "coordination tax". Each Agent has to understand the interface, maintain the context, and process the results. When the number of tools exceeds the threshold, the cost of information transfer exceeds the benefits of parallel processing.
The Tokens are all spent on reading instructions and having meetings, leaving no time for work.
2. Capacity Saturation Effect
When the accuracy of a single Agent exceeds 45%, introducing multi - Agent collaboration often leads to diminishing returns or even negative returns.
The logic behind this is simple: for a problem like 1 + 1 = 2, one Agent can solve it correctly, and having three Agents discuss it for a day won't make any difference.
3. Error Amplification Topology
This might be the key reason why after capacity saturation, multi - Agents not only cost more but also may produce worse results.
Intuitively, we think that for example, with 3 Agents voting to decide the answer should be able to correct errors and reduce the error rate. However, according to the paper's research, in the independent multi - Agent architecture, errors are more likely to be amplified.
The paper uses the error amplification factor to quantify this phenomenon. In the independent multi - Agent architecture, this factor is 17.2, which means that if the error rate of a single Agent is 5%, the error rate of the independent multi - Agent system may reach 86% (5% × 17.2).
The logic behind this is also quite simple. Because there is no cross - validation mechanism. Each Agent reaches a conclusion based on its own reasoning path, and errors will self - reinforce in their respective contexts. Voting just combines three wrong answers.
This is the "Tower of Babel Effect". Three mediocre Agents can't match one brilliant one.
Based on these three observations, DeepMind finally proposed a mixed - effect model.
Translated, the formula is approximately as follows:
Final effect = (Individual intelligence + Strength in numbers) - (Chaos caused by more people + Communication noise + Cognitive burden of tools)
If the reduction from the latter three items is greater than the gain from having more Agents, the multi - Agent system will fail.
In the paper, this formula can predict with 87% accuracy which Agent architecture is optimal for the current task based on the task attributes (such as the number of tools and decomposability) and the model's capabilities.
In tasks of different complexities, different multi - Agent architectures perform very differently. For example, in PlanCraft, all architectures failed. In web retrieval, the advantages are not obvious, and errors may be amplified. In general office work, only the decentralized model is slightly better, and other Agent architectures are inferior to the single - Agent architecture.
However, it is worth noting that only in tasks such as financial analysis, multi - Agents bring overall improvement, especially the centralized Agent architecture, which can improve the effect by up to 81%.
This is because the boundaries of financial analysis tasks are extremely clear, and the SOP is extremely well - defined. For example, an analysis task can be broken down into: reading financial statements -> extracting data -> calculating ratios -> generating summaries. In this case, each Agent only needs to fill in the blanks within the established framework and does not need to perform complex creative planning. At this time, the centralized multi - Agent architecture becomes very useful.
This shows that even the most powerful current LLMs have not yet emerged the ability of self - organized division of labor. They can only perform easily parallelizable divide - and - conquer tasks (such as financial analysis) or fault - tolerant tasks based on consensus (such as multi - path search).
For the centralized architecture with a coordinator, its intelligence ceiling is the context - processing ability of the commander. If artificial, hard - coded tool stratification (i.e., grouping tools so that different commanders only look at one group) is not carried out, a single commander cannot handle a complex tool library to issue appropriate instructions and task breakdowns.
In such a situation, to achieve the original intention of the multi - Agent system, that is, to handle complex long - chain tasks. Manually arranged task - breakdown SOPs are still the inevitable path at present.
Expecting to throw a bunch of Agents in and let them evolve a hierarchical structure on their own has been proven to be unfeasible, at least in this current paper.
This is also the significance of Anthropic's recent launch of Skills, which is to simplify the difficulty of the model's context - processing for tool usage and enable it to better perform task breakdown and verification.
02
The Limitations of Reasoning, More Budget ≠ Effective Scaling
Since "increasing the number of Agents" doesn't work, can we "be more patient"?
After the release of OpenAI o1, Test - time Compute has become a popular topic. People firmly believe that as long as we give Agents more time to think and let them search and deduce repeatedly, they will surely find a way, right?
Actually, many papers have disproved this. However, another paper by DeepMind in November, "Budget - Aware Tool - Use Enables Effective Agent Scaling", focused this disproof more on Agents.
In this paper, researchers found that if you simply increase the tool - call budget for an Agent, for example, allowing it to search the web from 10 times to 100 times, its performance will not increase linearly but will quickly hit a performance ceiling.
For example, for a standard ReAct Agent, when the budget is doubled, the accuracy only increases by 0.2 percentage points. Because when the budget is 100, the model only uses an average of 14.24 searches and 1.36 web page views, and the remaining 85% of the budget is not used at all.
This shows that Agents have no idea what they don't know and how much budget they have available.
When the model gets stuck in a wrong path (such as searching for a non - existent paper title), it has no concept of opportunity cost. Even if you give it infinite computing power, it will just dig deeper into the wrong pit. Moreover, the model often suffers from scattered attention due to an overly long context, and its performance may even decline. After searching a dozen times, it gets lost in the massive amount of invalid search results it generates.
To solve this problem, DeepMind proposed BATS (Budget - Aware Test - time Scaling).