HomeArticle

Just as Skills are becoming popular, Agents with zero Skills are already here...

量子位2026-01-26 19:37
Self-evolving Agent

Just as Skills became a huge hit, a new Agent paradigm has emerged to stir things up...

There's absolutely no need for Skills, nor is there a need to rummage through projects on GitHub or search for tools. Simply present your requirements to the Agent, and it can create its own tools while working.

Yes, it doesn't require any human assistance at all. There's no need to hand the AI tools or move ladders for it.

Whenever it encounters a task that requires specific tools during work, the Agent can directly "evolve" the necessary tools on its own.

Powered by Gemini 3 Pro as its backend, it leads the pack in the extremely challenging evaluation, the HLE (Humanity’s Last Exam), second only to the GPT5.2-Pro intelligent agent.

In several highly challenging evaluation sets, its score is nearly 20 points higher than the official undisclosed results that involve tool usage.

Moreover, it achieved this in a single run.

This is from a newly published research paper.

An Agent That Can Create Its Own Tools

I discovered this paper because I came across a demo a few days ago.

At first glance, it seemed like an ordinary interaction scenario: a user had a task requirement and sent a prompt to the Agent.

Find out which states among the 2023 graduates had an ACT test participation rate of 50% or higher and an average composite score of 20 or above. Also, provide the proportion of students in these states who met the science benchmark.

Then the Agent started analyzing the task, planning, and selecting potential tools.

So far, everything seemed normal.

To be honest, the task chosen in this demo doesn't seem ideal. It's too open-ended, and it doesn't seem like existing tools can solve it in one go. It probably requires multiple iterative conversations.

Sure enough, a problem arose. The available tools were insufficient, and the Agent couldn't proceed.

Wait a minute...

Why is it starting to create its own tools? And it can even fix them when there are errors?

This is quite surreal. It's like in a zoo, one moment you see an orangutan lying there peeling a banana, and the next moment, it flips up and starts making fire by rubbing sticks together.

I quickly dug out the paper and examined it from start to finish.

After reading it, I discovered a bunch of even more terrifying details.

This Agent actually created 128 tools in 5 evaluation sets where it only had one chance to answer questions!

Yes, it started from scratch and built up to 128 tools one by one.

It was like an extremely difficult start.

To make matters worse, the researchers immediately threw it into the extremely challenging benchmark - the HLE (Humanity’s Last Exam), where it had to compete with other powerful Agents based on GPT, Claude, and Gemini.

However, something unexpected happened.

When faced with questions it couldn't handle, this Agent actually started creating its own "weapons."

It continued to fight against the tasks and create tools along the way.

After completing over two thousand questions in the HLE, it had quietly accumulated 97 "swords" (tools).

That's not all. Then, carrying these over ninety tools, it headed to more diverse benchmark test fields - DeepSearchQA, FinSearch Comp, and XBench.

It continued to create tools and level up by solving tasks.

When it reached nearly 4000 questions, it suddenly stopped creating tools.

The trend is also evident. The curve shows that in the early stage, the growth rate was rapid, but later, there was an obvious diminishing marginal effect.

Finally, the number of tools stabilized at 128.

It's as if it knew that these tools were sufficient.

A statistical graph of the number of tools in a specific order as the number of processed queries increases

This is very crucial. It indicates that the previously created tools are not randomly made but are truly reusable.

So, when it accumulated 128 tools, the Agent suddenly realized that the existing tools could cover most new tasks, and there was no need to continue expanding.

Looking at the following graph makes this even clearer - a statistical comparison of the Agent's performance under two strategies. ZS represents starting with zero tools, and WS represents the knowledge transfer strategy based on the dataset order mentioned earlier.

Under the WS strategy, it's obvious that the more existing tools there are, the fewer new tools are needed. In the last two stages of XBench, the number of new tools even dropped to zero.

The following graph is even more interesting. It shows the top 50 tools that this Agent likes to use the most.

The top-ranked tool is "web search," leading by a large margin.

The following tools are also very familiar: content retrieval, calculator, file download, academic paper search, PDF processing...

It's exactly the same as human work habits. These are all general-purpose basic tools. Moreover, their reuse rate is extremely high, and the Matthew effect is very obvious.

It seems that this Agent may not be creating tools just for the sake of it. Instead, like humans, it has developed a set of methodologies during the work process and can transfer them between different tasks.

The experimental results also confirm this.

This Agent that can create its own tools outperforms others in almost all of the five benchmarks mentioned earlier.

It comprehensively outperforms the Agent based on Gemini 3 Pro, and in tasks that require complex retrieval and reasoning, it can even be more than ten percentage points higher.

In-situ Self-evolving Framework

How did it achieve this?

The research team used a brand-new framework called In-situ Self-evolving Agent.

At first glance, it's not easy to understand, but it vaguely feels like a very appealing concept.

After careful study, it's found that the industry has been working on self-evolving agents, but in-situ self-evolution is different.

Ordinary self-evolution mostly occurs during the training phase. It highly depends on high-quality external supervision signals. Experts need to pre-select the evolution domain in advance. One model creates questions or annotates the answers, and then a new model starts to evolve by maximizing the objective function based on these annotated questions.

This mode usually optimizes based on a long-term goal and can fundamentally reshape the model's "brain."

The most common result is what various model manufacturers are currently doing: training a new model and releasing it to make a splash.

However, the drawbacks are also obvious.

The workload is huge, and the feedback loop is extremely long. Therefore, it can only be completed during the training phase. Once the model is deployed, there's no more "evolution."

In contrast, in-situ self-evolution is a type of self-evolution that occurs during the inference phase.

It doesn't require external supervision or ground truth. Relying solely on the internal feedback during the model's inference and the experience accumulated from the previous interaction, it can distill reusable general skills.

In other words, once the model is deployed, it can "learn while doing."

At this point, some readers may surely ask:

Is this the long-sought-after goal in the AI industry, autonomous learning?

After just one round of training, the model can continuously acquire new abilities online and even reach the singularity of intelligent explosion to achieve Artificial Superintelligence (ASI).

In fact, at the Yunqi Conference in 2025, Wu Yongming, the CEO of Alibaba, pointed out:

ASI will definitely be achieved, and a key prerequisite is that AI can self-evolve.

It's worth noting that when the industry talks about self-evolution in the context of ASI, it mostly refers to the parameter level.

In-situ self-evolution focuses on three other aspects: workflow, memory, and tools.

It may not be the "ultimate" solution, but it's more realistic and can be implemented immediately.

I remember that a few weeks ago, at the Tsinghua Summit on Large Models, Yao Shunyu also mentioned a similar view:

Autonomous learning has already occurred. ChatGPT continuously adapts its chatting style based on the conversation process, and 95% of Claude's Agent codebase is written by the model itself.

The Agent developed by Yunjue Technology adopts this "in-situ self-evolving" approach that can be implemented immediately, but their approach is unique - tool priority.

The team believes that the workflow approach is prone to overfitting to a few tasks. Once the thinking pattern is fixed, it's difficult to generalize.

The memory approach can't avoid the hallucination problem inherent in large language models. As the number of tokens increases, the deviation will snowball.

From a first-principles perspective, tools are the most intuitive carriers of evolution.

First of all, tools directly determine the Agent's ability boundary.

All the wonders created by humans based on Earth's resources are built on new production tools. The same applies to AI. No matter how much context it accumulates, without a "shovel," it can only sit on a gold mine and do nothing.

Secondly, tool execution naturally comes with high-quality supervision signals and doesn't require human annotation.

Whether the workflow is good or the memory is reliable is quite subjective. However, whether a tool can be used can be directly determined by whether the code reports an error. This is the so-called binary feedback signal.

Moreover, through formal verification, the code can ensure maximum security, allowing the Agent to safely perform low-level operations such as API calls and database read - write.

There's no need to worry about sacrificing one thing for another. After the tool set has basically converged, it's still timely to improve the workflow and memory.

Based on the above considerations, the team, with the concept of "tool priority," has created a legion of Agents capable of in-situ self-evolution.

It consists of four roles:

First is the Manager, which is responsible for overall planning.

After receiving the user's request, it analyzes the task, breaks down the goals, and checks the existing tool library to see if there are any available tools.

If it finds that the existing capabilities are insufficient, the Manager will instruct the Tool Developer to create a new tool on the spot and configure it immediately in the current context.

Once everything is ready, the Executor will use these tools to process the task.

If it finds that the task still can't be completed, it will pause the execution and report to the Manager.

After receiving the information, the Manager repeats the previous process, continues to supplement tools and capabilities until the task can be fully completed.

After the task is completed, it's handed over to the Integrator, which integrates the execution history and intermediate results to generate the final answer.