Stop Reinventing the Wheel: Agent Skills Achieve "Collective Evolution" with 88% Improvement after Six Rounds of Evolution

Farewell to the static skills library

AI Agents based on Large Language Models (LLMs) are already capable of handling complex tasks such as configuring services, debugging APIs, and automating multi - step workflows. These capabilities largely rely on "skills", which are structured programs encoding tool - calling and task - solving processes.

However, there is a fundamental problem in the current Agent skills ecosystem: skills are mostly static after deployment. Effective solutions discovered by users during interactions are often confined to the current session. They are neither "precipitated" into the skills library nor passed on to other users. When different users repeatedly encounter the same workflows, similar tool - calling patterns, and analogous failure scenarios, the system fails to learn from them. Each user is reinventing the "wheel".

To address this pain point, the research team from DreamX proposed SkillClaw, a collective skills evolution framework for multi - user Agent ecosystems. It uses cross - user and cross - time interaction data as the core signal for skills improvement. Through an autonomous evolution engine, it continuously aggregates interaction trajectories, identifies behavior patterns, and updates the skills library, enabling improvements discovered in one user scenario to automatically spread across the entire system.

The relevant paper has been published on arXiv, and the code is open - sourced on GitHub.

Paper link: https://arxiv.org/pdf/2604.08377

GitHub address: https://github.com/AMAP - ML/SkillClaw

The core contributions are as follows:

SkillClaw is the first framework to achieve multi - user - driven collective skills evolution. It transforms the interaction experiences of different users into continuous updates of the shared skills library without additional user operations.
Based on the Agentic Evolver's skills update mechanism, it analyzes interaction evidence through open - ended reasoning (instead of predefined rules) and independently decides on the improvement, creation, or retention of skills.
Experiments on the WildClawBench benchmark show that after 6 rounds of evolution, SkillClaw achieved continuous improvement in four task categories, with a relative improvement of 88.41% in the Creative Synthesis category.

How is SkillClaw developed?

SkillClaw's design is centered around a core insight: Different users using the same skill in different scenarios generate complementary perspectives on the skill's behavior boundaries, revealing under what conditions it is effective and under what conditions it fails. A single user rarely generates enough signals to distinguish between "generalizable improvements" and "specific - scenario patches". Aggregating cross - user evidence provides a basis for stable skills evolution.

Figure | Overview of the overall SkillClaw framework

The entire system forms a cyclic pipeline: multi - user interaction → session collection → skills evolution → skills synchronization. The following will elaborate on it in three stages.

1. From isolated sessions to shared evidence

SkillClaw first records each interaction session as a structured causal chain: user prompt → Agent action (including tool calls) → intermediate feedback (tool results, error messages, user responses) → final answer. The complete intermediate process is retained because most skills - level failures are procedural. Problems such as incorrect parameter formats, omitted verification steps, and wrong tool - calling orders cannot be seen in the final answer but can only be diagnosed from the intermediate action - feedback chain.

Subsequently, all sessions are grouped according to the skills they reference. For each skill, all sessions that have called it are collected to form an evidence group; sessions that do not use any skill are placed in an independent group. When multiple sessions call the same skill but produce different results, the skill itself becomes the "control variable". This natural ablation experiment enables two types of analysis: evaluating the actual performance of existing skills in diverse real - world usage and identifying repetitive processes not covered by any skill from the independent group.

2. Agentic Evolver: Autonomous skills evolution engine

The core of SkillClaw is an Agentic Evolver, an LLM Agent equipped with a structured Harness. It receives the grouped session evidence and the current skill definition and decides how to act through open - ended reasoning. The Harness provides structured input but does not restrict the reasoning process. This separation design of "fixed framework + open reasoning" enables the system to handle diverse failure patterns without writing rules for each situation.

Specifically, for each skill and its associated session group, the Evolver examines both successful and failed executions and selects one of three operations: Refine (correct identified errors or improve robustness), Create (create a new skill when the evidence reveals a repetitive subprocess not covered by existing skills), Skip (keep unchanged when the evidence is insufficient to support modification).

The key is that the Evolver always jointly analyzes successful and failed sessions. Successful sessions define the "invariants" that must be retained in the skill, i.e., the effective parts; failed sessions define the targets that need to be corrected. This joint perspective prevents a common failure mode: accidentally disrupting an already verified and effective process while fixing a problem, thus ensuring that the evolution is cumulative.

3. Synchronization and evolution cycle

Candidate skills updates generated by evolution need to be verified before being written into the shared repository. The verification is carried out at night, using the idle user environment. For the current version and the candidate update of the same skill, the system selects relevant tasks from the interaction data collected on the same day, runs the two versions in the same environment, and compares the results. Only the update with better performance will be accepted and synchronized to all Agents, and the rejected update will only be retained as a candidate record.

This verification step introduces monotonic deployment behavior: Since only improvements are adopted, the deployed skills pool will not degrade over time. The entire system forms a complete cycle: interaction → evidence → evolution → verification → deployment. The updated skills influence future interactions and generate new evidence for the next round of evolution. From the user's perspective, all this happens automatically in the background without any additional operations.

Experimental results

The research team evaluated SkillClaw on WildClawBench. WildClawBench is a real - world Agent benchmark containing 60 complex tasks, covering six domains: productivity processes, code execution, social interaction, retrieval, creative generation, and security alignment. It requires end - to - end execution in a real Linux container environment.

The experiment simulated a multi - user deployment scenario, lasting for 6 days (6 rounds). Each day was divided into a daytime interaction phase and a nighttime evolution verification phase. Eight concurrent users participated in the interaction, and all executions, evolutions, and verifications were driven by Qwen3 - Max. The results are as follows:

Table | User - side performance evolution in four categories of WildClawBench (Day 1 is the baseline)

The Social Interaction category improved the fastest, rising from 54.01% to 60.34% on the second day and remaining stable, indicating that a high - impact workflow bottleneck was quickly resolved.

The Search & Retrieval category improved gradually. First, it solved the input verification and file accessibility problems, and then gradually established the constraint - aware retrieval planning ability, reflecting the characteristic of "underlying reliability before high - level reasoning" in retrieval tasks.

The Creative Synthesis category showed a significant improvement on the second day and then leveled off, indicating that the main bottleneck lies not in content generation itself but in environment settings such as file processing, working directory configuration, and multi - modal pipelines.

The Safety & Alignment category only showed improvement on the fifth day, with the main improvements focusing on execution reliability, such as the fallback strategy for Git authentication failure and the directory cloning protocol.

Meanwhile, in the controlled verification experiment, for customized queries such as "basic extraction", "deadline parsing", and "save report", the average improvement after a single round of evolution reached 42.1%. Among them, the success rate of "save report" increased from 28.3% to 100.0%. The initial failure was due to the lack of environment - specific processes (such as output path and format), which could be completely corrected once encoded as reusable skills.

Table | Controlled verification results: Performance comparison of three customized queries before and after evolution

In addition, the research team also demonstrated the specific effects of skills evolution through multiple case studies.

For example, in the Slack message analysis task, the original Agent used a naive workflow and handled tool failures through trial - and - error (such as incorrect API port configuration). The evolved skills introduced a structured pipeline, first scanning message previews to filter relevant content, then selectively retrieving complete messages, and directly encoding known API configuration errors into the skills. This transformation reflects three key improvements: task decomposition, active error correction, and selective retrieval.

Limitations and future directions

Of course, this research also has some limitations.

The research team pointed out that SkillClaw is currently in the small - scale testing stage, with limited numbers of user queries, feedback signals, and interaction depth. Within the 6 - day experiment window, the late - stage evolution of some categories (such as Creative Synthesis) failed to surpass the optimal skills pool established in the early stage. The long - term evolution effect remains to be observed.

In addition, although the verification mechanism ensures the monotonicity of deployment, it introduces additional Token overhead. Candidate skills need to perform complete tool interactions in the real environment. Compared with direct deployment, this additional cost results in more stable user - side performance.

According to the paper, future work directions include: expanding the user scale and time span to enrich the evolution trajectory, and introducing more diverse tasks and verification conditions.

From a static skills library to a dynamic, interaction - driven skills ecosystem, SkillClaw represents a new paradigm: letting the Agent's capabilities grow autonomously through collective experiences in real - world use rather than being manually maintained by developers. When the interaction trajectories of different users can converge into shared knowledge, the Agent system has the possibility of continuous evolution with usage.

This article is from the WeChat official account “Academic Headlines” (ID: SciTouTiao), author: Academic Headlines. Republished by 36Kr with permission.