When can an Agent write a skill by itself?
Why is the lobster so useful? One of the answers is that its skills are incredibly useful.
On December 18, 2025, Anthropic released Agent Skills as an open standard. It's a set of standardized folder specifications that allow agents to load professional skills just like installing apps. Each skill folder contains a SKILL.md file that clearly explains what the skill is and how to use it. It can also include executable scripts, enabling agents not only to "know what to do" but also to actually take action.
Once the standard was introduced, the industry followed at an unexpectedly rapid pace. Microsoft directly integrated it into VS Code and GitHub. OpenAI adopted an almost identical architecture in ChatGPT and Codex CLI, though it hasn't officially announced it. Coding tools like Cursor, Goose, and Amp also followed suit. Box used skills to teach Claude to convert files into PPTs and Excel spreadsheets that meet company standards, and Notion used skills to enable Claude to perform tasks directly in notes rather than just chatting.
Why is this standard important? Model companies have used harnesses like MCP, CLI, and memory layers to equip agents with "hands and feet," but they lack professional knowledge.
Agent Skills fill this gap. It's not about "what tools you can generally use and how to do things," but rather "how to do a specific thing correctly."
Skills are the crystallization of know-how in the work process. Another advantage is that they can be quickly replicated. A company can write a compliance check skill and distribute it directly to all colleagues' agents.
The blueprint looks great, but then reality hits.
Anthropic comes with a tool called skill-creator, which claims to help users automatically generate skills. In the first week after its launch, developer Samhita Alla specifically observed the usage of over 100 users and concluded that "most implementations seem more like toys than tools."
Skills fail to trigger when they should, too many instructions are input, causing the agent to get confused, there are security vulnerabilities, and file format errors occur repeatedly.
Automatically generated skills are rough and unreliable. Truly useful skills rely on manual refinement by humans.
Of course, the reason this skill product has become popular is that current agents don't have a sufficient understanding of human work processes, norms, and know-how.
However, we still hope that agents can discover solutions to problems on their own.
Actually, the question of "letting skills grow on their own" has been asked for 26 years.
01
From Weights to Code: A 26-Year Pursuit of Skills
In 1999, Rich Sutton and his students Doina Precup and Satinder Singh proposed a theoretical framework called the options framework. The core idea is that agents should be able to discover and combine reusable behavior modules on their own, rather than starting from scratch and trying step by step every time. This was the first formal proposal of a concept similar to skills in the field of reinforcement learning.
However, in that era, skills were trapped in the weight matrices of neural networks, being unexplainable, non-transferable, and non-editable. If you trained a skill to open a door, it was almost impossible to apply it to another environment.
This dilemma persisted for 24 years until, in 2023, Jim Fan and others' Voyager pulled skills from weights into code in Minecraft. There, an agent driven by GPT - 4 autonomously explored the game. Every time it learned a new ability, it wrote it as a JavaScript function and stored it in a skill library. The next time it encountered a similar situation, it would first search in the library. If found, it would directly call the skill; if not, it would create a new one.
As a result, Voyager obtained 3.3 times the number of unique items compared to the previous best method and unlocked the technology tree 15.3 times faster. Writing skills in code means they are naturally explainable, editable, combinable, and transferable.
Voyager architecture diagram: Automatic curriculum, iterative prompting mechanism, and Skill Library (Wang et al., 2023)
Voyager's real contribution isn't in the numbers. It proved that when the representation of skills changes from internal parameters to readable code, the entire game rules change. Skills in parameter form are black boxes, invisible, unchangeable, and unable to be shared with other agents. Skills in code form can do all these things. This is the real turning point in 26 years.
Agents don't learn skills because they become smarter; rather, skills become readable, allowing them to be accumulated, verified, and spread.
However, Voyager has a fundamental limitation. It only exists in Minecraft. The game has closed rules, observable states, and immediate verification. The real world is not like this. An agent processing financial data can't immediately verify whether a skill will fail in special situations.
From Minecraft to the real world, a whole set of problems such as verification, quality assurance, and cross - environment migration await solutions.
From the second half of 2025 to the beginning of 2026, Anthropic defined the standard, the industry had demand, and academia had a focus. Things started to change intensively. It wasn't just one paper but a whole batch. From the autonomous discovery, encapsulation, and combination of skills to continuous improvement, there were systematic solutions for almost every aspect.
After skills have the infrastructure for circulation, "where skills come from" has changed from an academic interest to an industrial bottleneck.
This wave of research unfolds according to the life cycle of skills, including three parts: how skills are discovered, how they are encapsulated and combined, and how they are continuously improved.
02
All Three Paths Are Open: Exploration, Failure, and Learning
Let's start with the most fundamental question. Can an agent discover useful skills on its own without being taught step by step?
In June 2025, Yongjin Yang and others from KAIST published EXIF (Exploratory and Iterative Feedback), proposing a very interesting dual - agent architecture. There are two agents, one called Alice and the other called Bob, with clear division of labor. Alice is the explorer, put into an environment to freely explore, try various operations, and record what works and what doesn't. Then Alice looks back at her exploration trajectory and extracts the definition of a skill from it.
Then these skills are handed over to Bob, who uses them to perform specific tasks. Bob's performance is feedback. Information such as which skills are useful, which are not, and where Bob gets stuck guides the direction of Alice's next round of exploration.
This cycle continues to iterate. Alice explores → defines skills → Bob executes → assesses shortcomings → guides the next round of exploration. The key is that the entire process doesn't require humans to provide any task descriptions or skill definitions. Alice and Bob complete the whole process from knowing nothing to accumulating a set of usable skills on their own.
EXIF architecture diagram: Alice explores the environment to generate skills, and Bob executes tasks and provides feedback
The most interesting discovery of EXIF comes from the disassembly test. The researchers tried to let the same model play both Alice and Bob. Intuitively, self - teaching should have poor results. However, the result is that the self - evolution of a single model is actually effective. Skill discovery doesn't necessarily require two models to complement each other. The "exploration" and "utilization" of one model can generate effective skills through self - game.
If EXIF discovers skills through "exploration," Salaheddin Alzubi and others from Sentient published EvoSkill in March 2026, taking a completely different path, relying on "failure."
EvoSkill doesn't let the agent freely explore the environment. It lets the agent directly execute tasks and then analyzes the reasons for failure. Every operation in the execution process is recorded. When the task fails, a Proposer agent reviews these execution records, diagnoses the specific reasons for failure, such as errors in data extraction, confusion in time granularity, or lack of multi - source verification, and then proposes new skills or modifies existing ones accordingly.
The proposed skills are not directly adopted but go through an elimination round. New skills must prove themselves better than the existing skill combinations on the verification set or show improvement in a certain dimension without sacrificing performance in other dimensions to be retained. This screening mechanism draws on the idea of the Pareto frontier in multi - objective optimization. Only those skills that are "not completely dominated by others in any dimension" are retained, and the rest are eliminated.
EvoSkill evolution cycle: Propose new skills from failure and retain them after Pareto screening
Since it was published after Anthropic had productized Skill security, the optimization of EvoSkill purely occurs at the skill level. There's no need to fine - tune the model or use additional training data. It only requires the agent to keep failing at tasks, continuously analyze, and improve skills.
After iteration, the skills improved by 7.3% in OfficeQA (a question - answering task in an office scenario) and 12.1% in SealQA (search - enhanced question - answering). But what's more noteworthy is the cross - task generality. Skills evolved from SealQA, without any additional adaptation, directly applied to BrowseComp (a web search test set with a completely different structure), improved by 5.3%.
The evolved skills are useful in their own tasks and also work when transferred to other tasks.
SkillCraft has a third path, demand - driven. The agent doesn't rely on exploration or failure. Instead, when performing a task, it finds that "I lack a skill to handle this kind of situation" and creates one directly. It's like a programmer who, in the middle of writing code, finds that a function doesn't exist and stops to write the function before continuing.
This path comes from a joint team of UC Berkeley and EPFL. In December 2025, Xu Huang, Junwu Chen, and others published CASCADE (Cumulative Agentic Skill Creation through Autonomous Development and Evolution).
CASCADE has a different starting point. Tools used in scientific research, such as material simulation software, chemical calculation packages, and machine - learning force fields, are extremely professional. Moreover, their usage documents are scattered, and the versions are chaotic. Even human scientists often take several days to run a new software. This makes it insufficient for the agent to "freely explore" or "learn from failure" because it first has to figure out how to use these tools.
CASCADE's solution is to equip the agent with two meta - skills (skills for learning skills). The first is continuous learning. When encountering a tool it doesn't know how to use, the agent will search for documents on its own, extract code examples from web pages, and read the source code to understand the usage. The second is self - reflection. After an execution error, the agent doesn't simply retry but checks the runtime state, traces the dependency relationship using a knowledge graph, and even reads the source code of the underlying package to locate the root cause of the problem.
These two meta - skills are not hard - coded processes but behavioral patterns that emerge through carefully designed prompts and tool - calling interfaces.
The tool usage and debugging experience that the agent acquires in the process of solving a task will be solidified in the memory system, from short - term session memory to cross - session consolidated memory, and finally precipitated into a reusable skill set. The next time it encounters a similar tool or problem, it directly calls on the existing experience.
In SciSkillBench (116 material science and chemical research tasks), the bare - running success rate of GPT - 5 is 35.4%. After adding CASCADE's evolution mechanism, it reaches 93.3%. More notably, CASCADE successfully reproduced the computational experiments in published papers and can drive the automated synthesis process in a real laboratory.
This requires manipulating an internal software package it has never seen before, without documentation and not in the training data.
CASCADE architecture diagram: LLM + Skill Acquisition paradigm and DeepSolver multi - Agent architecture
The three paths mentioned above actually correspond to three ways humans learn skills: curiosity - driven (I'll give it a try), failure - driven (I learned because I failed last time), and demand - driven (I found I needed this while working).
Humans use all three, but most people rely on the latter two most of the time. The same is true for current agents. The exploration path is the least efficient in the real environment.
However, the reason humans learn things quickly is that they can freely switch between the three modes, exploring when appropriate, reflecting when needed, and looking up information when necessary. Currently, no system has all three capabilities. EXIF won't actively look up documents, EvoSkill won't curiously explore unknown areas, and CASCADE won't systematically extract experience from failure. Current agents are still unbalanced in their learning strategies.
So far, there's an answer to the question of "where skills come from," but the answer is not complete.
03
Simple Skills Are Fine, but Combinations Can Collapse
After a skill is discovered, it needs to become a reliable and reusable module. If this step is not done well, the previous discovery is in vain.
The skills discovered by the above methods are all atomic, single - step operations, single API calls, or single - scenario processing logic. It's no problem for an agent to create a skill to extract table data from a web page or a skill to call an API to query the weather. Even using complex tools with clear engineering patterns to solve a problem is okay. However, real - world tasks rarely require only one skill.
Collect detailed information by calling 5 APIs for each of 5 cat breeds, then cross - compare and generate a report. This requires nesting and combining multiple skills such as "search for breed information," "extract health data," and "format output," repeating the process 5 times, and then making a summary. Currently, agents struggle with this task.
In February 2026, the Chinese Academy of Sciences and Harbin Institute of Technology published SkillCraft, which specifically measures an agent's skill combination ability. There are 126 tasks and 21 API families, with the difficulty scaled in two dimensions: the number of entities (N) and the complexity of API calls for each entity (M). N×M forms a two - dimensional matrix, with the difficulty gradient ranging from Easy (N = 1, M = 2) to Hard (N = 5, M = 5).
SkillCraft designed a three - stage Skill Mode protocol. In the first stage of exploration, the agent is given a simplified task to figure out on its own. In the second stage of combination, the experience is encapsulated into reusable skills. In the third stage of reuse, when facing a large number of similar tasks, the agent must reuse the previous skills.
What can be achieved with the support of skills? This gap is a direct