Let's have a serious chat. To what extent can Skill distill us?
During a week at the end of March, five or six projects simultaneously appeared on GitHub Trending. Their names were more outrageous than one another.
“Colleague Skill” feeds the Feishu messages, DingTalk documents, Slack records, and WeChat chats of a departing colleague into Claude to automatically generate a skill file. Once installed, the AI can “become” that colleague. It not only performs the tasks of that colleague but also speaks in their tone.
This project gained 9,500 stars in just one week. Someone commented, “It's recommended to rename it 'Colleague Kill'. Once it becomes a Skill, the original colleague can be 'killed'.”
Seeing the popularity of this project, a “distillation” craze swept through the community.
01 Everything Can Be Distilled
exskill distills an ex-partner into a Skill. It supports WeChat chat records, QQ messages, social media screenshots, and even photo EXIF (Exchangeable Image File Format) data. It has even constructed a five - layer personality structure.
Boss Skills are more practical. It is divided into three modules. Boss Judgment reviews proposals according to the boss's standards, Managing Up teaches how to report bad news, and Persona reproduces the boss's speaking style.
It also has built - in templates of celebrities such as Elon Musk, Steve Jobs, and Jensen Huang.
The peak of this distillation creation is Nüwa Skill. It uses 6 parallel Agents to extract the mental models of public figures from more than 40 information sources and directly helps you distill. It already has 13 built - in personalities, including Paul Graham, Charlie Munger, and Richard Feynman.
Since then, it seems that we have entered an era where everyone can be distilled into a Skill.
At the same time, Carnegie Mellon University (CMU) published a paper titled “SKILLFOUNDRY: Building Self - Evolving Agent Skill Libraries from Heterogeneous Scientific Resources”.
The 6 - step self - evolving pipeline of SkillFoundry (Source, Figure 1 of the SkillFoundry paper)
Although it is an academic paper, what it does is also distillation. However, it distills not people but the knowledge of the entire scientific field.
The logic of Skill Foundry is to scan GitHub repositories, API documents, Jupyter Notebooks, and academic papers to see if it can automatically extract structured Agent Skills.
As a result, after running the pipeline once, 286 skills were discovered, spanning 27 fields. Among them, 71.1% are new abilities not found in the existing skill library.
In the cell type annotation task in genomics, after adding skills, the coverage rate increased from 81.1% to 99.2%, and the accuracy rate increased from 68.5% to 82.9%.
On one side is the cyber distillation of people in the grassroots community, and on the other side is the knowledge alchemy of top academic teams. However, they are both trying to verify the same belief, that is, if experience can be written down, AI can learn it through skills.
But no one wants to be distilled.
Another project called anti - distill also appeared on GitHub that week.
This tool helps users generate a skill file that looks complete but has its core knowledge emptied. Through this skill, specific coding rules and methods will be rewritten as “Cache usage follows team specifications”.
This is technically correct, but it is just a bunch of correct nonsense with no real content.
The existence of anti - distill itself implies a problem.
If a skill can truly distill all of a person's abilities, it should be difficult to make it ineffective. However, anti - distill seems to do it effortlessly.
What is emptied may not be the “person” but a specific layer.
Behind this is actually a problem worth exploring.
To what extent can a skill distill us? What is the layer that can be emptied, and what is the layer that cannot?
02 A Strange Phenomenon
At the end of February 2026, Xiangyi Li and others from the BenchFlow team released the first large - scale cross - domain skill evaluation, titled “SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks”. It covered 84 tasks and 11 fields, and ran 7,308 test trajectories.
In terms of the total number, after adding skills, the average pass rate of Agent tasks increased by 16.2 percentage points. It seems that skills are indeed very effective.
However, when we break down the numbers, some interesting phenomena emerge.
Among all the skills, the improvement in the medical and health field was the most significant, reaching 51.9 percentage points, jumping directly from 1.0% to 51.9%.
In the software engineering field, using skills only increased the pass rate by 4.5 percentage points.
With the same set of skill mechanisms and the same batch of models, the effects differed by ten times in different fields.
The skill effects of SkillsBench by field (Source, Table 4 of the SkillsBench paper)
There are even more counter - intuitive discoveries later.
The researchers divided Skills into four levels according to their level of detail. SkillsBench also divided skills into four levels according to the level of detail in their documentation.
The original data from the SkillsBench paper (Source, Table 6)
Detailed - level skills, which have steps, examples, and focus on specific operations, increased the pass rate by 18.8 percentage points.
However, the Comprehensive level, which tries to cover all boundary cases, actually decreased the pass rate by 2.9 percentage points.
This shows that the more complete a skill is written, the worse the effect.
During the same period, Han, Zhang and others conducted more detailed tests specifically in the software engineering field and published a paper titled “SWE - Skills - Bench: Do Agent Skills Actually Help in Real - World Software Engineering?” They tested 49 real open - source project skills, approximately 565 task instances.
As a result, 39 skills (about 80%) had zero improvement in the pass rate.
Only 7 skills had obvious positive effects. For example, risk - metrics - calculation increased the pass rate by 30% because it encodes specific financial risk calculation formulas.
Another 3 skills even had negative effects.
The complete evaluation of 49 skills in SWE - Skills - Bench (Source, Table 2 of the paper)
The failure of linkerd - patterns is particularly instructive. This skill packages 7 sets of configuration templates for Linkerd. Objectively, the content is accurate, but the model is firmly anchored by the templates.
The model first wrote code according to an outdated API version. Then, when reconciling the templates with the task requirements, it fabricated non - existent fields.
Finally, the examples in the templates made the model add completely irrelevant resources.
These three failures were all because the model gave up its own judgment and blindly followed the specific examples in the skill.
The above data clearly shows that writing a skill does not necessarily make it useful.
It has amazing effects on some tasks, no help on others, and is even harmful on a few tasks. These differences do not seem to be random but seem to have some structure.
To understand this structure, we first need to figure out what a skill is.
03 What is a Skill?
Many people understand a skill as an “advanced prompt”. It is probably to write a structured instruction to tell the AI what to do when encountering a certain type of problem.
If this is the case, there is no essential difference between a skill and a well - written system prompt.
But is it really so?
At the beginning of 2026, Zhejiang University published a systematic SoK paper titled “Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward”, which gave a formal definition of Skill.
In their analysis, a skill consists of four parts, S = (C, π, T, R). C is the applicable condition, that is, when this skill should be triggered. It is not as simple as “the user said a certain keyword” but is automatically matched according to the semantic features of the task. π is the execution strategy, referring to the specific operation steps and decision - making logic. T is the termination condition, determining when to stop. R is the reusable interface. Other skills or agents can call it through this interface.
The three - level progressive loading architecture of Skill (Source, Figure 1 of the SoK paper)
The two most critical words in this definition are “composable” and “routable”.
Routable means that when the model faces a task, it can automatically match the skill according to the semantic features without the user having to manually select it.
Composable means that a skill can assign subtasks to other skills. SkillsBench data shows that the best collaborative effect is achieved when 2 to 3 skills work together, increasing the pass rate by 18.6 percentage points. However, when there are more than 4 skills, the increase is only 5.9 percentage points. This shows that modularization has benefits, but there is an optimal granularity.
The best collaborative effect is achieved when 2 - 3 skills work together (Source, Table 5 of the SkillsBench paper)
In addition, skills have cross - session persistence. They are stored in the file system and do not disappear with the conversation. Therefore, they can be accumulated. The 100th skill and the 1st skill can coexist and call each other. This makes skills a kind of knowledge asset that can be gradually built.
A prompt is a one - time piece of text without call relationships.
From this perspective, a skill is more like a software unit rather than just a piece of text.
Four comparisons between Skill and other enhancement paradigms
Its difference from a workflow is also crucial.
Traditional workflows like Zapier or n8n have completely hard - coded execution paths. Agentic workflows such as Dify, Coze, and LangGraph, although they embed the judgment of large language models (LLMs) in the nodes of a directed acyclic graph (DAG), the flow skeleton between the nodes is still fixed.
A skill does not have a predefined DAG. The entire execution strategy π is in natural language. The model decides for itself what path to take, which sub - skills to call, and when to stop. This layer of flexibility is the core value of a skill.
Now that the definition of a skill is clear, to understand its boundaries, we need to dig deeper.
First of all, since π is in natural language, the knowledge that a skill can carry must generally be expressible in language.
We can write “If the CPU usage exceeds 80% and lasts for 5 minutes, trigger capacity expansion” in it. But it is difficult to write “Judge whether this code change will cause architectural problems in three months”.
This is not because we don't know how to write it, but because the thinking process of making this judgment itself is not composed of rules that can be listed one by one.
The conditional routing of a skill depends on semantic matching. This also means that there may be an upper limit to the expansion of the skill library. When the descriptions of too many skills are semantically similar, the model will not be able to distinguish which one to use.
The paper “From Multi - Agent to Single - Agent” also measured a threshold and found that the selection accuracy will significantly decline after the number of skills exceeds about 80 to 100.
This is actually very similar to human cognition. When you have more memories associated with the same