HomeArticle

We tested 30 skills with 150 tasks and obtained 7 counterintuitive conclusions.

36氪的朋友们2026-05-22 18:53
Some skills only seem to be mastered well.

In the first half of 2026, the number of skills skyrocketed. Many companies are skill - enabling all internal work processes. By adding a skill to a large - scale model, the model is expected to "become professional immediately".

However, when the number of skills expanded from a dozen to hundreds, a simple question was repeatedly raised:

Does installing a skill really make the model more powerful?

With this question in mind, we conducted a systematic experiment in the TRACE Selective Evaluation. Instead of using lightweight methods like "checking the download list" or "running once and giving a score", we let each skill compete with the "naked model" (no - skill) under unified prompts, a unified judge, and a unified evaluation standard. We carried out 150 sets of task - level comparisons, evaluated the cost and stability of 30 skills, checked 107 normative issues, and conducted a round of transferability tests for cross - model reasoning intensity.

For a detailed introduction to the TRACE Selective Evaluation, you can read "3 Pictures, 5000 Words: Seriously Discuss What a Good Skill Is".

In the process of continuously evaluating skills, we sorted out 7 most notable findings and publicly released relevant experimental data, evaluation processes, and mechanism explanations. Many of the conclusions were beyond our expectations.

01 Having a skill doesn't necessarily lead to better results

Our initial idea of installing skills was to enable large - scale models or general agents to gain stronger professional capabilities in certain aspects. However, in the experiment, the first intuition to be overturned was precisely that "installing a skill is always better".

In 150 sets of task - level comparisons, the skill group won 62 times (41.3%), the no - skill group won 55 times (36.7%), and there were 33 ties (22.0%). The skill group only had a slight advantage, far from being overwhelming.

More importantly, the differences vary greatly among skills:

· Skills that stably bring positive gains include: data - analysis, multi - search - engine, baoyu - cover - image, etc.;

· Skills that are inferior to the naked model: note - taker, fliggy - travel - planner, meeting - minutes, etc.

Why is this the case? When looking through the win - loss sample data one by one, the pattern is actually clear: when a skill truly supplements something beyond the model's naked capabilities, such as a clear output structure, a set of external tools, a constrained workflow, or a specific deliverable, it is useful; when a skill just "rewrites what the model can already do in Markdown", it brings more burden than gain.

02 There is a siphon phenomenon in skills

We observed a phenomenon in the skill routing experiment, which can be called the "skill siphon": some requests are very simple, and the naked model can answer them directly. However, as long as the semantics are close to a certain skill, the system may still be tempted to read this skill.

To distinguish between different situations, we designed three types of requests:

The first type is the commonly applicable request: the user's intention is very clear, and the target skill should be triggered.

The second type is the boundary - applicable request: the user's request should still trigger the target skill, but the semantics are closer to an adjacent skill, which is used to observe whether similar skills will interfere with each other.

The third type is the request that should be no - skill: the task is simple, and theoretically, the naked model can answer directly without invoking any skills.

The results show that the system performs well when "a skill should be used": the correct trigger rate for commonly applicable requests is 99.0% (891/900), and for boundary - applicable requests, it reaches 96.0% (288/300). This indicates that it is not generally unable to find the correct skill.

The real problem occurs when "a skill should not be used": among 90 requests that should be no - skill, 12 still read a certain skill, and the over - invocation rate is 13.3% (12/90).

Two examples are very intuitive:

· The user says: "Without going online, help me come up with 5 Chinese keyword combinations suitable for searching 'office green plant maintenance'."

This is originally just a keyword brainstorming, but "search" and "keywords" draw it into the search - type skill (multi - search - engine).

· The user says: "Make this sentence more natural: We create long - term business value through better customer communication."

This is originally just a single - sentence rewrite, but "customer communication" and "business value" make it seem like a marketing task, so it is drawn into the marketing - type skill (marketing - skills).

Therefore, the "skill siphon" is not a random trigger of the routing system, but a more specific deviation: when a skill should be used, it is good at recall; when a skill should not be used, it is sometimes not restrained enough. When the request is just a simple rewrite, phrase generation, concept explanation, or a lightweight template, as long as relevant domain words in the skill description appear, it may be drawn into a certain skill.

This side effect will bring additional context and potential token/time costs, and it will also make a task that can be completed in one sentence be trapped in a more complex specialized workflow.

03 Most skills do not save tokens and time

To answer the question "Is it more expensive after installing a skill?", we defined two indicators:

· Token multiplier = average tokens per task in the skill group / average tokens per task in the no - skill group;

· Time - consumption multiplier = average time per task in the skill group / average time per task in the no - skill group.

A value of 1.0 means equal to the no - skill situation, and a value greater than 1.0 means more expensive. The experimental results of the full sample of 30 skills are as follows:

That is to say, after installing a skill, the token consumption increases by an average of 48%, and the time consumption increases by an average of 19%.

The increase in token consumption is more obvious. The skills with the highest multipliers are concentrated in scenarios such as multimedia generation, complex product organization, and structured delivery:

The distribution of time - consumption is relatively mild, but there are still "extremely long - lasting" tail skills:

It is worth noting that skills are not always more expensive. The token multipliers of skills such as fliggy - travel - planner, pdf - extract, market - research, word - docx, and chinese - baby - names are actually lower than 1.0; the time - consumption multipliers of fliggy - travel - planner, market - research, weather, zhihu - writer, and chinese - baby - names are also lower than 1.0. The reason is not difficult to understand: when a skill provides a clear process, a restricted output boundary, and a clear "do / don't do", the model actually does less ineffective exploration, and the overall consumption decreases. However, overall, after installing a skill, both token and time consumption increase.

04 Skills that consume more tokens are usually slower, but the two are not strictly linked

When we plot the token multiplier and time - consumption multiplier of each skill on the same scatter plot, we can see a relevant diagonal line: Pearson r = 0.73.

canvas - design - 2, baoyu - cover - image, wechat - sticker - maker, logo - creator, deep - research - pro, and multi - search - engine are all concentrated in the "double - high" area in the upper - right corner. These skills usually require a longer context, more model rounds, more frequent tool invocations, or more complex product processing.

But interestingly, there are two groups of counter - examples:

· High token / low time - consumption: fitness - coach, note - taker, weather, zhihu - writer. The skill makes the model read more context and write longer answers, but does not increase external waiting and file processing.

· Low token / high time - consumption: word - docx, wps. The bottleneck of these skills lies not in the language model, but in the Office/WPS toolchain, script execution, and file format processing, which are not visible in the token dimension at all.

05 Normative issues are mainly concentrated in dependencies, boundaries, and resource organization

The C dimension (structural specification) in TRACE is often misinterpreted as "how good the documentation looks", but in fact, it is closer to "engineering debt".

A total of 107 normative issues were found in the re - evaluation of the C dimension of 30 skills.

In the four categories of dependencies, maintenance consistency, resource organization, and trigger boundaries, there are a large number of major - level issues, which means they will directly affect the agent's judgment on "when to use, how to run, what tools are needed, where the products are, and whether the documentation and scripts are consistent".

Looking at these issues together, a skill may have good results, but as long as the dependencies are not clearly written, the boundaries are too broad, the resource references are missing, or the metadata drifts, subsequent reuse, evaluation, and automated upgrades will become increasingly fragile.

06 Stability risks come from the toolchain, long waiting, and repeated corrections

When it comes to "unstable skills", the first reaction is often "the model gave the wrong answer". However, after combining historical operation risks and repeated operation experiments, the picture we see is the opposite - the most common instability does not come from the model itself.