Microsoft Initiates Skills Self - evolution: Training Skills like Neural Networks with 3,300 Stars Gained in a Week

Free your hands and evolve on your own.

From the prompts of large models to the Skills of agents, it seems to have evolved, but not completely.

In agent applications, more and more programmers are starting to spend a lot of time writing CLAUDE.md, Codex skill files, and system prompts for various Agents.

Manually writing these skill documents is essentially a trial - and - error manual task. Write a version, run a few tasks to see the effect, modify it if something seems wrong, and then run it again. This process is not fundamentally different from manually adjusting prompts before, except that the object has changed from a single sentence to an entire document.

This is actually quite absurd. We originally wanted more intelligent AI to help us work, but now, on the contrary, we are spending a lot of energy teaching AI how to work.

This problem seems to have reached an end. Microsoft open - sourced SkillOpt this week, a text space optimization framework that treats Agent skill documents as "trainable parameters", allowing skill documents to evolve on their own.

Official website link: https://microsoft.github.io/SkillOpt/#idea
Github link: https://github.com/microsoft/SkillOpt
Paper link: https://arxiv.org/abs/2605.23904

The core idea is very simple. Instead of training model weights, it only trains the natural language document that guides Agent behavior. In all 52 evaluation combinations of 7 target models, 6 benchmark tests, and 3 execution environments (direct dialogue, Codex, Claude Code), the skill documents trained by SkillOpt all reached the optimal or tied for the optimal result.

Skills Can Also Be Optimized and Trained

The core insight of SkillOpt can be summarized in one sentence: The skill document of an Agent is its "external weight". Since internal weights can be optimized using gradient descent, external weights should also have a systematic training method.

The SkillOpt process. The frozen target model executes using the current skills; the optimizer model proposes bounded modifications; the reserved validation determines whether the candidate becomes the new current skill.

Training Loop: Forward Propagation, Backward Propagation, Parameter Update

The training loop in traditional deep learning is: calculate the loss through forward propagation, calculate the gradient through backward propagation, and update the weights using the gradient. SkillOpt applies the same logic to the text space:

Rollout (Forward Propagation): The frozen target model uses the current version of the skill document to execute a batch of tasks and records the complete execution trajectory, including messages, tool calls, validation feedback, and final scores. The output of this step is "evidence", which is equivalent to the forward propagation result of a neural network.
Reflect (Backward Propagation): An independent optimizer model analyzes this batch of execution trajectories. The key design is that failed cases and successful cases are reflected on separately. The failed minibatch is used to discover "which operation rules need to be corrected", and the successful minibatch is used to confirm "which existing rules are working and should not be changed". This step is equivalent to calculating the "gradient in the text space", telling the system which direction the skill document should be modified.
Edit (Parameter Update): Based on the reflection results, the optimizer model proposes structured editing operations for the skill document: adding new rules (add), deleting invalid rules (delete), and replacing rules that need to be corrected (replace).
Gate (Validation Gating): The candidate new skill document must be run on a held - out validation set, and it is only accepted when the performance strictly improves. This step prevents overfitting and ensures that each update is a real improvement.

The entire loop runs for multiple epochs, and multiple steps are run within each epoch, which is exactly the same rhythm as training a neural network.

Textual Learning Rate: Preventing Catastrophic Forgetting

When training a neural network, a too - large learning rate can lead to catastrophic forgetting, where the model forgets old knowledge after learning new things. SkillOpt encounters exactly the same problem in the text space: if the edits in one operation are too large, the previously learned effective rules may be overwritten.

The solution is to introduce a "textual learning rate": there is an upper limit on the number of editing operations allowed in each step. In the paper, the default setting is lr = 4, that is, a maximum of 4 add/delete/replace operations per step. This constraint forces the optimizer to make only small adjustments each time, maintaining training stability.

Ablation experiments verified the necessity of this design: after removing the learning rate constraint, the performance on SearchQA dropped from 87.1% to 84.6%, on SpreadsheetBench from 77.5% to 75.7%, and on LiveMath from 61.3% to 57.3%.

Rejected - Edit Buffer: Negative Feedback Memory

Another ingenious design is the rejected - edit buffer. When an editing proposal is rejected by the validation gating, it is not simply discarded but enters a buffer. The optimizer can see these "failed attempts" during the subsequent reflection phase, thus avoiding repeatedly proposing similar ineffective edits.

This is equivalent to providing negative gradient information to the optimizer: it not only knows which direction to go but also knows which directions have been tried and do not work.

Ablation experiments also confirmed its value: after removing the rejected buffer, the performance on SpreadsheetBench dropped sharply from 77.5% to 72.9%.

Slow Update and Meta - Skills: Long - Term Memory Mechanism

SkillOpt also introduces two cross - epoch memory mechanisms:

Slow Update: At the end of each epoch, a vertical comparative analysis is performed on all accepted edits within the entire epoch to find consistent patterns across steps and produce a larger - scale update. This is similar to the learning rate warmup or periodic large - step update in deep learning.
Meta Skill: The optimizer itself also has a "meta - skill" document that records the experience it accumulates during the optimization process (for example, "for this benchmark, paying more attention to the format of tool calls is more effective than paying attention to reasoning steps"). This meta - skill is continuously updated between epochs, allowing the optimizer itself to evolve.

Crucially, these two mechanisms only exist during training. During deployment, the target model only needs the final best_skill.md, without any additional model calls or memory modules. The inference overhead is zero.

Leading in All 52 Evaluations

Main Experiment: 7 Models × 6 Benchmarks × 3 Environments

The evaluation coverage of SkillOpt is quite comprehensive:

The target models include GPT - 5.5, GPT - 5.4, GPT - 5.4 - mini, GPT - 5.4 - nano, GPT - 5.2, Qwen3.5 - 4B, Qwen3.6 - 35B - A3B, ranging from the most powerful closed - source models to small models with 4B parameters.

The benchmark tests cover 6 different types of tasks: SearchQA (question - answering), SpreadsheetBench (code generation/spreadsheet operations), OfficeQA (tool - enhanced question - answering), DocVQA (document visual question - answering), LiveMathematicianBench (mathematical reasoning), ALFWorld (embodied agents).

The execution environments include three mainstream Agent execution frameworks: direct dialogue, OpenAI Codex, and Anthropic Claude Code.

In all 52 (model × benchmark × environment) evaluation combinations, SkillOpt achieved the optimal or tied for the optimal result.

Some highlight data:

GPT - 5.5 in direct dialogue mode: an average increase of +23.5 points, with an increase of 38.9 points in SpreadsheetBench and 39.0 points in OfficeQA
GPT - 5.4 - nano (the smallest model): an average increase of +24.9 points, with an increase of 49.4 points in DocVQA and 35.1 points in ALFWorld
GPT - 5.5 + Codex environment: an increase of 57.5 points in SpreadsheetBench
GPT - 5.5 + Claude Code environment: an increase of 58.3 points in SpreadsheetBench

The smaller models have a larger improvement, which shows that skill documents are more helpful for models with weaker capabilities. A good operation manual is of far greater value to beginners than to experts, and this intuition also holds true for AI Agents.

Comparison Experiment: Crushing All Baseline Methods

SkillOpt was compared with 6 baseline methods: no skill, human - written skills, LLM - generated skills in one go, Trace2Skill, TextGrad, and GEPA.

On each benchmark, SkillOpt outperformed the strongest baseline method:

SearchQA: +1.9 points higher than the strongest baseline
SpreadsheetBench: +4.4 points higher than the strongest baseline
OfficeQA: +4.1 points higher than the strongest baseline
DocVQA: +1.7 points higher than the strongest baseline
LiveMath: +9.2 points higher than the strongest baseline
ALFWorld: +8.9 points higher than the strongest baseline

It is worth noting that TextGrad and GEPA are both existing text optimization methods. The advantage of SkillOpt over them shows that the systematic training loop design (learning rate, validation gating, negative feedback buffer) is indeed more effective than loose self - correction.

Transfer Experiment: One Training, Multiple Deployments

The skill documents trained by SkillOpt show strong transferability:

Cross - model transfer: The LiveMath skills trained on GPT - 5.4 can be directly transferred to GPT - 5.4 - nano, resulting in a 15.2 - point increase. There is no need to retrain for the small model.
Cross - environment transfer: The SpreadsheetBench skills trained in the Codex environment can be directly transferred to the Claude Code environment, resulting in a 31.8 - point increase. This means that the skill documents optimized in one Agent framework are still effective in another framework.
Self - optimization: Even when using GPT - 5.4 - nano as both the target model and the optimizer model (self - optimizing), there is still a 10.4 - point increase on SpreadsheetBench. This shows that the training loop of SkillOpt itself provides sufficient structured constraints, and even if the optimizer is not stronger than the target model, it can still find effective improvement directions.
Minimal deployment: Only a best_skill.md file is needed for final deployment. There is no need for an optimizer model, memory modules, or any additional inference overhead.

Visualization of Skill Evolution: Learning from Failure

The paper shows a complete training process for an ALFWorld task, with the target model being GPT - 5.4 - mini and the optimizer being GPT - 5.5.

The initial skill document is a concise ALFWorld operation guide. After 4 training steps, the following rules were added to the skill document:

"Treat any general target container instance as valid"
"Maintain a strictly numbered set of searched locations and do not re - check observed locations"
"Expand the search scope after multiple consecutive misses in a certain type of location"

These rules are automatically extracted from failure trajectories. For example, the third rule comes from the failure experience of the Agent repeatedly searching the same type of location without finding the target item in some tasks. After observing this pattern, the optimizer proposed the rule of "expanding the search scope".

Final result: The performance on the hard difficulty of the ALFWorld test set increased from 70.9% to 85.8%.

During the entire process, the edits in Step 3 once led to a decline in the validation set performance, but it was rescued by the slow update mechanism. The training set score in Step 4 was higher, but the validation set did not improve, so it was rejected by the gating. This cycle of "proposing hypotheses, validating, and accepting or rejecting" is exactly the same as the methodology of human scientific research.

SkillOpt tells us that everything about agents can be self - learned.

The role of humans in the AI workflow has taken another step back. In the future, we will transfer more cognitive burdens to machines.

This article is from the WeChat official account

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

With 3,300 stars gained in a week, Microsoft initiates Skills self-evolution, training skills in the same way as training neural networks.