HomeArticle

Zero human involvement to achieve gradient updates. A new framework from MIT enables AI to automatically generate fine-tuning data and autonomously upgrade weights.

量子位2025-10-14 15:13
Surpass the high-quality prompt data generated by GPT-4.1

The large model has finally learned to update itself!

MIT has proposed a new reinforcement learning framework that enables the model to generate fine-tuning data and self-update instructions to achieve the update of model weights.

Without human participation, the model can automatically perform gradient updates and autonomously learn to acquire new knowledge or adapt to new tasks.

This framework is called SEAL (Self-Adapting LLMs), which is a two-layer nested learning mechanism.

Under this mechanism, rewards are calculated based on the performance of the updated model on the task, further optimizing the generation strategy of self-update instructions.

SEAL for the first time endows large models with self-driven update capabilities at the weight level, breaking free from the limitation of completely relying on external supervised data.

The model automatically learns knowledge update strategies

SEAL verifies its effectiveness through two main experimental tasks in the paper, namely knowledge incorporation and few-shot learning.

These two types of tasks correspond to two basic scenarios in model evolution - remembering new knowledge and quickly adapting to new tasks.

Knowledge incorporation

In the knowledge incorporation experiment, the researchers used the Qwen2.5-7B model and the paragraphs and corresponding questions from the SQuAD dataset.

At the beginning of each training round, the model receives a new paragraph as input context but does not see the corresponding question answers.

The model first generates training data based on this input, and then performs a small-scale fine-tuning using LoRA. This process is only based on the generated data and does not include the original paragraph or questions.

After the fine-tuning is completed, the model is evaluated on downstream tasks and needs to answer multiple questions related to the paragraph without the original paragraph as input.

The experiment compared the following schemes:

  • Fine-tuning only with the original paragraph;
  • Using implication sentences generated by GPT-4.1 for auxiliary training;
  • Fine-tuning with data generated by SEAL itself.

In the single-paragraph scenario, the accuracy of the original Qwen model is 32.7%. It increases to 33.5% when fine-tuning with the original text, reaches 46.3% after adding synthetic data from GPT-4.1, and further increases to 47.0% after fine-tuning using the SEAL method.

This shows that SEAL is not only superior to using the original text or non-optimized generated data but can even surpass the high-quality prompt data generated by GPT-4.1, demonstrating strong knowledge integration capabilities.

In addition, the researchers also tested with longer paragraphs, including 200 paragraphs and the complete SQuAD validation set (2067 paragraphs). SEAL achieved an accuracy of 58.2% in the 200-paragraph setting, significantly higher than the non-optimized version, indicating that SEAL can generalize to larger-scale data organization tasks.

Few-shot learning

In the few-shot learning experiment, the researchers used the LLaMA-3.2-1B-Instruct model and a selected subset of tasks from the ARC-AGI dataset.

ARC tasks usually include several input-output pairs of graphical transformations (expressed in text form) as examples, aiming to learn the rules from them and make predictions for a new input.

After receiving these examples, SEAL does not directly perform reasoning but generates a training configuration and performs LoRA fine-tuning. Then the model makes predictions on the test input of the task.

The results show that the strategy trained by SEAL achieves a task success rate of 72.5%, far higher than 0% when only using fixed few-shot prompts and significantly better than 20% under the random sampling strategy.

Although it is not as good as the 100% of the manually set optimal strategy (Oracle TTT), as a learning path explored autonomously by the model, this success rate fully demonstrates that SEAL has strong task adaptation capabilities.

So, how does the SEAL framework work?

The dual-loop system automatically generates training instructions

Simply put, the working process of SEAL is to read new information, rewrite it in its own language, and perform gradient updates, which is actually autonomous learning.

Structurally, the entire SEAL system consists of two parts - an external loop driven by reinforcement learning and an internal loop for performing parameter updates.

On the outer layer, when the model faces a new input context (such as a text paragraph or task example), it first generates a self-edit instruction.

This is a "fine-tuning instruction" composed of natural language, describing how the model should update itself based on the current input, including what kind of information to extract, what training samples to generate, and what training parameters to use (such as learning rate, number of training rounds, loss function strategy), etc.

In essence, this behavior is the model designing its own training process.

Then it enters the inner layer, and the model performs a fine-tuning according to the self-edit instruction.

This step will construct synthetic training data, which may include inference sentences, question-answer pairs, restated information, or specific sample transformations for the task, and update the weights with the specified optimization strategy.

The new model obtained after the update will be immediately used to solve the corresponding task (such as answering relevant questions or making predictions), evaluate its effectiveness, and obtain a performance score.

This score (reward) reflects the effect of the self-edit on performance improvement and is then used to generate the next self-edit strategy.

To complete this strategy optimization, SEAL uses an unconventional reinforcement learning method called ReSTEM (Reinforcement Learning with Sampled Trajectories and Expert Mimicry).

The key idea is not to directly backpropagate the gradient of the reward but to adopt a method of behavior cloning + filtered sampling.

Specifically, in each context, the model generates multiple self-edit candidates, and each candidate is applied separately to perform a fine-tuning to obtain a new model.

Then the performance of the new model on the task is evaluated, and only the self-edits that bring performance improvement will be retained.

Then, using this batch of "effective" self-edits as training data, a supervised fine-tuning (behavior cloning) is performed on the generation strategy, and then the iteration is repeated, making the model more and more inclined to generate effective self-edits.

ReSTEM is essentially an expectation maximization process. Compared with policy gradient methods such as PPO, ReSTEM has a simpler structure, more stable training, and is also more suitable for the generation behavior learning tasks of large models.

Through this set of mechanisms, SEAL achieves "learning how to learn better".

The model can not only complete tasks through existing data but also actively design training methods, construct training data, and continuously optimize this "self-learning" strategy through feedback. Ultimately, it presents a language model structure with self-editing and continuous evolution capabilities.

Paper link: https://arxiv.org/abs/2506.10943

Project homepage: https://jyopari.github.io/posts/seal

This article is from the WeChat official account “Quantum Bit”. Author: Keleixi. Republished by 36Kr with permission.