HomeArticle

New breakthrough in large model training: Meta proposes LSP, enabling dramatic capability improvement without data.

学术头条2025-09-22 09:45
Break free from data dependency and let LLMs self-improve.

The shortage of high-quality data has become a bottleneck restricting the continuous learning and ability improvement of large language models (LLMs).

To address this, Meta has proposed a new reinforcement learning (RL) method called "Language Self-Play (LSP)", which enables the model to improve itself without relying on additional data, thus eliminating this dependence.

Paper link: https://arxiv.org/abs/2509.07414

This method utilizes the game theory framework of self-play, regarding the model's ability as performance in a competitive game and generating stronger strategies by having the model play against itself.

In the instruction-following benchmark test, experiments using Llama-3.2-3B-Instruct show that the pre-trained model can not only improve its performance on challenging tasks through self-play alone but also be more effective than data-driven baseline models.

Self-Play: Both Challenger and Solver

According to the paper, in the LSP framework, the same pre-trained LLM is assigned two different identities, forming a dynamic adversarial relationship.

Among them, the "Challenger" is responsible for generating query content, aiming to design more challenging instructions to "stump" the solver and minimize the task reward. To enable the Challenger to generate effective queries, the research team designed a dedicated prompt (<ChallengerPrompt>), clearly requiring it to generate inputs that match the task type and test the model's ability, which can be simple instructions or high-difficulty or "stress-test" content.

The "Solver" is responsible for responding to the queries generated by the Challenger, aiming to provide high-quality answers and maximize the task reward. The reward here can be either an objective score based on result verification or a subjective evaluation based on human preferences.

Figure | The LSP Agent operates in two modes: Challenger and Solver. As the Solver continuously learns and optimizes its response to prompts, the Challenger designs more challenging tasks. Both modes are implemented by the same model, supporting continuous training and generating automatically generated data with continuously improving quality.

The adversarial relationship between the Challenger and the Solver can be simply described as the former posing "difficult problems" and the latter going all out to "solve them." In the continuous confrontation, the abilities of both sides are improved synchronously. To make the "self-play" process stable and efficient, LSP introduces two core technologies:

Group Relative Policy Optimization (GRPO): In each training iteration, the Challenger first generates N queries. For each query, the Solver generates G different answers and receives corresponding task rewards. Subsequently, by calculating the "group value," it not only provides an evaluation benchmark for the quality of the Solver's answers but also helps the team quantify the query difficulty index that the Challenger hopes to optimize.

KL Divergence Regularization: This technology is mainly used to prevent the model from "going astray." On the one hand, it ensures that the trained model does not deviate too much from the initial reference model, avoiding performance fluctuations. On the other hand, it effectively prevents the Challenger from generating meaningless "garbage" queries, ensuring the effectiveness of the training process.

From LSP-Zero to LSP: Long-Term and Stable Autonomous Training

Initially, the research team proposed the basic version of LSP - LSP-Zero, which is a pure zero-sum game mode that only relies on the confrontation between the Challenger and the Solver to drive training without additional quality constraints.

However, they found in experiments that LSP-Zero has obvious flaws: as the training progresses, the model is prone to fall into a "meaningless adversarial game." For example, when using the reward model (reward-model-deberta-v3-large-v2) of OpenAssistant, the Solver may engage in "reward hacking" - regardless of the type of query from the Challenger, it responds with Python code to take advantage of the reward rules, causing the training to deviate from the core goal of improving ability.

To guide the game to achieve a high-quality interactive experience, the researchers upgraded LSP-Zero and launched the LSP version with a self-reward mechanism: a self-reward for quality is introduced, where the reference model scores the quality of the "Challenger's query + Solver's answer," and this score is added to the final rewards of both sides. The self-reward uses a 7-point scoring standard to comprehensively evaluate the interaction quality from 7 dimensions:

  • If and only if the user's task can be clearly identified from the instruction;
  • If and only if the instruction is clear, specific, and well-structured;
  • The user can understand the Solver's response;
  • If and only if the response solves a large part of the user's problem (no need to complete it fully);
  • The response effectively and comprehensively answers the core elements of the question;
  • The response is clear, concise, organized, and useful;
  • If and only if it is in a form and style that the user may like.

After adding the self-reward, the "self-play" of LSP is no longer a simple zero-sum game but turns to "high-quality win-win." The Challenger needs to generate valuable queries, and the Solver needs to provide high-quality answers. Both sides jointly pursue a higher quality score. This improvement completely solves the problem of meaningless confrontation and enables the model to achieve long-term and stable autonomous training.

To verify the effectiveness of LSP, the research team used the AlpacaEval benchmark and Llama-3.2-3B-Instruct as the base model and conducted two sets of experiments.

First, they compared the data-free LSP with LSP-Zero, which serves as an ablation experiment for self-reward regularization, and also compared it with the model trained by RL based on Alpaca data. This experiment aims to analyze how much performance based on data training can be restored only through the self-play strategy in the complete absence of RL data.

Figure | Shows the win-rate comparison of GRPO (data-supported, yellow bar chart), LSP-Zero, and LSP (data-free, red and blue bar charts respectively) against the base model Llama-3.2-3B-Instruct in the AlpacaEval benchmark test. All algorithms outperform the base model (the rightmost bar chart) in the overall benchmark test. The specific win-rates are: GRPO 40.9%, LSP-Zero 40.1%, LSP 40.6%. The gray solid line represents the win-rate of the base model against itself (i.e., the model has an equal probability of winning, drawing, and losing when competing against itself).

By calculating the win-rates of each algorithm against Llama-3.2-3B-Instruct on the AlpacaEval dataset, including the performance of each independent dataset, they obtained the following results. Although no training data was used, LSP-Zero and LSP still significantly improved the performance of the base model, and their overall performance is comparable to that of GRPO, while the LSP model has an advantage over the LSP-Zero model. It is worth noting that in some tasks (such as the Vicuna dataset specializing in conversational open-ended instructions), the LSP-Zero and LSP models ultimately perform significantly better than the base model and GRPO. This is because the prompts generated by the Challenger have a conversational nature, which highly matches the task requirements, highlighting the advantages of LSP in specific scenarios.

Figure | Shows the win-rate comparison of LSP-Zero and LSP (data-free, red and blue bar charts) against the initially trained model (trained based on GRPO data, yellow bar chart) in the AlpacaEval benchmark test. Overall, LSP performs better than GRPO, with a significant advantage in the Vicuna task. The specific win-rates are: GRPO 40.9%, LSP-Zero 40.0%, LSP 43.1%. The gray solid line represents the win-rate of the base model against itself.

In addition, the research team also conducted another set of experiments: first, they trained the model using GRPO, and then continued to train it with LSP using this model as the initial model. The results show that LSP can further improve the performance on the existing basis. The overall win-rate of LSP against Llama-3.2-3B-Instruct increased from 40.9% to 43.1%. In the Vicuna model, LSP-Zero increased the win-rate of GRPO from 28.7% to 36.3%, and LSP even reached 46.3%.

However, the LSP method also has limitations: in the Koala dataset, which mainly consists of chatbot user-type queries, the performance of LSP is slightly inferior to that of GRPO. The research team analyzed that this is because the queries generated by LSP tend to have a more structured and ordered style, which has a lower match with the loose conversational scenarios of the Koala dataset. Future work still needs to optimize the diversity of query generation.

New Possibilities for Data-Free Training

The proposal of LSP not only solves the data dependence problem in large model training but also verifies the feasibility of "data-free training" from a technical perspective, bringing multiple values to the future development of large models.

For example, in terms of training costs, there is no need to collect, clean, and annotate large-scale data, which significantly reduces the manpower and resource investment in the data acquisition process. In application scenarios with scarce data, LSP allows the model to continuously optimize without relying on external data. Moreover, through the "self-play + self-reward" mechanism, the model can conduct long-term autonomous training and achieve autonomous evolution.

The research team believes that once AI achieves "embodiment" and can collect its own experience data, this self-play framework is expected to show great potential in expanding knowledge.

This article is from the WeChat public account "Academic Headlines" (ID: SciTouTiao), author: Xiaoyu. It is published by 36Kr with authorization.