Es ist möglich, hochwertige Daten zu synthetisieren, ohne Milliarden von Parametern zu verwenden. Dieser Open-Source-Frameworks ermöglicht es kleinen Modellen, "gemeinsam die Wende zu schlagen". Die Leistung eines 7-Billionen-Modells kommt der eines 72-Billionen-Modells direkt nach.
Is it possible for small models to be self - sufficient and jointly improve without distilling any large - scale language models?
The GRA framework (Generator–Reviewer–Adjudicator) proposed by the Shanghai Artificial Intelligence Laboratory in collaboration with Renmin University of China is exactly such a new paradigm:
This method centers around the concepts of "multi - person collaboration" and "role division of labor", systematically exploring how multiple open - source small models can generate high - quality training data through a collaborative mechanism.
Experimental results show that on 10 mainstream datasets covering mathematics, code, logical reasoning, general knowledge Q&A, etc., the quality of the data generated by GRA is comparable to or higher than the output of a single large - scale language model (such as Qwen - 2.5 - 72B - Instruct), and it has achieved a significant lead in most tasks.
This project has been open - sourced. For details, see the link at the end of the article.
GRA Framework: "Simulating Paper Submission"
If the traditional method is to generate data single - handedly, then GRA is more like a "simulated review process of top conferences" - the author, reviewers, and AC all take their positions. Small models cooperate with division of labor, score and review to ensure the stable quality and unified standards of the data content.
1. Generator: Create new samples like an "author"
GRA first divides tasks into multiple domains (such as mathematics, programming, logical reasoning, etc.). Each small Generator model is responsible for generating new instructions and responses in the corresponding domain. They extract keywords and summaries from the seed data and generate high - quality samples by combining domain knowledge to ensure rich content, focused themes, and clear semantics.
2. Reviewer: Strictly review like a "reviewer"
After each piece of data is generated, it will be sent to multiple small Reviewer models for two rounds of review:
First, check whether the instructions are reasonable and clear;
Then comprehensively evaluate the correctness, relevance, and language quality of the responses, and score them with comments.
The system will screen samples according to the average score and score consistency - samples with low scores will be directly eliminated, and those with divergent opinions will be sent to the next step.
3. Adjudicator: Make the final decision like an "AC"
When there is a scoring conflict among Reviewers, the small Adjudicator model will step in, conduct an independent review, and make the final judgment. It is like the Area Chair in academic review, effectively avoiding "majority misjudgment" and ensuring that the remaining data is objective and reliable.
4. Post - processing module: Make good data more "refined"
After passing the review, the system will also perform semantic deduplication, summary completion, and format unification to further improve the consistency and expression quality of the samples.
Generally speaking, GRA constructs an automated system of "simulating the review process of top conferences": small models take turns to play roles such as creation, review, and arbitration, and generate high - quality training data through multiple rounds of collaboration.
This mechanism not only improves the diversity and fairness of data generation but also breaks the previous dependence on large - model distillation - realizing a "collective intelligence" path truly belonging to small models.
Experimental Verification: "Three Cobblers with Their Wits Combined Equal Zhuge Liang"
The GRA team selected 10 public datasets covering four domains: mathematical reasoning (such as Math, GSM8K), code generation (HumanEval, MBPP), reasoning Q&A (HellaSwag, ARC - C, GPQA, BBH), and general knowledge Q&A (MMLU, IFEval) to comprehensively evaluate the performance of the GRA framework.
The GRA framework integrates 5 open - source small - scale language models with a parameter count between 7–8B, including LLaMA - 3.1 - 8B - Instruct, Qwen - 2.5 - 7B - Instruct, InternLM3 - 8B - Instruct, Mistral - 7B - Instruct - v0.3, and Tulu - 3 - 8B.
The data generated by GRA was used to train two basic models (LLaMA - 3.1 - 8B - Base and Qwen - 2.5 - 7B - Base), and a systematic comparison was made with the original seed data and the data generated by distilling Qwen - 2.5 - 32B and Qwen - 2.5 - 72B - Instruct.
The core experimental results show:
1. Significantly better than the original data: The data generated by GRA has an average improvement of 6.18% on LLaMA - 3.1 and 11.81% on Qwen - 2.5, indicating that even with the collaboration among small models, GRA can significantly improve the data quality and training effect.
2. Can compete head - on with large - model distillation: The performance of the model trained with the data generated by GRA on LLaMA - 3.1 is only 0.59% lower than that of the distilled version of Qwen - 72B; the performance of the model trained with the data generated by GRA on Qwen - 2.5 leads the distilled version of Qwen - 72B by an average of 8.83%. This shows that the small - model collaborative mechanism is expected to become a lower - cost and more cost - effective alternative to large models.
3. "Bigger" large models ≠ better: The experiment also found that the performance improvement of Qwen - 72B compared with 32B is limited, reflecting that the return of the traditional distillation paradigm is gradually decreasing when further expanding the parameter scale. In contrast, the "collective wisdom" path of GRA has more expansion potential.
In a nutshell: Multiple small models with reasonable division of labor can also achieve training effects comparable to or even better than large models. This not only saves computing power but may also reshape our understanding of "what is effective data synthesis".
Factor Analysis: "1 + 1+1 > 3"
By analyzing the advantages of GRA from the dimensions of data diversity, quality, and difficulty control, the following key factors are found:
1. Diverse data to fill the blind spots
Through t - SNE visualization comparison, it is found that the distribution of the data generated by GRA is significantly wider and more uniform than that of the original seed data and the large - model distilled data, especially showing good supplementary ability in the semantic space not covered by the original data. This indicates that the data produced by GRA has stronger coverage and diversity.
2. Reliable data quality with detailed and stable review
The data generated by GRA is not only reviewed by multiple small models but also highly recognized with high scores from Qwen - 2.5 - 72B in the comparative experiment - more than 87.3% of the samples have highly consistent scores.
At the same time, the scoring system of GRA shows a smoother and more delicate distribution, indicating that it has stronger discrimination and consistency in data quality evaluation, which verifies the reliability of its data screening mechanism.
3. More "challenging" data for more effective training
Through the analysis of the Instruction - Following Difficulty (IFD) indicator, the task difficulty of the data generated by GRA is 14.58% higher than that of the seed data and is basically the same as that of the large - model distilled data (GRA: 75.82%, Qwen - 72B distillation: 75.49%). This means that GRA can construct challenging and high - knowledge - density data, providing more tense training samples for small models.
Paper address: https://arxiv.org/abs/2504.12322
Project address: https://github.com/GX - XinGao/GRA
Model address: https://huggingface.co/collections/GX - XinGao/gra - 6801cba58ceb0074566cdb4e
This article is from the WeChat public account "QbitAI". Author: GRA team. Republished by 36Kr with permission.