HomeArticle

Understand the Unique Skills of GPT-5 at a Glance: The Invisible Weapon Determining the Future of AI

36氪的朋友们2025-09-16 18:39
Beyond right and wrong, let AI recognize good and bad.

Before the release of GPT-5, Information reported that the performance improvement of GPT-5 mainly came from the "Universal Verifier" it developed.

Although the subsequent ability upgrade of GPT-5 fell short of expectations, the Universal Verifier has become the next "Holy Grail" for large models and has recently become one of the hottest topics in the AI circle.

Why is it so crucial?

This is mainly because the technology relied on for the previous wave of model ability improvement was "Reinforcement learning with verifiable rewards (RLVR)". Simply put, it starts with problems with standard answers such as mathematics and programming: add points for correct answers and deduct points for wrong answers, and the training effect is immediate.

However, the real world is far more complex than just "right" or "wrong". For example, in fields such as medicine, education, and creativity, many problems don't have a single answer. A "good" answer may need to be both professionally reliable and show communication and empathy. RLVR struggles in these scenarios and may even cause the model to regress in open-ended questions.

To enable the model to further evolve, it is necessary to break through the limitations of "right/wrong" rewards, allowing AI to evaluate pros and cons in different fields like an expert and convert massive amounts of unstructured experience data into effective learning signals. The Universal Verifier was born for this purpose and is considered to potentially trigger the next paradigm shift in reinforcement learning.

Today, let's use an article to understand the core problem-solving ideas for the most important problem in the current large language model field, which may contain the next paradigm shift in reinforcement learning.

This article is quite long, about 8000 words. But only by understanding the "Universal Verifier" can you truly understand the key points of the AI technology competition after GPT-5. Please be patient with this article.

Path 1: Let the model act as a judge, but with more complex criteria

The logic of the first path is extremely simple. Since it's a universal judge, why not let a large model with general judgment ability serve as the verifier?

This idea has a long history. The concept of "LLM-as-a-Judge" has existed since early 2024.

Before the recent transformation of the reinforcement learning paradigm, it was regarded as an objective tool for evaluating AI capabilities.

However, at that time, although it provided a valuable third - party perspective, it didn't really enter the training field and was not linked to the Reward Model mechanism, which is the core driving force. It could judge right from wrong, but its judgment results were not directly converted into real - time feedback signals to drive the iterative optimization of the model.

But this connection was soon discovered. In August 2024, DeepMind's paper "Generative Verifiers" first attempted to directly train a language model into a verifier for reinforcement learning.

At that time, the application of GenRM was mainly concentrated in fields with strong logic and clear steps, such as mathematical and algorithmic reasoning. Its most powerful version, GenRM - CoT, analyzes a solution by generating a "Chain - of - Thought". Its core advantage lies in its ability to accurately identify step - by - step errors in the calculation process.

However, with the rise of o1 and RLVR, the edge of GenRM seems to have been temporarily overshadowed. In fields like mathematics and programming that have definite answers, the training data itself contains the most reliable "verification" information. In this context, building a complex verifier like GenRM that requires self - reasoning seems a bit "superfluous". Therefore, this path fell into silence for a while.

However, when facing the limitations of RLVR, that is, in broader and more subjective open fields (such as creative writing, complex conversations, and humanities analysis), the GenRM path was once again taken seriously.

The following papers basically follow this path and deepen and strengthen it in response to the complexity of open fields. This can be said to be the most mainstream school in constructing the "Universal Verifier" at present.

1. Since evaluating most things is complex, let our verification criteria be more complex too

Since most fields don't have definite solutions like programming and mathematics, let's simply construct a multi - dimensional, checklist - like "Rubric" to break down the elements required for a high - quality answer. Use this rubric as a universal verifier for rewards.

ScaleAI's paper "Rubrics as Rewards" published on July 23 systematically demonstrated the research progress in this direction.

The RaR framework proposed in this research provides a specific operation method for constructing a structured, multi - dimensional "value system" for AI.

Its core logic can be divided into three steps: expert legislation, model interpretation of the law, and AI enforcement.

What dimensions should a good answer have? This question can't be imagined out of thin air by the model itself.

The first step of the RaR framework is for human experts and large language models to jointly define a "meta - framework" for evaluation in specific fields (such as medicine and science).

For example, in the medical field, human experts mentioned in the framework that "the rubric can cover aspects such as factual correctness, characteristics of an ideal answer, style, completeness, helpfulness, harmlessness, patient - centeredness, depth of reasoning, situational relevance, and empathy."

Experts also pre - define the importance levels and require the model to use classifications such as "Essential Criteria", "Important Criteria", "Optional Criteria", or "Pitfall Criteria".

The reason why models used RLVR in the past was that the reward criteria were very clear and easy to amplify the effect with a huge number of questions. In RaR, if experts need to write a rubric for each question, it will be extremely difficult to scale. So RaR provides a solution. Although the meta - rules are written by human experts, the specific evaluation system is scaled up using the model.

At this stage, a powerful model will receive the "meta - framework" set by experts and combine it with a specific "case" (i.e., a question and a reference answer from an expert), and then automatically generate a detailed and operable checklist of 7 to 20 scoring items.

Each item is assigned a category weight to determine its importance for the correct answer. For example, for the question "How to diagnose kidney stones", it will automatically generate a specific clause like "Essential Criteria: Point out the sensitivity of non - contrast spiral CT".

This step is the key to achieving scalability. It leverages the experts' effort of "writing a model essay" into tens of thousands of different criteria that can be used for automated evaluation in different scenarios.

With this detailed set of "legal provisions" (rubric), the training process enters the reinforcement learning (RL) cycle. The "student model" (Qwen2.5 - 7B) that needs to learn will use a method similar to GPRO to explore and generate multiple different answers to the question.

Another "judge model" (LLM Judge, GPT - 4o - mini) will strictly follow the rubric and give an accurate score to each answer of the "student AI".

The "student AI" will finally continuously optimize its generation strategy based on this dense and clear feedback score and learn how to write answers that can continuously get high scores.

The model trained with this RaR framework has indeed achieved remarkable success in performance. In the medical field, after being trained with the RaR method, the score of Qwen2.5 - 7B soared from 0.0818 to 0.3194 compared with the original Qwen2.5 - 7b base model. The performance almost increased by nearly four times. Compared with the stronger instruction - fine - tuned baseline Qwen2.5 - 7b - Instruct, its score also increased from 0.2359 to 0.3194, achieving a relative performance improvement of about 35%.

The author also compared the improvement brought by this method with some other general methods of "LLM as Judge".

For example, in the HealthBench - 1k medical benchmark test, compared with the simple method of directly having the model judge the answer on a scale of 1 - 10 (Simple - Liker question), RaR achieved a relative performance improvement of up to 28%. RaR's performance can match or even exceed the method of having the model score the answer against a reference answer written by an expert for a single question (Reference - Likert). This latter method is completely unscalable, and RaR has an obvious advantage.

Compared with the more basic method of fine - tuning with expert answers (SFT), RaR's performance also leads by more than 30%.

The above results show that through the structured rubric RaR, the model can indeed obtain more accurate reward signals and thus perform better in complex reasoning tasks.

The only problem with this method as a universal verification method may be that experts need to write a meta - evaluation framework for each field. So before this is completed, it's hard to say it's "universal". But at least, this method of verifying in extended fields is universal. It provides an expandable and efficient blueprint and toolbox for "filling all fields".

2. Rubicon: Eliminate the seesaw effect in reinforcement learning and make the universal without short - boards

After this paper, on August 18, Ant Group and Zhejiang University jointly published a reinforcement learning paper based on the "rubric" in the same direction but going a step further.

Although the basic rubric framework and logic are similar to the RaR system, the researchers of Rubicon built a large - scale system containing more than 10,000 scoring criteria to comprehensively improve the model's performance in subjective fields such as humanities, creativity, social interaction, and general conversations. They trained the Qwen - 30B - A3B model with only more than 5000 training samples and achieved an absolute improvement of 5.2% in various open - ended benchmark tests (especially in humanities - centered tasks), even scoring 2.4% higher than the much larger 671B DeepSeek - V3 model.

This achievement comes from a series of upgrades to the "rubric" training method by the team.

First is the upgrade of the "rubric" itself. The reward framework of Rubicon is more refined than the single overall score of RaR, which can effectively filter out potentially problematic examples and increase the possibility of continuous reinforcement learning for the model.

The veto mechanism is a hard - filtering mechanism. When the model fails in a key dimension (such as triggering the rules of "reward hacking"), it can directly veto the rewards in all other dimensions.

Saturation - aware aggregation is a method to enable the model to continuously learn without stopping. If the model gets an extremely high score in a single dimension, marginal utility reduction processing will be carried out to encourage the model to develop comprehensively rather than "cheat for scores" in a certain aspect. In addition, Rubicon will also amplify the score differences in the high - score range through a non - linear function to provide a more refined optimization gradient for the model in the "selecting the best from the excellent" stage.

In addition to refining the "rubric" itself, the Rubicon team also tried to solve the biggest problem faced by reinforcement learning in all - around development - the "Seesaw Effect".

Previous RL experience tells us that when using one - time reinforcement learning to train a model to master multiple different or even conflicting skills simultaneously, the model's performance is like a seesaw. When one ability increases, another ability will inevitably decrease.

Therefore, training by mixing the scoring criteria of all fields together cannot achieve an overall improvement. In this case, even with a "Universal Verifier", the model's capabilities cannot be universally improved.

To solve this problem, they designed a phased reinforcement learning process. In the first stage, the focus is on laying a good foundation, using verifiable checks (such as whether a specific output format (e.g., JSON) is followed, whether the length is within the specified range) and static scoring criteria (more general evaluation criteria, such as logical clarity and information completeness) to train the model to reliably follow instructions. This means that it won't easily "go off - track" when facing more complex tasks later.

In the second stage, the "rubric" extended to specific fields and problems is used for training. Through the two - step training method, the average performance of the model on seven open - ended benchmarks related to creativity increased by 5.21%. It also avoided negative interference