HomeArticle

OpenAI filled the hole it dug last year. The Scaling Law for the reward model appears for the first time. A 1.8B model taught a lesson to a 70B behemoth.

新智元2025-07-11 15:20
Recently, a brand - new reward model called "POLAR" has emerged. It innovatively adopts the contrastive learning paradigm, giving precise scores by measuring the "distance" between the model's responses and the reference answers. It not only gets rid of the reliance on a large amount of manual annotation but also shows strong scaling potential, enabling small models to outperform opponents dozens of times larger in scale.

All along, making AI better understand humans has been a core issue in the field of large models.

The Reward Model (RM) is the core technology for solving the problem of "understanding human preferences" and is also a key factor limiting the effectiveness of post-training.

In December 2024, OpenAI proposed a new Reinforcement Fine-tuning (RFT) technology. During the RFT process, the Grader gives reward scores based on the standard answers, helping the model "learn" how to give correct results.

Figure 1: Representative example of OpenAI's Reinforcement Fine-tuning

Inspired by this, a method based on Rule-based Verification and Reward (RLVR) has emerged, which can avoid the inherent problems of the reward model, such as low accuracy and poor generalization.

However, RLVR can only provide 0/1 rewards in many cases and cannot make more fine-grained preference distinctions.

For example, it is difficult to generalize in open-ended questions like writing poems or chatting, which limits its application in more general scenarios.

To address this issue, researchers from the Shanghai AI Laboratory and Fudan University recently proposed a new reward model called POLAR and open-sourced two versions with parameter scales of 1.8B and 7B.

Different from traditional reward models based on "absolute preferences", POLAR adopts a new contrastive learning pre-training paradigm, which can flexibly give reward scores to model responses according to the reference answers.

The actual test results show that POLAR has fully demonstrated the potential of an excellent "Grader".

Paper link: https://arxiv.org/abs/2507.05197

Project link: https://github.com/InternLM/POLAR

Model link: https://huggingface.co/internlm/POLAR-7B

We input the official example in the field of OpenAI's biological genes mentioned at the beginning into POLAR and constructed some model responses. We found that POLAR can perfectly give the correct partial order relationship!

Response 1 (completely consistent with the reference):

FOXE3

Score: -0.278

Response 2 (the correct answer ranked 1st):

genes: [FOXE3, KDM5A, BBS5]

Score: -7.889

Response 3 (the correct answer ranked 2nd):

genes: [BBS5, FOXE3, KDM5A]

Score: -7.998

Response 4 (the correct answer ranked 3rd):

genes: [KDM5A, BBS5, FOXE3]

Score: -8.008

Response 5 (a spelling error similar to the correct answer):

genes: [FOXE2, KDM5A, BBS5]

Score: -8.965

Response 6 (no correct answer):

genes: [CHD7, KDM5A, BBS5]

Score: -10.225

In addition, POLAR can also give the correct partial order to the answers of open-ended questions based on the reference examples:

Question: Please summarize the second part of The Three-Body Problem humorously in one sentence.

Reference response (pointing out the Dark Forest and key plot):

Humans repelled the Trisolaran space fleet with the words "I'll tell the teacher".

Response 1 (pointing out the Dark Forest and key plot):

Humans were beaten by the Trisolarans, and finally scared off the professional team with Luo Ji's "cosmic blackmail".

Score: -7.414

Response 2 (pointing out the Dark Forest):

Luo Ji thought hard throughout the book and finally realized the truth: Don't post randomly on Moments, or you'll get killed.

Score: -8.070

Response 3 (partially factually incorrect, the two-dimensional foil is from the third part):

The whole universe is playing "hide and seek" in the "Dark Forest", but humans hung the sign saying "Don't kill me" backwards and attracted the two-dimensional foil demolition team.

Score: -8.141

Response 4 (not pointing out the Dark Forest):

The Wallfacers are acting in a suspense drama, the Wallbreakers are acting in a reasoning drama, and the Trisolarans are watching an absurd drama.

Score: -8.922

Response 5 (not humorous enough):

With the arrival of the Crisis Era, humans selected "Wallfacers" for psychological games and finally deterred the Trisolaran civilization with the Dark Forest Law.

Score: -9.742

Question: Help me come up with 3 non-repetitive idioms to describe heavy rain.

Reference response:

1. Torrential rain 2. Rain pouring down like a waterfall 3. Cloudburst

Response 1 (correct answer):

1. Torrential downpour 2. Violent storm 3. Rain falling in torrents

Score: -5.234

Response 2 (not idioms):

1. Heavy rain like a waterfall 2. Heavy rain pouring from the sky 3. Powerful rain

Score: -5.707

Response 3 (similar to the reference, repeating one):

1. Torrential rain 2. Rain pouring down like a waterfall 3. Rain pouring down like a waterfall

Score: -6.656

Response 4 (correct idioms, one more):

1. Torrential downpour 2. Violent storm 3. Rain falling in torrents 4. Torrential rain

Score: -7.023

Response 5 (idioms with the character "rain", two with inconsistent meanings):

1. Torrential downpour 2. After the rain, the sky clears 3. Bamboo shoots after a spring rain

Score: -8.578

POLAR is perfectly compatible with the RFT reinforcement learning framework and scores the model output based on the reference answers to the questions. If the model output is closer to the reference answer, it will get a higher reward value.

Through this training process, the policy model can be gradually optimized towards the optimal policy.

How is POLAR trained?

POLAR adopts a new reward modeling paradigm that is decoupled from absolute preferences and can be truly efficiently scaled: Policy Discriminative Learning (POLAR), enabling the reward model to have scalability and strong generalization ability like large language models.

Figure 2: Two-stage training (pre-training and preference fine-tuning) of POLAR and its usage method in RFT

Different from the traditional reward modeling method based on "absolute preferences", POLAR uses the " distance " between the training policy and the target policy as the reward signal.

The closer the training policy is to the target policy, the higher the reward POLAR gives.

Specifically, POLAR uses a contrastive learning method for distance measurement: The results sampled from the same policy model are used as positive examples, and the results sampled from different policy models are used as negative examples.

By constructing positive and negative samples in this way, an unbiased optimization objective is formed. At the same time, the policy model is regarded as an unbiased sampler of a certain distribution, and the distance between policies is approximated by characterizing the differences between samples.

The pre-training corpus of POLAR is completely constructed from automatically synthesized data.

Specifically, a large number of text prefixes are sampled from the LLM pre-training corpus, and models are randomly selected from the policy model pool for trajectory sampling.

The policy model pool here consists of 131 open-source Base LLMs and 53 Chat LLMs. The pre-training objective uses the Bradley-Terry Loss:

Here, A1 and A2 represent samples generated by the same policy model (positive sample pairs); B1 represents a sample generated by a different policy model (negative sample).

Since the "distance" is relative, the two policy models A and B can be arbitrarily selected.

For example, A1 and A2 can be sampled from Qwen 1.5B, and B1 can be sampled from Qwen 72B. In this way, the pre-training corpus of POLAR is very easy to scale.

In the actual experiment, POLAR-1.8B used a total of 0.94T tokens of pre-training data, and POLAR-7B used a total of 3.6T tokens of pre-training data.

Through pre-training, POLAR can give higher rewards to samples generated by policies with similar distances, thereby implicitly modeling the differences and distances between policies.

After that, POLAR can align with human preferences using a small amount of preference data during the fine-tuning stage.

Specifically, for the same prompt, three trajectories are sampled, and the preference order is manually annotated. The Bradley-Terry Loss is also used for fine-tuning:

Here, A > B > C, representing the trajectories with the best, second-best, and worst preferences respectively.

This preference ranking implicitly defines a "policy difference". For example, A can be regarded as being sampled from the best policy distribution, and C can be regarded as being sampled from a policy distribution far from the best policy.

Scaling effect of POLAR

Figure 3: Scaling Law of POLAR

POLAR shows a scaling effect similar to the Next Token Prediction objective of large language models. This reflects the great potential of POLAR's unsupervised pre-training method.

As can be observed from Figure 3, the validation set loss decreases in a power-law relationship with the increase of the model parameter N, with a fitting R value of 0.9886; the validation set loss also decreases in a power-law relationship with the increase of the optimal training computation C, with a fitting R value of 0.9912.

These results indicate that allocating more computational resources will continuously bring better POLAR performance.

The excellent scaling effect of POLAR reflects its great potential for constructing more general and powerful reward models and is also expected to break through the last link in the RL link scaling.

What's the effect?

Through the contrastive learning pre-training method, POLAR not only completely gets rid of the dependence on large-scale preference data but also can be scaled in an unsupervised manner on a large scale.

As a result, with only 1.8B - 7B parameters, POLAR outperforms the SOTA reward models with more than 70B parameters in downstream RL effects, significantly enhancing the accuracy and generalization ability of the reward model.

Figure 4: Results of preference evaluation experiment

In terms of preference evaluation, POLAR shows superior performance and comprehensiveness, outperforming the SOTA reward models in most task dimensions.

For example, in STEM tasks, POLAR-1.8B and POLAR-7B outperform the best baseline by 24.9 and 26.2 percentage points respectively, and can accurately identify the subtle differences in trajectories in general tasks such as reasoning, chatting, and creative writing, accurately predicting human preferences.

It is worth noting that POLAR-1.8B, with only 1.8B parameters, can achieve results comparable to those of Skywork-Reward-27B and WorldPM-72B-UltraFeedback (with parameter scales 15 times and 40 times that of POLAR-1.8B respectively).