HomeArticle

The 7B model has an "emotional quotient" comparable to that of GPT-4. Tencent has overcome the challenge of open-domain RL, and its score has directly increased by fivefold.

量子位2025-07-18 15:33
Solve the three major dilemmas of the "emotional quotient" of large models.

How should RL perform in open-ended conversations without standard answers? How should RL perform in open-ended conversations without standard answers?

Multi-turn conversations are the most typical open tasks for large models: high frequency, multi-turn, strongly context-dependent, and what constitutes a “good response” varies from person to person.

However, when using RL to optimize the “emotional intelligence” of large models in real interactions, RLVR once fell into “three major dilemmas”:

  • Environmental Dilemma

Real conversations are multi-turn, dynamic, and highly personalized. How can we build an interactive environment that is both realistic and diverse, and allows the model to freely explore (rollout)?

  • Reward Dilemma

There is no standard answer for “high emotional intelligence”. How can we convert users' subjective satisfaction into stable and optimizable long-term rewards?

  • Training Dilemma

How can we achieve stable and efficient multi-turn online RL training on LLMs?

The RLVER (Reinforcement Learning with Verifiable Emotion Rewards) framework proposed by Tencent's Hunyuan Digital Human Team points out a direction:

By having a stable and high-quality user simulator play the dual roles of “interaction environment” and “reward source” simultaneously, RLVER successfully introduces RLVR into multi-turn conversations, providing an effective and scalable new solution for training large models in open-domain RL.

After being trained by RLVER, the Qwen2.5 - 7B model's score on the Sentient-Benchmark for emotional conversations jumped from 13.3 to 79.2, performing on par with top commercial models such as GPT-4o and Gemini 2.5 Pro.

The model is now open-source. The link can be found at the end of the article.

RLVER: Building an Effective RL Closed-loop for the Open Problem of “Emotional Intelligence”

Traditional dialogue optimization either relies on static data or expensive manual annotation.

RLVER proposes a new path: centered around a user simulator that integrates “environment + reward”, it ingeniously solves the above three challenges.

The Simulator as the Environment: Creating a “Living” Dialogue World

The RLVER team realized that true “high emotional intelligence” varies from person to person. Therefore, the user simulator built by RLVER is not just a simple dialogue robot.

It has a variety of user profiles and user interaction scenarios (different user personalities, dialogue backgrounds, and potential needs), and can simulate a vast number of real and changeable users.

Each user interacts with the model independently and dynamically, updates their own emotional state in real-time based on the model's response, and gives personalized responses.

This provides the model with an online learning environment that can be infinitely explored, full of realism and diversity, while avoiding reward hacking.

The Simulator as the Reward: A Trustworthy “User Experience Scoring System”

The evaluation of “emotional intelligence” is essentially the user's subjective experience. But how can this subjective experience be transformed into stable and optimizable rewards?

Based on the SAGE framework, RLVER simulates the user's emotional changes after each round of dialogue through an explicit and reproducible reasoning process.

After the dialogue ends, the accumulated “total mood score” becomes the reward signal, directly driving the PPO/GRPO algorithm to optimize the model.

This design gets rid of the “black-box scorer” and explicitly models “user satisfaction” as a logically controllable reward function, making the training process more stable, transparent, and trustworthy.

Global Reward Optimization: From Single-turn Feedback to “Global Emotional Trajectory” Optimization

Different from the sentence-by-sentence feedback method, RLVER focuses on the emotional change trend of the entire dialogue and only uses the final “total emotional score” as the reward to guide the model to optimize long-term strategies.

Only by truly understanding the user's intention and maintaining the user's long-term positive emotions can the model obtain a higher total reward. This encourages the model to break out of local optima and learn more extensible and strategic social dialogue behaviors.

Core Achievement: A 7B Model Competing with “Industry Giants' Flagships”

After being trained by RLVER, the Qwen2.5 - 7B model's score on the Sentient-Benchmark for emotional conversations jumped from 13.3 to 79.2, performing on par with top commercial models such as GPT-4o and Gemini 2.5 Pro.

More importantly, the model hardly declined in general abilities such as mathematics and coding, successfully avoiding “catastrophic forgetting”.

In addition, the impact of RLVER on the model's behavioral style is also very significant: the model has shifted from a “problem-solving style” to an “emotional style”, and its thinking is no longer “how to solve the problem” but “I can understand your feelings”.

Deep Insights: From Thinking to Action

During the training practice of RLVER, the research team also made some inspiring discoveries.

Insight 1: “Thoughtful” vs. “Reactive” Models — Two Paths to “Empathy”

RLVER introduces an explicit think-then-say prompt template, requiring the model to conduct emotional analysis and strategy reasoning before each round of response, and then generate the final response. By comparing models with/without “thinking”, the research team observed two completely different paths to “empathy”:

“Thoughtful Model”: Towards “Deep Understanding”

The explicit thinking chain prompts the model to reason before generation, significantly improving two core abilities:

Problem insight: Identifying the real causes and potential needs behind the user's emotions;

Empathetic expression and verification: Accurately capturing and reflecting deep emotions, making the user “feel understood”.

This type of model is more like a “soulmate”: good at quiet listening, accurate response, and establishing deep emotional connections through language.

“Reactive Model”: Towards “Quick Action”

In contrast, the model without guided thinking directly generates responses. Although it is slightly inferior in terms of insight and empathy, it spontaneously develops an “action-oriented” compensation strategy:

Quickly judging the user's dilemma and providing specific and executable suggestions or personalized action invitations;

Compensating for the lack of emotional understanding with “practicality” and forming the role of an “action-oriented partner”.

This comparison reveals an interesting phenomenon in RL training for open and complex tasks: when the model's capabilities are limited, it will spontaneously find strategic “compensation paths”. The diverse and multi-strategy compatible training environment provided by RLVER is the key soil for the evolution of such diverse behaviors.

Insight 2: PPO vs. GRPO — Stable Growth or Capability Breakthrough?

In terms of optimization algorithms, the RLVER team also reached practical conclusions:

GRPO: Tends to bring more stable and balanced capability growth.

PPO: Can push the model's capabilities in specific dimensions (such as depth of empathy and core insight) to a higher ceiling.

This leads to an interesting strategic consideration: for complex multi-dimensional capabilities like “emotional intelligence”, after the model has reached the “passing line” in all aspects, should it continue to be a “well-rounded warrior” or focus on building one or two “killer” dimensions?

In the experimental results of this article, the latter brought better comprehensive performance.

Insight 3: The Impact of the Environment and Reward Style — A Strict Teacher Does Not Necessarily Produce Excellent Students

In the RLVER framework, the user simulator plays the dual roles of “training environment” and “reward model”. Therefore, its style — that is, “user acceptance” and feedback method — has a direct impact on the model's learning path.

A natural question is: Will a more demanding user train a stronger model?

The experimental answer is: Harder is not always better.

The RLVER team built two types of user simulators:

Vanilla version: Emotions are exposed, feedback is positive, and acceptance is relatively high;

Challenging version: Emotions are reserved, feedback is restrained, and the requirement for response quality is extremely high.

After training and testing with the same initial model, the RLVER team found:

An overly difficult environment is not conducive to the model's early growth

Although the Challenging simulator is more realistic in design, it has implicit feedback and low fault tolerance, making it difficult for the model to explore diverse strategies through trial and error in the early stage of training and to obtain positive incentives. This will lead to a vicious cycle of “no feedback → no learning → collapse” in RL training.

On the contrary, the feedback mechanism of the Vanilla simulator is relatively inclusive and positive, which is more conducive to the model's strategy exploration and ability accumulation in the early stage of training, forming stable empathetic expression habits.

Strategic implication: When using reinforcement learning to optimize open tasks (such as “emotional intelligence”), the training environment should not simply “set difficulties” but should emphasize the design of the “growth curve”. The premise of “a strict teacher produces excellent students” is that the student can already understand the teachings.

In the early stage when the capabilities are still shallow, a gentle and learnable “training partner user” can actually help the model grow into a true empathizer.

The model with thinking is more “resilient”

An additional interesting discovery is that in the Challenging environment, the model with an explicit “thinking structure” is significantly more robust:

Although the overall score decreases, it still remains at a usable level;

While the model without a thinking structure almost completely collapses, with a score as low as 19.8.

This indicates that explicit reasoning ability can buffer the training instability caused by sparse rewards. Even without clear feedback, the model can still dig out user demand signals through “internal analysis” and thus maintain a certain degree of adaptability.

Previous work: Can AI Become an Emotional Master? Tencent Releases the Latest AI Social Intelligence List, and the Latest Version of GPT-4o Takes the First Place

Paper address: https://arxiv.org/abs/2507.03112

Project code: https://github.com/Tencent/digitalhuman/tree/main/RLVER

Open-source model: https://huggingface.co/RLVER

This article is from the WeChat official account “Quantum Bit”, author: Tencent's Hunyuan AI Digital Human Team, published by 36Kr with authorization.