First Confirmation: RL Enables 3D Models to Learn Reasoning, with Drastic Jump in Generation Quality under Complex Text Descriptions

Tailored for 3D from rewards, algorithms to evaluations.

Reinforcement Learning (RL) has achieved remarkable results in image generation. What about 3D generation?

When GRPO brought about a qualitative change in the mathematical and code reasoning capabilities of large models, the research team was the first to offer an answer - the first research systematically introducing reinforcement learning into text-to-3D autoregressive generation was officially born and accepted by CVPR 2026. This research does not simply transplant 2D experience. Instead, it conducts a comprehensive and systematic exploration from reward design, algorithm selection, evaluation benchmarks to training paradigms, aiming at the unique challenges of 3D generation.

Why is 3D generation much more difficult than 2D?

RL has been proven effective in text and image generation, but directly applying it to 3D generation doesn't work.

The core contradiction lies in: 3D objects do not have a "standard view". People can easily tell whether an image is correct at a glance. However, for a 3D object, geometric consistency, texture quality, and semantic alignment need to be evaluated from multiple perspectives simultaneously. If any dimension is not properly designed, the training will fail.

A deeper problem is that in the autoregressive decoding of 3D generation models, each token implicitly commits to the overall structure. This long-range dependence makes the sparsity problem of reward signals more prominent in 3D than in 2D - it is difficult for the model to perceive where the problem occurs midway.

The research team dissected this problem into four dimensions for systematic study:

How to design the reward model - Which type of reward signal is most effective for 3D generation?

How to select the RL algorithm - Which variants of GRPO are suitable for the sequential characteristics of 3D?

How to evaluate - Can the existing benchmarks truly measure the reasoning ability of 3D generation?

How to upgrade the paradigm - How to make RL work with the hierarchical structure of 3D generation?

The choice of the reward model is more crucial than expected

Core insight: Human preference is the "key", and other rewards are "bonus points".

The research tested various reward combinations, including four dimensions: Human Preference Score (HPS v2.1), Semantic Alignment (CLIP Score), Aesthetic Quality, and 3D Geometric Consistency. The conclusion is clear:

Using HPS v2.1 (Human Preference Score) alone has the strongest effect and is the best among all single rewards, directly determining the lower limit of the model's generation quality.
When used alone, Semantic Alignment and Aesthetic Quality have limited improvement. However, when superimposed on human preference, they can continuously increase the score, forming a complementary relationship.

The most unexpected discovery: The general large model (Qwen2.5 - VL) is more robust in evaluating 3D consistency than the dedicated model. The reason is that there is currently no ready - made dedicated reward model for 3D geometric consistency, while the general multi - modal large model can fill this gap and provide a more stable reward signal due to its broad understanding of spatial relationships.

Practical significance: Don't expect to find a "universal reward". Instead, human preference should be the core. On this basis, a multi - dimensional reward integration of geometric consistency and semantic alignment should be superimposed to cover all dimensions of 3D generation quality.

Token - level vs Sequence - level: A crucial but overlooked choice

Core insight: 3D generation is naturally suitable for token - level optimization, and sequence - level operations may even be counterproductive.

The research systematically compared three types of algorithms: GRPO, DAPO, and GSPO, revealing an important rule:

Averaging the token - level loss (the core improvement of DAPO) brings the most significant improvement. The reason behind this is that the global structural differences of 3D objects are reflected in each token of the autoregressive sequence. Using the token - level average loss can more precisely perceive the quality deviation of each step of generation.
Sequence - level operations (the idea of GSPO) are effective in mathematical and code tasks but have extremely limited benefits in 3D generation. The sequence structure of such tasks is too sparse, and the key signals are submerged in a large number of neutral tokens.
Dynamic Sampling is a low - cost and high - return technique. It can significantly stabilize the training curve and avoid training oscillations caused by excessive reward variance.
Completely removing the KL penalty will lead to a decrease in performance. The KL divergence still plays an important regularization role in 3D generation, preventing the policy from drifting too far from the reference distribution.

There are also clear conclusions in terms of data expansion:

Doubling the training data is effective, but tripling the number of iterations will lead to overfitting. The model starts to memorize the preference features by rote, and its generalization ability for rare object categories significantly decreases. This shows that in the RL training of 3D generation, data diversity is more important than training duration.

MME - 3DR - Why can't the existing benchmarks evaluate 3D reasoning ability?

The existing 3D generation benchmarks (such as ShapeNet and Toys4K) mainly focus on object diversity but cannot measure the model's implicit reasoning ability under complex text descriptions. For example, they cannot evaluate the generation quality of fine - grained semantic alignment, such as "a chair with wooden armrests and slightly worn legs, viewed from a 45 - degree top - down angle from the left rear".

To address this, the research team proposed the MME - 3DR benchmark, which includes 249 carefully selected complex 3D objects. The evaluation dimensions cover three levels: multi - view geometric consistency, semantic detail alignment, and texture realism, specifically designed to measure the model's generation performance in reasoning - intensive scenarios. The design of MME - 3DR ensures that models that only rely on memorizing training data cannot achieve high scores, truly distinguishing the gap between generation ability and generalization reasoning ability.

AR3D - R1 outperforms existing SOTA methods such as Trellis on both the MME - 3DR and Toys4K benchmarks, with a Kernel Distance of 0.156, verifying the substantial improvement in reasoning ability brought by RL training.

Hi - GRPO and AR3D - R1 - 3D generation is naturally "coarse - to - fine"

Core insight: 3D generation is inherently hierarchical, and the RL paradigm should also be hierarchical.

The research team observed an interesting phenomenon during the training process: the model first learns the global geometric shape in the early iterations and gradually refines the texture details later. This is completely consistent with the way humans perceive 3D objects (first look at the outline, then look at the details). Inspired by this, the research proposed the Hi - GRPO (Hierarchical GRPO) framework:

Step 1 (Coarse - grained stage): Generate high - level semantic reasoning through Chain - of - Thought to produce a rough geometric shape. The dedicated reward focuses on geometric consistency and overall structural integrity.

Step 2 (Fine - grained stage): Generate low - level visual reasoning based on the CoT output of Step 1 to produce fine texture details. The dedicated reward focuses on appearance quality and component integrity.

Two independent reward integrations are used in the two stages, avoiding the interference between geometric rewards and texture rewards, and enabling the model to receive the most accurate learning signals at each stage. This hierarchical design directly encodes the structural prior of 3D generation into the RL training paradigm.

Quantitative results of the final model AR3D - R1:

The CLIP score increased from 22.7 to 29.3, an increase of approximately 29%, indicating a significant improvement in semantic alignment ability.
The Kernel Distance decreased by approximately 37%, indicating that the geometric distribution is closer to that of real 3D objects.
It outperforms existing SOTA methods such as Trellis on both the MME - 3DR (249 complex objects) and Toys4K benchmarks, and the improvement in reasoning ability is particularly prominent in scenarios with complex text descriptions.

Summary: RL in 3D generation needs to be customized

The core contribution of this research is not just a better 3D generation model but a systematic research framework for this field. When you want to introduce RL into 3D generation, it shows which rewards to test, which algorithms to choose, which benchmarks to use for evaluation, and how to design a training paradigm that matches the 3D structural prior.

As the authors asked in the title of the paper: "Are we ready to use RL in text - to - 3D generation?" The answer from this work is: yes, but only if you customize rewards, algorithms, and training paradigms for 3D instead of simply copying 2D experience.

As RL technology continues to mature in the language and image fields, the value of this methodology will go beyond 3D generation itself and provide reusable ideas for the RL application in a wider range of multi - modal generation tasks. The code has been open - sourced. Welcome to explore.

Paper link:

https://arxiv.org/pdf/2512.10949 (CVPR 2026)

Code link:

https://github.com/Ivan - Tang - 3D/3DGen - R1

This article is from the WeChat official account "QbitAI". Author: Focus on cutting - edge technology. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

For the first time, it is confirmed that RL enables 3D models to learn reasoning, and the generation quality under complex text descriptions has jumped significantly.