3B Small Model Matches Opus 4.5 in Programming: Mysterious Model Causing Heated Debate Is Domestic

Can small language models also have strong reasoning capabilities?

In recent days, a 3B small model has become popular on X. In some verifiable reasoning tasks (such as programming), it has entered the performance range of cutting - edge models like Gemini 3 Pro, GPT - 5 high, Claude Opus 4.5, GLM - 5, and Kimi K2.5, while its size is much smaller than these models.

This model is called VibeThinker - 3B, a dense reasoning model with 3 billion parameters. It aims to explore how far verifiable reasoning ability can be advanced under the strict scale of small models.

After the model was released, many people were amazed by its results and said they wanted to try it out.

Notably, it is a domestic model from the Sina Weibo team.

The technical report shows that this model is designed for tasks with reliable verification signals, including mathematical reasoning, competitive programming, STEM reasoning, and instruction execution with clear constraints.

Therefore, it performs excellently in various benchmark tests. It scored 94.3 in the AIME26 test, 89.3 in the HMMT25 test, 80.2 (Pass@1) in the LiveCodeBench v6 test, and achieved a 96.1% pass rate in the latest unpublished weekly and bi - weekly LeetCode contests from April 25 to May 31, 2026.

How was this model trained? The technical report reveals some details.

First, it is built on Qwen2.5 - Coder - 3B and uses an upgraded Spectrum - to - Signal process for post - training. This process strengthens data synthesis, quality filtering, and curriculum learning in supervised fine - tuning (SFT), extends MGPO - style reinforcement learning to multiple verifiable domains, retains complete long - context reasoning trajectories, and consolidates various capabilities through offline self - distillation and instruction reinforcement learning (Instruct RL).

Overall training process of VibeThinker - 3B

Spectrum - to - Signal process.

In addition, VibeThinker - 3B also introduces Claim - Level reliability assessment (CLR), a test - time scaling strategy for answer - verifiable reasoning. CLR further improves the performance of mathematical benchmark tests, increasing AIME26 from 94.3 to 97.1, HMMT25 from 89.3 to 95.4, and BruMO25 to 99.2.

Its specific training process is as follows:

Curriculum - based two - stage SFT. The first stage focuses on broad coverage of capabilities in areas such as mathematics, programming, STEM reasoning, general dialogue, and instruction following. The second stage shifts to more difficult and broader - vision reasoning samples. Diversity exploration distillation is used to retain multiple effective solution paths.
Multi - domain reasoning reinforcement learning. VibeThinker - 3B reuses MGPO. Reinforcement learning is sequentially applied to mathematical, programming, and STEM reasoning tasks. Training uses a single 64K long - context window to retain complete long - term reasoning trajectories.
Offline self - distillation. High - quality trajectories are screened and refined from mathematical, programming, and STEM RL checkpoints to form a unified student model. Learning potential scores are used to prioritize trajectories that are correct but not well - imitated by the student.
Instruct RL. The final stage improves the controllability of user - oriented prompts. For format - sensitive and open - ended teaching data, rule - based validators and scoring - criterion - based reward models are used.

In a recent post, well - known AI researcher and blogger Sebastian Raschka systematically summarized the key points disclosed in the VibeThinker - 3B technical report, including the following:

If you are interested in these contents, you can read their technical report in detail. Currently, the model can also be publicly downloaded.

Report title: VibeThinker - 3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

Report link: https://arxiv.org/pdf/2606.16140

HuggingFace link: https://huggingface.co/WeiboAI/VibeThinker - 3B

However, the application scope of this model is clearly limited because it does not perform well in fields that require general knowledge.

The official also clearly points this out and proposes the "parameter compression coverage hypothesis": Different capabilities depend on model parameters in completely different ways. Verifiable reasoning is closer to a highly compressible and parameter - dense ability, with its core lying in multi - step reasoning, constraint satisfaction, self - correction, and answer verification. When the task space structure is clear enough and the feedback signal is reliable enough, a compact model may also have near - cutting - edge reasoning ability. In contrast, open - domain knowledge, general dialogue, and long - tail scenario understanding rely more on large - scale parameters to widely cover facts, concepts, and world knowledge. This hypothesis is very inspiring. VentureBeat wrote in its report: "It reveals that there is a partial decoupling between reasoning ability and factual knowledge, and the former can be compressed more effectively than previously thought - this insight has far - reaching implications for how the industry views model design, deployment costs, and the popularity of advanced AI functions."

The authors said that their goal is not to create a small model to replace large - scale models, but to examine the real boundaries of small models along specific ability dimensions. With VibeThinker - 3B, they hope to show that small models should not be merely regarded as a compromise to reduce deployment costs. In the field of capabilities with clear feedback and verification mechanisms, small language models are showing a promising research path, expected to achieve cutting - edge performance and form a fundamental complementary relationship with the traditional parameter - scale expansion paradigm.

Currently, this model still faces some doubts in the community. If you are interested in this model, you might as well try it yourself.

Reference links:

https://x.com/orcus108/status/2066876960073281582

This article is from the WeChat official account "MachineHeart" (ID: almosthuman2014), author: Zhang Qian. It is published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The 3B small model scores on a par with Opus 4.5 in programming, and the mysterious model that sparked a heated discussion turns out to be a domestic one

Reference links: