Just now, the Thinking Machines Lab blog mentioned that Qwen was cited 38 times in policy distillation.
Just now, the Thinking Machines Lab (TML for short), which doesn't publish papers but loves to post blogs, updated again and published a blog titled "On-policy Distillation".
On-policy distillation is a training method that combines the error-correction relevance of reinforcement learning (RL) with the reward density of supervised fine-tuning (SFT). When applying it to mathematical reasoning and internal chat assistants, TML found that on-policy distillation can outperform other methods at a very low cost.
The company's CEO, Mira Murati, said that this method can be used for small models to enable them to have strong domain performance and continuous learning ability.
Notably, in this new blog, TML clearly stated that this new achievement was inspired by the research of the Qwen team, and the Qwen3 series of models were also widely used in its experimental process. In fact, in the original English blog, the keyword "Qwen" appeared 38 times in total! One more time than the 37 times Lei Jun mentioned "Apple" at the Xiaomi 17 series press conference.
As a star startup, TML's updates have also attracted wide attention. Some people summarized its advantages:
Moreover, some netizens highly praised that TML is the real OpenAI.
Blog address: https://thinkingmachines.ai/blog/on-policy-distillation/
The main author of this blog is Kevin Lu, a researcher at Thinking Machines Lab. He previously worked at OpenAI, led the release of 4o-mini, and participated in the research and development of models such as GPT-5 series, GPT-oss, o3 & o4-mini, 4.1-nano & 4.1-mini, o1-mini, and o3-mini.
Now let's take a detailed look at the content of this blog.
Large language models (LLMs) can demonstrate expert-level performance in specific domains. This is the result of the combined action of several abilities, including: perception of input, knowledge retrieval, planning and selection, and reliable execution.
To achieve this, a series of training methods are required. We can roughly divide them into three stages:
Pre-training: Teach general abilities, such as language use, broad reasoning, and world knowledge.
Mid-training: Impart domain knowledge, such as code, medical databases, or company internal documents.
Post-training: Guide the target behavior, such as following instructions, solving mathematical problems, or chatting.
In specific professional domains, small models that have undergone intensive training often outperform large general models. There are many benefits to using small models:
For privacy or security reasons, they can be deployed locally.
They can be more easily continuously trained and updated.
They can also save inference costs.
To take advantage of these benefits, the right method needs to be selected for the subsequent stages of training.
The methods for post-training "student" models can be divided into two types:
On-policy training: Sample trajectories (rollouts) from the student model itself and assign some kind of reward to these trajectories.
Off-policy training: Rely on the target outputs from an external source, and the student model needs to learn to imitate these outputs.
For example, we may want to train a compact model to solve the following mathematical problem:
We can perform on-policy training through reinforcement learning (RL). Specifically, we score each trajectory of the student model based on whether it solves the problem. This scoring can be done manually or by a "teacher" model that can reliably give the correct answer.
The advantage of on-policy training is that the student can more directly learn to avoid mistakes by training on its own samples.
However, RL has a major drawback: it provides very sparse feedback. No matter how many tokens are used, the number of bits it teaches in each training episode is fixed.
In our above example, the student only knows that "21" is the wrong answer and updates the model to avoid generating this trajectory. But it doesn't learn what exactly went wrong - whether it was a mistake in the order of operations or an error in the arithmetic itself. This sparsity of feedback makes RL inefficient in many applications.
Off-policy training is usually done through supervised fine-tuning (SFT), that is, training on a set of carefully curated, annotated examples for a specific task. The source of these annotated examples can be a teacher model that performs well on the current task.
We can use a mechanism called distillation: train the student model to match the output distribution of the teacher model. We train on the teacher's trajectories, which are the complete sequences of generated tokens, including the intermediate thinking steps.
At each step, we can either use the teacher's complete "next token distribution" (often called "logit distillation") or only sample the given sequence. It has been proven that sampling the sequence provides an unbiased estimate of the teacher's distribution and can achieve the same goal. The student model will update its learning of each token in the sequence accordingly based on how low the probability of generating that token is (represented by the dark color in the following example):
It has been proven that distilling large teacher models is very effective in training small models, enabling them to:
Follow instructions
Perform mathematical and scientific reasoning
Extract clinical information from medical notes
And participate in multi-round chat conversations
The distillation datasets used for these and other applications are usually open-source and publicly available.
The drawback of off-policy training is that the student learns in the contexts that the teacher often encounters, rather than in the contexts that the student itself will often encounter in the future.
This may lead to compounding errors: if the student makes a mistake early on that the teacher has never made, it will find itself increasingly deviating from the states observed during training.
This problem becomes particularly prominent when we care about the student's performance on long sequences. To avoid this deviation, the student must learn to recover from its own mistakes.
Another problem observed in off-policy distillation is that the student can learn to imitate the teacher's style and confidence, but not necessarily its factual accuracy.
For example: if you are learning chess, on-policy RL is like playing chess by yourself without a coach. The feedback on winning or losing a game is directly related to your own playing style, but you only receive feedback once per game, and it doesn't tell you which moves contributed the most to the result. Off-policy distillation is similar to watching a grandmaster play chess - you observe very sophisticated moves, but these moves are made in game states that novice players rarely encounter.
We hope to combine the on-policy relevance of RL with the dense reward signal of distillation.
For learning chess, this is like having a teacher to score each of your moves, from "completely wrong" to "wonderful". For the post-training of LLMs, this is on-policy distillation.
On-policy Distillation - Combining the Best of Both Worlds
The core idea of on-policy distillation is to sample trajectories from the student model and use a high-performance teacher model to score each token of each trajectory.
Going back to our above mathematical example, on-policy distillation will score each step of solving the problem, punishing the wrong steps that lead the student to the wrong answer while reinforcing the correct steps.
In this article, we explored the application of on-policy distillation in the following tasks:
1. Train the model for mathematical reasoning.
2. Train an assistant model with both domain knowledge and instruction-following ability.
We applied on-policy distillation to models that already have the basic abilities of pre-training and mid-training. We found that this is a cheap and powerful post-training method that successfully combines the advantages of on-policy training with a dense reward signal.
Our on-policy distillation work draws on DAGGER (Ross et al, 2010), an iterative SFT algorithm that includes the teacher's evaluation of the states visited by the student.
It is also similar to process reward modeling (Lightman et al, 2023), an RL method that scores each step in the student model's chain of thought.
We extended the previous on-policy distillation work of Agarwal et al. (2023) and the Qwen3 team (2025). Using the Tinker training API, we replicated the results of Qwen3, that is, achieved comparable performance on the inference benchmark through on-policy distillation, while the cost was only a fraction of that of RL.
Implementation
You can follow each step of the implementation in this Tinker cookbook:
https://github.com/thinking-machines-lab/tinker-cookbook/tree/main/tinker_cookbook/recipes/distillation
Loss Function: Reverse KL
On-policy distillation can use a variety of loss functions to score the student's trajectories. For simplicity, we choose the per-token reverse KL - that is, the divergence between the distributions of the student (π_θ) and the teacher (π_teacher) at each token given the same previous trajectory:
Our reward function will minimize the reverse KL, which will prompt the student to approximate the teacher's behavior in each state it is in. When the student's behavior is exactly the same as the teacher's, the reverse KL is zero. For simplicity, we use a discount factor of zero: at any given time step, the student only optimizes the next token in front of it, without considering the future tokens.
Reverse KL has a natural synergy with RL, which usually optimizes a certain sequence-level reverse KL guided by the reward model. However, unlike most reward models in practice, reverse KL is "unhackable" because from the perspective of the teacher model, a low KL always corresponds to a high probability of the desired behavior. Another useful property of reverse KL is that it is "mode seeking" - it learns a specific behavior (the teacher's behavior) rather than spreading its distribution among several suboptimal options.
This method can save a lot of computing resources. Since it doesn't need to wait for a trajectory to complete sampling to calculate the reward, we can use shorter or partial trajectories for training. Querying the teacher's log probabilities only requires one forward pass of the large model, while the trajectories are generated by the smaller, cheaper student model.
We also don't need a separate reward or annotation model. Combining the distillation-based per-token reward with the sequence-level environmental reward may be beneficial; this is an interesting potential research area in the future.
Illustration
Now let's look at a real example, which is a wrong student trajectory scored by the teacher model. This example is from SimpleBench, which requires the model