Skip Direct Training: Add a Wrong-Question Notebook to the Main Model and Let 6B Model Easily Outperform 8B Model

Introduce large models into the wrong question notebook for real-time error correction and performance improvement

[Introduction] Traditional training only focuses on whether the model output is correct or not. The latest research introduces a "Mistake Log" in the training of large models, which records the internal thinking state of the model when it makes mistakes, including the question, the reasoning process, and the location of the error. This is closer to human reflective learning. By having an auxiliary model learn from these "Mistake Logs", it can correct the predictions of the main model in real - time and improve its performance.

Many people will find when looking back on their learning experiences that the real leap in ability doesn't occur when they are doing the most practice questions, but when they start to systematically organize their "Mistake Logs".

The key is not to copy down the wrong answers, but to keep asking questions - why did I think that way at that time? Which step of the judgment had a deviation? Is this error an accidental one, or a recurring thinking pattern?

It is through this reflective learning that humans gradually learn to recognize their own "error patterns" and become more stable in the face of complex and uncertain problems.

Then, a question arises: Do large language models have their own "Mistake Logs"?

In the current mainstream training paradigm, the learning process of large models is highly simplified into a cycle:

Given input → Predict output
Compare with the standard answer → Calculate loss
Update parameters through backpropagation

In essence, this process emphasizes "how to better fit the correct answer".

The model only needs to know whether the result is correct or not, and doesn't really care about what internal reasoning path I took to reach this wrong conclusion?

This also reveals a key missing element: current large models don't lack data or computing power, but they lack a human - like deep reflective ability - that is, a structured review centered around the error itself.

Researchers from the University of Illinois at Urbana - Champaign and Princeton University published a new paper, proposing a very "human - like" concept: Mistake Log.

Paper link: https://arxiv.org/pdf/2505.16270
Code link: https://github.com/jiaruzouu/TransformerCopilot

Different from traditional training that only focuses on the final output, the goal of the Mistake Log is not to answer "Did the model make a mistake?", but to depict a more fundamental question: Under what internal state did the model make this mistake?

In other words, it focuses not on the answer, but on the whole process of error generation.

The Three - layer Structure of Mistake Log

Question: What problem was the model trying to solve at that time?

During the training process, each input will be mapped to a question - level representation

, which is used to depict "the task context the model is facing at the moment". This step corresponds to Which question was I working on at that time?

Rationale (Core): The internal reasoning state of the model at that time

This is the key point where this method differs from the standard SFT. The research is not satisfied with observing the finally generated tokens, but directly reads the hidden state representations of the Transformer at all layers and all token positions. These high - dimensional vectors are not human - readable text explanations, but the real internal thinking trajectories of the model:

Among them, t represents the t - th training step, i represents the i - th token, l represents the l - th layer of the Transformer, and h represents the hidden state at this moment during the model's calculation process.

After collecting all these hidden states, a complete Rationale trajectory is obtained:

It can be regarded as a "snapshot of the cognitive state" of the model at the moment of making a mistake.

This step is similar to when humans recall during the review of wrong questions: "Which formula did I base my derivation on?" "Why did I make a wrong judgment at this branch?"

Mistakes: Fine - grained characterization of the error source at the token level

Different from using a scalar loss to vaguely measure the overall error, this work locates the deviation at the token level: (1) Compare the model's predicted distribution with the real distribution; (2) Calculate the difference between the two at each token:

The model's predicted distribution:

The real correct distribution:

The discrepancy between the two:

Based on this, a heat map of errors is constructed, which precisely answers questions like From which token did the error start to appear? And how did it accumulate and magnify step by step? What does a complete Mistake Log contain?

Finally, each training iteration generates a triple:

Question: The task context
Rationale: The internal reasoning state
Mistakes: Characterization of the deviation at the token level

If the training is carried out for T steps, then the model implicitly accumulates T structured "mistake records":

How to really "utilize" these Mistake Logs?

The authors further propose an inspiring design: Introduce an auxiliary model Copilot, which is specifically designed to learn from the Mistake Logs of the main model (Pilot).

Training method of Copilot

The input form of the auxiliary model: Combine the input context representation corresponding to the task

with the internal intermediate representation generated by the main model during the reasoning phase

to jointly model and depict the current decision - making state of the model;

The training goal of the auxiliary model: Learn to predict the error distribution at each token level during the generation process of the main model

to judge which positions are more likely to have deviations and the degree of the deviations.

In other words, what Copilot learns is Under what internal reasoning state is the main model more likely to make which type of errors?

Collaborative reasoning of Pilot - Copilot

During the generation process, the error - correction logits output by Copilot will be fused with the original logits of the main model, so as to make real - time corrections at the token generation stage. The final model no longer just "memorizes the answers", but has the ability to dynamically correct the current reasoning trajectory based on historical error experiences.

Theoretical result: Error correction is guaranteed

The paper further proves that as long as Copilot can accurately predict the error trend and the error - correction weight λ is selected within a reasonable range, then in each token dimension, the expected error of the fused prediction is strictly smaller than that of the original model.

This means that the Mistake Log is not a heuristic technique, but an error - correction mechanism with clear theoretical support.

Error - correction improvement enables small models to "punch above their weight"

The experiments verified the effectiveness of this method on a variety of mainstream models (such as LLaMA - 3, Qwen2.5) and 10 reasoning benchmark tasks. A particularly remarkable phenomenon is that The combination of a large - scale main model Pilot and a small - scale auxiliary model Copilot often significantly improves the cost - performance ratio.

The performance of LLaMA - 3.2 - 3B + 3B Copilot (a total of 6B parameters) exceeds that of the original 8B LLaMA - 3.1 - 8B.

This indicates that The ability to correct errors itself may be more crucial than simply increasing the model size.

Discussion and outlook

This work is the first to systematically define and explore the Mistake Log mechanism in the training of large models, but this is just a starting point.

Current mainstream "reflective" methods mostly rely on explicit Chain - of - Thought and multi - Agent external error correction. These methods mostly stay at the output level, while the Mistake Log directly acts on the internal cognitive state of the model.

A question worthy of in - depth

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Don't train directly. Add a wrong-question notebook to the main model, and a 6B model can easily outperform an 8B model.