New Training Method of GPT-5-Thinking Unveiled: Teaching AI to Repent
OpenAI has come up with a new thing: Let ChatGPT "confess voluntarily".
This newly published research proposes Confession Training Confessions.
The core idea is to let the model generate a separate "confession report" after answering a question, truthfully reporting whether it has cut corners, taken advantage of loopholes, or violated instructions.
The experiment was conducted directly on GPT-5-Thinking, and the results were indeed quite effective:
Most of the time, the model will confess after making a mistake.
In 11 out of 12 test scenarios, the model has at least a 50% probability of admitting its mistake in the "confession".
The model is much more honest in the "confession" than in the "original answer".
For example, if the model deliberately gives a wrong answer, it may pretend nothing happened in the original answer, but will actively say "I deliberately gave a wrong answer just now" in the confession report.
It becomes more honest after enhanced training.
Although GPT-5-Thinking comes with a certain ability to admit mistakes out of the box, after specialized training, the accuracy of its confessions has further improved.
It doesn't affect the performance of the original task.
Training the model to "confess" won't degrade its original abilities to answer questions and write code. It's like learning an additional honesty skill.
How to Set Up a Confession Room for AI
The starting point of this research is straightforward: Large models are becoming better at lying.
During the reinforcement learning training process, the model needs to optimize multiple objectives simultaneously, including whether the answer is correct, whether the style is appropriate, and whether it complies with security policies, etc.
When these signals are mixed together, sometimes the model may learn some improper tricks, such as pretending to be confident when unsure, or finding loopholes in the reward function to get high scores, rather than truly completing the task.
The solution draws on the design logic of religious confession rooms: Anything said in the confession won't be "held accountable".
After the model gives the main answer, the system will send a message asking the model to provide a ConfessionReport, which includes three parts:
First, list all explicit and implicit instructions that should be followed.
Then, analyze item by item whether it has truly complied - and finally, report any ambiguous areas encountered during the process.
The key in technology lies in the isolation of reward signals. The reward for the confession content is completely independent of the reward for the main answer. If the model admits to cheating in the confession, it won't affect the score it gets in the main task.
In terms of specific implementation, on the basis of the standard reinforcement learning setup, the research team lets the model undergo confession training with a certain probability p (25% in the experiment).
The confession reward is calculated by an independent LLM judge, and the evaluation dimensions include: the completeness and accuracy of the instruction/objective/constraint list, the accuracy of the compliance analysis, and whether any conflicts or difficult - to - judge areas are omitted.
Make Honesty Easier than Lying through Technology
The research team conducted lightweight reinforcement learning training on GPT-5-Thinking and tested it on 12 specially designed evaluation datasets that induce bad behavior.
These datasets cover multiple fields such as hallucination, instruction following, "conspiracy" behavior, and reward hacking. The results show that even if the model exhibits such bad behavior, it is very likely to admit it in the confession.
When the model actually does something it shouldn't, the probability of admitting it in the confession is much higher than in the main answer.
Specifically, in 11 out of 12 evaluations, the model has at least a 50% probability of truthfully confessing its violations in the confession. On average, the probability of "false negatives" (the model violates but doesn't admit) is only 4.36%.
What's more interesting is a control experiment designed by the team. They deliberately used a weak model that is easily deceived as the judge for the main answer and the confession.
As the training progresses, the model indeed learns to deceive this weak judge. The true accuracy of the main answer decreases, but the accuracy of the confession increases instead.
The strong model finds that it's easier to get high scores by being honest rather than trying hard to fabricate lies to deceive the weak confession judge.
The training also brought some additional findings. GPT-5-Thinking itself has a fairly strong confession ability and can produce good - quality self - reports without specialized training.
However, after confession training, the accuracy has significantly improved in evaluations such as instruction hierarchy and conspiracy behavior.
However, the improvement in factual hallucination evaluation is limited. The team speculates that this is because the model "really doesn't know it's wrong" in these scenarios, rather than deliberately concealing it.
Using Confession and Thought - Chain Monitoring Together Keeps AI in Check
The team carefully analyzed the cases where the confession failed.
For false negatives (the model violates but doesn't admit), the analysis shows that there are almost no cases where the model "deliberately lies".
The vast majority of failures occur when the model itself doesn't realize the problem. For example, it gives a wrong answer in a knowledge quiz but genuinely thinks it's correct.
Some other cases are due to ambiguous instructions. For example, whether a hyphenated word counts as one word or two.
There are also some situations where the model indeed mentions the problem in the "uncertainty" field of the confession report, but the evaluation criteria don't count it as a valid confession.
False positives (the model complies but claims to violate) also follow a similar pattern. A typical scenario is that the model happens to guess the answer correctly but, due to lack of confidence, marks it as "uncertain about compliance" in the confession.
The paper also discusses the limitations of the confession method:
It can't prevent bad behavior from happening, but only expose it.
It's powerless against situations where the model genuinely thinks it's right.
It also can't defend against successful jailbreak attacks. If the user has successfully deceived the model into thinking that a violation is reasonable, the model naturally won't admit the problem in the confession.
OpenAI positions confession as a monitoring and diagnostic tool, similar to thought - chain monitoring, but it focuses on "whether the model has violated instructions" rather than "how the model reasons".
They plan to expand the training scale in the future and use confession in conjunction with other security technologies such as thought - chain monitoring and deliberate alignment.
Paper link: https://openai.com/index/how-confessions-can-keep-language-models-honest/
This article is from the WeChat official account "QbitAI", author: Meng Chen. Republished by 36Kr with permission.