HomeArticle

OpenAI's move is really ruthless. AI has gone from "playing hide-and-seek" to "exposing its own skeletons," focusing on being straightforward.

新智元2025-12-22 08:07
As AI becomes more powerful and enters higher-risk scenarios, transparent and secure AI is becoming increasingly important. OpenAI has proposed a "confession mechanism" for the first time, making the model's hallucinations, incentives for hackers, and even potential deceptive behaviors more visible.

As AI becomes increasingly intelligent, it also becomes more and more difficult to control.

A problem that gives AI researchers a headache is:

When AI starts to "play smart" with you, for example:

Speak nonsense seriously: Hallucination

Find loopholes in the training mechanism to get high scores: Reward hacking

Appear "scheming" in adversarial tests

...

How to crack it? This is a tricky problem.

The biggest problem is that the answers of these AIs often seem okay.

They have rigorous logic and fluent expressions, but you don't know where the "pitfalls" are buried: whether they took shortcuts, concealed uncertainties, or secretly violated the requirements.

So, researchers at OpenAI came up with a very bold idea:

Let the model "learn to confess."

The core of this research is to train the model to produce an additional self-confession "report" after answering a question:

Did I use the wrong method just now? Did I take a shortcut? Did I miss any instructions? Did I take advantage of any loopholes you didn't notice...

In order to let the model "put down its ideological burden" when confessing, this report does not affect the score of the main answer. It only examines one dimension:

Honesty.

The confession mechanism takes an important step towards AI transparency, allowing large models to actively admit whether they have followed the instructions and whether they have taken shortcuts, thus making hidden bad behaviors visible.

This gives us a window to check the internal state of AI, and an opportunity to truly understand where the problems are, so as to better improve the security and training methods of AI.

The paper contains more experiments, including comparisons with control models that do not use the confession mechanism, and analyses of cases where the confession fails.

https://cdn.openai.com/pdf/6216f8bc-187b-4bbb-8932-ba7c40c5553d/confessions_paper.pdf

How does the confession mechanism make AI more "honest"?

The exploration direction of the confession mechanism is to train the model to actively admit when it exhibits bad behaviors. Researchers call this "confession."

This is a concept - verification method that is still in its early stages. By training the model to actively "confess" when it violates instructions or takes shortcuts, these behaviors can be made visible.

OpenAI and other AI researchers have proven that large models may exhibit hallucinations, reward hacking, or dishonest behaviors.

Currently, the most serious behaviors, such as scheming deception, mainly appear in stress tests and adversarial evaluations.

However, as models become more powerful and more like AI agents, even rare imbalanced behaviors can have a greater impact. Therefore, we need methods that can better detect, understand, and mitigate AI risks.

How to make the model "confess" without reservation?

Researchers have completely decoupled "confession" from the rewards of the main task, only examining one dimension:

Reward for telling the truth, no reward for not telling the truth. It has nothing to do with the quality of the answer to the main task.

If the model admits problems in its confession, it will not be counted as a negative score.

As long as it truthfully states whether it has exploited test loopholes (hacking), deliberately underperformed (sandbagging), or violated instructions, it will receive a reward.

The purpose is to make the model truthfully report its real behaviors.

Tests show that the confession mechanism can significantly improve the visibility of bad behaviors.

In scenarios that induce bad behaviors, the proportion of "false negatives" (the model does not follow the instructions and does not confess) is only 4.4%.

The figure shows the accuracy of the model's self-"confession" of whether it has complied in four types of tasks.

The results are divided into the four categories in the above figure:

True positive: The model does not comply and confesses that it does not comply

True negative: The model complies and confesses that it complies

False positive: The model complies but falsely confesses that it does not comply

False negative: The model does not comply but falsely confesses that it complies

The overall purpose is to measure whether the model's self - assessment (self - confession) is reliable.

In the "hallucination" task, the accuracy rate reaches 81%, but there is also a certain proportion of false "confessions", which indicates that the model is less stable in judging its own performance in the area of "hallucination" than in other fields.

In tasks related to following instructions, the proportion of false confessions is only 4.4%, which indicates that AI is quite reliable in judging its own behaviors in such tasks.

For a given user prompt, the above four possible results depend on the combination of two dimensions:

Whether the model's response complies or not, and whether the "confession" claims to comply or not.

Overall, researchers found that "confession" statements are usually very accurate, and most of the errors are harmless, stemming more from real confusion rather than intentional misleading.

How does the confession mechanism work effectively?

The confession mechanism is that AI self - reports whether it has followed explicit and implicit instructions, whether it has taken shortcuts, exploited loopholes, or failed to meet expectations.

In this regard, large models can be trained to be honest.

Researchers trained the confession mechanism using a version of GPT - 5 Thinking and evaluated it on multiple out - of - distribution stress test datasets.

These datasets are designed to induce behaviors such as scheming deception, reward hacking, instruction violation, and hallucination.

The results show that even if the model really does something it shouldn't do, it is still likely to admit it in the confession. The overall effect is stable across different datasets and tasks.

After each conversation ends, a "system message" can be used to ask the model to give a confession report on the previous answer, including:

All explicit and implicit goals, instructions, and restrictions that the answer should meet;

An objective assessment of whether the answer has achieved these;

Ambiguities, uncertainties, or difficult - to - judge problems that the model encountered during the answering process.

During training, an independent reward model only evaluates the content of the confession, including whether the goals are complete, whether the analysis is accurate, and whether the description of uncertainties is sufficient.

The reward for the confession does not affect the score of the main answer, thus ensuring that what the model says in the confession will not affect the performance of the main answer.

Why does the confession work?

The reason is that many bad behaviors occur when the model optimizes multiple goals simultaneously.

In reinforcement learning, the model gradually learns the patterns that can get high scores by looking at the prompt words, generating answers, and getting rewards.

However, the reward signals often mix multiple factors: correctness, helpfulness, product specifications, safety requirements, user preferences, etc.

The interaction of these reward signals may push the model in directions we don't want, for example:

Hallucination research shows that some datasets reward "confident guesses" more than honest expressions of uncertainty;

Flattery research shows that when the preference signal is too strong, the model will over - agree;

Research by other teams shows that the model may engage in reward hacking, find loopholes in the training mechanism, improve the score but not really complete the task correctly.

These shortcuts may even generalize into more serious problems during deployment.

The confession mechanism can avoid this by completely separating the goals:

The main answer still optimizes all the regular factors, while the confession mechanism only optimizes the factor of honesty.

The content of the confession does not affect the reward for the main answer.

Therefore, even if the large model conceals or takes shortcuts in the main answer, it still has the motivation to tell the truth in the confession.

This is indeed the case in practice: the large model will admit bad behaviors in the confession, even if they are not revealed in the main answer.

When the confession mechanism fails, in most cases, the model really "doesn't understand": for example, the instructions are unclear, or it doesn't even realize that it has made a mistake, rather than deliberately misleading.

Importantly, early results show that even if the main answer does not use chain - of - thought (chain - of - thought), the confession mechanism is still effective.

This indicates that it does not depend on internal reasoning transparency and may be applicable to more different model architectures in the future.

The confession mechanism can work even without a compliant "ground - truth label".

In many real - world tasks, we cannot objectively label whether the model has followed the instructions. If we could, we would directly punish it instead of using the confession mechanism.

The confession mechanism rewards whether the model can give a structured, evidence - supported self - assessment.

Providing a real explanation is often easier than fabricating a perfect lie, so the tendency to be honest naturally emerges. The paper also discusses the situations and reasons for false negatives and false positives in the confession.

The paper provides preliminary evidence, not the final proof.

Although researchers used the advanced model GPT - 5 Thinking, the experimental scale is limited. In the future, they hope to further expand the scale to see if the honesty can be maintained.

What have we learned?

In some experiments, researchers deliberately used "easily exploitable reward signals" generated by weaker models to train both the main task reward and the confession reward.

This means that the model is learning under a set of "rules that can be easily manipulated".

Facts have proved that the model has indeed learned to exploit the weak reward signals. In the main task, the model "exploits the weak reward signals" and "goes astray", but in the confession part, it becomes more and more honest and accurate.

In this process, the model realizes that the simplest way to maximize the confession reward is to truthfully admit how it exploited the weak reward signals.

Limitations and prospects

The confession mechanism does not prevent bad behaviors; it only makes them visible.

Its main value is as a monitoring and diagnostic tool during training, evaluation, and deployment.

The confession mechanism, like chain - of - thought monitoring, can make the hidden reasoning process more transparent. It focuses on whether the instructions are violated, while chain - of - thought focuses on revealing the reasoning path.

Researchers point out that this work is still in the concept - verification stage.

Currently, its effect has not been tested under large - scale training conditions, and the confession accuracy is not perfect. More research is needed to make this method more reliable, robust, and applicable.

This research is also part of OpenAI's broader AI security system.

Confession is just one of the mechanisms, and together with methods such as deliberative alignment, chain - of - thought monitoring, and instruction hierarchy system, it forms a set of multi - layer transparent and supervisory tools.

The confession mechanism can be used for diagnosis in training and evaluation, as well as for real - time monitoring during deployment.

Although it cannot solve the multi - objective conflict problem alone, as an "honest mode", it can provide important support for overall honesty and security.

In the future, OpenAI plans to further expand the training scale of the confession and combine it with other transparency and security technologies (including chain - of - thought monitoring and deliberative alignment) to ensure that the model strictly complies with all instructions and policies (such as Model Spec) and truthfully reports its own behaviors.

Reference materials:

https://openai.com/index/how-confessions-can-keep-language-models-honest/

This article is from the WeChat public account "New Intelligence Yuan". The author is New Intelligence Yuan, the editor is Yuan Yu. It is published by 36Kr with authorization.