HomeArticle

Can fixing hallucinations mean killing AI? Science exposes the "inherent Achilles' heel" of large models.

新智元2025-11-10 12:04
It's difficult to eliminate hallucinations in large models, and commercial interests prevent them from saying "I don't know."

A new article in Science points out that large models have an inherent and difficult - to - solve weakness: hallucinations are difficult to eradicate. When AI manufacturers make large models say "I don't know" in uncertain situations, although it helps to reduce model hallucinations, it may affect user retention and activity, thus shaking the business foundation.

On the day when OpenAI completed its restructuring and lifted the restrictions on going public, a hot article in Science exposed an inherent and fatal weakness of large models, which makes it difficult for large models to completely get rid of hallucinations.

The article points out that although OpenAI has completed the long - awaited restructuring, its core products still experience hallucinations .

In the past, we often mainly attributed this kind of hallucination to the quality of training data, but this explanation is not sufficient.

Last month, a research team from OpenAI and the Georgia Institute of Technology pointed out in a pre - print paper:

Just as students will "guess" the answers when they encounter difficult questions in an exam, large models also tend to "guess" in uncertain situations, generating seemingly reasonable but actually wrong answers instead of admitting that they don't know.

Paper: https://arxiv.org/abs/2509.04664 Why Do Large Models Hallucinate?

Choosing to say "I don't know" in uncertain situations can significantly reduce hallucinations, but why haven't model designers done so?

Researchers believe that the problem mainly lies in the training and evaluation mechanisms of large models:

During the training and evaluation of large models, there is a greater tendency to "reward guessing" rather than "encourage admitting uncertainty".

But it's not easy to change this.

Making large models learn to say "I don't know" may also shake the business foundation of AI manufacturers.

For example, some people question whether OpenAI will sincerely make its models value "truthfulness" more than "attractiveness".

This is a huge challenge.

If ChatGPT often answers "I don't know", users may switch to competitors.

Why Are Large - Model Hallucinations Difficult to Eradicate?

"If you completely fix the hallucinations, you'll kill the product."

Wei Xing, an AI researcher at the University of Sheffield, once published an article saying that OpenAI's "anti - hallucination" plan would kill ChatGPT.

OpenAI researchers believe that hallucinations are not mysterious. They analyzed the possible errors in the pre - training stage of large models and found that even if the training data is correct, the pre - training objective may still cause the model to make mistakes.

Researchers further pointed out that the reason why hallucinations persist in the subsequent stages is that the scoring method of the mainstream evaluation system encourages the model to "guess" like a student in an exam rather than honestly express uncertainty.

OpenAI has explored the reasons why hallucinations are difficult to eradicate and believes that the root lies in the "next - word prediction" of pre - training: the model learns from a vast amount of text to master how to predict the next word based on statistical laws.

But this kind of prediction is like swallowing dates whole. Each sentence is not optimized with "true/false" labels. When there are no examples marked as wrong, it is particularly difficult to distinguish between valid and invalid sentences, so hallucinations occur.

Take image recognition as an example. If millions of photos of cats and dogs are labeled as "cat" or "dog", the algorithm can classify them reliably.

But if the photos are labeled according to the pets' birthdays, since the birthday data is essentially random, no matter how advanced the algorithm is, this task will inevitably produce errors.

A similar mechanism exists in the pre - training of language models.

For example, spelling and parentheses follow fixed patterns, so errors will disappear as the scale expands.

However, arbitrary low - frequency facts like pet birthdays cannot be predicted by patterns alone, so they are likely to cause hallucinations.

OpenAI has clarified several misunderstandings about model "hallucinations":

Misunderstanding: Improving accuracy can eliminate hallucinations because a 100% accurate model will never hallucinate.

Claim: Accuracy can never reach 100% because no matter the model size, search ability, or reasoning ability, some real - world problems are inherently unsolvable.

Misunderstanding: Hallucinations are inevitable.

Claim: This is not the case because language models can choose to remain silent in uncertain situations.

Misunderstanding: Avoiding hallucinations requires a certain level of intelligence, which can only be achieved by large models.

Claim: Small models are more likely to recognize their own limitations. For example, when faced with a question in Maori, a small model that doesn't understand Maori can directly respond "I don't know", while a model that has some knowledge of Maori needs to evaluate the confidence of its answer.

Misunderstanding: Hallucinations are mysterious glitches in modern language models.

Claim: We have understood the statistical mechanism of hallucinations and the reward mechanism in evaluation.

Why Is It Difficult to Eliminate the Phenomenon of Large - Model "Score - Padding"?

When introducing this paper on its official blog, OpenAI described hallucinations as "seemingly reasonable but wrong statements".

Blog: https://openai.com/zh-Hans-CN/index/why-language-models-hallucinate/

Moreover, the occurrence of hallucinations is unpredictable, and it may appear in unexpected ways.

For example, if you ask for the title of a paper, it will confidently give three different answers, but all of them are wrong.

Adam Kalai, a research scientist at OpenAI and a co - author of the paper, believes that although we can never be 100% accurate, it doesn't mean that the model has to hallucinate.

The solution can be in the "post - training" stage. Using human feedback and other fine - tuning methods can guide the model to be safer and more accurate.

But this also leads to the phenomenon of large - model "score - padding":

Since the performance of the model is scored through standardized benchmark tests, a high score means fame and business success. So companies often set their training goals on "scoring high".

OpenAI believes that the persistent existence of hallucinations is partly due to the wrong incentive mechanism set by the current evaluation methods.

Researchers analyzed ten popular benchmark tests and found that nine of them use a binary scoring method of "1 point for a correct answer, 0 points for a blank or wrong answer", and only WildBench uses a 1 - 10 scoring system.

Although answering "I don't know" may be considered slightly better than a "seriously hallucinated but seemingly reasonable answer", overall, it is still rated lower than a "barely qualified" answer.

This means that IDK (I don't know) may get some points under this benchmark, but it is not considered the preferred strategy.

Under this scoring mechanism, since the penalties for "random guessing" and "not answering" are the same, models that like to "pretend to know" are more likely to get high scores than models that cautiously answer "I don't know".

For example, if a large model is asked someone's birthday but doesn't know the answer, if it randomly guesses a day, it has a 1/365 chance of guessing correctly, while saying "I don't know" will definitely get zero points.

In thousands of test questions, this kind of guessing - type model will ultimately perform better in the scoring mechanism than a cautious model that admits uncertainty.

Kalai speculates that this may be the reason why anti - hallucination solutions have been slow to be implemented in the past.

Why Does a Higher Accuracy Score May Also Mean Greater Hallucinations?

For purely objective questions with a single "correct answer", OpenAI classifies the model's responses into three categories: accurate responses, wrong responses, and abstentions without making wild guesses.

OpenAI believes that "abstention" reflects humility and regards it as a core value.

In terms of accuracy, the previous - generation o4 - mini model performed slightly better, but its error rate (i.e., the probability of hallucination) was significantly higher.

This shows that although strategic guessing in uncertain situations can improve accuracy, it will increase the occurrence of errors and hallucinations.

OpenAI believes that the scoring mechanism based solely on accuracy still dominates the model evaluation system, prompting developers to tend to build models that make blind guesses rather than retain uncertainty:

"This is the important reason why the model still hallucinates even though it is constantly improving, that is, it confidently gives wrong answers instead of admitting uncertainty."

Therefore, OpenAI's research team calls for a redesign of the scoring mechanism to punish wrong guesses, so that the model can "learn humility through setbacks".

Even Well - Intentioned Adjustments May Have Counter - Effects

Carlos Jimenez, a computer scientist at Princeton University and the creator of the SWE - Bench benchmark, believes that it is not easy to change the scoring standard.

Because different subject areas have unique evaluation methods, "each discipline has its own standards for measuring uncertainty or confidence."

In addition, some scholars worry that well - intentioned adjustments may also have counter - effects.

Peng Hao, a computer scientist at the University of Illinois at Urbana - Champaign, warns that encouraging the model to say "I don't know" may also bring new hallucinations, just like the current practice of optimizing the "confidence" of large models.

He pessimistically points out that at present, there may be no data or indicators that can naturally solve the problem of hallucinations because these models "are too good at exploiting the loopholes in the system".

Model hallucinations are not simply a data problem but the result of a combination of training mechanisms, evaluation systems, and business models.

At the technical level, researchers generally believe that hallucinations cannot be completely eradicated but can only be alleviated;

At the business level, if the model frequently answers "I don't know", both the user experience and retention rate may decline.

Currently, AI startups represented by OpenAI and Anthropic are facing huge profit - making pressures. Between making the model more truthful and more attractive, no company is willing to take the risk of losing users by sacrificing attractiveness first, because this is likely to drive users to those seemingly more "confident" competitors.

However, from the perspective of AI technological progress, making AI learn to say "I don't know" helps to reduce hallucinations and promote the evolution of the model towards a more mature direction.

References:

https://www.science.org/content/article/ai-hallucinates-because-it-s-trained-fake-answers-it-doesn-t-know%20

https://openai.com/zh-Hans-CN/index/why-language-models-hallucinate/

This article is from the WeChat official account "New Intelligence Yuan", author: Yuan Yu. Republished by 36Kr with permission.