Why doesn't GPT-5 "hallucinate" anymore? OpenAI's new paper explains it thoroughly.
After the release of GPT-5, although its performance did not meet the industry's expectation of a "leap forward", the most remarkable thing is the significant reduction in the hallucination rate.
Data provided by OpenAI shows that the probability of GPT-5 making factual errors is about 45% lower than that of GPT-4 and about 80% lower than that of OpenAI GPT-3.
However, the reason behind this improvement has not been made public. In the System Card, OpenAI attributes this to reinforcement learning training. They seem to have used some latest training methods. These methods enable the models to "improve their thinking process, try different strategies, and recognize their own mistakes". However, the specific methods remain a mystery.
On September 4th, OpenAI released a long-awaited paper titled "Why Language Models Hallucinate".
Although OpenAI has not fully disclosed all the technical details, by combining this official paper and the published technical documents, we can catch a glimpse of its core ideas.
01 Hallucination is Inevitable in the Pre-training Stage
The conclusion that hallucination is inevitable is not new. However, previous studies rarely explored it from the mechanism of the language model itself and mostly focused on the problems of training data.
OpenAI's new paper proves at the beginning that "hallucination" is a predictable by-product that inevitably occurs under the statistical learning nature of large language models (LLMs).
OpenAI's reasoning logic is simple: It is more difficult to generate reliable information than to judge whether it is reliable, and there will inevitably be failures in the process of judging reliability.
First, the paper defines the inherent "judgment ability" of the language model based on its autoregressive prediction nature.
When evaluating a sentence, the model multiplies the conditional probabilities of each step through word-by-word prediction to obtain a total probability value. This probability value reflects the degree to which the sentence conforms to the statistical laws learned by the model from massive data. Based on this, the researchers proposed a theoretical "Is-It-Valid (IIV)" judge: when the internal probability of a sentence is higher than a set threshold, it is judged as "valid"; otherwise, it is "invalid".
To put it simply, if the words generated by the model "seem familiar and smooth", they are "valid"; otherwise, they are invalid.
However, this theoretical "judge" is not always reliable. When dealing with gray information that is "unfamiliar but seems to have been seen somewhere", it will inevitably make mistakes. The paper lists various scenarios that lead to judgment failures, including: the model can only make guesses due to data sparsity (such as "isolated" facts); the model itself lacks the ability to understand complex concepts; and complex calculations, data distribution shifts, and errors in the training data itself (garbage in, garbage out).
Regarding the consequences of this inevitable "judgment error", the paper gives a strict mathematical conclusion: (the error rate of the generative model) ≥ 2 × (the error rate of the IIV judge).
The root of this amplification effect is that errors at the judgment level will inevitably lead to more than two types of hallucinations. For example, if the model judges that 1 + 1 equals 3, it will give rise to at least two hallucinations. One is that 1 + 1 = 3, and the other is that 1 + 1 does not equal 2. Both of them stem from the same wrong judgment.
Therefore, the conclusion is clear: as long as there are inevitably long-tail, sparse, and noisy parts in the training data, the model will inevitably fail at the judgment level. And every error made by the model in judgment will be amplified and transmitted to the generation task. Therefore, hallucinations in generation are also inevitable.
For humans, hallucination is also basically inevitable. We also have many things that we are not sure about, but humans have the proverb "Know what you know and know what you don't know". For things we are not sure about, we can choose to say "I don't know".
For models, the alignment process should be a process of teaching them to "know what they don't know". For example, increasing the threshold of the "validity discriminator" similar to IIV inside the model or highlighting the more likely answers.
However, the second half of OpenAI's paper proves that post-training cannot effectively address this issue under the current evaluation system.
02 Post-training Fails to Effectively Suppress Hallucinations
Post-training is not completely ineffective. In the paper, OpenAI introduces the concept of calibration.
In the pre-trained model, the probability distribution of a word is actually completely generated according to its training materials, which means that its confidence level basically reflects the real situation in the training data. In the pre-training of the model, in order to minimize the loss function, the model will naturally be calibrated.
However, this also gives rise to a "plateau effect", that is, the model has a relatively high confidence in many options, which can all exceed the threshold of the IIV judge. Therefore, it is easy to generate hallucinations.
The intervention of the post-training process forcibly changes this flat probability landscape through explicit preference feedback (for example, humans prefer answer A to B, C, and D).
This process leads to the "miscalibration" of the model, making the probability distribution more concentrated. The model is taught to concentrate most of the probability on the answer considered to be the "best", forming a steep peak. At the same time, the probabilities of other once seemingly reasonable options (such as B, C, and D) are greatly suppressed, far below the IIV judgment threshold.
In this way, the model no longer needs to make guesses among multiple weak options because it is clearly told which "peak" to choose. When this peak happens to be the correct answer, the model successfully overcomes the hallucinations caused by uncertainty, and the hallucination rate is thus reduced.
However, this "miscalibration" is a double-edged sword. While reducing the hallucinations caused by "guessing due to uncertainty", it may also increase the risk of "overconfidence".
One of the key directions of post-training is to reduce this overconfidence and enable the model to say "I don't know".
However, most of the mainstream evaluation benchmarks widely used to measure the capabilities of AI models, such as GPQA3, MMLU-Pro, and SWE-bench, generally adopt a "binary scoring system". In these benchmarks, answers are simply judged as "correct" (1 point) or "wrong" (0 points).
This scoring mechanism brings a serious problem: it systematically punishes uncertainty. When the model faces a question it is not sure about, if it chooses to honestly answer "I don't know" (IDK) or refuses to answer, its score will be 0 points, which is exactly the same as the score for directly giving a wrong "best guess" answer. Under this rule, giving up answering is a "stupid" strategy, while baseless "bluffing" becomes a rational choice for pursuing higher scores.
Therefore, in the current model training where models try to outperform their competitors in scoring and demonstrate their strength, it is a thankless task to make the model honestly answer "I don't know".
Therefore, although the post-training process is effective in eliminating model hallucinations at the technical level, it is not properly guided in practice. The current industry evaluation standards are systematically rewarding the behavior of generating hallucinations. As long as this evaluation paradigm that "punishes honesty and rewards guessing" remains unchanged, the hallucination problem will continue to be a stubborn obstacle preventing AI systems from achieving higher reliability.
03 Possible Hallucination Killer of GPT-5 and Shortcomings of DeepSeek R1
Although the article does not really delve into the details of the post-training process to explain this problem but only criticizes the binary right-wrong benchmark, when applying it to the RL field, its conclusion is still reasonable.
The inference is that if a reinforcement learning (RL) process itself also adopts a binary reward path, it is very likely to reduce the model's ability to suppress hallucinations.
The core of reinforcement learning is to guide the behavior of the language model through a "reward model". The language model generates an answer, the reward model scores this answer, and then the language model adjusts its strategy according to the score in order to obtain a higher score in the future.
If the reward model adopts an extreme binary scoring system (such as +1 for a "good answer" and -1 for a "bad answer"), the following problems will occur:
● A factually wrong answer gets -1 point.
● An honest but unhelpful answer also gets -1 point.
This reproduces the defect of the benchmark in the paper: an RL process that adopts a binary reward path will encourage the model to "bluff" at the root of training. It will not encourage the model to learn how to calibrate its uncertainty but will punish the expression of this uncertainty.
Currently, there are two main types of reward models.
One is the Outcome Reward Model (ORM), which is basically the same as the situation assumed above. Take DeepSeek R1, which uses ORM, as an example. Its reward model consists of two paths: one is whether the final answer is correct, and the other is whether the format is correct. This is an extreme binary path. As long as the final answer is correct, a high score will be given.
Such a post-training process that further strengthens the binary path is very likely to increase the "stubborn" or "overconfident" hallucinations while reducing the "hesitant" hallucinations. Because these overconfident hallucinations are more stubborn, they may actually increase the overall hallucination rate.
This may be the reason why DeepSeek R1 faced great challenges in hallucinations after its release. In the Vectara HHEM hallucination test, its hallucination rate was as high as 3.9%, far higher than that of the pre-trained model DeepSeek V3.
In contrast, the hallucination rate of models using the Process Reward Model (PRM), such as OpenAI GPT-3 during the same period, is only 6.8%, less than half of that of DeepSeek R1.
This is because PRM examines the "thinking process" of the model (such as Chain-of-Thought). When it finds that a step of reasoning is based on a fabricated fact, it will give negative feedback at that step. This forces the model to learn to reason based on facts. Although it still relies on the "good/bad" or "right/wrong" judgment of each step in the process and also has a bipolar form.
According to The Information, GPT-5 is very likely to introduce the Universal Verifier technology to go beyond the original binary evaluation standard of right and wrong. For example, the currently popular Rubric method. It will let another "verification model" score according to a set of complex, non-binary standards (such as factuality, logic, and nuances). This will fundamentally eliminate the negative impact of binary incentives on the reinforcement learning process.
This may be the secret of GPT-5's ability to achieve such a low hallucination rate.
Of course, all this may still be far from enough. At the end of the paper, the researchers propose that the best way to solve the hallucination problem is to introduce a scoring mechanism with penalties in the post-training stage.
This mechanism will clearly inform the model in the instructions that overconfidence will have a huge cost (for example, getting 1 point for a correct answer, -1 point for a wrong answer, -9 points for an overconfident wrong answer, and 0 points for not answering), forcing the model to transform from a simple "score optimizer" into a "risk assessor". It must accurately calibrate its own confidence level and only dare to answer when the confidence level is high enough.
Perhaps only by making the model not only focus on scoring but also focus on truth can the hallucination problem be solved.
This article is from the WeChat official account "Tencent Technology", author: Boyang. It is published by 36Kr with authorization.