Chat with Three University Professors on the Escalating AI Hallucinations

The so - called "hallucinations" in AI actually stem from the fact that humans don't really know what they want.

Recently, there has been a farce caused by AI hallucinations on the internet.

On July 2nd, a large amount of content suddenly appeared on the internet about "DeepSeek apologizing to Wang Yibo for the illegal association of its AI model." It was eventually discovered that DeepSeek had fabricated the event in the conversation and even cited a judgment that could not be found on the China Judgments Online platform.

This farce originated from the hallucinations generated by DeepSeek during the conversation with users. Therefore, the editorial department of "Zhiwei" believes that it is necessary to discuss the increasing hallucination rate of large AI models.

Some time ago, shortly after the release of OpenAI's o3 model, it also attracted wide attention due to the phenomenon of its hallucination rate "increasing instead of decreasing."

The OpenAI o3 model makes many strange mistakes. For example, it fabricates code that has never been run, uses invalid non-ASCII dashes in the coding settings, and even pretends to be calling tools.

In the PersonQA benchmark test, o3 hallucinates in 33% of the Q&A sessions, almost twice that of o1 (16%). The hallucination rate of o4-mini is as high as 48%, far higher than that of the previously released inference models.

Other recently released in-depth thinking models also show a similar pattern, that is, as the inference ability increases, the hallucination rate also becomes higher.

Nathan Lambert, a scientist at the Allen Institute for Artificial Intelligence, once published an article commenting on the inference hallucinations of o3, stating that the emergence of this problem is due to excessive optimization in RL (Reinforcement Learning).

For example, in the typical "reward hacking" phenomenon, Nathan Lambert gave an example. They once let a cheetah learn to run fast in the MuJoCo environment. Finally, the cheetah achieved the maximum forward speed through a handspring instead of running. Similarly, o3's pretending to use tools is probably because o3 can get rewards if it successfully calls tools during training.

In the case of inference models, it is manifested as the answer being correct, but the reasoning process being wrong, or having no strict logical relationship with the answer. (This is a new type of hallucination, different from the factual hallucination in the event of DeepSeek spreading false rumors about apologizing to Wang Yibo.)

The Stanford University team [1] summarized the types of these strange behaviors, including skipping key intermediate steps, substituting special values to guess general rules, rough numerical approximation, non-closed logical derivation, and even not using real mathematical language. Through systematic evaluation, the research team also found that Grok3 mini has a final answer accuracy rate of 71.5%, but the accuracy rate of its reasoning process is only 6.0%.

Zhang Weinan, a professor, doctoral supervisor, and deputy director of the Department of Computer Science at Shanghai Jiao Tong University (whose main research directions include reinforcement learning and large decision-making models), told "Zhiwei" that "Saying that o3 has an increased hallucination rate due to excessive optimization in reinforcement learning actually means that humans don't know what they want."

"It is quite normal to develop to this stage. Reinforcement learning can optimize the performance of large models in certain tasks (such as mathematics and coding). After these capabilities are improved, people start to pay attention to its hallucination problem and think that what the large model says is abnormal. Such situations are also often found in other reinforcement learning application scenarios. For example, people first train a robot to walk fast, but later think that the robot doesn't walk gracefully."

Hao Jianye, a professor at the School of Intelligent Computing at Tianjin University and the director of the Huawei Noah Decision and Reasoning Laboratory (whose main research directions include deep reinforcement learning and multi-agent systems), also agrees that the root cause of the problem lies in reinforcement learning. He told "Zhiwei": "In the learning paradigm of reinforcement learning, the main supervision signal is whether the final result is correct. The reasoning process of a large model itself, especially the multi-step reasoning in mathematical problems, is a very long multi-step decision-making process. However, reinforcement learning methods such as GRPO (a reinforcement learning algorithm) only give rewards at the last step, which may lead to the model learning a correct final result but a wrong intermediate reasoning process. The model may develop some wrong but efficient strategies, which is the source of the so-called 'hallucination' phenomenon."

"Overall, at present, using reinforcement learning to train large models for slow thinking is still in a relatively preliminary stage, and basically relatively standard reinforcement learning methods are still used. Especially the online training methods, including GRPO, which is just a variant of PPO and essentially has no difference from PPO."

Wang Jun, a professor at the Department of Computer Science at University College London (whose main research directions include reinforcement learning and multi-agent systems), conducted in-depth experimental research on this. He told "Zhiwei": "The current mainstream reinforcement learning methods such as GRPO, or the methods of encouraging the model to think before outputting results through prompts, all have many problems. One of the problems is that the thinking process of the model is not regularized, which leads to its so-called thinking process may not conform to human logic."

"Specifically, we tested models such as DeepSeek R1 on the AIME benchmark test and analyzed all the wrong and correct cases of the mathematical problems in AIME. We found that when the model tries to maximize the reward while ignoring the normativity of the thinking process, the logic of its reasoning may not be correct, with a lot of repetition or redundancy, but it can still give the correct answer in the end. This kind of phenomenon can be understood as taking shortcuts."

"I'm quite disappointed about this. So although people have successively proposed various reinforcement learning algorithms such as GRPO, none of them really grasps the key to the problem."

"People also try to break through the limitations of algorithms such as GRPO. For example, we have a method like this: Assume that x is the input and y is the output. We let the model have the ability - given x and the previous y, to deduce x in reverse. After such training, the model can continuously improve its output ability, which greatly improves reinforcement learning."

"Currently, people don't pay attention to how to regularize the thinking process. We focus on this direction because, in essence, in most online reinforcement learning training, there is no correct answer in the thinking (reasoning) stage. Because there are no facts to tell the model what the thinking process should be, so in essence, it is implicit. If only a reward is provided when the result is output, then for this implicit intermediate process, if it is not regularized, it could be anything."

"From another dimension, whether the thinking chain is in the form of tokens (included in the output) or in the latent form (not included in the output), they are just different methods. The latent form may be more efficient or faster and is more suitable for tasks with real-time requirements, but it has poor interpretability. Of course, it can also be done in a hybrid way. Use the explicit token form during training, but use the latent form during execution if there is no need to output these tokens. Another possibility is that information can be transmitted in a latent way between large models and small models."

"Of course, calling this phenomenon a hallucination may not be accurate and is somewhat misleading. The hallucinations of large language models discussed in the past mainly belong to factual errors, which are an inevitable result of the probabilistic nature of AI generation. The reasoning process of AI is different from that of humans, but the answer is correct. It is just the result of the lack of constraints on the intermediate process in the reward settings of algorithms such as GRPO."

Professor Zhang Weinan further explained, "The data used in the training of such inference models may already contain a considerable amount of CoT (Chain of Thought) data obtained by large models (or agents) through interaction with the environment in reinforcement learning. That is to say, the interaction data itself is generated out of thin air and is not completely from human data."

"These CoT data are generally verified, that is, the verifier determines that the thinking process ultimately leads to the completion of the task, and then this thinking chain will be used as training data."

"However, people actually don't pay attention to whether the specific process of these thinking chains is standard or elegant at the sentence, grammar, and natural language levels. Therefore, this will inevitably cause a certain deviation in the ability of the large language model to'speak human language' after post-training. But its ability to solve professional tasks, such as problem-solving, planning, and decision-making of agents, has generally become stronger."

"Going deeper, it involves the core component of reinforcement learning, the'reward function'. Actually, humans still don't know how to design a correct and perfect reward function at present. The more fundamental reason is what was mentioned above, that humans don't know what they really want."

Professor Hao Jianye also emphasized, "Designing a reasonable reward function is the most crucial point in reinforcement learning methods and also the most difficult one."

The reward model can be divided into the outcome-level (ORM) and the process-level (PRM). ORM easily allows the model to get the correct answer through a wrong reasoning path. Therefore, it is necessary to introduce PRM to supervise the reasoning process. However, the implementation of the PRM method itself is very difficult, such as the high cost of collecting training data.

"Not only is the data cost high, but the definition of PRM for the intermediate process itself is very difficult. Therefore, one solution is to better define the rewards for the intermediate process through manual or semi-automatic methods to guide the model and try to reduce the hallucination problem in the intermediate reasoning process."

"In addition, some techniques from past reinforcement learning can also be considered, such as how to distribute rewards - that is, how to reasonably distribute the final reward to each intermediate step, so as to automatically design a more accurate reward value for the intermediate process."

However, when asked about the development of reward function design in the past two years, Professor Zhang Weinan told "Zhiwei" bluntly, "There hasn't been any decent development."

Where is the difficulty in designing the reward function? It actually stems from the fact that as an agent, the large model needs to interact with an environment of increasing complexity to achieve continuous progress and even surpass humans.

Professor Zhang Weinan explained, "Applying reinforcement learning to large models has promoted the trend of the gradual blurring of the boundary between large models and agents. For example, OpenAI's DeepResearch is also a model. In the pretraining stage, it completely uses the next token method to directly output commands to call tools (generating a tool token out of thin air, and this tool token corresponds to an API that can be called), without the need to select tools to call from the prompt like an agent."

"Previously, it was the executable framework that enabled the agent model to interact with the environment. Its function was to convert the perception signals from the environment into language tokens that the large language model could understand, and the tokens output by the large language model could be converted into control instructions for tasks and actions to be issued to the environment. But in fact, this is just a layer of framework. Now the agent model itself can do this. The problem is that you have to input all the task-related data into the large language model during pretraining."

"However, there are tens of thousands of such task types, which are inexhaustible. It is impossible for people to interact and obtain suitable data for each task in one training session and then let the large language model be trained uniformly using the next token decision method."

"So, there is always a trade-off relationship between mainstream tasks and outliers or a large number of narrow-edge tasks. For example, DeepResearch focuses on some professional tasks, such as research, scientific research, market research, mathematics, programming, etc. But the premise is that you must select these types of tasks during the training stage. But if one day I suddenly want to use the large model to handle a task like ordering takeout, it may not be able to do it because it has never seen the API for ordering takeout."

"Therefore, to improve the generalization ability of inference models, more external interaction is still needed. In the future development, both agents and large models need to interact with a dynamic environment to generate data that surpasses humans. One is to exceed all the text data accumulated by humans in quantity, and the other is to exceed humans in terms of data performance indicators."

"If it always just imitates humans, such as imitating how humans write text, it can at most surpass humans in the dimension of integration and application." And indeed, large language models have already surpassed humans in this aspect.

"If its development ceiling is limited by the 'teacher' (that is, humans themselves), then its growth space is very limited." For example, AlphaGo must interact with the environment to generate data for completing tasks and then adjust its own parameters based on these data to truly have stronger abilities than humans. AlphaGo can improve through self-play mainly because the environment is relatively simple, and it can use a previous version as an opponent. But now agents need to interact with the entire open internet, and the environment is the internet, which is much more complex.

As the model is strengthened, to prevent over-optimization, the reward model generally also needs to progress. So this not only requires the interaction environment to be more and more open and complex, but also the reward model to be more and more powerful.

The academic research on the reward model has developed slowly. At present, the introduction of the reward function into large models, even into large in-depth thinking models, is only a very preliminary progress. The reward model has long been in the form of scalar output, which actually greatly limits its expressive ability and applicability in scenarios.

"In fact, reinforcement learning doesn't really restrict the algorithm to maximize a scalar reward signal. The real definition of reinforcement learning is: as long as the agent can dynamically interact with the environment and improve its own strategy based on the experience data of these interactions, it's okay. It doesn't have to use MDP (Markov Decision Process), have a reward function, or use a scalar reward, etc. It only needs the feedback of environmental changes. So this feedback can completely be non-scalar data, such as a visual signal, natural language, or multi-modal data. Just like humans, human learning has never had completely clear numerical feedback."

"So, in the future, when training a large language model, the final reward function design may be more like a critic, giving relevant textual and unstructured feedback. Then we need to propose a method to allow the model to continue to optimize based on this textual feedback. For example, the coach says: 'You didn't play that ball very well just now. You need to use more strength in your right upper limb when swinging the racket in the future.' It is completely possible to adjust the strategy based on such language feedback, and there is already some work being done in this area."

Professor Zhang Weinan added, "From a business competition perspective, if large language models currently conduct next token prediction training based on real human data, it's actually difficult to distinguish the differences between them. They can only compete in terms of who has a larger model or is more meticulous in execution. This is because the differences at the data level are very limited since everyone uses basically the same human data. However, if they can self-generate brand-new data, it can continuously promote the progress of the model."

On the other hand, this also reflects that the current benchmarks for testing the reasoning ability of large models in the industry actually have great limitations.

"The current benchmarks cannot really evaluate the model's ability. To put it bluntly, people still tend to evaluate a very flexible large language model under the premise of some rules and fixed data. This is like using a test paper to judge a person's ability, which can only be a one-sided judgment. To really judge whether a person is reliable and how their various abilities are, it actually requires cooperation and continuous, multi-dimensional communication for evaluation."

From the discussion of the reward function, we can find that in the framework of reinforcement learning, the thinking chain of large models is more regarded as a path for environmental exploration, which reminds us that we need to rethink the essence of inference models.

In fact, from the perspective of practical utility, the reasoning ability of large models has always been questioned.

Many scholars have said that AI seems to be reasoning, but in fact, it is relying on memory to "fit templates". The most important evidence is that their generalization ability is very fragile. The Stanford University team [2] found that just by changing the variable names and variable value ranges in the original questions, the performance of many inference models dropped significantly.

The Anthropic team also found that the thinking chain may not provide context for the model and may not be completely related to the final answer [3]. For example, adding clues about the final answer (which may be correct or wrong) to the prompt. As a result, the model accepted such a "cheating note" and gave the correct (or wrong) answer. However, in most cases, its reasoning thinking chain did not mention the use of this clue at all.

These strange phenomena have further stimulated people's desire to explore the essence of large model reasoning.

Recently, the Tsinghua University team [4] made the following discovery: Under a sufficient number of sampling times, the performance of in-depth thinking models is no different from that of basic models. RLVR (Reinforcement Learning with Verifiable Rewards) does not

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

We found three university professors and had a chat about the increasingly serious AI hallucinations.