HomeArticle

Is someone finally taking charge of the issue of AI spouting nonsense?

机器之心2025-09-10 15:56
AI hallucination detector is here: low cost, scalable, real-time recognition, with an AUC as high as 0.9

Imagine if large AI models like ChatGPT could mark the parts they're uncertain about when generating content. Would you feel more confident in the answers they generate?

Last weekend, a paper published by OpenAI set off a storm in the community. This paper systematically reveals the root cause of hallucinations, pointing out that the problem lies in the rewards. Standard training and evaluation procedures tend to reward guesses rather than rewarding the model when it bravely admits uncertainty. Perhaps because they realized this problem and found a targeted solution, the hallucination rate of GPT - 5 has significantly decreased.

As large AI models are increasingly applied in high - risk fields such as medical consultations and legal advice, the hallucination problem will become more and more difficult to handle. Therefore, many researchers are focusing on this area. Besides looking for the causes of hallucinations like OpenAI, many people are also researching hallucination detection technologies. However, existing hallucination detection technologies face bottlenecks in practical applications. They are usually only suitable for short factual queries or require expensive external resources for verification.

In response to this challenge, a new study from the Swiss Federal Institute of Technology in Zurich (ETH) and MATS proposes a low - cost and scalable detection method that can identify "hallucinated tokens" in long - form content in real - time and has been successfully applied to large models with up to 70 billion (70B) parameters.

Paper title: Real - Time Detection of Hallucinated Entities in Long - Form Generation

Paper link: https://arxiv.org/abs/2509.03531

Code link: https://github.com/obalcells/hallucination_probes

Project link: https://www.hallucination - probes.com/

Code and dataset: https://github.com/obalcells/hallucination_probes

The core of this method is to accurately identify entity - level hallucinations, such as fabricated names, dates, or citations, rather than judging the truth of the entire statement. This strategy allows it to naturally map to token - level labels, enabling real - time streaming detection.

Detect hallucinated entities through token - level probes. In long - text generation scenarios (Long Fact, HealthBench), the performance of linear probes far exceeds that of uncertainty - based baseline methods, and LoRA probes further improve the performance. These probes also perform well in short - text scenarios (TriviaQA) and out - of - distribution reasoning fields (MATH). The results of the Llama - 3.3 - 70B model are shown in the figure.

To achieve this goal, the researchers developed an efficient annotation process. They used web searches to verify the entities in the content generated by the model and annotated each token to indicate whether it was fact - based. Based on this specially constructed dataset, the researchers successfully trained accurate hallucination classifiers through simple and efficient techniques such as linear probes.

In the evaluation of four major model families, the performance of this classifier comprehensively surpasses existing benchmark methods. Especially when dealing with long responses, its effect is far better than methods with higher computational costs such as semantic entropy. For example, on the Llama - 3.3 - 70B model, the AUC (a classifier performance indicator) of this method reaches 0.90, while that of the benchmark method is only 0.71. In addition, it also shows superior performance in short - answer question scenarios.

Notably, although this classifier is only trained with entity - level labels, it can effectively identify incorrect answers in mathematical reasoning tasks. This finding indicates that this method has the generalization ability beyond entity detection and can identify a wider range of logical errors.

Although the annotation cost of the original dataset is high, the research found that the data annotated based on one model can be reused to train effective classifiers for other models. Therefore, the research team has publicly released this dataset to promote subsequent research in the community.

Method Overview

Dataset Construction for Token - Level Hallucination Detection

To train a classifier that can detect hallucinations at the token level, researchers need a dataset with precise annotations of hallucinated content in long texts. This process is divided into two steps: (1) Generate mixed texts containing factual and hallucinated content; (2) Accurately annotate these texts at the token level to identify which tokens belong to fabricated entities. The following figure shows this annotation process.

Token - level annotation pipeline.

  • Data Generation

Based on the LongFact dataset, the researchers created a prompt set  LongFact++ that is 10 times larger in scale and more diverse in domains.

LongFact++ includes four types of prompts: topic queries, celebrity biographies, citation generation, and legal cases. The aim is to induce large language models to generate long texts rich in entities, which will be used as raw materials for subsequent annotations.

  • Token - Level Annotation

Different from traditional methods that decompose texts into atomic claims, this research focuses on annotating entities (such as names, dates, citations, etc.) because entities have clear token boundaries and are easy to detect in a streaming manner. They used the Claude 4 Sonnet model with web search functionality to automatically complete the annotation process.

The system will identify entities in the text, verify their authenticity through web searches, and mark them as "Supported" (supported by evidence), "Not Supported" (proven to be fabricated), or "Insufficient Information" (insufficient information).

  • Label Quality

To verify the annotation quality, the researchers conducted two checks. First, the agreement rate between the annotation results of human annotators and the automatic annotation results of the large model was 84%. Second, in a controlled dataset containing known errors (manually injected), the recall rate of this annotation process was 80.6%, and the false positive rate was 15.8%.

Training Token - Level Probes

A probe is a hallucination detector attached to a language model. It consists of a linear "value head" and an optional LoRA adapter. The value head reads the hidden state of the intermediate layer ℓ of the model 

and outputs a token - level hallucination probability:

The total loss function for training

is a combination that includes both the probe loss for training the hallucination classifier

and a regularization term for constraining the behavioral changes of the model 

.

To solve the problem that hallucination signals are usually concentrated on a few key tokens, the researchers designed a hybrid loss function that combines word - by - word loss and span maximum loss. The formula is as follows:

The cleverness of this design lies in that for a text segment marked as hallucinated, as long as at least one token has a high hallucination score, the loss function will give a reward, allowing the probe to learn to focus more precisely on key error signals.

Experimental Results

In long - text settings (LongFact and HealthBench), the token - level probes significantly outperform the baseline methods on two major models (Table 1). Simple linear probes consistently achieve an AUC value of over 0.85, and LoRA probes further improve the performance, pushing the AUC above 0.89.

In contrast, the uncertainty - based baseline methods perform poorly, with AUC values not exceeding 0.76. In short - text settings (TriviaQA), the baseline methods perform better than in long - text settings, but the probes still lead. LoRA probes consistently achieve an AUC value of over 0.96, and linear probes also perform well. Notably, the probes proposed in this paper also achieve strong results on the MATH dataset. This out - of - distribution performance indicates that the method proposed in this paper captures signals of correctness, and these signals can generalize beyond the fictional entities it was originally targeting.

The authors replicated the long - text results on three secondary models, training each model with only 2000 annotated samples from its own long - text generation. The results are similar: LoRA probes again outperform linear probes, with AUC values for LongFact generation ranging from 0.87 to 0.90. The complete results for the secondary models are shown in Table 5.

Although the AUC values of LoRA probes are close to or exceed 0.9 in multiple settings, the R@0.1 on long texts is at most about 0.7. That is, at a 10% false positive rate, the detector can identify about two - thirds of the hallucinated entities. These results not only highlight the practical benefits compared to standard uncertainty - based baseline methods but also indicate that there is still room for further improvement before such methods can be widely applied in high - risk scenarios.

This article is from the WeChat official account “Machine Intelligence” (ID: almosthuman2014), edited by +0