HomeArticle

More accurate than GPT-5? The accuracy rate of AIME25 soaring to 99.9% has gone viral, marking the first time for an open-source model.

新智元2025-08-25 11:49
Meta proposed DeepConf. Confidence filtering improves the accuracy and efficiency of the model. AIME achieves a 99.9% accuracy rate and reduces tokens by 85%.

DeepConf was proposed by Meta AI and the University of California, San Diego. Its core idea is to enable large models to monitor confidence in real-time during the inference process. Low-confidence paths are dynamically eliminated, while high-confidence paths undergo weighted voting, thus balancing accuracy and efficiency. At AIME 2025, it was the first time an open-source model achieved a 99.9% accuracy rate without external tools, while reducing the number of generated tokens by 85%.

How can we make the model think more intelligently and efficiently while being confident in its answers?

Recently, the research teams from Meta AI and the University of California, San Diego provided an exciting answer - Deep Think with Confidence (DeepConf), enabling the model to think deeply with confidence.

Paper link: https://arxiv.org/pdf/2508.15260

Project homepage: https://jiaweizzhao.github.io/deepconf

This new method, through parallel thinking and "confidence screening," not only enabled the model to achieve a remarkable 99.9% accuracy rate in the international top - level mathematics competition AIME 2025.

It can be said that this is the first time an open - source model has achieved a 99.9% accuracy rate in AIME 2025 without using any tools!

Moreover, while maintaining high - quality reasoning, it reduced the number of generated tokens by 84.7%.

DeepConf also brings several significant advantages to parallel thinking:

  • Performance boost: On various models and datasets, the accuracy rate increased by an average of about 10%.
  • Extreme efficiency: The number of generated tokens was reduced by up to 85%.
  • Plug - and - play: Compatible with any existing model - no additional training is required (nor hyperparameter fine - tuning!)
  • Easy deployment: It can be integrated in vLLM with only about 50 lines of code.

Take the inference process of DeepConf on the 11th question of HMMT 25 (Harvard - MIT Mathematics Tournament) as an example.

The core idea is that DeepConf filters inference paths through "confidence signals" to obtain high - quality answers and strike a balance between efficiency and accuracy.

  • Horizontal axis (token index): Represents the inference steps generated by the model (increasing step by step with tokens).
  • Vertical axis (confidence): Represents the confidence level of each inference path at that step.
  • Green curve: Represents the confidence trajectories of different inference paths. Darker green indicates higher confidence.
  • Red cross: Inference paths below the confidence threshold are dynamically filtered out.
  • Green checkmark: High - confidence paths that are finally retained.
  • Final vote: Under the majority vote weighted by confidence, these paths finally reach a unified answer: 29.

During the generation process, DeepConf continuously monitors the confidence of inference paths. Low - confidence paths are promptly eliminated, and only "more confident" paths are retained to improve overall accuracy.

From the accuracy comparison curve in the above figure, the vertical axis represents accuracy. The yellow curve (DeepConf) is significantly higher than the blue curve (standard method).

This indicates that DeepConf can achieve a higher accuracy rate under the same voting scale.

In the figure below, the horizontal axis represents the number of tokens (the computational cost required for inference). While maintaining a high accuracy rate, the yellow curve consumes significantly fewer tokens.

This indicates that DeepConf significantly reduces the generation of invalid tokens and has better inference efficiency.

DeepConf enables the model to stop "random thinking" and efficiently follow high - confidence inference tracks.

DeepConf supports two working modes:

  • Offline mode: Filter the completed inference paths based on confidence, and then weight the votes according to quality.
  • Online mode: Immediately stop generation when the confidence drops below the threshold in real - time.

What's the secret of DeepConf?

Actually, LLMs know when they start to be uncertain, but people have never really paid attention to their "thinking process."

Previous methods used confidence/entropy for testing and reinforcement learning (RL) after complete generation.

DeepConf's method is different. It captures inference errors during the generation process rather than after completion.

DeepConf monitors "local confidence" in real - time and terminates incorrect inference paths before they consume thousands of tokens.

Only high - quality, high - confidence inference paths are retained!

How does DeepConf "filter with confidence and vote with confidence"?

This figure shows the core mechanism of DeepConf during offline thinking:

It first determines which inference paths are trustworthy, eliminates unreliable paths in advance, and then allows reliable paths to conduct weighted voting to obtain a more accurate and efficient final answer.

First, it's about "how certain" each token is.

When the model is writing inference steps, each word (token) actually has a "confidence value" behind it.

If the model thinks "this step's answer is reliable," the confidence value is high. If it's unsure, the confidence value is low.

In the above figure, different shades of green and red are used to mark: green = more confident, red = less confident.

Second, not only should we look at individual tokens, but also the overall trend.

DeepConf doesn't just look at a single word. Instead, it uses a sliding window to check the average confidence value in a short passage to measure "whether this passage is reliable as a whole."

Pay special attention to the confidence values of the last few sentences because the final answer and conclusion often depend on the end.

DeepConf also records the worst step in an inference chain. If there is an obvious "mishap" in the middle, this path is less reliable.

In this way, each complete inference link will get a comprehensive "confidence score."

Finally, it's about elimination first and then voting.

When the model generates many different inference paths in parallel:

  • Step 1: Filtering. Sort the "confidence scores" and discard the worst 10% directly to avoid waste.
  • Step 2: Voting. Among the remaining inference chains, instead of simply counting votes, weighted voting is conducted according to confidence.

That is to say, the opinion of a high - confidence path carries more weight; a low - confidence path, even if the answer is the same, won't increase the vote weight significantly.

Finally, looking at the results, on the right side of the figure, some paths say "the answer is 109," while others say "the answer is 103, 104, 98."

However, since there are more paths supporting "109" and they have higher confidence, 109 was finally selected as the answer through voting.

Achieving a 99.9% accuracy rate, higher than GPT - 5

Offline mode results: Achieved a 99.9% accuracy rate in AIME 2025 (the baseline is 97%)!

Universal gains were achieved on 5 models × 5 datasets.

A stable accuracy improvement of about 10% was achieved under all settings.

Online mode results: Saved 33% - 85% of tokens in all benchmark tests!

In the AIME 2025 benchmark test, using GPT - OSS - 120B, with an 85% reduction in token consumption, it still achieved a 97.9% accuracy rate.

This method is applicable to various open - source models from 8B to 120B - achieving real - time efficiency without sacrificing quality.

Benchmarking confidence metrics in an offline environment. The reported values are accuracy rates (%).

Cons@512 and mean@512 represent the majority - voting results using 512 inference trajectories and the mean of average confidence, respectively. All experiments were repeated 64 times.

Benchmarking DeepConf in an online environment.

Under the condition of a voting - scale budget of 512, report the accuracy rates (%) and the number of generated tokens (×10⁸) of the majority - voting method and DeepConf (high/low) methods.

Deep thinking based on confidence

The researchers' thinking is: How can we use "confidence" more skillfully to make the model think more accurately and faster?

As mentioned above, there are two usage scenarios here:

  • Offline thinking: After the model has written a complete inference path, evaluate the confidence of each path retrospectively and aggregate reliable results. The advantage of this approach is that it can maximize the accuracy of the answer.
  • Online thinking: Refer to the confidence in real - time during the model's step - by - step generation of inferences. If a certain train of thought is found to be unreliable, it can be stopped in time to avoid wasting computing power. This can filter