Surpassing Claude Mythos and GPT-5.5, Stanford's Agent Verification Framework Achieves SOTA, Forwarded by the Author of Transformer
The author of the Transformer paper, Lukasz Kaiser, and the author of GAN, Bing Xu, reposted and showed interest in a piece of work -
LLM-as-a-Verifier verification framework. This method is a general verification mechanism that can be combined with any Agent Harness and model.
It is jointly developed by Stanford, Berkeley, and NVIDIA.
Research shows that by scaling verification compute, the overall performance of the Agent can be significantly improved, and it can outperform Claude Mythos and GPT-5.5 on the most influential AI programming benchmark, Terminal-Bench!
LLM-as-a-Verifier has achieved the state-of-the-art (SOTA) performance on both the AI Coding benchmarks Terminal-Bench and SWE-Bench Verified.
Method
Most Agent Harnesses actually "have" the ability to solve problems.
When we run the same Agent multiple times (for example, 100 times), it often can generate the correct answer in one of the attempts.
However, the problem is that they cannot tell which one is the correct answer.
This problem is particularly serious in long-horizon tasks.
LLM-as-a-Verifier significantly improves the verification ability and further increases the success rate of downstream tasks by scaling the score granularity, repeated verification, and criteria decomposition.
In addition, the team found that as the score granularity increases, the score difference between positive and negative samples will further widen.
Core Problem: Limitations of LLM-as-a-Judge
The standard LLM-as-a-Judge prompts the model to output a scoring result (for example, a score between 1 and 8) and selects the score with the highest probability as the final discrete score.
However, this method often has the problem of too coarse scoring granularity.
When comparing long-horizon Agent trajectories, LLM-as-a-Judge usually assigns the same score to different trajectories (for example, both trajectories are rated 4 points), resulting in a tie and failing to effectively distinguish them.
This coarse-grained scoring mechanism has a 27% tie situation on Terminal-Bench, limiting the accuracy and discrimination ability of the judgment.
LLM-as-a-Verifier: Paradigm Shift from Scoring to Verification
By definition, a judge is a person who forms an overall judgment on the overall situation and gives a conclusion; while a verifier is a person who verifies the truth and correctness of specific matters, so more detailed and specific evaluations are required.
Therefore, the team proposed LLM-as-a-Verifier. It provides fine-grained feedback by expanding the following three dimensions:
Granularity of score tokens
Number of repeated verifications
Decomposition of evaluation criteria
Given a task t and two candidate trajectories
and
, LLM-as-a-Verifier constructs a scoring prompt and obtains the corresponding conditional distribution by extracting toplogprobs from <score_A> and <score_B>:
LLM-as-a-Verifier represents the reward of the trajectory as:
Where:
C = Number of evaluation criteria
K = Number of repeated verifications
G = Number of score tokens (granularity level)
is the probability of the model for the score token
= A function that maps each score token to a scalar value
= Set of discrete score tokens
When selecting the best trajectory, we use a round-robin tournament: for each pair of candidate trajectories (i, j), the verifier will use the above formula to calculate its reward.
The trajectory with a higher reward wins, and the trajectory with the most wins in all comparisons will be selected as the final result.
Experimental Results
In complex long-horizon benchmark tasks such as Terminal-Bench 2.0 and SWE-Bench Verified, LLM-as-a-Verifier outperforms the cutting-edge models across the board and achieves the state-of-the-art (SOTA) performance. All experimental results are from the official leaderboard.
LLM-as-a-Verifier can be seamlessly integrated into different Agent Harness frameworks. Its universality is verified in the following three benchmark tasks:
ForgeCode: The verification accuracy is improved to 86.4%;
Terminus-Kira: The accuracy is improved to 79.4%;
Terminus 2: The accuracy is increased to 71.2%.
This shows that regardless of the Agent Harness or model, this verification method can be efficiently compatible and improve performance.
LLM-as-a-Verifier comprehensively leads the traditional LLM-as-a-Judge in terms of verification accuracy and eliminating ties.
Even when increasing the number of repeated verifications (for example, k = 16), the Verifier method still maintains an advantage of at least 7% in verification accuracy.
In addition, it completely eliminates the tie phenomenon.
The experimental results show that increasing the granularity of score tokens and the number of repeated verifications both significantly improve the verification accuracy.
In addition, in the refined grading of the score token dimension (1→20), the quantization error is greatly reduced, thus getting closer to the real reward.
LLM-as-a-Verifier abandons the traditional single scoring mechanism and decomposes the trajectory verification into three combinable evaluation criteria:
Specification: Whether the trajectory meets all task requirements (path, naming, etc.).
Output Format: Verify whether the output format meets the expected results.
Error Checking: Whether there are obvious error signals in the trajectory.
Compared with the traditional LLM-as-a-Judge method, the LLM-as-a-Verifier framework uses finer scoring granularity, repeated verification, and evaluation criteria decomposition to achieve higher verification accuracy and more precise discrimination ability, eliminate the scoring tie phenomenon, not only improve the Agent performance but also significantly enhance the safety and stability of the model in long-horizon tasks.
Team Introduction
This project is led by Jacky Kwok, a CS Ph.D. student at Stanford University. The main contributors include Shulu Li, a Ph.D. student in EECS at Berkeley. The corresponding authors are Ion Stoica (Professor at UC Berkeley and founder of Databricks), Azalia Mirhoseini (Professor at Stanford, formerly worked at DeepMind and Anthropic), and Marco Pavone (Director of AI and Autonomous Driving Research at NVIDIA).
Blog: llm-as-a-verifier.notion.site
Code: llm-as-a-verifier.github.io
Contact: jackykwok@stanford.edu
This article is from the WeChat official account “QbitAI”. Author: LLM-as-a-Verifier. Reposted with permission by 36Kr.