The Second Half of Agent Evaluation: Why Do We Need a "Living" Benchmark?

Claw-Eval-Live proposes dynamic evaluation to track the real execution process of agents and ensure that tasks meet real-world needs.

[Introduction] Claw-Eval-Live proposes the concept of a "living" benchmark. Through signal collection and task screening, it ensures that the evaluation content closely follows the actual pain points of enterprises, rather than relying on a fixed question bank. The evaluation not only focuses on the results but also tracks the execution process, comprehensively verifying the real capabilities of the Agent from data invocation to state changes.

Today's AI Agents are increasingly resembling "digital employees capable of doing work": they can call APIs, query databases, write emails, modify code, arrange schedules, and generate reports. However, the real troublesome issues are not whether they "can talk" but two more practical matters: whether they can actually complete the tasks, and whether the tasks we use to test them still represent the most important workflows in the real world at present.

Claw-Eval addresses the former issue, while Claw-Eval-Live addresses the latter. The former resolves the problem of "how to confirm that the Agent has truly completed the task," and the latter resolves the issue of "how the question bank in the benchmark can continuously keep up with real-world needs." This article aims to discuss this continuous upgrade of the evaluation logic. In a sense, this is also a sign that the Agent benchmark has entered the "second half": it is no longer just about comparing who is better at answering questions but about comparing who is closer to the real world.

Paper link: https://arxiv.org/abs/2604.28139

Paper link: https://arxiv.org/pdf/2604.06132

Are you sure the Agent actually did it?

Before Claw-Eval, the mainstream approach to Agent evaluation was to give the Agent a task, look at the final result, and judge whether it was correct. Was the file created? Was the test passed? Did the answer match? If so, it was considered a pass.

This may sound reasonable, but for Agents, such an evaluation has two fatal problems.

First, it only looks at the results and ignores the actions. The model submits a beautiful report, but did it really query the correct data sources? Did it really call the correct APIs? Or did it just "fabricate" an answer that seems correct? Recent research has shown that advanced models will actively seek shortcuts in the evaluation, bypassing the expected execution path to directly meet the final inspection. An evaluation that only looks at the results provides an opportunity for such behavior.

Second, it is difficult to reflect the real deployment requirements. A truly deployable Agent not only needs to complete the work but also avoid doing things it shouldn't while working and be able to operate stably in an environment where APIs time out or services report errors. In other words, the evaluation should not only focus on "whether it can be done" but also on "whether it can be done safely and robustly." Claw-Eval further incorporates multimodal and multi-round conversations into a unified evaluation framework. However, its most crucial contribution is to advance Agent evaluation from "only looking at the answers" to "looking at the actions."

Claw-Eval: Making the Agent's execution process auditable evidence

Claw-Eval includes 300 manually verified tasks, covering three major groups: general service orchestration, multimodal perception and generation, and multi-round professional conversations. A total of 2,159 independently verifiable scoring rules are defined.

Its core idea can be summarized in one sentence: Make the Agent's execution process auditable evidence. Each evaluation is conducted in an isolated environment and is divided into three stages: Setup, Execution, and Judge. When the Agent is running, the scoring script and reference answers cannot be seen in the container. What is truly used for scoring is not just the final output but three independent evidence chains: the execution trajectory, the server audit log, and the environment snapshot after execution.

On this basis, Claw-Eval then incorporates completion, security, robustness, and cross-modal tasks into the same evaluation framework.

The most crucial finding of Claw-Eval is actually very straightforward: If the process is not considered, Agent evaluation will systematically "lenient."

The team conducted a strict controlled experiment: A vanilla LLM judge was given the complete conversation record and the source code of the scoring script, but lacked the server audit log and the environment snapshot. As a result, it still missed 44% of security violations and 13% of robustness issues. This means that for Agents, the evaluation method of "only looking at the results" is not just lacking in precision but will systematically overestimate the model.

Claw-Eval, of course, also shows more things, such as error injection significantly reducing reliability (Pass^3 can drop by up to 24 percentage points), and there is no unified champion in multimodal and multi-round conversation capabilities. However, for this article, the most important conclusion is only one: Agent benchmark should not only look at the answers but also at the actions.

However, after the issue of "how to evaluate" is finally clarified, another more practical problem emerges: even if the evaluation is reliable enough, if the workflows measured by the benchmark have gradually deviated from real-world needs, then no matter how accurate the evaluation is, it may not hit the point.

This is exactly the problem that Claw-Eval-Live aims to solve.

Being "accurate in evaluation" is not enough; benchmarks can become outdated

From here on, the problem is no longer just "how to evaluate" but "what to evaluate." This is also where Claw-Eval-Live truly comes in.

Claw-Eval solves the problem of "whether the scoring is reliable." However, like almost all existing benchmarks, it has a more fundamental limitation:

The task set is fixed.

The 300 tasks were determined on the day of release. Regardless of how the external tool ecosystem changes, how the focus of enterprise workflows shifts, or how the things users most want the Agent to automate change from writing daily reports to cross-system reconciliation, the task distribution in the benchmark will not change.

In traditional NLP evaluations, this is not a big problem because task forms such as "translating a passage" or "answering a question" are relatively stable. However, in Agent evaluations, this problem is significantly amplified. Agents are not faced with abstract language tasks but specific workflows. And workflows are constantly changing - the tool stack is iterating, enterprise pain points are shifting, some automation scenarios are emerging from scratch, and others are moving from the core to the periphery.

A benchmark can be completely reproducible technically, but the combination of tasks it measures may be quietly deviating from what users most want the Agent to do at the moment.

This deviation does not come from a specific task becoming "outdated" but from the task mix ratio itself. The most popular automation requirements six months ago are likely to be different from those today.

This is the problem that Claw-Eval-Live aims to solve.

What does a "living" benchmark look like?

When people hear "live benchmark," their first reaction is often: doesn't that mean it changes every day, making it impossible to compare?

Claw-Eval-Live's answer is not "let the benchmark change all the time" but:

Make each release a snapshot of the real world at that moment.

Its core is a two-layer separation design:

Signal Layer - When building a new release each time, instead of the team brainstorming "what should be tested," it starts from public workflow demand signals such as the ClawHub Top-500 popular skills to observe which workflows are more worthy of attention at the moment. It should be emphasized that these signals are not automatic question generators, nor are they precise measurements of real needs. They are just a public and verifiable prior demand to help the benchmark decide which workflows this release should focus on.

Release Layer - The truly public benchmark is still a fixed snapshot with a timestamp. The task definitions, execution environment, data fixtures, and scoring scripts are all locked. Models can be stably compared, and it is also completely reproducible academically.

The two layers are connected by a five-stage pipeline:

Signal Collection: Capture a timestamp snapshot of the ClawHub Top-500, with each signal bringing the source and metadata.

Pattern Clustering: Aggregate fragmented skill names into stable workflow patterns - distinguishing not the surface names of skills but the underlying user goals, operation objects, and execution environments.

Family Weighting: Determine the target weights of each task family based on the upstream signal strength. Workflows with stronger signals will have a larger proportion in the release.

Seed Expansion and Screening: Expand the weighted patterns into executable task candidates. After trial runs and screening, only candidates that are runnable, reproducible, and can produce effective score differences are retained - from 178 generated candidates to 157.

Discrimination Optimization Selection: Use mixed-integer linear programming (MILP) to select 105 public tasks from 157 candidates, while optimizing three constraints - release scale, family coverage, and leaderboard discrimination.

The MILP here is not mechanically pursuing "diversity" but making three things explicit: how large the public release should be, that each family should be covered at least, and that this set of questions should be able to truly distinguish between models. Turning these originally vague curation judgments into auditable constraints is the way Claw-Eval-Live makes the release construction itself transparent.

The scale of the current public release: 105 tasks, 22 task families, and 13 advanced models. The tasks are divided into two major execution environments - 87 service-driven business workflows (involving 18 controlled services such as CRM, email, calendar, finance, and work orders) and 18 local workspace repair tasks (terminal operations, environment repair, and configuration debugging).

Each task is not just a prompt but a complete executable evaluation unit: task definition (task.yaml), tool interfaces, data fixtures, and a dedicated scoring script (grader.py), all of which are indispensable. The scoring follows Claw-Eval's evidence anchoring principle - in the entire release, the three most common types of deterministic evidence include: data retrieval (whether the correct tools and data sources are called), data accuracy (whether the entities and values are consistent with the ground truth), and action verification (whether the necessary state changes actually occur). Only when these deterministic checks cannot cover semantic dimensions (such as report organization quality and summary coherence) is a structured LLM judge introduced.

Therefore, from the perspective of project evolution, the two works are in the same vein:

Claw-Eval solves the problem of "reliable scoring" - allowing us to see what the Agent has actually done.

Claw-Eval-Live solves the problem of "keeping the question bank in line with reality" - making the benchmark no longer stay on a fixed set of questions but continuously focus on the workflows that are most worthy of testing at present.

What do we see when the benchmark truly approaches reality?

The results of 13 advanced models on the current release are straightforward and sobering.

The overall ceiling is still very low

No model has exceeded a 70% pass rate. The gap between the top and the bottom is 22.9 percentage points. Real-world workflow automation is far from the stage of "reliable deployment."

It is worth noting that models with similar pass rates can have very different completion rates. The pass rates of MiMo V2 Pro, Kimi K2.5, and Gemini 3.1 Pro are all 53.3%, but the Overall Completion ranges from 76.9 to 74.0. This shows that some models are not completely unable to do the tasks but often "almost finish" - the problem lies not in language ability but in the execution loop.

The truly impactful discovery: the difficult tasks are not what you think

If relying on intuition, many people would think that the most difficult tasks must be terminal operations and environment repair, which require hardcore technical capabilities.

Claw-Eval-Live's results are exactly the opposite.

From the grouped heatmap, the Development / Terminal has almost reached the ceiling for strong models: Claude Opus 4.6, GPT-5.4, and Claude Sonnet 4.6 all achieved 100% in this aspect, and the weakest model is also above 72.2%. The truly difficult tasks are business tasks such as HR / People, Management / Ops, and cross-system workflows. In the HR / People group, no model has exceeded 22.2%, and several models have a pass rate of 0.

Looking further at the fine-grained families, the conclusion is even more striking. The average pass rate of HR is only 6.8%; MGMT fails completely under the public pass rules; the average pass rate

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。