HomeArticle

The pass rate of Claude is less than 4%. SaaS-Bench has shattered the "fully automated office" fantasy of Computer-Use.

机器之心2026-05-25 10:23
All SaaS applications targeting humans may need to be rebuilt for Agents.

Imagine a real workday: project managers need to update project statuses, finance personnel need to organize customer bills, and medical administrators need to verify appointment and insurance information.

These are not tasks for senior experts. In many cases, a diligent intern can complete them by following the procedures.

However, for today's AI Agents, these "daily tasks" are far from as simple as they seem.

It needs to understand business goals, search for information across applications, maintain consistent states, and correctly implement all details in the system after dozens or even hundreds of steps of operations.

This is also the reality that SaaS-Bench aims to reveal: Agents not only need to be able to click buttons and fill out forms but also complete long - process tasks in a real office.

If an Agent cannot stably complete tasks that an intern can do daily, then we need to re - evaluate: how far are we from a truly usable Agent?

Blog link: https://unipat.ai/blog/SaaS-Bench

GitHub link: https://github.com/UniPat-AI/SaaS-Bench

Paper link: https://arxiv.org/abs/2605.15777

The "singularity" of Computer - Use Agents has not arrived, but the cold water of reality has splashed down first.

In the past year, various GUI Agents have rushed to claim that they can do work on behalf of humans. Benchmark scores have soared, investors are excited, and the media is in a frenzy. "Fully automated office" seems to be just around the corner.

However, UniPat AI has just proven with a set of data that all of this is built on sand!

Leaderboard

23 real systems, 106 tasks, a cruel actual combat test

Existing Agent evaluations, to put it simply, are: simulated environments, simple tasks, and at most dozens of steps to complete.

This is completely different from real work.

What does real - world office work look like? A medical administrator writes a SOAP medical record → fills out a case report → generates an official document. A finance person receives a reimbursement application → approves it → makes a payment → records the account. It involves several systems, and the steps can easily reach hundreds.

The idea of SaaS - Bench is very straightforward: directly move real systems into Docker and let Agents work in real front - end and back - end logic, database states, and business constraints.

SaaS-Bench tasks —— real work scenario tasks

SaaS - Bench has carefully selected 23 open - source SaaS (Software - as - a - Service) systems, all deployed locally through Docker, retaining complete front - end and back - end logic, database states, and business constraints. It covers six professional fields:

Software R & D: OpenProject, Baserow, Code - Server, Metabase

Business Finance: Twenty CRM, BigCapital, HRMS, Pretix

Medical Management: OpenEMR, OpnForm, OnlyOffice

Team Collaboration: SiYuan, Roundcube, Mattermost, ownCloud

Agricultural Supply Chain: FarmOS, Grocy, Recipya, E - Label

Independent Media: PhotoPrism, MediaCMS, BookLore, Watcharr

More importantly, these systems are not "empty web pages": each software is filled with real - world business data, including entity records such as users, projects, orders, and files. Agents enter a real work environment with historical data, interference items, and cross - system associations, rather than a blank test page.

Three - layer distribution of task modality, domain, and app

Among the 106 tasks, 93.4% span at least two applications, and half (53) are three - application tasks. There are 74 pure - text tasks and 32 tasks involving multi - modal understanding. Based on the execution trajectory of Claude Opus 4.6, 97.3% of text tasks have more than 100 operation steps, and the longest trajectory reaches over 300 steps.

Task difficulty analysis —— most tasks are Cross - App + Long - Horizon

Where do these tasks come from? How to evaluate the operation ability of Agents?

SaaS - Bench uses the method of "LLM generation + expert review" to complete task construction:

First, the LLM generates tasks around six major professional fields and specific professional roles, clarifies task goals, cross - application dependencies, and verification requirements, and reduces ambiguities and loopholes through multiple rounds of modification.

Subsequently, experts will conduct manual screening and real - execution checks on the tasks, focusing on judging whether the tasks are professional, natural, achievable, and verifiable. Tasks with redundant steps, chaotic logic, or inaccurate verification will be modified or removed to ensure that each task can be run in reality and accurately evaluated by the verifier.

Task construction flow chart —— four stages ensure task quality

SaaS - Bench allows Agents to operate the computer in the SaaS environment using Browser - Use and provides two indicators:

Resolved Score (fully passed score, strict): Only when all checkpoints are passed is it counted as 1; otherwise, it is 0.

Checkpoint Score (checkpoint score, lenient): Calculate the completion ratio of partial checkpoints by weight.

Overview diagram of Agent → Browser - Use → Execution → Verification → Scoring

The following results will show that the huge gap between these two numbers exactly exposes the most core problem of Agents.

The leaderboard is out: all failed

Let's look at these numbers ——

Main results (DeepSeek V4, M2.7, and GLM5.1 are single - modal models, only evaluated in the Text - Only Domain)

The strongest, Claude Opus 4.7, has a checkpoint score of 43.9%, and the end - to - end fully passed score is only 3.8% —— out of 106 tasks, only 4 were fully passed. What about Kimi K2.5 and Gemini 3.1 Pro? The fully passed score is zero. Not a single task was completed.

The meaning of these numbers is extremely cruel: Agents can advance some intermediate links of the work, but they hardly have the ability to complete a complete long - range workflow.

Can running multiple times save the situation?

Pass@k results of four models

Run each model independently 3 times on the same task, and count it as passed if it passes once. The overall improvement of pass@3 compared to pass@1 is about 8 percentage points.

Sonnet 4.6 jumped from 33.9% to 52.1% (+18.2pp) in multi - modal tasks —— it is not completely useless, but its execution is extremely unstable.

This is not due to environmental randomness. The initial state of each run is exactly the same. This is path - dependent —— a small difference of the model at a certain decision point leads to a complete divergence of the subsequent trajectory.

Running multiple times is helpful, but it is far from a solution.

The more complex, the lower the score

All three structural dimensions show a monotonic decrease:

Score vs. number of applications / Score vs. step length / Score vs. number of checkpoints

Number of cross - applications from 1 to 4: the average score drops from 53% to 20%

Increase in operation step length: the longer the task trajectory, the significantly lower the score

Number of checkpoints ≤6 vs. ≥18: the average score drops from 65% to 27%

Tasks with "cross - application + long trajectory + fine - grained verification" have the lowest scores —— this is exactly the most common form of real - world workflows.

Four structural failures: where exactly do Agents fail?

The real value of SaaS - Bench lies not in the scores themselves but in exposing four fatal flaws of Agents in a real environment.

Failure 1: The longer the task, the more mistakes are made

Even if the passing rate of each checkpoint is as high as 95%, the probability of passing all 12 checkpoints is only 54%. And the average number of checkpoints in SaaS - Bench far exceeds 12.

All models show the same pattern: the passing rate decreases as the task progresses, and no model can maintain its early - stage performance in the second half.

Models make fewer and fewer correct decisions as the task is executed

This is an irreversible downward curve. The further it goes, the less likely it is to complete the task.

Failure 2: One wrong step leads to a series of wrong steps

A typical case: The task requires creating a corporate customer "Arcturus Digital". The Agent filled in both the contact name and the company name at the same time, triggering the individual customer logic, and actually created an individual customer, Elena Vasquez.

Subsequently, 10 invoices, payment records, and account reconciliations were all associated with the wrong entity. The weight of the core checkpoint is only 3%, but it results in a 30% loss of downstream weight.

Schematic diagram of the downstream failure chain caused by an upstream task

A 3% error node causes a 30% score loss.

Failure 3: Don't check after completion, thinking it's correct

Claude Opus 4.6 identified a date error (2026 - 03 - 19 vs. 2026 - 03 - 20) at Step 124, executed the modification, but did not return to the page for recheck and directly advanced to the subsequent subtasks. When submitting at Step 210, the report stated that "the bill date is 2026 - 03 - 20 and has been fixed" —— the actual date on the page is still 03 - 1