HomeArticle

Tsinghua University and Zhipu Team propose Vision2Web: Validating and Evaluating Visual Website Development Based on Agents

账号已注销2026-03-31 17:59
A hierarchical benchmark for visual website development

Recently, a research team from Tsinghua University and Zhipu jointly launched Vision2Web, a hierarchical benchmark designed to evaluate the real development capabilities of multimodal code agents.

Different from previous tests limited to static page generation or partial code repair, this research reveals a key phenomenon:

As the task complexity upgrades from simple UI restoration to full-stack system construction, even the current SOTA models experience a significant decline in performance.

According to the paper, Vision2Web fills the gap in the existing evaluation system by constructing three levels of progressive tasks including static web pages, interactive front-ends, and full-stack websites, combined with a workflow-based agent verification mechanism. It provides a new perspective for understanding the ability boundaries of AI in long-range, cross-modal software engineering.

Paper link: https://arxiv.org/abs/2603.26648

Core Design

Vision2Web divides website development tasks into three levels of increasing difficulty, each corresponding to different ability requirements.

Level 1: Static Web Pages

This level evaluates the model's ability to understand user interfaces and generate code in device-responsive scenarios. The model needs to generate responsive pages that maintain consistent layout and style across different terminals based on multiple prototype images and resolution specifications of the same web page on desktops, tablets, and mobile devices.

Level 2: Interactive Front-End

This level requires the model to build a multi-page front-end application with a complete navigation flow and highly consistent structure after receiving multiple prototype images and text input describing the logical relationships between pages. This examines its cross-modal page organization and logical reasoning abilities.

Level 3: Full-Stack Website

As the ultimate assessment of engineering capabilities, the full-stack website development task requires the model to coordinate all-link work such as requirement understanding, state management, and integration debugging in the face of structured requirement documents and complex prototype images, and finally deliver a logically self-consistent and fully functional full-stack system that can be run.

Figure | Overview of the Vision2Web visual website development hierarchical benchmark testing system. The system covers three levels of tasks: static web pages, interactive front-end interfaces, and full-stack websites, requiring the agent to integrate visual prototypes and text specifications. The evaluation uses a workflow-based agent verification paradigm, and quantitative analysis is conducted through dual indicators of functional correctness and visual fidelity.

In terms of dataset construction, Vision2Web screened and integrated a large number of real-world website resources, ultimately including 193 development tasks, widely covering 16 subcategories under 4 categories of content, transactions, SaaS platforms, and public services. A total of 918 prototype images and 1255 test cases are provided, aiming to provide a solid and high-quality data foundation for evaluating the generalization ability of models in diverse and highly complex scenarios.

Evaluation and Results

To address the challenges of functional testing and visual testing in end-to-end website evaluation, Vision2Web adopts a workflow-based agent verification paradigm. This paradigm organizes the testing process into a directed dependency graph, where each node corresponds to a verification subroutine, and the order dependencies and shared states are encoded between nodes.

The verification nodes are divided into two categories:

Functional Verification Nodes: Executed by the GUI agent, they determine the functional correctness based on the target, guiding actions, and verification criteria, and output the functional score.

Visual Verification Nodes: Executed by the VLM judge, they compare the rendered page with the prototype at the component level and output the visual score.

Based on this, the research team conducted comprehensive tests on several cutting-edge models such as Claude-Opus-4.5, GPT-5, and Gemini-3-Pro in a pre-configured containerized environment. There are mainly the following six findings:

Finding 1: The model's ability shows a significant downward trend as the task complexity increases. For example, Gemini-3-Pro-Preview scored 63.3 on the static desktop, but its visual score dropped sharply to 11.7 and the functional score was only 22.6 in the full-stack task.

Figure | End-to-end performance of multimodal coding agents on the Vision2Web dataset across three task levels. Reported metrics include device-level static scores, average functional scores (FS) and visual scores (VS) for interactive tasks and full-stack tasks. The deployment success rate (DSR) is for reference only and not an official metric. Unless otherwise specified, all metrics are reported on a 0-100 scale.

Finding 2: The models have obvious shortcomings in device adaptability. The visual performance on tablets and mobile devices is generally 10%-20% lower than that on desktops.

Finding 3: There are significant differences in the performance among models. In a horizontal comparison, Claude-Opus-4.5 performs the most robustly.

Finding 4: The choice of framework affects the results. Most models perform better under the OpenHands framework than under Claude Code.

Figure | Performance (visual score/functional score) of selected coding agents in different website categories under the OpenHands framework on the Vision2Web platform. Opus-4.5 and Sonnet-4.5 correspond to Claude-Opus-4.5 and Claude-Sonnet-4.5 respectively.

Finding 5: The website category affects the performance. The models perform best on public service websites with simple structures, but struggle the most on SaaS platforms with complex interactions.

Finding 6: There are significant differences in functional categories. The pass rates of basic logics such as navigation and authentication are relatively high, but all models expose obvious shortcomings in tasks involving complex data flow such as state management, CRUD operations, and file operations.

Summary and Outlook

The experiments show that as the task complexity climbs from static pages to full-stack systems, the performance of the models drops precipitously.

At the basic level, models often lack a robust multimodal foundation due to excessive reliance on file names, resulting in layout misalignments and visual inconsistencies when dealing with unnamed resources or regular components.

In the multi-page interaction stage, this defect evolves into a lack of cross-view coherence. Models can often reproduce the home page well but fail to maintain the visual fidelity and interaction logic of subsequent pages.

At the most complex full-stack system level, due to the lack of a reliable self-verification mechanism, models are prone to deviate from the requirement specifications in long contexts, leading to project startup failures or runtime crashes.

These findings profoundly expose the limitations of current agents in overall task management and system-level engineering. Strong performance in isolated tasks cannot be reliably translated into end-to-end system construction, revealing systematic defects in handling structural complexity, cross-page coordination, and persistent state reasoning.

Looking to the future, research should not only focus on the accuracy of single-point code generation but also on task design with hierarchical and progressive challenges and a principle-based, reproducible self-evaluation paradigm as the basis for rigorously understanding and evaluating the capabilities of coding agents.

This article is from the WeChat official account “Academic Headlines” (ID: SciTouTiao), author: Wang Yueran. It is published by 36Kr with authorization.