Was ist der nächste Schritt von AI Coding? Die neueste Übersicht über „multimodale Code-Intelligenz“ ist erschienen – diese Richtungen sind besonders hervorzuheben
Given a screenshot, the AI can already generate code for you. This is no longer news.
The real challenge lies in the fact that the generated results must withstand execution and interaction testing.
The traditional “Text-to-Code” mainly relies on text descriptions for code generation. However, texts are not well-suited for representing spatial hierarchies and complex structures. The information that an image can convey often requires long texts to explain. In comparison, visual input is often more direct and comprehensive for tasks such as frontend interfaces, visual diagrams, and CAD graphics. With the development of Multimodal Large Language Models (MLLM), “Multimodal Code Intelligence”, which is capable of understanding images, interfaces, and diagrams, has also begun.
In this context, teams from Meituan, the University of Hong Kong, and the Chinese University of Hong Kong, as well as their cooperation partners, have published a latest review paper. In this paper, the main tasks and bottlenecks of Multimodal Code Intelligence are systematically summarized, and 4 main directions for future research are proposed.
Link to the paper: https://arxiv.org/abs/2606.15932
They point out that, for example, based on the IWR benchmark, the visual fidelity of the current model can reach 64.25%, but the correctness of the interaction functions is only 24.39%. In addition, the evaluation of Multimodal Code Intelligence should not only consider visual similarity but also examine the correctness at the semantic, structural, executable, and interactive levels.
The relevant projects and resources have been published on GitHub.
Current Status
In the task description part, the research team summarizes the tasks related to Multimodal Code Intelligence into two categories:
One category is Multimodal Code Synthesis, which focuses on the generation, editing, and refinement of code considering visual information.
The other category is “Code-Centered Inference and Action”, which emphasizes that code is not only the end product but can also serve as an intermediate interface for inference, tool calls, and task execution by agents.
They classify the existing research into the following four main directions:
Figure | Overview of the field of Multimodal Code Intelligence.
GUI Direction: The closed loop of website code generation and testing is the clearest, but the existing evaluations still focus on static visual similarity. The results on the IWR benchmark show that the visual fidelity of the model can reach 64.25%, while the correctness of the interaction functions is only 24.39%. In comparison, the evaluation on mobile devices is more difficult to standardize due to the lack of a unified execution and interaction environment.
Figure | Examples of GUI code generation tasks in websites and mobile applications.
Scientific Visualization: The core requirement is that the generated code must not only correctly render the result but also accurately express the data semantically, the document structure, or the relevant scientific process/mechanism.
Figure | Examples of scientific visualization code generation tasks, including diagrams, documents, presentations, and demonstration contents.
Structured Graphics: It is emphasized that a shift from pixel similarity to structural correctness is necessary. SVG files must remain editable, flowcharts must maintain the logical topology and relationship type, and CAD files must restore the parametric construction logic, constraints, and feature dependencies.
Figure | Examples of structured graphic generation tasks.
Cutting-Edge Tasks: The code is further extended from a “product” to an “interface for inference and action”. This includes programmed visual operations, video code generation, physical control, visually-driven programming, and a unified framework for multimodal code generation.
Figure | Tasks in the field of cutting-edge tasks and frameworks, including programmed visual operations, video code generation, physical control, visually-driven programming, and a unified framework.
Future Directions
With the further development of cutting-edge tasks towards interaction, execution, and control processes, the weaknesses of the existing evaluation system become more obvious.
Based on this situation, the research team proposes four future directions that should be considered.
1. Multi-Signal Validation
The research team points out that a single indicator cannot fully describe the correctness of Multimodal Code Intelligence. High visual similarity does not mean that the structure is correct; closer agreement with the reference code does not mean that the program is executable; and preference-based evaluations often only reflect local properties.
Therefore, the future evaluation system should not only give an overall score but instead create a more detailed “diagnostic image” that reports the visual fidelity, execution success rate, text correctness, data or semantic fidelity, structural validity, editability, and interaction correctness separately. At the same time, the evaluation design should clarify which properties the system optimizes, which validators are used, and distinguish between the reward signal in the training phase and the final reliability test.
2. Multi-State Verification
The research team believes that visual code tasks involving state changes should no longer be evaluated based on isolated static results but must be considered in the entire execution process. GUI tasks illustrate this best: A page may visually reproduce the screenshot, but problems may still occur when clicking, routing, zooming the window, or updating the state.
This challenge exists not only in GUI tasks. Scientific demonstration code may be executable but convey a wrong mechanism; a video script may write the key frames correctly but lose the event sequence; a physical program may achieve the goal but fail in case of contact, coverage, or controller constraints.
Therefore, future benchmarking should not only consider a single result but cover the entire execution chain, including the initial state, the generated code or actions, the intermediate observations, the expected state transitions, the validator outputs, and the recovery scenarios. More specifically, website tasks must check the DOM and state assertions, mobile tasks must be checked in connection with the design operation trajectories or simulator gestures, video tasks must check the time sequence synchronization, and physical tasks must be diagnosed in connection with the simulator or the controller.
3. Cross-Task Transfer Testing
The research team points out that when evaluating a unified model, not only should it be checked whether it supports more task formats, but also whether the learned skills can be transferred between tasks. The key lies not in broader coverage but in whether the model has actually acquired reusable visual code skills, such as layout inference, symbol relationship modeling, and interaction understanding, rather than just improving the performance in individual tasks.
Therefore, in the future, a special transfer test protocol must be developed to compare the base model, the model improved on the source task, and the comparison model separately optimized for the target task, and report both positive and negative transfer. For example, it can be tested whether diagram training improves the layout inference ability, whether learning the document structure helps to transfer to other visual code tasks, and whether interaction monitoring can improve the repair ability of the generated products.
4. Verifiable Agent Traces
For visual code systems oriented towards agents, the research team believes that in the future, complete process proofs must be maintained to connect the visual evidence, the tool calls, the code changes, and the final result into a verifiable chain. Just seeing whether the task is successful in the end is not enough to judge whether the intermediate trajectory is actually supported by visual evidence, and it is also difficult to say whether this trajectory has a causal influence on the result.
The research team mentions that in the future, a “Agent Evidence Protocol” must be created. Each entry should at least include: the observed data, the referenced visual areas or tool outputs, the changed codes or actions, the expected improvements in the validator results, the rendering results, and the withdrawal or rollback decisions made in case of missing evidence.
Such a protocol not only helps with rendering, ablation testing, counterfactual input, access control, sandbox protection, and manual review, but above all, it can trace the error back to more precise areas such as visual understanding, code generation, environment execution, validator design, or action selection itself. In this way, the agent-driven multimodal code system will no longer be just a black box measured only by the final success rate but rather a verifiable, auditable, and traceable process.
Some Problems
The research team points out that the core bottleneck of current Multimodal Code Intelligence lies not only in the generation ability itself but also in the lack of reliable validation mechanisms. The existing evaluations often rely on a single visual signal and cannot cover interaction, state changes, structural constraints, and time sequences:
- In website tasks, it cannot be determined based on a single screenshot whether clicks, routing, and state changes are correct;