What's the next step for AI Coding? The latest review of "Multimodal Code Intelligence" is here, and these directions are worth focusing on
Given a screenshot, AI can help you generate code. This is no longer something new.
The real difficulty lies in making the generated results withstand execution and interaction verification.
Traditional "text-to-code" mainly relies on text descriptions to generate code. However, text is not good at expressing spatial hierarchy and complex structures. The information that a single picture can convey often requires a large amount of text to clarify. In contrast, in tasks such as front - end interfaces, visual charts, and CAD graphics, visual input is often more direct and complete. With the development of Multimodal Large Language Models (MLLMs), "Multimodal Code Intelligence", which can understand images, interfaces, and charts, has emerged.
Regarding this prospect, teams from Meituan, the University of Hong Kong, the Chinese University of Hong Kong, and their collaborators have published the latest review paper. They systematically sorted out the main tasks and bottlenecks of multimodal code intelligence and proposed 4 main directions for future research.
Paper link: https://arxiv.org/abs/2606.15932
They pointed out that, taking the IWR - Bench as an example, the visual fidelity of the current model can reach 64.25%, but the correct rate of interactive functions is only 24.39%. Moreover, the evaluation of multimodal code intelligence should not only focus on visual similarity but also examine the correctness at the semantic, structural, execution, and interaction levels.
Related projects and resources have been made public on GitHub.
Current Progress
In the task definition section, the research team summarized the tasks related to multimodal code intelligence into two major categories:
One is multimodal code synthesis, which focuses on generating, editing, and refining code with the participation of visual information.
The other is "code - centered reasoning and action", which emphasizes that code is not only the final result but can also serve as an intermediate interface for reasoning, tool invocation, and Agent task execution.
They summarized the existing research into the following four main directions:
Figure | Overview of the multimodal code intelligence field.
GUI Direction: The closed - loop verification of web code generation is the clearest, but the existing evaluations still focus too much on static visual similarity. The results on IWR - Bench show that the visual fidelity of the model can reach 64.25%, and the correct rate of interactive functions is only 24.39%. In contrast, due to the lack of a unified execution and interaction environment on mobile devices, it is more difficult to standardize the evaluation.
Figure | Examples of GUI code generation tasks in websites and mobile applications.
Scientific Visualization: The core requirement is that the generated code should not only be able to correctly render the results but also accurately express data semantics, document structures, or relevant scientific processes/mechanisms.
Figure | Examples of scientific visualization code generation tasks, including charts, documents, presentations, and demonstration content.
Structured Graphics: It emphasizes shifting from pixel similarity to structural correctness. SVG should maintain editability, flowcharts should retain logical topology and relationship types, and CAD should restore parametric construction logic, constraints, and feature dependencies.
Figure | Examples of structured graphics generation tasks.
Cutting - Edge Tasks: Further expand code from a "product" to an "interface for reasoning and action", covering programmatic visual operations, video code generation, embodied control, vision - driven programming, and a unified multimodal code generation framework.
Figure | Tasks in the cutting - edge tasks and framework section, including programmatic visual operations, video code generation, embodied control, vision - driven programming, and a unified framework.
Future Directions
As cutting - edge tasks push code further into the process of interaction, execution, and control, the shortcomings of the existing evaluation system have become more obvious.
Based on this, the research team proposed four future directions worthy of attention.
1. Multi - Signal Validation
The research team pointed out that a single indicator cannot comprehensively characterize the correctness of multimodal code intelligence. High visual similarity does not mean the structure is correct; closer reference code does not necessarily mean the program is executable; preference - based evaluation often only reflects local attributes.
Therefore, the future evaluation system should not only give a total score but should form a more detailed "diagnostic portrait", reporting visual fidelity, execution success rate, text correctness, data or semantic fidelity, structural effectiveness, editability, and interaction correctness respectively. At the same time, the evaluation design should also clarify what attributes the system is optimizing, which verifiers are used, and distinguish between the reward signals in the training phase and the final reliability check.
2. Multi - State Verification
The research team believes that visual - code tasks involving state changes should not be evaluated based on isolated static results but should be examined in the complete execution process. The GUI task best illustrates this point: A page may visually reproduce the screenshot, but problems may still emerge when clicking, route jumping, window scaling, or state updating.
This challenge is not limited to the GUI. Scientific demonstration code may be executable but convey the wrong mechanism; a video script may have the key frames correct but lose the event timing; an embodied program may ultimately reach the goal but fail under contact, occlusion, or controller limitations.
Therefore, future benchmark tests should not only look at a single result but should cover the complete execution chain, including the initial state, the generated code or actions, intermediate observations, expected state transitions, verifier outputs, and recovery cases. Specifically, web tasks need to check DOM and state assertions, mobile tasks need to be checked in combination with design operation trajectories or simulator gestures, video tasks need to be verified for timing synchronization, and embodied tasks need to be diagnosed in combination with simulators or controllers.
3. Cross - Task Transfer Testing
The research team pointed out that when evaluating a unified model, we should not only look at whether it supports more task formats but also at whether the capabilities it has learned can transfer across tasks. The key is not a wider coverage but whether the model has truly acquired reusable visual - code capabilities, such as layout reasoning, symbolic relationship modeling, and interaction understanding, rather than simply improving the performance of several single tasks respectively.
To this end, a special transfer testing protocol needs to be designed in the future to compare the base model, the model enhanced on the source task, and the control model optimized separately for the target task, and report both positive and negative transfers at the same time. For example, we can test whether chart training improves layout reasoning ability, whether document structure learning helps transfer to other visual - code tasks, and whether interaction supervision can improve the repair ability of the generated products.
4. Verifiable Agent Traces
For the vision - code system oriented to Agents, the research team believes that more complete process evidence needs to be retained in the future to connect visual evidence, tool invocation, code modification, and the final result into a checkable chain. Just looking at whether the task is ultimately successful is not enough to judge whether the intermediate trajectory is truly supported by visual evidence, nor can it explain whether this trajectory has a causal effect on the result.
The research team mentioned that a "Agent evidence log" needs to be established in the future. Each record should include at least: the observations on which it is based, the visual areas or tool outputs referenced, the modified code or actions, the expected improvement in the verifier result, the playback result, and the fallback or rollback decision triggered when the evidence is insufficient.
Such a log not only helps with playback, ablation testing, counterfactual input, permission control, sandbox protection, and manual review. More importantly, it can locate failures to more specific links, such as whether there are security problems in visual understanding, code generation, environment execution, verifier design, or action selection itself. In this way, the Agent - driven multimodal code system will no longer be just a black box measured by the final success rate but will be closer to a verifiable, reviewable, and attributable process.
Some Issues
The research team pointed out that the core bottleneck of current multimodal code intelligence is not only the generation ability itself but also the lack of a reliable verification mechanism. Existing evaluations often rely on a single visual signal and are difficult to cover interaction, state changes, structural constraints, and timing processes:
- In web tasks, a single screenshot cannot determine whether clicking, routing, and state switching are correct;
- In chart tasks, similar rendering does not mean accurate data recovery;
- In SVG, flowchart, and CAD tasks, visual proximity may also cover up structural, logical, or parametric constraint errors;
- In video and robot tasks, task completion does not mean that the timing process or physical behavior is real and reliable.
At the same time, the existing research lacks unified standards in dataset selection, evaluation indicators, and task settings, making it difficult to directly compare the results of different methods horizontally. Problems such as data leakage, benchmark saturation, and evaluation sensitivity further weaken the robustness and reliability of relevant conclusions.
Finally, they reminded that although multimodal code intelligence is expected to lower the threshold of visual programming, if the verification is insufficient, it may also bring actual risks such as web interaction failure, chart data errors, structural information loss, distorted expression of scientific mechanisms, and unsafe physical actions. In addition, screenshots and design files may contain private information, and the generated code may also be leaked or misused in a proprietary environment.
This article is from the WeChat public account "Academic Headlines" (ID: SciTouTiao), author: Academic Headlines. It is published by 36Kr with authorization.