Domestic multi-modal AI is open-sourced again. In practical tests, it can convert screenshots to web pages and support image search for shopping, with prices cut in half.
According to a report by ZDONGXI on December 9th, last night, Zhipu open-sourced its GLM-4.6V series of multi-modal large models, including the basic version GLM-4.6V (106B-A12B) for cloud and high-performance cluster scenarios, and the lightweight version GLM-4.6V-Flash (9B) for local deployment and low-latency applications.
In addition, this morning, Zhipu also open-sourced AutoGLM, similar to the "Doubao Mobile Assistant". When this intelligent agent was released in October last year, it was regarded by the industry as the "world's first AI Agent capable of mobile phone operations".
▲ GLM-4.6V open-source homepage (Source: Hugging Face)
▲ AutoGLM open-source homepage (Source: Hugging Face)
According to the official introduction, GLM-4.6V can complete tasks such as intelligent text and image mixing and content creation, image recognition shopping and product guidance, front-end replication and multi-round visual interaction development, and long-context document and video understanding. ZDONGXI had an experience with it as soon as possible.
In the actual experience, GLM-4.6V's image search, full-network price comparison, and long-text and video understanding abilities performed relatively stably. It can generate text and web pages quickly and accurately. However, in terms of text and image mixing ability, the images it generated could not be displayed. For vague instructions, GLM-4.6V had a slight deviation in understanding.
The GLM-4.6V series of models has increased the context window during training to 128k tokens. For the first time in the model architecture, it natively integrates the Function Call (tool call) ability into the visual model.
In terms of performance, with the same parameter scale, the GLM-4.6V series of models achieved SOTA performance in key abilities such as multi-modal interaction, logical reasoning, and long context.
Among them, in 34 tests covering general visual question answering, multi-modal reasoning, multi-agent, multi-modal long text, chart recognition, and spatial positioning abilities, the 9B version of GLM-4.6V-Flash scored higher than Qwen3-VL-8B in 22 tests. The performance of GLM-4.6V with 106B parameters and 12B activation is close to that of Qwen3-VL-235B with twice the number of parameters.
▲ Benchmark test of the GLM-4.6V series of models (Source: z.ai/blog/glm-4.6v)
In terms of price, compared with GLM-4.5V, the GLM-4.6V series has a 50% price reduction. The API call price is as low as 1 yuan per million input tokens and 3 yuan per million output tokens. GLM-4.6V-Flash is completely free.
▲ Price list of the GLM-4.6V series of models (Source: Zhipu AI)
Open-source address of GLM-4.6V:
GitHub: https://github.com/zai-org/GLM-V
Hugging Face: https://huggingface.co/collections/zai-org/glm-46v
ModelScope Community: https://modelscope.cn/collections/GLM-46V-37fabc27818446
Experience address of GLM-4.6V: https://chat.z.ai/
01. Intelligent text and image mixing: Can generate a tweet outline, but cannot display images
First, in terms of intelligent text and image mixing and content creation ability, GLM-4.6V has built native multi-modal tool call ability, which can directly understand multi-modal data such as images, screenshots, and document pages without first converting them into text descriptions for parsing.
We uploaded the technical report of GLM-4.5V and asked it to generate a WeChat official account article with both text and pictures. In about 1 - 2 minutes, GLM-4.6V completed reading and understanding the entire document and output a complete official account article including a title, an introduction, five chapters, and a conclusion. However, after multiple attempts, it still could not display the images.
▲ Intelligent text and image mixing
02. Image recognition shopping and product guidance: Automatic price comparison is seamless, but the understanding of fuzzy search is not in place
To experience the image recognition shopping and product guidance functions of GLM-4.6V, we directly input "Help me search for the prices of the iPhone 17 Pro Max on various platforms now."
GLM-4.6V will automatically call relevant tools to search the entire network and form a price comparison table including the product name, platform, brand, product image, product link, and store name. You can directly click on the link to jump to the purchase page.
According to the comparison on the purchase page, the product names and prices searched by GLM-4.6V are all correct. However, all the products it compared are from the JD platform, and the product names are directly extracted from the e-commerce purchase page, with a lot of redundant information and no further sorting.
We can also directly ask GLM-4.6V to search for the same glasses as Nick Wilde's in "Zootopia 2". Through the image search function, it directly searched for the real-shot images of the same glasses, but did not provide a purchase link.
▲ Fuzzy search for product guidance
03. Web page replication: Generate web page code smoothly from a single image, but there are errors in icon replacement
We uploaded a screenshot of the login page of the X platform and asked GLM-4.6V to generate HTML code and a web page preview.
▲ Generate web page code from a screenshot
▲ Generate web page preview
After receiving the instruction, GLM-4.6V immediately began to generate HTML code line by line and displayed the preview page. It can be seen that the "X-like" login web page it generated is almost identical to the original web page.
In addition, GLM-4.6V also supports multi-round visual interaction, allowing you to directly modify the web page's color, adjust the button position, etc. with natural language instructions based on the results.
For example, on the basis of the previous output, we asked it to change the web page's theme color to sky blue and change the icon X to Z while maintaining the original style. It can be seen that GLM-4.6V completed the task of changing the theme color perfectly. However, for the instruction to change the icon X to Z, it mistakenly generated an "upward arrow" shape.
▲ Modify web page elements
04. Long-context document understanding: Can process multiple Chinese and English papers simultaneously, and understand long documents accurately
GLM-4.6V has improved the context alignment ability of the visual encoder and the language model to 128k. In practical applications, a 128k context is approximately equivalent to a 150-page document, a 200-page PPT, or a one-hour video.
To verify its long-context document understanding ability, we directly gave GLM-4.6V three papers in the field of online platform governance, including two Chinese literatures and one English literature, and asked it to read these papers and generate a study note.
Judging from the results generated by GLM-4.6V, the images still could not be displayed, but the text part was complete and logically clear, clearly listing the core viewpoints and conclusions of each literature. There were no errors or omissions in processing the English literature.
05. Video understanding: Can quickly analyze video content, but there is a limit on video size
Finally, GLM-4.6V can also understand the content of long videos. Users can upload an MP4 movie within 200M and ask it to analyze the shooting techniques, content, and structure of the video.
For example, we uploaded a 6-minute and 48-second video sharing video production skills and wanted it to summarize the video's ideas and content and give some suggestions for making photography self-media.
▲ Video content understanding
GLM-4.6V gave a complete detailed explanation including the video's ideas, narrative techniques, camera use, and equipment selection within a few seconds, and gave four step-by-step suggestions for becoming a photography blogger. The answer was accurate, clear, and complete.
06. Conclusion: GLM-4.6V has lowered the threshold for accessing visual models
From the actual experience, GLM-4.6V can already be of great help in daily work. However, the generation results are not very stable. Images cannot be displayed when generating official account articles, and there are still some flaws when modifying web page details. However, its price has been reduced to half of the previous version, and the lightweight version is free. For individuals or small teams who want to try multi-modal AI, the threshold has indeed been significantly lowered.
In the current situation where the capabilities of various AI companies are getting closer, the one who can provide a smoother experience at a lower cost may attract more developers.
In the official tweet, the Zhipu team wrote that this week is their open-source release week, and more achievements will be open-sourced, which is worth looking forward to.
This article is from the WeChat official account "ZDONGXI" (ID: zhidxcom), author: Wang Han. It is published by 36Kr with authorization.