Zhipu Releases 100-Billion-Parameter Large Vision Model: Distinguishes McDonald's and KFC Fried Chicken, Outperforms 99% of Humans in "Guess Location from Picture" Challenge

Ordinary users can play for free, and the API comes with a free quota of 20 million tokens.

According to a report by Zhidongxi on August 11th, tonight, Zhipu open-sourced its latest generation visual understanding model, GLM-4.5V. This model is trained based on Zhipu's new-generation text base model, GLM-4.5-Air, and follows the technical route of the previous generation visual reasoning model, GLM-4.1V-Thinking. It has 106 billion parameters and 12 billion activation parameters. GLM-4.5V also adds a switch function for the thinking mode, allowing users to independently control whether the model engages in thinking.

The visual capabilities of this model unlock some interesting use cases. For example, the model can now discern the differences between McDonald's and KFC fried chicken wings and conducts a comprehensive analysis from aspects such as the color and texture of the fried chicken.

GLM-4.5V can also guess the location from a picture. Zhipu stated that GLM-4.5V participated in a picture-guessing location scoring competition with human players. After seven days in the competition, GLM-4.5V's score ranked 66th on the competition website, surpassing 99% of human users.

Zhidongxi also used this model to create a webpage similar to Xiaohongshu based on a webpage screenshot, achieving a similarity of about 80 - 90%.

Zhipu shared the results of GLM-4.5V in 42 benchmark tests, which cover common tasks such as image, video, and document understanding, as well as graphical interface agent operations. GLM-4.5V scored higher than models of the same size in 41 of these tests, such as Step-3 and Qwen2.5-VL.

Currently, this model has been released on open-source platforms such as Hugging Face, ModelScope, and GitHub, and an additional FP8 quantized version is provided. Zhipu has developed an experience app for it, but currently, it is only available on the Mac (and must be non-Intel chips).

Users can select the GLM-4.5V model on z.ai, upload pictures or videos for experience, or upload pictures on the Zhipu Qingyan APP/web version and enable the "reasoning mode" for experience.

To help developers experience the capabilities of GLM-4.5V, Zhipu has simultaneously open-sourced a desktop assistant application. This desktop application can capture screenshots and record screen videos in real-time to obtain screen information and rely on GLM-4.5V to handle various visual reasoning tasks, such as code assistance, video content analysis, game problem-solving, and document interpretation in daily use.

The GLM-4.5V API is now available on Zhipu's open platform, BigModel.cn, and a free resource package of 20 million tokens is provided. The lowest price of its API is $2 per million input tokens and $6 per million output tokens, supporting image, video, file, and text input.

After the model was launched, Zhidongxi immediately experienced its capabilities and summarized some of the technological innovations behind this model.

Model open-source address:

https://github.com/zai-org/GLM-V

https://huggingface.co/collections/zai-org/glm-45v-68999032ddf8ecf7dcdbc102

https://modelscope.cn/collections/GLM-45V-8b471c8f97154e

Desktop assistant open-source address:

https://huggingface.co/spaces/zai-org/GLM-4.5V-Demo-App

01. Slight setback in real - world picture - based location reporting, high similarity in webpage reproduction

Zhidongxi experienced some functions of the model in the desktop assistant application equipped with GLM-4.5V. This application provides settings such as a thinking switch, and users can also customize prompts and model settings, offering a high degree of freedom.

To use the model, users need to provide the API key for GLM-4.5V, which can be obtained on Zhipu's open platform.

During the test, Zhidongxi first used a photo provided by the official. The model was able to accurately guess the location and provide the precise latitude and longitude as required.

Subsequently, we uploaded our own test picture, a photo of "a corner of Lingyin Temple". This picture was quite challenging - although the picture showed ancient buildings with yellow walls and dark roofs, tall trees, and tourists, there was no obvious landmark information. Although the word "Lingyin Temple" was printed on the street lamp in the lower right corner, it was not recognized by the system because it was not in simplified Chinese.

During the analysis, the system misrecognized "Lingyin Temple" on the street lamp as "Baoyuanqing" and "Gratitude" in the picture as "Weisheng", and also captured the word "Inclusiveness" in another place. Combining these words with the environmental features, the system finally inferred the result as Qingcheng Mountain in Dujiangyan, Sichuan. Although it failed to accurately match the real location, the reasoning process was detailed, and the result had certain reference value.

This model has certain GUI (Graphical User Interface) capabilities, which are crucial for Agent scenarios such as understanding and operating webpages or apps.

In the official demo, GLM-4.5V can help users calculate discount information from a screenshot of a dazzling shopping website and reflect on and confirm the generated results. The latest version of Zhipu's thinking and execution agent, AutoGLM, will use GLM-4.5V.

In terms of productivity, GLM-4.5V can now reproduce front - end code based on webpage screen recordings and screenshots, analyze elements such as content, style, and layout in the picture, infer the underlying code, and then model and implement the interaction logic.

Zhidongxi experienced the "webpage screen recording/screenshot to reproduce specific functions" capability provided on the app side. Users can directly click the screenshot or partial screen recording button on the page, upload the recorded video to the system. After the system compresses the video, it conducts reasoning and analysis to generate the corresponding HTML code and render an interactive front - end.

During the actual test, due to the possibly high traffic, the system did not return a result within nearly 50 minutes. Subsequently, we submitted the same task to Zhipu's official platform and provided a screenshot to GLM-4.5V. A reproduced version of the webpage was generated in less than 10 minutes.

The actual test results of Zhidongxi (Result link: https://chat.z.ai/space/f00sx6s4jgp1-art)

The generated page presents more information than the Xiaohongshu web version - in addition to the number of likes, it also displays comment data, and adds a bottom function bar and a notification button in the upper right corner.

However, it failed to reproduce the alignment effect of the waterfall flow and lacked the "essence" layout sense of Xiaohongshu. In addition, the interactive function was not implemented in this version, which may be due to the fact that the screenshot content could not reflect dynamic operations. Uploading a video may improve the situation.

In the case demonstrated in Zhipu's official demo, the staff uploaded a short operation video of the Zhihu web version, and finally, GLM-4.5V delivered a relatively complete webpage, with functions such as clicking, jumping, and input working normally.

If users are not satisfied with a local position on the webpage and do not know how to locate the problem in the code, they can directly circle the unsatisfactory position on the webpage screenshot, and the model can directly modify the underlying code.

In the PPT and PDF scenarios, GLM-4.5V can read complex long texts containing a large number of charts and perform operations such as text summarization, translation, and chart extraction.

The model does not use OCR to extract image information but directly reads pictures visually, which can avoid the error transfer during the information extraction process to a certain extent and improve the accuracy of retaining and interpreting visual and structured information such as charts and tables.

The blog introduction states that GLM-4.5V performs well in traditional CV fields such as visual positioning. It can accurately identify, analyze, and locate target objects according to user questions and output their coordinate frames.

This capability can be applied to safety and quality inspection and high - altitude remote sensing monitoring and analysis. Compared with traditional object recognition based on visual models, GLM-4.5V can understand more complex positioning instructions through reasoning, thanks to its richer world knowledge and stronger semantic understanding ability.

02. Supports 64K multimodal context, with targeted improvements in STEM, multimodal positioning, and Agent capabilities

GLM-4.5V consists of a visual encoder, an MLP adapter, and a language decoder. It supports a 64K multimodal long context, accepts image and video input, and improves video processing efficiency through three - dimensional convolution.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Just now, Zhipu released a large vision model with 100 billion parameters, which can distinguish between McDonald's and KFC fried chicken. It outperforms 99% of humans in the "guess the location from the picture" challenge.

01. Slight setback in real - world picture - based location reporting, high similarity in webpage reproduction

02. Supports 64K multimodal context, with targeted improvements in STEM, multimodal positioning, and Agent capabilities