HomeArticle

First-hand test of Zhipu's most powerful model, are the "three giants" of AI programming about to take shape?

智东西2026-06-17 16:17
Zhipu has completed the technical puzzle for long-horizon tasks.

According to a report by ZDONGXI on June 17th, today, Zhipu officially released and open - sourced its new - generation flagship model, GLM - 5.2. On the programming evaluation system Code Arena of the large - model blind - test platform Arena.ai, GLM - 5.2 scored an impressive 1595 points. It ranked second on the overall leaderboard, only behind Fable 5, and ranked first among globally available models.

In the benchmark test FrontierSWE, which assesses "ultra - long - range, open - ended, and high - difficulty software engineering tasks", GLM - 5.2 currently ranks only behind Opus 4.8 and the temporarily unavailable Fable 5.

On the Design Arena, which specifically evaluates model taste, GLM - 5.2 achieved the top - ranked performance globally, and its aesthetic sense has reached the global forefront.

On Zhihu, the well - known user toyama nao joked that users who use Opus through a relay in the future will face a new problem: If Opus is impersonated by GLM - 5.2, users may really not be able to tell the difference.

Domestic and international users who have experienced the actual effects of GLM - 5.2 have responded enthusiastically. Some developers have straightforwardly stated: "This is the first domestic model that reaches the Opus level in my workflow."

Overseas users have also reported that the performance of GLM - 5.2 exceeded expectations, and the gap between it and Fable 5 is much smaller than expected. Now that Fable 5 is no longer available, overseas netizens originally thought that its unavailability would widen the gap, but unexpectedly, GLM is quickly catching up. This has given Anthropic a headache.

Currently, the GLM - 5.2 API has been launched, and enterprises and users can directly download and deploy this model on open - source platforms such as Hugging Face.

Previously, ZDONGXI had conducted in - depth experiences with Zhipu's models such as GLM - 4.5, GLM - 4.7, GLM - 5, and GLM - 5.1. After the release of GLM - 5.2, we immediately ran several large - scale cases and could clearly feel a clear evolutionary path: If GLM - 4.7 achieved alignment with the then - top programming model Sonnet 4.6, in GLM - 5.2, the "usage experience" of this model is basically no different from that of Opus - level models.

In the field of AI programming models, previously, the globally recognized top players were only Anthropic (Claude series) and OpenAI (GPT series). This time, with the first - place ranking among globally available programming models and the real - world reputation of being a "replacement for Opus" from developers, GLM - 5.2 is entering this top - tier club. It can be said that a "Big Three in Coding" pattern composed of Anthropic, OpenAI, and Zhipu is taking shape.

In the current situation where closed - source giants monopolize the right to speak of programming models and may revoke access rights at any time, GLM - 5.2 returns the choice to developers through open - source.

01. Four - hour Collaborative Programming with GLM - 5.2: Almost Using a One - Million - Token Context, Fixing 16 Bugs, and Creating a Replica of Civilization from Scratch

My first actual - test task was to let GLM - 5.2 develop a strategy game in the style of Civilization from scratch, gradually iterating from version M0 to version M4.

Before the formal development, I asked GLM - 5.2 to write a PRD document and discussed specific technical implementations with it. The final technical solution was determined to use the Godot engine and GDScript to implement a 2.5D - style game.

Version M0 is the foundation of the entire project. In this version, GLM - 5.2 created and wrote more than a dozen files in a row, generating core content such as standard map grids and basic game units. After the development was completed, GLM - 5.2 quickly ran a verification and delivered version M0.

However, this version is only a preliminary result. The game design is still relatively rough. Characters are only replaced by circular icons, there is no clear game mechanism, and there are also many small bugs at the interaction level.

I decided to optimize these bugs one by one in the M0 stage. Under my instructions, GLM - 5.2 fixed multiple bugs such as the inability to open the information panel and the inability to move the initial units. Basically, each bug could be fixed within one or two rounds of dialogue, and the efficiency was quite good.

After that, I skipped version M1 and directly asked GLM - 5.2 to develop version M2, which is also the core of the game's depth. Without clear requirements, GLM - 5.2 independently judged and decided to add four major subsystems: a combat system, a technology tree, urban economy, and resource limitations. The development workload of these new systems was relatively large. GLM - 5.2 worked continuously for more than 30 minutes to complete it.

In this process, GLM - 5.2 strictly followed the development rules set by me and it: complete a function, run a test, and then proceed to the next development if there are no problems. Actually, in the later stage of this iteration, the context window had reached more than 300,000 tokens. It's really not easy for GLM - 5.2 to still remember the rules at this time.

Version M3 turned the game from a sandbox into a complete single - game where winners and losers can be determined. GLM - 5.2 implemented the enemy tactical AI and expanded the size of the map. Although my development instructions mainly focused on the functional iteration of the game itself, GLM - 5.2 also actively considered the issue of game optimization. As the map became larger, GLM - 5.2 decided to split the terrain rendering into static and dynamic layers, and added cache optimization to the minimap, which made the game run more smoothly.

The work of the later M4 version mainly focused on aesthetics and playability. At this stage, GLM - 5.2 showed good aesthetic sense. For example, when I told it that the UI design of the game "lacked a game - like feel" and was just a pile of text, it found materials by itself to update the icons and redesigned the interaction cards, which improved the visual effect of the entire game to a higher level.

Finally, I encountered an unexpected bug. When the map was expanded to a size of 100x100, the screen jumped violently when dragged, and I tried various methods but couldn't solve it. Finally, GLM - 5.2 successfully located the problem: it found that this problem actually existed since version M0 but only became obvious after the map was enlarged, and it was related to the problem of UI controls.

This kind of root - cause positioning of the problem means that GLM - 5.2 can cross a context length of hundreds of thousands of tokens and accurately locate the hidden bugs in the initial version of the code.

After completing all the above development tasks, we also simply counted. In this project, GLM - 5.2 used a total context window of 870,000 tokens, which is close to its limit.

GLM - 5.2 reviewed all the bugs it fixed in the task close to a one - million - token context length. Its statistical result was 16, which was consistent with the actual data. At the same time, GLM - 5.2 still remembered the cause and solution of each bug, truly demonstrating reliable memory within a one - million - token context scenario.

02. GLM - 5.1 Fails to Read a 30 - Hour Podcast Transcript at One Go

In addition to programming, the one - million - token context ability of GLM - 5.2 can unlock many other uses. In daily work, I often need to process a large amount of long - text information integration, and a model with a larger context window can play a good role in improving efficiency.

In the actual test, I uploaded 13 podcast transcripts related to the AI field at one time. The total duration exceeded 30 hours, and the text volume was about 250,000 words, which was equivalent to at least 300,000 tokens. These podcasts were from The Lex Fridman Podcast, involving different guests, spanning several weeks, and covering multiple sub - fields such as large - model architecture, enterprise AI strategy, multi - modality, AI security, and open - source ecosystem. The information was highly dispersed, and there were a large number of echoes, supplements, and contradictions of cross - period views.

After letting GLM - 5.2 read all 13 transcripts at one time, I issued the following interpretation tasks:

(1) Cross - period View Tracking:

I asked GLM - 5.2 to locate the discussion trajectory of the topic "whether the scaling law has encountered a bottleneck" in all 13 transcripts. GLM - 5.2 successfully identified Huang Renxun's clear - cut negation of the "pre - training bottleneck theory" and also found Sam Altman's emphasis on the importance of computing power in the scaling process, completely connecting an evolutionary chain of views spanning a 30 - hour dialogue and tens of thousands of words.

Finally, GLM - 5.2 gave a summary. In 2023, people were still discussing single pre - training scaling, but later the definition of the Scaling Law continued to expand, evolving into four curves, covering pre - training, post - training, testing, and agents. It also judged that the main difficulty at present is still at the architecture level