HomeArticle

The first batch of tests of Musk's most expensive AI have caused a stir. Grok 4 has both achieved great acclaim and encountered setbacks. Netizens: Is this what I get for paying $20,000?

爱范儿2025-07-11 17:25
Be deified on one hand while having a spectacular failure on the other.

The stage of AI has never been short of the plot of "a new king ascending the throne".

For consecutive months, models have been taking turns to appear one after another, each claiming to be more amazing than the last. Take Grok 4 yesterday for example. Elon Musk claimed that "this is the smartest AI on Earth". Even before its launch, it had already generated a huge amount of buzz.

However, the Grok model has always excelled in benchmark tests but failed to deliver in user experience.

Now, 24 hours have passed since the release of Grok 4. We've also collected some real - world test cases shared by netizens. Let's take a look at whether this model really has the chops or if it's just another case of "starting strong but ending weak".

Programming: A Mix of Highlights and Blunders

Blogger @mckaywrigley posed a creative programming problem to Grok 4 Heavy.

He asked it to create an animation using three.js, where a group of people walk around and finally form the words "Hello, world, I'm Grok", and then perform a camera switch to a bird's - eye view. Grok only needed one attempt and delivered a surprisingly good result.

Throughout the process, Grok actively retrieved 3D model resources from the internet and constructed the entire scene in the browser using three.js. It can be said that the new version of Grok has made significant improvements in areas such as three.js and Blender.

Of course, UI generation remains a significant weakness. In the words of a netizen, "It's not the best designer. I sincerely hope it can catch up with Claude Opus 4 in this aspect, but it does have a knack for logical modeling and structure control."

It's worth mentioning that Grok 4 Heavy can call multiple agents in parallel, each working independently and then aggregating the results, ensuring the output quality from a mechanism perspective.

Blogger @tetsuoai directly put Grok 4 to "work", asking it to act as an experienced C - language programmer with 15 years of experience to write a CLI tool to sort and organize various files in a folder.

Grok's output was very "authentic". Not only was the code written rigorously, but the handling of details also showed a high level of professionalism. For example, it used strrchr() to extract file extensions, strdup() to avoid dangling pointers, didn't overlook boundary values and hidden files, and even used the ctype.h standard library for case conversion.

Next, the challenge was ramped up.

He then asked Grok to design a 2D autonomous driving simulation based on DQN reinforcement learning, covering everything from perception, training, to collision feedback. Grok provided a complete set of code in one go, and the trained car could even increase its speed and complete laps autonomously.

Another test came from @DirtyTesLa, who asked Grok to write a web - based mini - game. The game ran surprisingly smoothly, but the player's performance in the demo dragged down the overall impression.

However, Grok 4 also had some real blunders.

Blogger @karminski3 presented his classic test project - a 3D physics test of 20 balls bouncing in a heptagon. Grok ran the test three times. Twice, it showed syntax errors directly, and the only successful run was just "barely usable".

Compared with the early version of DeepSeek - R1, Grok 4 doesn't have a significant edge.

He then added a more challenging test: "Chimney demolition simulation".

This is a 3D physics construction task. It involves creating a chimney structure using three.js, adding a demolition point at the bottom, and simulating the collapse effect. Although it seems that the principle only involves collision and gravity, it actually tests the model's ability to understand instructions, generate code, and design interactions.

The good news is that it got the direction of gravity right, and the collapse effect was basically valid. However, the chimney was in a "half - demolished" state, the particle simulation was strange, the smoke rendering was blurry, the lighting effects were rough, and the UI was a mess - the buttons were gray and almost invisible to the naked eye.

Writing: High IQ but Low "EQ"

Grok 4's performance in the 192k context window is second only to Gemini. In tests from 1k to 120k, Grok 4 almost consistently maintained a high level, indicating that it is indeed good at semantic coherence and memory retention.

When a netizen asked Grok 4 to write a six - line poem where all words start with the letter "S" and cover five elements: love, betrayal, revenge, tragedy, and heroism, Grok actually managed to write one, and it read quite smoothly.

However, in the more comprehensive benchmark of short - story creative writing, Grok 4's score of 7.69 is only average.

The evaluation team's summary was straightforward. Although Grok 4 can continuously produce stories with clear structures and complete plot progressions, the plots tend to be formulaic, the endings are dull, the language is overly showy, and the symbols and metaphors are superficial.

SVG Test Ground: Drawing Without Aid

Having large models generate SVG images can better evaluate their visual and spatial reasoning abilities, which is also one of the key capabilities on the path to AGI. A Reddit user designed a task where four models were asked to draw without any tool assistance.

[Draw the Map of the United States from Memory]

In the first round, the models were asked to generate the outline of the contiguous United States. Grok 4's geographical details were a bit blurry, but the outline logic was relatively complete. Meanwhile, Claude 4 Sonnet was the only model that accurately marked three regions (the contiguous United States, Alaska, and Hawaii) and added place names, showing better spatial sense and knowledge retrieval ability.

[Recreate a Line Drawing Comic]

When asked to completely restore a line - drawing comic that had been split into three small images into a pure SVG, Grok 4 performed outstandingly, with natural character movements. On the other hand, o3 also tried to piece together the whole picture, but the page layout was chaotic, with problems such as text overlapping and dialogue clashing.

[Reconstruct an Album Cover]

In the third round, the models were asked to draw the cover of Radiohead's "In Rainbows". OpenAI o3 was the only model that highly reproduced the layout and structure, demonstrating strong memory and design execution ability. In contrast, Grok 4's composition was a bit thin and lacked a sense of hierarchy.

[Draw a Diagram of the Krebs Cycle]

In the biological diagram task, Grok 4's output was well - structured, with all key elements such as NADH, ATP, and CO₂ included and a logical layout. Claude 4 Sonnet had a very strong visual hierarchy, and the diagram effect was comparable to a PPT template. o3's style was more like classroom notes, with concise information but clear teaching.

[Draw Your Self - Portrait with SVG]

Finally, the models were asked to draw themselves, with no style restrictions. Grok 4 drew a human face; Gemini 2.5 Pro's output was a bit abstract; OpenAI o3 was highly recognizable and friendly; and Claude 4 Sonnet's output had a strong modern - art flavor.

Visualization: Black Hole Simulation, Euler's Identity, and Philosophical Self - Portrait

Netizen @techartist_ used Grok 4 to write an interactive 3D black hole simulation and visualization project. It used threejs for rendering and combined with a custom GLSL shader to precisely reproduce the starry background and the stunning visual effects.

In a more "philosophical" test, @dvorahfr asked Grok an abstract question: "If you had to exist in a physical form, what would you look like?"

Blogger @KettlebellDan