Just now, Elon Musk released Grok 4, which ranked first on the overall list, and its annual fee soared to over 20,000.
All disciplines are at the postdoctoral level.
After a long period of preparation, xAI's next-generation large model, Grok 4, has finally been released! Its capabilities exceed our imagination.
At around 12:00 noon Beijing time today, the long-awaited xAI press conference finally began. Elon Musk appeared in the live stream. As soon as he came on, he said, "This is the best AI in the world. Let's show it off."
Musk said that Grok 4 can always get a perfect score in the SAT (the US college entrance exam) without looking at the questions in advance. It can also get close to a perfect score in any subject of the GRE, surpassing the level of all postgraduate students in the world. The most powerful aspect of Grok 4 is its reasoning ability, which has exceeded the reasoning level of humans.
Musk believes that Grok 4 can make new scientific discoveries within this year.
Thanks to the enhanced computing power and reinforcement learning training, the reasoning ability of Grok 4 has been improved by 10 times compared to its predecessor. From Grok 2 to Grok 4, different technical paradigms are adopted, namely next token prediction, pre - training calculation, pre - training + RL, and RL calculation.
Among them, the computing volume during the pre - training phase from Grok 2 to Grok 3 increased by 10 times. Grok 3 reasoning introduced RL fine - tuning for the first time, bringing in deep reasoning ability. The reinforcement learning of Grok 4 reasoning has increased the computing volume by another 10 times, which means a significant improvement in reasoning ability.
In addition, due to the improvement of the ability to call tools, Grok 4 has further amplified its own intelligence. Therefore, it can achieve results far beyond the state - of - the - art (SOTA) in various high - difficulty benchmarks.
Next comes the highlight: the benchmark test results of Grok 4.
First, HLE (Humanities Last Exam), including mathematics, chemistry, and logic. In the benchmark test results leaked last Saturday, Grok 4's standard score on the HLE (Humanities Last Exam) was 35%, which increased to 45% after using reasoning technology. However, most netizens were skeptical.
During today's live stream, xAI researchers said that in the past, the highest score that SOTA models could achieve when using tools was 41.0%.
Now, Grok 4 has further improved the results of this benchmark test.
Specifically, compared with other SOTA models (o3, Gemini 2.5 Pro), when using tools, Grok 4 scored 38.6%, and Grok 4 Heavy's score soared to 44.4%. If the large model is allowed to spend more time thinking during the test and use more external tools appropriately, the HLE score can be further increased to 50.7%.
Regarding other benchmark test results, including GPQA (Graduate - level Google Verification Q&A Benchmark), AIME25 (American Invitational Mathematics Examination), LCB (Jan - May) (Programming Competition / Online Algorithm Competition), HMMT25 (High School Mathematical Tournament), and USAMO25 (USA Mathematical Olympiad). As can be seen from the following figure, Grok 4 Heavy has achieved the latest SOTA in all of them.
In contrast, humans can hardly answer a few questions in the HLE test. Elon Musk emphasized several times: Grok now reaches the postdoctoral level in all disciplines, without exception. It hasn't discovered new science or new physical laws yet, but it's just a matter of time.
"I'd be very surprised if Grok doesn't discover practical new science and technology within this year," Musk said.
The full - set benchmark test results of the large - model performance evaluation platform Artificial Analysis show that Grok 4 has become the current leading AI model, with a total score of 73 points, leading o3, Gemini 2.5 Pro, Claude 4 Opus, and DeepSeek R1 0528.
Imagine where we are now. We are in the process of a big bang in intelligent development, which is unprecedented in human history. It's time to see what Grok 4 can actually do.
Let's take a look at one or two demos, such as "a 30 - second visualization of two black holes colliding and generating gravitational waves based on physical principles using HTML animation":
Grok 4 almost completely presented the simulation effect of gravitational waves from the approach of two black holes to their final merger. On one side of the animated image are the reasoning process, calculation steps, and code, and each paper consulted has a link.
Grok 4 has become a more versatile model
In addition to the improvement in various language benchmark scores, Grok 4 has also been strengthened in other aspects.
Among them, the speech ability of Grok 4 is twice as fast as its predecessor, with lower end - to - end latency; it supports 5 languages; and the total daily user stay time has increased by 10 times.
The newly added Grok characters, Eve and Sal, are now available in the iOS version of Grok. Sal supports multiple personalities, and Eve can sing and whisper.
In the ARC - AGI benchmark test set, which is specifically designed to evaluate the general reasoning ability of artificial intelligence systems and is regarded as an important touchstone for achieving AGI, aiming to test whether the model can flexibly solve new problems it has never seen before like humans.
On this extremely difficult benchmark that points directly to the core ability of AGI, Grok 4 has also achieved the latest SOTA. It reached 15.9% on ARC - AGI - 2, almost doubling the previous commercial SOTA and surpassing the current Kaggle competition SOTA.
In the Vending - Bench benchmark test, which focuses on evaluating the ability of agents to perform complex operation tasks in the real physical world. Its core goal is to bridge the "Sim2Real Gap" between traditional simulation environments (such as Habitat, AI2 - THOR) and the real world, and promote the practical application ability of robot technology in open scenarios.
It can be seen that Grok 4 has taken the lead compared with Claude Opus 4, humans, Gemini 2.5 Pro, and o3.
Grok 4 can be called through the API, providing a context window of 256K tokens. It is currently open for use, with the version number grok - 4 - 0709, and the price is the same as Grok 3.
According to the tests of Artificial Analysis, xAI's API currently provides Grok 4 services at a speed of 75 tokens per second. Although it is not as fast as o3 (188 tokens per second), it is better than Claude 4 Opus Thinking (66 tokens per second).
Finally, regarding the game experience, DannyLimanseta used Grok 4 to make a first - person shooter (FPS) game within 4 hours. Grok can not only be used to make games but also run games, understand the elements of excellent games, and propose improvement suggestions. The effect looks really good.
Next, xAI is expected to release a code model, a multimodal agent, and a video generation model. It seems that new products will be released at a monthly rate.