Shocking Scores of Musk's Grok 4 Leaked: Tops "Last Human Exam" with 45% to Take First Place

I'm totally confused by OpenAI, Gemini and Claude!

The scores of Grok 4 were leaked in advance. It scored as high as 45% in the "Human Last Exam", far exceeding Gemini and Claude, and becoming one of the strongest models in the current tests. Elon Musk said that Grok 4 builds its reasoning mechanism based on the "first principles", and Grok 4 is expected to reshape the landscape of LLMs.

Grok 4 is coming soon, as Elon Musk said!

Even the currently deployed Grok has shown significant improvements in capabilities.

Meanwhile, a screenshot from netizen LEGIT directly leaked the scores of Grok 4 and Grok 4 Code in multiple key benchmark evaluations.

Currently, this news has been confirmed by well - known figure Tibor Blaho in the AI circle.

According to the leaked data, Grok 4 "leads by a large margin" in the GPQA, AIME 25, and SWE - bench evaluations, comprehensively crushing Google's Gemini 2.5 Pro, OpenAI o3, and Claude 4 Opus.

GPQA (Graduate - level Physics and Astronomy Questions): Grok 4 scored 87 - 88%, slightly better than Gemini 2.5 Pro's 86.4% and significantly higher than Claude 4 Opus's 79.6%.

AIME 25 (2025 American Invitational Mathematics Examination): Grok 4 scored 95%, far exceeding Claude 4 Opus's 75.5% and better than OpenAI o3's 88.9%.

SWE - bench (Real Software Engineering Questions): Grok 4 Code scored 72 - 75%, slightly better than Claude Opus 4's 72.5% and slightly higher than OpenAI o3's 71.7%.

Moreover, Grok 4 also achieved an amazing high score of 35% by default and up to 45% in the ultimate closed - book academic benchmark "Human Last Exam" (HLE), which has the widest coverage and the highest difficulty.

This also means that in its strongest state, Grok 4's score is twice that of the current leader, Gemini 2.5 Pro - a full 24 percentage points higher.

Compared with Claude 4 Opus, which only has a correct rate of 10.7%, the score is more than quadrupled.

The HLE exam is extremely tough and is designed to humble LLMs:

It consists of 2,500 expert - level questions across more than 100 disciplines

14% of the questions are multimodal (text + image)

24% of the questions are multiple - choice questions

There are anti - memory traps and hidden test sets to prevent "cheating - style training"

The following is a high - level visualization chart of the knowledge it contains, and each category includes many specific disciplines.

Project homepage: https://lastexam.ai/

You know, most cutting - edge models can't even come close to this score.

If this leak is true, then Grok 4 has passed one of the most difficult levels in the field of AI benchmark testing.

Due to its extremely high score in the HLE, the release of Grok 4 has once again sparked extensive discussions in the community.

Yes, if it's true, it means that this model has extremely powerful world knowledge.

Seeing such a powerful Grok 4, netizens can't wait and are urging its release online:

Source Code of Grok 4 Leaked

People's expectations for Grok 4 have been fully raised.

Elon Musk once revealed in an interview before.

Grok 3.5 is trying to reason from first principles, that is, applying the methods of physics to the thinking process.

Grok - 3.5 is now Grok 4. Musk decided to make a big leap from Grok - 3 directly to Grok 4 instead of making small, incremental updates.

This seems to indicate that Grok 4 will have a significant breakthrough in capabilities!

A few days ago, someone on X found two Grok 4 models, Grok 4 and Grok 4 Code, in the source code of the xAI console.

Grok 4:

The latest and most outstanding flagship model, it shows unparalleled performance in natural language, mathematics, and reasoning, and is the perfect all - around choice.

Grok 4 Code:

A model specifically designed as a programming companion. You can ask it code - related questions or directly embed it in a code editor.

Some People Are Skeptical

Of course, some people seem to have been "heart - broken" by the previous hype about Grok 3.

Dan Hendrycks, the creator of the HLE, is an intimate advisor to xAI (compared to other labs).

Netizens want to know whether Dan Hendrycks only provided security advice or gave specific R & D suggestions to enhance scientific knowledge details in some way.

This inevitably makes people think of the fiasco of Llama 4 before, which was also due to "targeted training" in advance.

Elon Musk Promotes It Himself

Elon Musk posted on June 27 that he and his team were working overtime to develop Grok.

Grok 4 will be released after July 4. According to Eastern Time in the US, starting today, Grok 4 could be released at any time.

Elon Musk specifically emphasized that a large - scale training was needed to develop a "special" coding model.

At the Microsoft Build 2025 conference on May 20, Elon Musk explained on - site that Grok 3.5 (Grok 4) would be built based on first principles.

Elon Musk:

Especially in the upcoming Grok 3.5, our goal is to make the model reason from first principles.

That is to say, think like a physicist and use physical tools to analyze problems.

If you want to explore the essential truth of things, you must break the problem down to the most basic and most likely correct axiom level, and then reason upwards from these foundations.

Then, you can verify the final conclusion with these basic principles. In physics, if your result violates the conservation of energy or momentum, you either discover a Nobel - level new theory or - more likely - you made a mistake.

So the core goal of building Grok 3.5 is to be guided by the basic principles of physics, apply these methods to reason about various problems, and strive to approach the truth with the smallest error.

Of course, errors are inevitable, but our goal is to continuously reduce these errors. This direction is crucial for AI safety.

I've been thinking about AI safety for a long time, and my final conclusion can actually be summarized by an old saying: Honesty is the best policy.

This is not only a moral requirement but also a safety guarantee. Of course, we will also make mistakes, but we promise to correct them as soon as possible.

We also look forward to feedback from the developer community - what do you need? Where did we go wrong? And how should we improve?

We hope that Grok will become a tool that developers are looking forward to and a platform where their voices can be truly heard.

Grok will continue to evolve and strive to meet the needs of developers.

Coding Ability Becomes a Battleground

Based on the previous model speculation of the Grok API, Grok 4 Code will be the highlight of this release, and there may also be a Grok 4 mini.

Elon Musk specifically mentioned the coding ability of Grok 4, which is also influenced by other companies. Coding ability has become the touchstone for measuring new models.

Google

Gemini2.5 includes improved code generation, complex code refactoring/transformation, context management, better PR review capabilities, and customizable commands.

The Gemini CLI is a recently launched command - line AI assistant based on Gemini2.5 Pro. It can handle a context of up to one million tokens and supports a multi - functional development experience including code writing, debugging, content generation, and task management.

Anthropic

Claude 4 (including Opus and Sonnet) is the most powerful model series developed by Anthropic so far, significantly improving coding and AI agent capabilities.

Claude Code focuses on terminal environment use and provides a one - stop tool from code editing, problem fixing, architecture understanding, to running tests, linting, git operations, and PR creation.

OpenAI

The new version of Codex is fine - tuned based on OpenAI o3 and is used for translating natural language into code, continuing the core capabilities of current generation tools (such as GitHub Copilot).

DeepSeek

DeepSeek - R1 - 0528 is the latest R1 version launched by DeepSeek, positioned as a model for improving all - around reasoning and coding capabilities.

Since Elon Musk emphasized the coding ability, this release may be worth

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The astonishing scores of Musk's Grok 4 have been leaked. It topped the "last human exam" with 45%, taking the first place.

Source Code of Grok 4 Leaked

Some People Are Skeptical

Elon Musk Promotes It Himself

Coding Ability Becomes a Battleground