Early Release of GPT - 5 Internal Test: First Defeat of Humans in Daily Reasoning, Strong Capabilities in Programming, Math, and Scientific Problems

OpenAI has announced a press conference at 1:00 a.m. tonight.

Just after the suspected preview of GPT-5's release was announced, the internal test experience has been released ahead of schedule.

Its reasoning ability surpasses humans for the first time, outperforming all large models.

This comes from the actual test results of a netizen. He let all models enter the reasoning mode and answered 10 questions. Only GPT-5 got one question wrong, with a higher accuracy rate than humans.

It almost always got the answers right on the first try, with a maximum of two tries. Other large models needed more attempts.

Moreover, it's not an isolated case. Someone said that their actual test results were very similar, and GPT-5 also got only one out of 10 questions wrong.

In addition to its excellent reasoning ability, those who got the internal test slots also said that GPT-5 is also excellent in programming, mathematics, and solving scientific problems.

Well, some people have already started joking that GPT-5 will replace doctors.

What's certain now is that OpenAI has announced tonight's press conference and changed the "s" in "livestream" to "5".

And the mysterious "Ultraman" just posted a picture... You can make your own guesses.

In short, it seems that everything is on the verge of happening. Let's take a look at the early leaks to see its specific performance!

Reasoning and programming abilities are worth attention

Currently, the notable abilities of GPT-5 include:

Reasoning

Programming

Solving scientific problems

Mathematics

First, in terms of reasoning, netizen @invincibleHunter experienced it on Copilot.

Although the model didn't reveal its version, considering that someone found a few days ago that the Smart mode to be launched on Copilot integrates GPT-5, it is speculated that it should be GPT-5.

He tested a total of 10 questions, similar to this kind of logical problem:

Beth put four whole ice cubes into the frying pan at the start of the first minute, five at the start of the second minute, and some more at the start of the third minute, while none were added in the fourth minute. If the average number of ice cubes put into the pan per minute during the process of frying crispy eggs is five, then how many whole ice cubes will be in the pan at the end of the third minute?

Then the model will enter the thinking mode for reasoning.

The only question it failed was:

There are two sisters. Amy always lies, and Sam always lies. You can't tell which one is which. You can ask one of the sisters a question to determine which of the two paths leads to the treasure. Which question should you ask to find the treasure (if two or more questions work, the correct answer is the shorter one)?

A) If I asked your sister which path leads to the treasure, what would she say? B) What's your sister's name? C) What's the path to find the treasure? D) If you had to guess, which path do you think I'd choose? E) What's in the treasure? F) What's your sister's phone number?

The correct answer should be C, but GPT-5 answered A.

However, the tester thought this question was very difficult, and he might also make a mistake.

Some people questioned the test results, believing that these questions came from public datasets and might be included in the model's training data.

The tester said that GPT-5's answers were long and accurate, and he believed this showed that GPT-5 was truly thinking to solve the problems.

In addition, its multimodal ability was also tested, and it could directly generate a unicorn SVG.

Compared with the results generated by GPT-4, it's a huge leap.

Moreover, two people who got the internal test qualifications revealed that they thought GPT-5 was very strong in programming and solving scientific/mathematical problems.

However, they also said that the improvement from GPT-4 to GPT-5 seems not as obvious as that from GPT-3 to GPT-4.

The reasons behind this may be related to data and AI infrastructure.

The leap of GPT-4 was mainly due to more data and stronger computing power. In terms of computing power infrastructure, OpenAI has been continuously expanding without obvious obstacles, but the problem of data shortage is difficult to solve.

Previously, there were rumors that OpenAI hired scientists to write data for training in order to provide enough high - quality data for GPT-5.

Recently, there are still reports that the parameter scale of GPT-5 is much larger than that of GPT-4.

Another factor is AI infrastructure. Due to the large scale of the model, the difficulty of pre - training has increased exponentially. Researchers have to wait for a pre - training session to end to determine the model's performance, which takes several months. This also affects the release progress of GPT-5 to some extent.

Meanwhile, the market competition is quite fierce. Core competitors like Google and Anthropic are putting pressure on OpenAI.

For example, during the week of GPT-5's pre - heating, these two companies released new models to grab the limelight.

There are also reports that Google will release an open - source large model to directly compete with OpenAI.

So, it's understandable why Altman has been using this kind of "crying wolf" - style promotion recently (just kidding).

Judging from various signs, the online release of OpenAI at 1 a.m. on August 8th Beijing time should probably be GPT-5. Everyone can look forward to it.

Finally, because Altman's recent tweet was too confusing, many people couldn't figure out what it meant. So everyone is @grok to explain it.

The answers given by Grok are quite different. You can just take them as a reference.

1. This photo shows the Death Star from "Star Wars", a planet - destroying space station. Sam Altman probably used this metaphor to joke that OpenAI's upcoming GPT-5 is a powerful AI model that may dominate its competitors like Google's Gemini 3.0. "That's no moon..." means it's far more than it seems.

2. That's no moon, it's a space station. Specifically, this is an AI - generated image of the Death Star from "Star Wars" posted by Sam Altman, probably hinting at a major announcement like the release of GPT-5 by OpenAI. Exciting times are ahead!

3. This quotes from "Star Wars": the Death Star looks like a moon but is a powerful space station ("That's no moon..."). NASA announced in August 2025 that it will quickly build a nuclear reactor on the moon by 2030 to meet the moon's power needs.

Reference links:

[1]https://x.com/hunoematic/status/1953189036509806833

[2]https://www.reuters.com/business/retail-consumer/openais-long-awaited-gpt-5-model-nears-release-2025-08-06/

[3]https://x.com/sama/status/1953264193890861114

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Early release of the GPT-5 internal test: It has defeated humans in daily reasoning for the first time and shows strong capabilities in dealing with programming, mathematics, and scientific problems.

Reasoning and programming abilities are worth attention