首页文章详情

Tausende Tweets mit "Wutkritik", sinkender Unternehmenswert: OpenAI wird von einem irreführenden "Durchbruch" heimgesucht. Terence Tao: Stark, aber falsche Richtung?

AI前线2025-10-20 19:42
OpenAI behauptet, dass GPT-5 mathematische Probleme lösen kann, aber es handelt sich tatsächlich um einen Fehler bei der Literaturrecherche.

"Shot themselves in the foot with their own GPT stone." This is the latest evaluation of OpenAI researchers by Meta's Chief AI Scientist Yann LeCun.

The incident started when these researchers previously celebrated a new mathematical "breakthrough" of GPT-5 with high-profile, but quickly retracted the claim after being questioned by the entire AI community. Even Demis Hassabis, the CEO of Google DeepMind, criticized it, saying there were omissions in their communication.

The "Breakthrough" of GPT-5 Proved to Be a Mistake

The news of the "breakthrough" was first released by Sebastien Bubeck, a former Microsoft vice president and now a research scientist at OpenAI. He said on X that two researchers found answers to 10 Erdős problems over the weekend with the help of GPT-5. Erdős problems are a series of mathematical problems proposed by Hungarian mathematician Paul Erdős, which include both unsolved difficult problems and solved ones. Famous cases include the "Distinct Distances Problem" and the "Discrepancy Problem". Such problems are known for their high difficulty and are often the object of in-depth academic research. Some problems even offer cash rewards to encourage researchers to solve them.

On October 18, OpenAI researcher Mark Sellke officially announced that after thousands of queries with GPT-5, they found answers to 10 Erdős problems that were originally in an "unsolved state". In addition, there were important partial progress on 11 other difficult problems, which were updated on the official website. Even in one problem, they found an error in Erdős' original paper, which was later corrected by scholars Martínez and Roldán-Pensado.

Immediately afterwards, other OpenAI researchers also promoted this news one after another. Kevin Weil, the vice president of OpenAI, retweeted Sellke's post and confirmed the situation, saying, "GPT-5 solved 10 (!) previously unsolved Erdős problems and made progress on 11 other problems."

However, currently, these OpenAI researchers have deleted the above posts.

Their statements sounded like GPT-5 independently generated mathematical proofs for complex number theory problems, which could not only be a major scientific breakthrough but also meant that generative AI was expected to discover unknown solutions, showing the ability to promote innovative research and pave the way for major progress. But things were not so, and the claim was quickly overturned.

Mathematician Thomas Bloom, who is responsible for maintaining the Erdős problems website, spoke out, saying that the above posts "were seriously inconsistent with the facts". He said, "GPT-5 just found some references that could solve these problems, and these literatures were unknown to me personally before. The 'open' status marked on the website only represents that I personally was not aware of any papers that could solve the problem."

Even within OpenAI, the statement changed. Bubeck, who previously promoted the achievements of GPT-5 with high-profile, admitted that "(GPT-5) only found solutions already in the literature." But he thought it was still a real achievement. "I know how difficult it is to retrieve literature." Hassabis commented on this, saying, "This is so embarrassing."

The "Backlash" Caused by Misleading Statements

The original relevant tweets have basically been deleted, and the involved researchers have also admitted their mistakes.

Nevertheless, this incident still made the outside world further think that OpenAI is an institution under great pressure and with a rather rash way of doing things. People can't help but wonder: Why did top AI researchers release such sensational statements without verifying the facts? Especially in this field that is already full of hype and involves billions of dollars in interests.

According to foreign media reports, on social platforms, topic tags such as "OpenAIFail" became more and more popular. In just a few days, more than 10,000 tweets expressed disappointment and doubts about OpenAI's so-called "achievements" in the field of mathematics. Moreover, after this incident, the valuation index of OpenAI linked to stocks dropped significantly in pre-market trading.

Regulatory agencies are also strengthening their review. According to foreign media reports, the US Federal Trade Commission (FTC) has started an investigation into whether OpenAI's actions constitute false advertising, and it may face fines or other penalties. At the same time, legislators are calling for more transparency in artificial intelligence research. Senator Maria Cantwell said, "We need to ensure that the progress of artificial intelligence is not exaggerated to the public, because it will weaken the public's trust in this technology and its applications."

In addition, after US regulatory agencies found that OpenAI obtained internal priority access to the FrontierMath benchmark test through an undisclosed financial relationship with Epoch AI, they are strengthening their review of OpenAI, which has raised concerns about fair competition and benchmark test transparency. An assistant director of Epoch AI confirmed that OpenAI could access most of the benchmark test data, except for a "reserved" data set, and emphasized that only an "oral agreement" prevented it from being used for training, which left room for potential manipulation. Previously, at the Davos Forum, well - known general artificial intelligence (AGI) skeptic Gary Marcus said that OpenAI's public demonstrations were "manipulative".

The "Strength" of AI in Promoting the Field of Mathematics, Recognized by Terence Tao

Due to the misleading publicity, the really valuable information behind this incident seems to be covered up: In fact, GPT-5 showed practical value at the level of research tools for tracking relevant academic papers. This ability is particularly important for research problems with scattered literature distribution or inconsistent terminology expressions.

Famous mathematician and professor of mathematics at the University of California, Los Angeles, Terence Tao, has repeatedly said in public that AI assistants can change mathematical research.

On October 17, he emphasized in a post that the most productive application of AI in the field of mathematics is not to use the most powerful models to tackle the most challenging problems, but to use tools with medium computing power to accelerate and scale up more ordinary, time - consuming but still crucial research tasks. In this process, it is necessary to rely on the experience and understanding accumulated by humans in such tasks to guide and verify the output of artificial intelligence and integrate it safely into the research process. Terence Tao said that although AI has had some "scattered progress cases" in solving difficult problems, it was under the condition of investing a large amount of computing resources and expert efforts.

A typical example of such routine tasks is literature review: finding relevant past literature for a specific problem. If a problem has a recognized name and there is a mature research community dedicated to it, the existing web search and literature retrieval tools are sufficient to find the past and latest literature on the problem. Specifically, the citation network density among these literatures is relatively high. Researchers only need to start with a core paper in the field and conduct forward and backward citation retrieval to form a relatively complete understanding of the current research status of the problem.

Moreover, Terence Tao also mentioned in his post the example of using AI to find relevant literature to solve Erdős problems. In addition, he further pointed out the various benefits of applying AI to do literature reviews:

The output results of literature retrieval tools can be independently verified by humans, which makes it a suitable application scenario for AI (provided that the user has sufficient professional ability to complete the verification). This advantage is more obvious especially when multiple problems need to be retrieved sequentially instead of focusing on a single problem. In such scenarios, the success rate of artificial intelligence output results does not need to reach 100%; it only needs to meet the requirement that, on the premise of investing the same amount of time and energy, compared with traditional non - AI - driven retrieval methods, it can bring more useful results (while reducing useless results). In addition, the initial time investment in learning how to use AI tools correctly can be amortized through multiple uses. Therefore, when the retrieval function needs to be applied on a large scale, this way of using AI tools will be particularly attractive.

If a literature review is conducted by humans, when no relevant literature is found in the end, such results are often not clearly recorded (although there are sometimes statements like "As far as we know, this is the first known research progress on this problem" in the literature). There may be a concern behind this: if someone later finds a relevant paper that was omitted in the previous review, the researchers involved in the review may feel embarrassed. This situation may cause two problems: on the one hand, if the failure results of multiple searches for the same problem without finding relevant literature are never reported, multiple researchers may repeatedly invest energy in vainly searching for non - existent literature; on the other hand, people may mistakenly think that a problem is still unsolved, but in fact, no one has ever conducted a rigorous literature review before, and the solution to the problem already exists in the existing literature.

But when we use AI - driven literature review tools to systematically search a large number of problems, it will be more natural to report both "positive results" (finding relevant literature) and "negative results". For example, it can be stated like this: "Among the 36 problems searched by the tool, 24 (66%) returned new relevant results judged by us, and 12 (33%) only returned literature we already knew or irrelevant literature." This practice helps to present the actual situation of the existing literature on a certain problem more accurately.

Previously, he also mentioned that generative AI is expected to promote the "industrialization" of mathematical research and accelerate the development process in this field. However, he also emphasized that human professional judgment is still crucial when reviewing, classifying the results generated by AI and integrating them safely into actual research.

Reference Links:

https://the-decoder.com/leading-openai-researcher-announced-a-gpt-5-math-breakthrough-that-never-happened/

https://mathstodon.xyz/@tao/115385022005130505

This article is from the WeChat official account "AI Frontline", author: Hua Wei. Republished by 36Kr with permission.