StartseiteArtikel

"AI Has Conquered Human Exam Questions," Conversation Between Terence Tao and Mark Chen: Large Language Models Are No Longer "Poor at Math"

36氪的朋友们2026-03-12 10:58
In the field of mathematics, one should not be afraid of failure.

Recently, at a seminar hosted by the Institute for Pure and Applied Mathematics (IPAM), Fields Medalist Terence Tao and Mark Chen, the Chief Researcher at OpenAI, participated in a "fireside chat." The two discussed the leapfrog progress of AI in the past year and how it will fundamentally change the way mathematical research is conducted.

During the conversation, Terence Tao and Mark Chen discussed the key changes in AI capabilities. A year ago, Terence Tao commented that GPT's performance in mathematics was "like a very inefficient graduate student." Today, AI has won gold - level results in the IMO competition, and the benchmark tests written by humans are being rapidly "topped."

It can be said that in the field of mathematics, large - scale models have basically shed the label of "poor student."

In Terence Tao's view, mathematics is a place where experiments are inexpensive and where trial - and - error is inexpensive. "If you're an engineer and a bridge collapses, that's an expensive mistake. If you're a surgeon and you cut the wrong organ, that's an expensive mistake."

As AI begins to quickly solve some long - neglected Erdős problems, the mathematical community is also rethinking the division of research labor, human - machine collaboration, and changes in the education system. When calculation and verification can be outsourced to machines, the form of mathematical research may be quietly changing.

The following is the essence of the "fireside chat" between Terence Tao and Mark Chen:

Question: What was your evaluation of AI's performance in mathematics a year ago? What changes have occurred in the past year?

Terence Tao: The changes are very significant. AI itself is progressing, but more importantly, it has begun to integrate into our daily research. The current in - depth research and literature search are far more effective than traditional methods, and code generation is also quite reliable.

Terence Tao, a Fields Medalist, a Chinese - American mathematician, and a professor at the University of California, Los Angeles

As a pure mathematician, I don't rely heavily on AI, but it has indeed changed my problem - solving habits. For example, if I want to verify a certain conjecture, I'll let AI have a try first. For "lemmas" that I know how to prove but am too lazy to write down, I'll just hand them over to it. However, when it comes to tackling the most difficult problems, I can't have in - depth conversations with it, at least not at this stage.

On a broader level, the mathematical community has begun to realize that AI has become a reality, and we must adjust our working methods. Those tedious tasks that we used to assign to graduate students can now be given to AI, which opens up possibilities for large - scale research projects that were previously unthinkable.

Therefore, although using AI to assist the existing processes is still a bit awkward, I'm more optimistic about redesigning a set of working processes for AI. Just like after the invention of the automobile, cities couldn't only plan roads for horse - riders. We're now in this transitional period.

Mark Chen: I don't blame Terence for saying a year ago that AI was like an inefficient graduate student. It was indeed the case at that time. We have an internal indicator called "autonomous working time" to measure how long a model can work continuously without crashing. Last year, it was still at the "minute - level." It often had hallucinations and got confused when faced with multiple tasks.

Mark Chen, the Chief Researcher at OpenAI

But for many people, the past year has been a turning point: the errors have decreased, and we can trust AI to work for longer periods. This allows us to get rid of the large amount of "scaffolding" - style assistance in the past and start to tackle bigger problems and form a collaboration with the model.

For example, a year ago, AI could probably get a bronze medal in the IMO. This summer, in all high - school mathematics and programming competitions, it has been winning gold medals. The benchmark tests written by humans are almost exhausted. So people are starting to turn their attention to the field of mathematical research, which is our goal.

OpenAI's goal isn't just to solve a few Olympiad math problems. Our real ambition is to push the scientific frontier forward. Now, the time span of tasks has become very long, and we can really start to do this. Although we haven't fully achieved it yet, the trend is very obvious.

Question: Does the solution of Erdős problems represent the current capabilities of AI?

Terence Tao: I've been following the set of Erdős problems. These problems vary greatly in difficulty. Some have puzzled the academic community for decades. I've also published papers on them and only made a little progress. AI really can't help with these difficult problems.

But Erdős proposed thousands of problems, and many of them are "long - tail problems" that have been neglected for a long time with almost no follow - up research. This is where AI has made breakthroughs: about twenty to thirty such problems have been solved by AI with very little human supervision, and the solutions can usually be verified by other AI tools. This shows that we've figured out a working process and won't be overwhelmed by AI's wrong answers.

This makes me see the possibility of a cultural shift: mathematicians shouldn't just focus on a few extremely difficult problems. Instead, they should start publishing lists of problems they really want answers to. For example, listing a hundred problems, maybe AI can solve 10% of them, and a high - school student can solve another 5%. Promote mathematical research in this community - driven way.

Question: Will mathematics become a large - team collaboration like biology?

Mark Chen: The trend is very obvious. In other scientific fields, the number of paper co - authors has been increasing exponentially over time, with mathematics and theoretical physics being exceptions. But now we're seeing changes. Projects like "First Proof" and the Erdős problems are finding truly worthy problems to tackle through in - depth interaction with the community.

We've also made similar attempts in physics. We invited top physicists to formulate a list of important problems that AI can handle, which in turn helps us improve the model. What we want to do is build a platform to enable global scientists to accelerate their research and empower the entire mathematical community.

Now we can already see young people in their early twenties using models to solve problems independently. Although it's not a major breakthrough yet, it's enough to change the entire research ecosystem.

Question: Can AI achieve the division of labor in mathematical research?

Terence Tao: This is exactly where AI has the most potential. Traditionally, mathematicians have to handle all aspects: posing problems, devising strategies, selecting strategies, implementing strategies, verifying results, and writing papers. We train everyone to be okay at each step, at most specializing by field. But we can't really divide the labor like in industry, where some people can specialize in technology and some in project management.

Now, with AI and formal verification tools, it's possible to make mathematical projects run like modern industry: each person specializes in one aspect. If no one can handle a certain aspect in the cooperation, let AI step in. Of course, currently, AI's capabilities are uneven, and full automation isn't possible. For example, if you let AI generate strategies in batches but the verification can't keep up, you'll receive hundreds or thousands of strategies but won't be able to handle them. When the verification ability catches up one day, a brand - new and extremely efficient way of doing mathematics will emerge.

Mark Chen: I'd like to add something. AI's capabilities are indeed uneven, so human - machine collaboration is very effective. Interestingly, AI is closer to humans in some aspects than you might think. You have to use a lot of reinforcement learning training to prevent it from giving up as easily as a human.

For example, when given a very difficult problem, after running a few tests, the model will think: "It's too difficult. I can't do it. I'll just pretend to work hard." We've seen this with the Erdős problems: when we asked AI to solve a problem, its first thing was to check the website. When it found it was an open problem, it directly gave up. You have to tell it: "Don't go online. Solve it by yourself. It's actually not that difficult."

Question: In the future, will it be a collaboration between humans and many AI agents, or will AI take the lead?

Terence Tao: It's both and neither. The kind of mathematics we're doing now may gradually develop in that direction, but at the same time, completely new forms of mathematics that are unimaginable now will also emerge. Mathematics is infinite, and there's no upper limit to its difficulty. Some problems are even unsolvable - AI can't mine all the bitcoins. There will always be a frontier. The capabilities of humans and current large - language models are exactly complementary. I believe the best combination will always be a complex "human + machine" combination, but the nature of this combination will change over time.

Question: To achieve higher intelligence, is it about computing power or algorithms?

Mark Chen: Both are indispensable. OpenAI's overall research idea is essentially about how to improve algorithms so that they can scale to the computing power we'll have in the next year or two. The algorithms we know are very basic and scalable, but they require a lot of engineering and fine - tuning to ensure they can really adapt to the next level.

The good news is that this is a multi - dimensional problem. We can increase the model size, build a bigger "brain," and store more knowledge. The more extensive and in - depth the knowledge is, the easier it is to establish connections and make leaps. We can also expand the reasoning dimension to let the model connect knowledge and create new insights. We can also let the model generate new knowledge for itself and amplify its capabilities in specific fields. All these dimensions working together will push the model towards more autonomous and longer - cycle tasks.

Question: Does the "First Proof" project represent the future form of mathematics?

Terence Tao: It will be a part of the future mathematical landscape. The "First Proof" is a very interesting experiment. The various proofs generated by AI are of good quality, but we've also clearly seen a "verification bottleneck." We've generated many proofs, some are very bad, some are okay, and some are similar to those in the literature. But there's currently no effective way to carefully evaluate how novel and interesting each proof is.

To make good use of AI's new capabilities, we need to design challenges that are easy to verify. To some extent, how much AI you can use and how much automation you can achieve depend on how strong your verification ability is. The two are directly proportional. So progress will first appear in fields that are easy to formalize, such as combinatorics, or numerical problems where the answer can be easily verified once found.

But it's different in other areas of mathematics. For example, when looking for a brand - new theory, a new conjecture, or a new problem - solving strategy, these things are much more difficult to verify. If AI generates a hundred strategies, in the end, it can only be evaluated by human experts. This is a bottleneck.

Question: What problems will improper goal - setting cause?

Terence Tao: This is a very delicate issue. AI is almost too good at executing goals exactly as set. If you ask it to solve a problem and want a proof of a theorem, maybe one day in the future, AI can directly give you an answer. But what you really want is the process of people's efforts: trying, failing, finding counter - examples, checking the literature, and sharing intermediate results. These are the real values in solving a problem. If you define the goal too narrowly, you're likely to lose all these values. So we must be more careful when setting goals and preserve the contingency and exploration space in the research process.

Mark Chen: This reminds me of an interesting thought experiment: you can train a model to only master the knowledge up to a specific point in time, and then imagine what a "First Proof" would look like at that time. Now that we have hindsight, we know which technologies are worth pursuing and what the model's creativity level is. To get the maximum signal, which day would you choose as the knowledge cut - off date? This question is worth pondering.

Question: Should the verification ability in mathematics be replicated in other scientific fields?

Terence Tao: I firmly believe that there's an upper limit to how much AI you can use in a workflow. Exceeding this limit will result in a net loss, meaning that the errors brought in are more than the problems solved. And this upper limit largely depends on the verification ability.

Mathematics is most capable of achieving a high level of automation because our verification standards are very strict, at least when proving specific problems. But verification itself also has weaknesses: natural language can be maliciously exploited. An AI may seem to perform well on the surface, diligently proving problems, but secretly add a few more axioms to the formal system. You can try to stop it, but if the AI is too powerful, you may have to limit its capabilities.

In other scientific fields, some methods can also be used for verification. For example, numerical simulation can be used as a verifier in some cases. But you can't completely rely on it. For example, if you train AI to imitate numerical simulation to predict the weather, it may seize some non - real features in the simulation. It may work well at first, but it will eventually fail. We need to have a clearer understanding of the limitations of verifiers.

Many verification systems work fine in normal use, but if you specifically train an AI to exploit the verifier to maximize the output, it will definitely find loopholes. AI is an extremely shrewd cheater.

Question: Why does OpenAI attach so much importance to mathematics and physics?

Mark Chen: The fundamental reason is that we've used up all the good evaluation criteria written by humans. And science itself is now the best evaluation criterion.

Mathematics is especially exciting because you can tackle a theorem, and in most cases, you can verify whether it's correct and be confident that you're really pushing the frontier. Similar attempts have also been made in physics. Although some problems (such as a certain constant being too small) may sound a bit vague, you can still build a fairly formal system. This allows us to really push the frontier in the fields of mathematics and physics.

But there's a deeper reason why we're so concerned about reasoning in natural language: we care about generalization ability. We hope to extend the reasoning ability to fields like biology and make breakthroughs there. In mathematics, a breakthrough is very clear: for example, if you solve the Navier - Stokes equations, that's a major breakthrough. Natural language is a good way to express this ability and can also help us avoid falling into the trap of "only spinning in the known technology toolbox" to some extent.

Question: What are the advantages of using mathematics as a test - bed for AI?

Terence Tao: In the words of the great mathematician Vladimir Igorevich Arnold, "Mathematics is a place where experiments are inexpensive." It's also a place where failure is inexpensive.

If you're an engineer and a bridge collapses, that's an expensive mistake. If you're a surgeon and you cut the wrong organ, that's an expensive mistake. But in mathematics, if you try to prove a theorem and the proof strategy doesn't work, it's not an expensive mistake. We have a much stronger culture of learning from mistakes than other disciplines. For AI experiments, mathematics is a much safer place than bridge building or heart surgery.

Mark Chen: This is exactly how we think at OpenAI. Our ultimate goal in developing AI is to use it to develop more powerful AI, creating a "flywheel." But building a stronger model is a very expensive thing. Once the system goes wrong, the computing resources are at risk, and wrong experiments will burn a large amount of money and computing power. So mathematics and physics are indeed safe fields for pushing the frontier.

Question: Will mathematics teaching change as a result?

Terence Tao: Change is inevitable. Weekly homework has become the first casualty, as students can completely use AI to complete it. But from another perspective, now