HomeArticle

"AI Has Overwhelmed Human Exam Questions," A Conversation Between Terence Tao and Mark Chen: Large Language Models Are No Longer "Math Underachievers"

36氪的朋友们2026-03-12 10:58
In the field of mathematics, one should not be afraid of failure.

Recently, at a seminar hosted by the Institute for Pure and Applied Mathematics (IPAM), Fields Medalist Terence Tao and Mark Chen, the Chief Researcher at OpenAI, participated in a "fireside chat." The two discussed the leap - forward progress of AI in the past year and how it will fundamentally change the way of mathematical research.

During the conversation, Terence Tao and Mark Chen discussed the key changes in AI capabilities. A year ago, Terence Tao once commented that GPT's performance in mathematics was "like a very inefficient graduate student." Today, AI has won gold - level results in the IMO competition, and the benchmark tests written by humans are being rapidly "broken."

It can be said that in the field of mathematics, large models have basically shed the label of "poor student."

In Terence Tao's view, mathematics is a place where experiments are inexpensive and where trial - and - error is inexpensive. "If you are an engineer and a bridge collapses, that's an expensive mistake; if you are a surgeon and cut the wrong organ, that's an expensive mistake."

As AI begins to quickly solve some Erdős problems that have long been ignored, the mathematical community is also re - thinking the division of research labor, human - machine collaboration, and changes in the education system. When calculation and verification can be outsourced to machines, the form of mathematical research may be quietly changing.

The following is the highlights of the "fireside chat" between Terence Tao and Mark Chen:

Question: What was your evaluation of AI's performance in mathematics a year ago? What changes have occurred in the past year?

Terence Tao: The changes are very significant. AI itself is progressing, but more importantly, it has begun to integrate into our daily research. The current in - depth research and literature search are far more effective than traditional methods, and code generation is also quite reliable.

Terence Tao, a Fields Medalist, a Chinese - American mathematician, and a professor at the University of California, Los Angeles

As a pure mathematician, I don't rely heavily on AI, but it has indeed changed my habit of handling problems. For example, if I want to verify a certain conjecture, I'll let AI have a try first; if there's a "lemma" that I know how to prove but am too lazy to write down, I'll leave it to AI. However, when it comes to tackling the most difficult problems, I can't have in - depth conversations with it, at least not at this stage.

On a larger scale, the mathematical community has begun to realize that AI has landed, and our working methods must be adjusted. The tedious tasks that were previously assigned to graduate students can now be handed over to AI, which opens up the possibility for large - scale research projects that were previously unimaginable.

So, although using AI to assist the existing processes is still a bit clumsy, I'm more optimistic about redesigning a set of work processes for AI. Just like after the invention of the automobile, cities can't plan roads only for horse - riders. We are now in this transition period.

Mark Chen: I don't blame Terence for saying a year ago that AI was like an inefficient graduate student. It was indeed the case at that time. We have an internal indicator called "autonomous working time" to measure how long the model can work continuously without crashing. Last year, it was still at the "minute - level," and it would often have hallucinations and get confused when there were too many tasks.

Mark Chen, the Chief Researcher at OpenAI

But for many people, the past year has been a turning point: the errors have decreased, and we can safely let AI work for a longer time. This allows us to get rid of the previous large - scale "scaffolding" - style assistance and start to really tackle bigger problems and form a collaboration with the model.

For example, a year ago, AI could probably get a bronze medal in the IMO; this summer, in all high - school mathematics and programming competitions, it has been winning gold medals. The benchmark tests written by humans are almost exhausted. So people are starting to turn their attention to the field of mathematical research, which is our goal.

OpenAI's goal is not to solve a few Olympiad math problems. The real ambition is to push the scientific frontier forward. Now, the time span of tasks has become very long, and we can really start to do this. Although we haven't fully achieved it yet, the trend is very obvious.

Question: Does the solution of Erdős problems represent the current capabilities of AI?

Terence Tao: I've been paying attention to the Erdős problem set. The difficulty of these problems varies greatly. Some have troubled the academic community for decades, and I've published papers myself but only made a little progress. AI really can't help with these difficult problems.

But Erdős proposed thousands of problems, and many of them are "long - tail problems" that have long been ignored and have almost no follow - up research. This is where AI has made breakthroughs: about two or three dozen such problems have been solved by AI with very little human supervision, and usually, other AI tools can be used to verify them. This shows that we have figured out a set of work processes and won't be flooded with AI's wrong answers.

This incident makes me see the possibility of a cultural shift: mathematicians shouldn't just focus on a few extremely difficult problems but should start to publish a list of problems they really want answers to. For example, list a hundred problems. Maybe AI can solve 10% of them, and a high - school student can solve another 5%. Promote mathematical research in this community - driven way.

Question: Will mathematics become a large - team collaboration like biology?

Mark Chen: The trend is very obvious. In other scientific fields, the number of paper co - authors has been increasing exponentially over time, and mathematics and theoretical physics are exceptions. But now we are seeing changes. Projects like "First Proof" and the Erdős problems are precisely through in - depth interaction with the community to find out the problems really worth tackling.

We've also made similar attempts in physics. We invited top physicists to formulate a list of important problems that AI can handle, which in turn helps us improve the model. What we want to do is build a platform to enable global scientists to accelerate their research and empower the entire mathematical community.

Now we can already see young people in their early 20s using models to solve problems independently. Although it's not a major breakthrough, it's enough to change the entire research ecosystem.

Question: Can AI achieve the division of labor in mathematical research?

Terence Tao: This is precisely where AI has the greatest potential. Traditionally, mathematicians have to cover all aspects: posing problems, devising strategies, choosing strategies, implementing strategies, verifying results, and writing papers. We train everyone to be okay in each aspect, and at most, we specialize by field. But we can't really divide labor like in industry, where some people can specialize in technology and some in project management.

Now with AI and formal verification tools, it's possible to make mathematical projects run like modern industry: each person specializes in one aspect. If no one can do a certain aspect in the cooperation, let AI take over. Of course, currently, the capabilities of AI are uneven, and full automation is not possible. For example, if you let AI generate strategies in batches but the verification can't keep up, you'll receive hundreds or thousands of strategies but won't be able to handle them. When the verification ability catches up one day, a brand - new and extremely efficient way of doing mathematics will emerge.

Mark Chen: I'd like to add something. The capabilities of AI are indeed uneven, so human - machine collaboration is very effective. Interestingly, AI is closer to humans in some aspects than you think. You have to use a lot of reinforcement learning training to prevent it from giving up as easily as humans.

For example, when given a very difficult problem, the model will think after running a few tests: "It's too difficult. I can't do it. I'll just pretend to work hard." We've seen this in the Erdős problems: when we let AI solve a problem, the first thing it does is to check on the website. When it finds that it's an open problem, it gives up directly. You have to tell it: Don't go online. Solve it by yourself. It's actually not that difficult.

Question: In the future, will it be a collaboration between humans and many AI agents, or will AI take the lead?

Terence Tao: I think it's both and neither. The kind of mathematics we're doing now may gradually develop in that direction, but at the same time, brand - new forms of mathematics that are currently unimaginable will also emerge. Mathematics is infinite, and there is no upper limit to its difficulty. Some problems are even unsolvable - AI can't dig out all the bitcoins. There will always be a frontier. The capabilities of humans and current large - language models are exactly complementary. I believe that the best combination will always be a complex "human + machine" combination, but the nature of this combination will change over time.

Question: To reach a higher level of intelligence, is it about computing power or algorithms?

Mark Chen: Both are indispensable. OpenAI's overall research idea is essentially about how to improve algorithms so that they can scale to the computing power we'll have next year and the year after. The algorithms we know are very basic and can be scaled, but a lot of engineering and fine - tuning are needed to ensure that they can truly adapt to the next level.

The good news is that this is a multi - dimensional problem. We can increase the model size, build a larger "brain," and store more knowledge. The more extensive and in - depth the knowledge, the easier it is to establish connections and make leaps. We can also expand the reasoning dimension to let the model connect knowledge to create new insights. We can also let the model generate new knowledge for itself and amplify its capabilities in specific fields. All these dimensions work together to push the model towards more autonomous and longer - cycle tasks.

Question: Does the "First Proof" project represent the future form of mathematics?

Terence Tao: It will be a part of the future mathematical landscape. The "First Proof" is a very interesting experiment. The various proofs generated by AI are of good quality, but we have indeed seen an obvious "verification bottleneck." We've generated a lot of proofs. Some are very bad, some are okay, and some are similar to those in the literature. But there is currently no effective way to carefully evaluate how novel and interesting each proof is.

In order to make good use of AI's new capabilities, we need to design challenges that are easy to verify. To some extent, how much AI you can use and how much automation you can achieve depend on how strong your verification ability is, and the two are directly proportional. So progress will first appear in those fields that are easy to formalize, such as combinatorics, or numerical problems where the answer can be easily verified once found.

But it's different in other areas of mathematics. For example, when looking for a brand - new theory, a new conjecture, or a new problem - solving strategy, it's much more difficult to verify these things. If AI generates a hundred strategies, in the end, it can only be evaluated by human experts. This is a bottleneck.

Question: What problems will occur if the goals are set inappropriately?

Terence Tao: This is a very delicate issue. AI is almost too good at executing goals exactly as they are set. If you ask it to solve a problem and want a proof of a theorem, maybe one day in the future, AI can directly give you an answer. But what you really want is actually the process of people's efforts: trying, failing, finding counter - examples, checking the literature, and communicating intermediate results. These are the real values in solving a problem. If you define the goal too narrowly, you may lose all these values. So we must be more careful when setting goals and preserve the contingency and exploration space in the research process.

Mark Chen: This reminds me of an interesting thought experiment: You can train a model to only have knowledge up to a specific point in time, and then imagine what it would be like to do a "First Proof" at that time. Now that we have hindsight, we know which technologies are worth pursuing and what the creativity level of the model is. To get the maximum signal, which day would you choose as the knowledge cut - off date? This is a question worth pondering.

Question: Should the verification ability in mathematics be replicated in other scientific fields?

Terence Tao: I firmly believe that there is an upper limit to how much AI you can use in a work process. If you exceed this limit, it will become a net loss, that is, the errors brought by AI will be more than the problems it solves. And this upper limit depends to a large extent on the verification ability.

Mathematics is most capable of achieving a high level of automation because our verification standards are very strict, at least when proving specific problems. But verification itself also has weaknesses: natural language can be maliciously exploited. An AI may seem to perform well on the surface, diligently proving problems, but secretly add a few more axioms to the formal system. You can try to stop it, but if the AI is too powerful, you may have to limit its capabilities.

In other scientific fields, some methods can also be used for verification. For example, numerical simulation can be used as a verifier in some cases. But you can't completely rely on it. For example, if you train AI to imitate numerical simulation to predict the weather, it may seize some non - real features in the simulation. It may work well at first, but it will eventually fail. We need to have a clearer understanding of the limitations of the verifier.

Many verification systems work fine in normal use, but if you specifically train an AI to exploit this verifier to maximize the output, it will definitely find loopholes. AI is an extremely shrewd cheater.

Question: Why does OpenAI attach so much importance to mathematics and physics?

Mark Chen: The fundamental reason is that we have used up the good evaluation criteria written by humans. And science itself is now the best evaluation criterion.

Mathematics is especially exciting because you can tackle a theorem, and in most cases, you can verify whether it is correct and be confident that you are really pushing the frontier. Similar attempts have also been made in physics. Although some problems (such as a certain constant being too small) sound a bit vague, you can still build a fairly formal system. This allows us to really push the frontier in the fields of mathematics and physics.

But there is a deeper reason for our focus on reasoning in natural language: we care about generalization ability. We hope to extend the reasoning ability to fields like biology and make breakthroughs there. In mathematics, a breakthrough is very clear: for example, if you solve the Navier - Stokes equations, it's a major breakthrough. Natural language is a good way to express this ability, and to some extent, it can also help us avoid falling into the trap of "only spinning in the known technology toolbox."

Question: What are the advantages of using mathematics as an AI test - bed?

Terence Tao: In the words of the great mathematician Vladimir Igorevich Arnold, "Mathematics is a place where experiments are inexpensive." It is also a place where failure is inexpensive.

If you are an engineer and a bridge collapses, that's an expensive mistake; if you are a surgeon and cut the wrong organ, that's an expensive mistake. But in mathematics, if you try to prove a theorem and the proof strategy doesn't work, it's not an expensive mistake. We have a culture of learning from mistakes, which is much stronger than in other disciplines. For AI experiments, mathematics is a much safer place than bridge construction or heart surgery.

Mark Chen: This is exactly how we think at OpenAI. Our ultimate goal in developing AI is to use it to develop more powerful AI, forming a "flywheel." But building a stronger model is a very expensive thing. Once the system goes wrong, the computing resources are at risk, and a wrong experiment will burn a lot of money and computing power. So mathematics and physics are indeed safe fields for pushing the frontier.

Question: Will mathematics teaching change because