2 million views overnight, in sync with OpenAI. This evaluation framework made all the world's top LLMs fail.
The paper led by a Chinese team has gone viral on foreign platforms, with 2 million reads overnight! The team, founded by a MIT PhD after returning to China to start a business, has brought together 24 top global institutions to give a strong boost to how AI can assist in scientific discovery.
Recently, a paper led by a Chinese team and co - published by 24 top global universities and institutions, which is used to evaluate the capabilities of LLMs for Science, has caused a stir on foreign platforms!
That night, François Chollet, the creator of Keras (one of the most efficient and user - friendly deep learning frameworks), reposted the paper link and exclaimed, "We urgently need new ideas to drive artificial intelligence towards scientific innovation."
After AI influencer Alex Prompter shared the core abstract of the paper, Mark Cuban, the owner of the NBA's Dallas Mavericks, reposted it, and Silicon Valley investors, European family offices, and sports media flocked to the comment section.
Overnight, the cumulative readership approached 2 million.
It's worth mentioning that at the same time, OpenAI also released an overview of the paper "FrontierScience: Evaluating Al's Ability to Perform Scientific Research Tasks", which pointed out that the existing evaluation criteria fail in the field of AI for Science.
With such a synchronized move with OpenAI and widespread overseas discussions, what kind of work result has stirred up the global AI public opinion field?
How far is AI from being able to assist in scientific discovery?
Some time ago, the United States launched the "Genesis Project", claiming to mobilize "the largest - scale federal scientific research resources since the Apollo Project" with the goal of doubling the productivity and influence of US scientific research within a decade.
However, at a time when the valuation bubble of artificial intelligence is looming and the energy - to - output ratio is highly questioned, on one hand, there is a capital frenzy, while on the other hand, AI capabilities are stuck in superficial applications such as "text - to - image"; on one hand, various large language models frequently top question - bank - style benchmarks like GPQA and MMMU, while on the other hand, existing LLMs still cannot accurately analyze simple nuclear magnetic resonance spectra.
People can't help but ask: Can getting high scores in question banks mean being able to assist in scientific discovery? How far are current models from scientific discovery? What kind of AI models can be competent to expand the boundaries of human survival? These discussions have become even more intense in the context of the intensifying AI competition between China and the United States.
Against this background, the paper "Evaluating LLMs in Scientific Discovery" jointly published by the Chinese startup "Deep Principle" in the field of AI for Science, leading 24 global scientific research institutions including the Massachusetts Institute of Technology, Harvard, Princeton, Stanford, Cambridge, and Oxford officially answers this question of the era.
The paper introduces the first evaluation system SDE (Scientific Discovery Evaluation) for LLMs for Science, which conducts a comprehensive assessment of the scientific research and discovery capabilities of mainstream global large language models such as GPT - 5, Claude - 4.5, DeepSeek - R1, and Grok - 4 in the fields of biology, chemistry, materials, and physics, from scientific questions to research projects.
Different from previous evaluation systems, SDE shifts the consideration of model capabilities from simple question - and - answer formats to specific "hypothesis -> experiment -> analysis" experimental scenarios.
The study found that the average accuracy of GPT - 5, Claude - 4.5, DeepSeek - R1, and Grok - 4 is 50 - 70%, far lower than their 80 - 90% scores on question banks like GPQA and MMMU; among 86 "SDE - Hard" difficult questions, the highest score is less than 12%, jointly exposing the shortcomings in multi - step reasoning, uncertainty quantification, and the experiment - theory closed - loop.
What's even more alarming is that the improvement in model scale and reasoning ability has shown an obvious "diminishing marginal return".
Compared with its previous - generation model, GPT - 5 has significantly increased its parameter scale and reasoning computing power, but in the four major scientific fields of the SDE benchmark, the average accuracy has only increased by 3% - 5%, and in some scenarios (such as NMR structure analysis), the performance has even declined.
In other words, the current performance of large language models in promoting scientific discovery is not even as good as that of an ordinary undergraduate student.
Who is the team behind leading the publication of this paper with 24 top scientific research institutions?
The corresponding author of the paper "Evaluating LLMs in Scientific Discovery" is Duan Chenru, the founder and CTO of "Deep Principle".
As early as 2021, while pursuing a chemistry doctorate at MIT, he initiated the establishment of the AI for Science community with the support of Turing Award winner Yoshua Bengio and organized an AI for Science workshop at NeurIPS.
In early 2024, he returned to China with Jia Haojun, a MIT physical chemistry doctor, to jointly found "Deep Principle". Jia Haojun serves as the CEO, and Duan Chenru serves as the CTO. Although they are post - 1995s, they are already well - known in the global AI for Science startup field.
Since starting the business a year and a half ago, it has received investments from many well - known institutions such as Linear Capital, Hillhouse Ventures, and Ant Group, and has established strategic cooperation relationships with well - known companies in the AI for Science field such as Jingtai Technology and Shenshi Technology.
From its inception, "Deep Principle" has carried the expectations of global leading researchers in AI for Science. Currently, "Deep Principle" has penetrated into the front line of global material research and development, combining generative artificial intelligence with quantum chemistry to promote the new era of material discovery and other fields.
In the past year, they have continuously published significant results in top - tier journals such as major sub - journals of Nature and JACS, declaring their technological leadership and the mindset of a "post - 1995s startup" that advocates open communication.
From exploring the application of diffusion models in chemical reaction generation, proving that "it's not just about generating materials but also their synthesis paths", to directly comparing machine learning potentials (MLPs) with diffusion models, proving that traditional MLPs are not "omnipotent", and now organizing top scholars and universities to launch SDE, proving that traditional question - and - answer benchmarks cannot lead us to scientific super - intelligence, they have accurately targeted the core conflicts in the field of AI for Science.
However, at the same time, for all AI4S companies, in the real - world commercial tests, whether AI can truly solve new product R & D problems and meet customer expectations is a daily question they must face.
With the implementation of commercial cooperation with industry - leading customers, the database of "Deep Principle" has gathered a large amount of real - world industrial R & D scenario data and model application experience from customers and its own laboratory.
The in - depth research in the academic circle and the accumulation at the forefront of AI for Science commercialization have enabled "Deep Principle" to receive an enthusiastic response when proposing to build a new yardstick to evaluate the capabilities of LLMs for Science. It has gathered more than 50 scientists from 23 global top scientific discovery institutions to form a "dream team" for formulating SDE.
Among them, there are many well - known scholars active in the LLM field, such as:
- Huan Sun, the initiator of MMMU and a professor at The Ohio State University
- Yuanqi Du, a Cornell doctor and the "operation manager" of the AI4Science community
- Wang Mengdi, the youngest professor at Princeton and a pioneer in AI + Bio Safety
- Philippe Schwaller, the father of IBM RXN and a professor at EPFL
The scientific discovery scenarios accumulated by "Deep Principle" in the early stage became the predecessor of the later SDE evaluation system.
After nearly 9 months of cross - university, cross - disciplinary, and cross - time - zone collaboration, the paper "Evaluating LLMs in Scientific Discovery" was officially published, with the corresponding institution clearly stated as: Deep Principle, Hangzhou, China.
Since then, the Chinese startup team "Deep Principle", which has gathered the collective wisdom of global top scientific discovery institutions, and OpenAI across the ocean, have simultaneously stood at the starting line of the climb towards AI for Science - the peak of ultimate AGI for humanity.
Perhaps thousands of years later, when humanity looks back at the AGI era, at the end of the first quarter of the 21st century, this serious discussion on AI for Science jointly echoed by Chinese and American teams has pushed the involution of LLMs on various question - and - answer lists a step closer to the vast ocean of real scientific discovery.
The research of "Deep Principle" and more than 50 collaborators from more than 20 institutions has proven that the current development path of LLM cannot "incidentally conquer" scientific discovery.
This path towards scientific super - intelligence requires more like - minded people to walk side by side.
This article is from the WeChat official account "New Intelligence Yuan", author: New Intelligence Yuan, editor: Aeneas, published by 36Kr with authorization.