Google Releases 71 - Page AI Research Report, Surpassing Experts in Six Major Fields and Achieving Months

It's not just about writing code, but about scientific research on "empirical software".

Google's latest 71 - page paper has shocked the scientific research community: AI can not only write code, but also propose new methods and conduct experiments like a scientist. It has even outperformed experts in six major fields! What used to take months of exploration can now be completed in just a few hours. The pace of scientific research is being rewritten by AI.

Get a comprehensive understanding of global large - scale models at a glance! As a grand offering for the 10th anniversary of New Intelligence Yuan, the 37 - page 2025 ASI Frontier Trends Report is released for the first time.

In the latest 71 - page paper, Google dropped a bombshell in the scientific research community.

In the past year, DeepMind's FunSearch has demonstrated the potential of AI in mathematical discovery, and teams such as MIT have also proposed the concept of AI co - scientist.

However, compared with these explorations, Google's system this time has gone further: it can not only propose new methods and verify experimental results, but also outperform top experts in multiple fields.

Paper link: https://arxiv.org/abs/2509.06503

Different from traditional code that only pursues correctness, the only goal of empirical software is to maximize the indicator score of scientific research tasks.

This means that AI has begun to intervene in the most core aspects of scientific research - hypothesis verification and method innovation.

Not just writing code, but scientific research "empirical software"

In scientific research, the most time - consuming part is not coming up with ideas, but how to verify them.

Scientists often have to write and debug a large amount of experimental code for a problem, and try dozens or even hundreds of model and parameter combinations. This process can take months.

Google's new system has completely accelerated this process. They proposed a concept: empirical software.

Different from regular software, which usually only uses functional correctness as the evaluation standard, the primary goal of empirical software is to maximize the preset quality score.

That is to say, scientific research problems are re - abstracted into a scorable task.

The task includes a clear problem description, indicators to measure pros and cons, and a dataset. What AI has to do is to continuously optimize in the direction of the highest score.

Under this mechanism, the role of AI is no longer just a code - writing assistant, but more like a high - speed experimenter.

It will first generate research ideas and write executable code, then run it in a sandbox environment, use the tree search method to screen out candidate solutions worthy of in - depth exploration, and then let the large language model rewrite and optimize the code repeatedly.

This process is repeated until the optimal solution is found.

The workflow of the AI scientific research system: scientific research problems are converted into scorable tasks, code is generated by the large language model, and it is iteratively optimized through tree search to finally obtain the best solution.

The researchers also emphasized:

Its output, as a coded solution, is verifiable, interpretable, and reproducible.

In other words, this is not just a simple program, but a real achievement that meets scientific research standards.

Hard - core achievements in six major fields

What really amazes people about Google's system is that it has achieved results comparable to those of experts in six completely different scientific fields.

Genomics: 14% better than experts

In the problem of batch integration of single - cell RNA sequencing (scRNA - seq) data, Google's system demonstrated real scientific research innovation ability.

The difficulty of this kind of task lies in that there will be complex technical biases between different experimental batches. How to eliminate these biases while retaining real biological signals has always been the core challenge in the field.

The researchers did not let the system start from scratch, but directly input the text descriptions of existing methods to it.

For example, BBKNN is a common batch correction method. Its core idea is: find the nearest neighbors for cells within each batch, and then merge these neighbor sets to get an overall graph after batch correction.

An example of the method description of BBKNN. The researchers input it into the system, and the AI rewrites and optimizes it on this basis.

On this basis, AI can generate new variants and combine them.

Finally, it combined BBKNN with another method, ComBat, to get a completely new solution.

The results showed that in the comprehensive indicators of OpenProblems V2.0.0, it was 14% better than the best manual method.

In the single - cell RNA sequencing batch integration task, the AI system automatically combines methods, and the overall score exceeds that of existing expert tools.

Public health: outperforming the CDC official model

During the pandemic in the United States, the CDC's CovidHub Ensemble was regarded as the "gold standard" for predicting hospitalizations.

However, the 14 models automatically generated by Google's system collectively outperformed the official Ensemble.

The performance of AI in the task of predicting COVID - 19 hospitalizations is generally better than that of the CDC's official CovidHub Ensemble.

Geographic remote sensing: segmentation accuracy exceeds 0.80

In the high - resolution remote sensing image segmentation task, all three models generated by the system exceeded existing methods, and the segmentation accuracy (mIoU) exceeded 0.80.

More importantly, it uses architectures such as U - Net and SegFormer and combines image enhancement means, which shows that it is not only "copying", but also "transforming and optimizing".

The segmentation results generated by the AI system (bottom row) are highly similar to the manually annotated results (middle row), and are significantly better than traditional models.

Neuroscience: predicting 70,000 neurons in the whole brain

In the prediction of whole - brain neural activities in zebrafish, the AI system not only defeated all existing baselines, but also designed a hybrid model that can combine biophysical simulators.

In the prediction of whole - brain neural activities in zebrafish, the model generated by the AI system (blue) has lower overall errors and comprehensively surpasses existing baseline methods (red). Among them, TS - Jaxley integrates biophysical simulators into the prediction, improving interpretability.

Mathematics: solving difficult integrals with ease

Mathematical problems have always been the most challenging for algorithms.

Google's system was used to tackle 19 extremely difficult integral tasks, and the result was unexpected: the standard numerical methods almost failed completely, while the AI system successfully calculated 17 of them.

Some examples of numerical integration tasks. Google's system successfully solved 17 out of 19 test integrals, while the standard numerical methods failed to give results.

This shows that it is not just staying on the surface, but has really learned how to find breakthroughs in complex mathematical scenarios.

For researchers, this means that AI can already give usable answers to the long - troubled numerical calculations.

Time series: building a general prediction library from scratch

On the GIFT - Eval benchmark for general time series prediction, Google's system accomplished an almost impossible task:

Starting from scratch, only by continuously optimizing a piece of code, it developed a general prediction library that can cover 28 datasets, span 7 fields, and adapt to 10 frequencies from seconds to years.

This means that AI can not only solve specific problems, but also summarize a set of general methods on its own - it has also tackled the "cross - field generalization", which is the most difficult part in scientific research.

A turning point in the scientific research paradigm: AI can innovate and cross boundaries

If the previous six cases are just achievements, what really shocks people behind them is that AI is no longer satisfied with imitation, but has demonstrated innovation ability and cross - disciplinary versatility in scientific research.

In the genomics task, it can automatically combine two different expert methods to get a better solution than humans;

In the neuroscience task, it even combined a biophysical simulator and a deep model for the first time, opening up a brand - new hybrid idea.

There have been precedents for similar attempts in the academic and industrial circles: for example, DeepResearchGym provided an evaluation framework, and the OpenProblems.bio community established a public benchmark for scRNA - seq.

However, Google's system was the first to fully run the pipeline on these benchmarks and gave quantifiable and reproducible expert - level results.

This kind of innovation is not a single - point breakthrough, but a common phenomenon across disciplines.

From genomics to public health, from remote sensing images to time series prediction, the system can quickly adapt and find new paths.

The diversity of these benchmarks allows us to comprehensively evaluate its abilities in zero - shot generalization, high - dimensional signal processing, uncertainty quantification, semantic interpretation of complex data, and system - level modeling.

In the past, scientists advanced through repeated trials. Now, the AI system can also conduct large - scale trial - and - error in the same way, and the speed has increased hundreds of times - compressing months of exploration into just a few hours.

This means that the pace of scientific research may really experience an "exponential acceleration".

When AI enters the laboratory, what should humans do?

AI can already generate new methods, verify results, and outperform experts in multiple frontier fields. The role of human scientists is also being redefined.

In this system, AI is responsible for tireless experiments and explorations:

Trying, optimizing, and screening thousands of solutions, which originally took months or even longer, are now compressed into a few hours or days.

Our system can quickly generate expert - level solutions, shortening the exploration time of a set of ideas from months to hours or days.

The responsibility of scientists is gradually shifting to proposing directions, judging value, and defining priorities.

AI can expand infinitely in technical paths, but the significance of scientific research problems themselves and the social value behind them still need to be set and grasped by humans.

This means that the division of labor in scientific research is moving towards a new pattern:

AI may become an efficient experimenter and method inventor, while humans make choices and decisions from a higher perspective.

This means that Google's system is no longer just an experiment of a "research tool", but has taken the next step on the same track as projects such as FunSearch and AI co - scientist -

From single - point breakthroughs to cross - field scientific research collaborators.

It is worth mentioning that Google has open - sourced all the best solutions produced by this system and provided an interactive interface for researchers to track the entire search and breakthrough process.

This open attitude means that the scientific research community can directly verify and expand these AI - generated solutions in real tasks.

References:

https://arxiv.org/abs/2509.06503

https://research.google/blog/accelerating-scientific-discovery-with-ai-powered-empirical-software/

This article is from the WeChat official account "New Intelligence Yuan". Author: New Intelligence Yuan. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Just now, Google released a 71-page AI research report, comprehensively surpassing experts in six major fields. It can achieve in a few hours what would take months.