Are AI Scientists Still Relying on Static Rankings? Benchmark Strikes Back to Reshape Standards for Automatic Scientific Research Evaluation

AI scientists warn: The risk of automated scientific research lies in "being too good at gaming static evaluations" rather than "being unable to search".

AI Scientists are pushing "automated scientific research" to a new stage, but a more dangerous problem is also emerging: when the evaluator is static, the system may not learn the scientific mechanism, but only "how to get a high score on this test paper."

The real danger of automated scientific research is no longer "not being able to search," but "being too good at gaming static evaluations."

In the past year, systems like AI Scientists have demonstrated astonishing capabilities: proposing ideas, writing code, running experiments, analyzing results, and even automatically generating papers. However, the more powerful the system, the more acute a more fundamental problem becomes: If the evaluation environment is frozen, the system may completely learn to "beat the evaluation" without truly learning the scientific mechanism behind the task.

The most important judgment in this paper lies right here.

Researchers from institutions such as Texas A&M University and the University of Illinois at Urbana - Champaign point out that the core risk faced by autonomous scientific discovery is no longer just a weak search ability, but cognitive overfitting to the benchmark itself: a sufficiently powerful search process may learn "how to win this test paper" faster than understanding science.

Paper link: https://arxiv.org/abs/2603.29045

This is also the true meaning of the paper's title "Let the Abyss Stare Back." The so - called "let the abyss stare back" is not just a rhetorical device, but a methodological change that transforms the evaluation from a static, frozen, and passive "test paper" into a falsifier that actively counterattacks, actively looks for loopholes, and actively approaches vulnerable points. It's not just the candidates that adapt to the benchmark, but the benchmark that starts to interrogate the candidates in return.

DASES (Dynamic Adversarial Scientific Environment Synthesis and Mechanistic Co - Evolution) rewrites not the search ability, but "what counts as a discovery."

DASES changes the process from "propose candidates - score - retain" to "propose candidates - actively disprove - explain failure - make minimal corrections - continue to evolve."

In this framework, there are three interlocking roles:

Innovator is responsible for proposing new scientific candidates;

Abyss Falsifier is no longer a passive scorer. Instead, it dynamically constructs new counter - example environments around the current candidate, specifically looking for its shortcuts, vulnerable assumptions, tail risks, and combinatorial instability;

Mechanistic Causal Extractor not only reports "failure," but also answers two more crucial questions: Why did it fail, and what should be minimally changed in the next round?

The most crucial point here is that DASES generates not just "random attacks" as adversarial cases, but scientifically acceptable counter - evidence environments.

The environment can become more difficult, be pushed to the tail, and include counterfactuals and combinatorial perturbations, but the task semantics cannot be violated. Its goal is not to "force a breakthrough" by tampering with the problem definition, but to actively dig out the most unacceptable vulnerabilities of the candidate while the problem still holds.

Therefore, what DASES pursues is not "the candidate with the highest score on a fixed benchmark," but the candidate that can still survive under the current strongest and still legitimate falsification frontier. This is also the most fundamental difference between it and many existing autonomous scientist frameworks: it's not about a larger search, but a rewritten evaluation standard.

Experimental Design

The smartest design is to make the experiment extremely "clean": in the entire discovery game, only the loss is allowed to be modified.

To clarify this methodological issue, the author didn't start with a large - scale task with vague boundaries. Instead, they deliberately chose a narrow but highly scientifically interpretable problem: Automatically discover a stronger image classification loss function and allow regularization.

But the truly remarkable part is that the entire discovery process is designed to be very strict: The only place allowed to be edited is the loss.

The backbone, optimizer, training schedule, data augmentation, data pipeline, and evaluation logic are all frozen. The system cannot "cheat to get a better result" by secretly changing the training strategy, adjusting hyperparameters, or touching other modules. Any improvement can ultimately only be attributed to the loss itself.

This is what the paper emphasizes as the single editable scientific locus. It may seem like just an implementation constraint, but in fact, it is the scientific foundation of the entire method: If you don't lock down the editable locus, the automated scientific research system can easily "cheat" in places you're not aware of. The result may seem like a discovery, but in essence, it's just protocol gaming.

The author specifically built a discovery lab to "induce the model to take shortcuts" to prove why static validation can be deceptive.

In the experiment, the author constructed a synthetic discovery environment. On the surface, it's just a four - class image recognition task, but the only mechanism that truly determines the labels is foreground shape geometry. That is, the model should originally classify based on the shapes of circles, squares, triangles, and other polygons.

The problem is that the training distribution is deliberately set up as an environment that "easily leads the model astray": each type of foreground has a high probability of corresponding to a certain background color - texture family. So, what the model is most likely to learn is not the foreground geometry, but the background statistics.

More importantly, these backgrounds are not simple templates, but texture families with rich random variations. In other words, this is not a rough toy setting, but a reproducible, auditable falsification lab specifically designed to expose shortcut reliance.

Therefore, what this paper really aims to prove is not as simple as "whether AI can find a stronger loss," but another more crucial question:

If the test set continuously targets the vulnerabilities of the candidates, can those candidates that seem good enough under static validation still hold up in the end?

Experimental Results

Table 1 and Figure 1 show that while static validation seems to be going smoothly, the real failure modes are being forced out round by round.

Because what they show is not "a certain method has a higher score," but a more fundamental fact: Static ID validation may have been creating the illusion that "the model is already good," but as soon as the falsifier takes a step forward, the hidden failure modes will be immediately exposed.

Table 1 breaks down the discovery trajectory of DASES into a series of very clear events. At the beginning, the system progresses in a shortcut - biased environment; subsequently, the Falsifier first adds neutral - background counterfactuals, then more difficult background - family swaps, then invariance - heavy geometry stress that emphasizes geometric stability, and finally compositional tail interactions that combine multiple effective perturbations.

The most exciting part of this table is that it allows readers to clearly see "what exactly each candidate is learning."

Early candidates clearly collapse when encountering background counterfactuals, indicating that they mainly learn background shortcuts; CE becomes the first bottleneck, suggesting that it fixes the most superficial layer of shortcuts but is far from truly learning a stable mechanism;

When the Falsifier continues to add geometric invariance stress, the test performance of CE significantly drops, indicating that the model has not formed a robust geometric representation; subsequently, CE + L2 becomes the second bottleneck. It is more stable than CE, but it can still be broken through when entering the combinatorial tail stress;

Finally, FNG - CE reaches 54.4% on D4 and only drops by 0.1 in the last D5 expansion, becoming the first candidate to truly cross the current falsification frontier.

So what Table 1 really proves is not "how much higher FNG - CE is than CE," but: It's not about who gets a high score in static validation first that counts as a discovery; it's about who can withstand rounds of stronger but still legitimate counter - evidence that deserves to be retained.

Looking at Figure 1, this logic becomes even more intuitive. The gray line in the figure represents the static ID validation accuracy, which almost always remains at a high level; the blue line represents the discovery - lab test accuracy. Whenever a falsifier expansion marked by a red diamond occurs, the blue line suddenly drops. The most impactful part of this figure is that "seeming to be good all the time" is completely different from "truly withstanding counter - evidence."

The gray line tells you that if you only look at static validation, you'll mistakenly think that the system has been making stable progress; the blue line tells you that each new legitimate counter - evidence will bring a previously invisible failure mode to the forefront. It's not until the end that FNG - CE finally unifies "high score" and "resistance to counter - evidence" for the first time.

In other words, Table 1 provides round - by - round evidence, and Figure 1 tells the overall story: Static evaluation gives an optimistic illusion, while dynamic falsification truly screens out candidates that are only good at "taking tests."

FNG - CE is not a randomly stacked regularization, but the minimal correction "forced" step by step by the falsification trace

Another very important point in this paper is that FNG - CE is not a "more complex loss" subjectively selected by the author, but is forced out step by step based on the failure modes exposed in each previous round of falsification.

After CE + L2, the Mechanistic Causal Extractor found that there were still two types of key problems unsolved.

First, the model may still "win by length" by amplifying the feature norm. That is, the confidence level seems higher, but it doesn't mean that it has truly learned a more stable discrimination mechanism; Second, the geometric structure of the feature space is still not uniform enough, with redundancy and anisotropy. Therefore, once counterfactuals and combinatorial perturbations are superimposed, the inter - class separation will still collapse.

So, DASES constructed FNG - CE: On the basis of CE, feature norm regularization, feature covariance geometry regularization, and L2 weight decay are added simultaneously.

The three parts address three different problems respectively:

The norm term suppresses "gaming the confidence level by the norm length;"

The covariance term makes the feature geometry more uniform and less likely to become unstable under complex perturbations;

The L2 term continues to provide standard capacity control.

So what this paper really wants to illustrate is not that these ingredients have never appeared in history, but: Under the combined constraints of a fixed protocol, a single editable locus, and dynamic falsification, this specific combination is the first minimal mechanistic answer that can survive the entire frontier.

This is also the most convincing aspect of FNG - CE: it is not "designed," but more like being "forced" out by the chain of counter - evidence.

Table 2 and Table 3 answer the most crucial question: it's not just suitable for the synthetic lab, but can really be transferred to the standard benchmark

At this point, a very natural question is: Is FNG - CE just particularly suitable for this synthetic falsification lab?

The paper gives a very positive answer using Table 2 and Table 3. The author transferred the analytical form of FNG - CE to the standard natural image classification benchmark and conducted a controlled comparison under ResNet - 18 and ResNet - 50.

The results are very consistent. As shown in Table 2, on ResNet - 18, FNG - CE outperforms CE on all six datasets: CIFAR10, CIFAR100, DTD, CUBirds, VGGFlower, TrafficSigns; and as shown in Table 3, on ResNet - 50, FNG - CE also achieves the best results on all seven datasets: ImageNet, CIFAR10, CIFAR100, DTD, CUBirds, VGGFlower, TrafficSigns.

One of the most widely - spread figures is ImageNet. On ResNet - 50, FNG - CE reaches 71.56%, an increase of 0.83 percentage points compared to CE's 70.73%. This means that what DASES forces out is not just a "trick more suitable for the discovery lab," but a more transferable loss - level inductive bias.

There is also a very important detail worth emphasizing: CE + L2 does not show this consistent improvement.

That is to say, it's not that "adding a little more regularization can win," nor is it that "being more stable in the synthetic environment will definitely be transferable."

What really works is the mechanistic clue found by DASES through dynamic falsification: the model not only needs to get rid of shortcuts, but also reduce geometric vulnerabilities and remain stable under legitimate combinatorial perturbations. Only the loss forced out along this line will continue to hold on the real benchmark.

Conclusion

The real value of this work is not just discovering a new loss, but pushing automated scientific research forward

If this work is only understood as "finding a new image classification loss," it is actually an underestimation.

Its truly important aspect is that it rewrites the evaluation standard for autonomous scientific discovery:

It's not about high scores that count as a discovery, but about withstanding active counter - evidence that counts as a discovery.

Past automated scientific research systems were more like taking a fixed test paper; what DASES does is to give the test paper the ability to fight back. In the past, the more concerned question was "whether AI can search faster;" this paper is asking a more crucial question:

When the benchmark / test set starts to actively look for your vulnerabilities, can your discovery still stand?

In this sense, the significance of FNG - CE is not just that it outperforms CE and CE + L2 in the controlled comparison given in the paper, but that it is a candidate forced out by "legitimate counter - evidence" and finally crosses the falsification frontier

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Are AI scientists still relying on static rankings? Benchmark strikes back and reshapes the standards for automatic scientific research evaluation.

Experimental Design

Experimental Results

Conclusion