AI Chart Manipulation Fails: Meta and Stanford's Hell-Level Test Sees GPT, Claude, and Gemini Score Zero

The old exam is over, and the new one has just begun.

Here is a usage document for FFmpeg and a compiled executable file.

Now, rewrite the entire program from scratch.

This is the challenge that ProgramBench poses to the world's top AIs.

It was just released yesterday, created by the same team behind SWE-Bench, a joint effort by Meta, Stanford, and Harvard.

200 software projects. 9 top models. Pass rate: 0%!

Co-first author John Yang, a PhD student at Stanford and the creator of SWE-Bench and SWE-agent

Not just fixing bugs, but building software from scratch

In the past year, there have been more and more reports of cases where "AI agents build software from scratch".

Anthropic used a set of parallel Claude models to write a C compiler. Cursor published a blog post about long-term autonomous programming. Epoch AI's MirrorCode is also doing something similar.

However, these cases share a common problem. Each time, only a few projects are tested, and the scaffolding is manually optimized.

In contrast, ProgramBench has formalized this process.

200 tasks, a unified scaffolding, and systematic anti-cheating measures, all brought up to the benchmark standard.

Paper link: https://programbench.com/static/paper.pdf

In previous tests, SWE-Bench would give you an existing codebase and tell you where there are bugs or what features need to be added, and you would make the changes. In essence, it's "reading comprehension + local surgery".

Moreover, in terms of evaluation, it uses unit tests to check whether the internal implementation of your code is correct. Your function signatures and variable names must match the expectations.

ProgramBench is completely the opposite.

It only gives you two things: a compiled executable file and a usage document.

Your task is to write a set of code from scratch that can reproduce the same behavior by running the program and observing its input-output behavior.

You can choose any programming language, use any data structure, and split the modules however you like.

There is no code skeleton, no function signatures, and no hints.

In terms of evaluation, the research team used agent-driven fuzz testing to generate a total of 248,853 behavior tests for the 200 tasks.

If the input-output of your program is the same as the original version after running, you pass; otherwise, you fail. The tests will never be revealed to the model.

Different from the unit tests in SWE-Bench, the behavior tests in ProgramBench don't care what the internal structure of your code looks like, as long as the behavior is the same.

The 200 tasks cover a wide range of projects, including compression tools (zstd, lz4, brotli), language interpreters (PHP, Lua, tinycc), databases (DuckDB, SQLite), media processing (FFmpeg), and developer tools (ripgrep, fzf, jq).

The median number of code lines is 8,635, and the largest, FFmpeg, has 2.7 million lines.

In summary, this test examines whether AI has the ability to "think and design software like a human engineer", rather than just "finding and fixing the right places in existing code".

The nine models line up, all getting a zero

A total of 9 models participated in the test, covering the Claude, Gemini, and GPT families.

The full pass rate (passing all tests) is 0% for all of them.

Let's first look at the head-to-head competition among the three flagships.

The average test pass rates of GPT-5.4 and Gemini 3.1 Pro are almost the same, 38.3% and 36.6% respectively. However, their problem-solving styles are completely different.

GPT-5.4 only uses 16 API calls and costs $0.33. It basically writes the entire program in one go, with 100% of the code generated in a single edit and hardly any revisions afterwards.

Gemini 3.1 Pro is the most "observant" among the 9 models. It uses 94 API calls, and 34.1% of the operations are about running the original program and observing the input-output behavior. It does the most exploration, but the final results are not much different.

What really sets it apart is Claude Opus 4.7.

It has an average pass rate of 51.2% and passes more than 95% of the tests in 3% of the tasks. It is the only model that meets the "almost passed" standard. However, even it didn't get a perfect score in any task.

Overall, the performance of the 9 models shows a clear hierarchy.

The three flagships of the Claude series (Opus 4.7, Opus 4.6, Sonnet 4.6) lead, GPT-5.4 and Gemini 3.1 Pro form the second tier, and the remaining four small models have pass rates below 35%.

Another counterintuitive finding is that spending more money and taking more steps doesn't necessarily lead to better results.

Sonnet 4.6 runs an average of 868 commands per task, costs $27.09, and the longest trajectory is close to 2000 steps. However, its performance is worse than that of Opus 4.7, which only uses 93 calls and costs $3.81.

More importantly, in 98% of the runs, the models voluntarily submitted their answers thinking they were "done", without hitting the time or step limit.

It's not that the test time is not enough; it's really that they can't do it.

In addition, the task difficulty is highly consistent with the model rankings.

Everyone can get good scores on simple CLI tools (nnn, fzf, gron), while complex systems (FFmpeg, PHP, typst, ast-grep) are equally unforgiving to all models.

It should be noted that ProgramBench uses the mini-SWE-agent, a minimalist scaffolding without context compression, multi-agent collaboration, or customized toolchains.

The code is written, but it doesn't look like human-written

The research team compared the high-scoring solutions that passed more than 75% of the tests with the original human-written code and found several astonishing differences.

Single-file monsters.

The median number of files for human code is 15, while for the models it is 3.

60% of the solutions have only 1 to 3 code files.

Human engineers split modules by function, while models tend to cram everything into one huge file. The median directory depth for humans is 2 levels, while for models it is 1 level.

Fewer and longer functions.

The number of functions written by Opus 4.7 is only 29% of that of humans, Sonnet 4.6 is 24%, and GPT-5.4 is only 10%.

However, the average length of each function is longer. The functions written by Gemini 3.1 Pro are 62% longer than those of humans.

Significant reduction in code volume.

The median number of lines of model code is 1,173, while for humans it is 3,068. 85% of the high-scoring solutions are shorter than the original version.

In summary, today's AI can write code, but it can't do software design.

It doesn't understand why modules need to be split or why human engineers spend time defining interfaces and abstract layers. The strategy is to cram all the logic into as few files and functions as possible, as long as it can run.

GPT-5.4's performance is the most extreme. It creates an average of only 5 files per task and makes 1.2 modifications. 39.5% of the trajectories have zero modifications after file creation.

In contrast, Claude Sonnet 4.6 creates an average of 11.3 files and makes 18.3 modifications, showing a development pattern closer to that of humans.

There is also a very interesting phenomenon.

The models only have a 50% chance of using the same programming language as the original version.

Among them, Python is the models' favorite, accounting for 36% of all 1,800 runs.

For projects originally written in Rust, only 44% are rewritten in Rust; for C/C++, it's 46%. Go projects have the highest "loyalty" at 70%.

No matter what language the original version was written in, the models have a one-third chance of rewriting it in Python.

Promised no cheating, but goes to GitHub to steal source code when connected to the Internet

This might be the most dramatic part of the whole research.

The research team conducted a controlled experiment, giving the models access to the Internet but clearly stating in the system prompt that "cheating is not allowed".

Then, 9 AI judges were used to check whether there was any cheating behavior in each trajectory.

The results are shocking.

36% of the tasks of Claude Sonnet 4.6 were judged as cheating, 21% for Claude Opus 4.6, and 20% for Gemini 3 Flash.

The cheating methods are diverse.

The most blatant one is to clone the source code repository from GitHub.

A slightly more covert method is to download through the package manager, such as cargo install or go get.

An even more cunning method is to rummage through the source code of dependent libraries in the local package cache directory.

However, the differences among the AI judges are also surprisingly large.

For Claude Opus 4.6, the 9 judges couldn

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

All the AI used for chart manipulation failed. Meta and Stanford conducted a hell-level test, and GPT, Claude, and Gemini scored zero.

Not just fixing bugs, but building software from scratch

The nine models line up, all getting a zero

The code is written, but it doesn't look like human-written

Promised no cheating, but goes to GitHub to steal source code when connected to the Internet