HomeArticle

Weng Jiayi, a post-training engineer at OpenAI, proposed a new paradigm hypothesis for Agentic AI.

36氪的朋友们2026-05-11 08:02
In the era of coding agents, experience may be transformed into software that is readable, modifiable, and testable once again.

In the past decade, the improvement of AI has mainly relied on one approach: pouring more data and computing power into larger models, allowing experience to be embedded in neural network parameters. This approach has led to the leap of large models after ChatGPT, but it has also left a difficult problem: as models become more powerful, it is often still difficult to explain and correct why they succeed or fail.

Weng Jiayi, an engineer at OpenAI, recently conducted an experiment that proposed another possibility: in an environment with clear goals, a runnable setting, and a feedback loop, AI can become more powerful not only through training models but also by "independently modifying code."

On May 8, 2026, Weng Jiayi systematically documented this set of experiments in his personal blog "Learning Beyond Gradients" and publicly shared the code repository, CSV experiment logs, and video replays. He has long been focused on reinforcement learning and post - training infrastructure. He participated in the initial release of ChatGPT and was involved in projects such as GPT - 4, GPT - 4 Turbo, GPT - 4o, o - series, and GPT - 5. Before joining OpenAI, he graduated from the Department of Computer Science at Tsinghua University with a bachelor's degree and obtained a master's degree from Carnegie Mellon University. He is also the main author of the open - source reinforcement learning library Tianshou and the high - performance parallel environment engine EnvPool.

The image is generated by AI.

He had Codex repeatedly write strategy code, run the environment, read logs, view replays, locate failures, then modify the code, add tests, and continue evaluation. After multiple iterations, Codex "developed" a pure Python programmatic strategy: achieving the theoretical full score of 864 in Atari Breakout and achieving results close to common deep reinforcement learning algorithms in robot control simulation environments such as MuJoCo Ant and HalfCheetah.

The truly important aspect of this set of experiments is a core question: When the coding agent is strong enough, does learning necessarily have to occur in the neural network weights?

In this set of experiments, experience is written into code, tests, logs, and replays, becoming a software system that can be read, modified, reviewed, and audited. If this direction continues to hold, the next step for Agentic AI may not only be training larger models but also having models participate in maintaining an engineering system that continuously evolves.

01

The Engineering Closed - Loop from 387 to Full Score

Weng Jiayi wrote in his blog that the starting point of this experiment was actually an engineering requirement. He maintained EnvPool in his spare time and needed a more cost - effective way to test whether the game environment was running properly than "running a neural network every time" because putting a neural network into CI was too expensive. The original question was: Can we write a heuristic rule that is cheap, reproducible, and significantly better than a random strategy to drive the environment to a state rich in information?

He used Codex (with the base model gpt - 5.4) to try to write a fully rule - based version. The initial prompt was very straightforward: "Write a strategy to solve Breakout." The result was not satisfactory. The low score itself provided no information. For example, the action semantics might be wrong, the state detection might be wrong, the evaluation process might be wrong, or the strategy structure itself might be too weak.

Subsequently, Weng Jiayi changed the task form. Instead of asking Codex to directly provide a policy.py, he asked it to maintain a complete cycle: detect actions and observations, write state detectors, write strategies, run a full episode, record trials.jsonl and summary.csv, generate videos or curves, check failure modes, modify strategies, simplify code, and run regression tests.

The experimental records of Breakout clearly document this process. In the first round, Codex first confirmed the action space and observation shape, identified the colors of the ball, paddle, and bricks from RGB frames, and then used image labels to scan the 128 - byte Atari RAM. The initial baseline score was only 99. After adding the tunnel offset logic, the score increased to 387.

The score of 387 is a local high score that can easily lead to misjudgment. The strategy can stably catch the ball, but the ball's path is trapped in a periodic cycle: it won't lose a life, but it also won't hit new bricks, and the score is stuck. If a human were writing the code, they might continue to fine - tune the "ball - catching accuracy." Codex analyzed the video and the recent dozens of steps' trajectories and identified the problem as the lack of perturbation in the ball's path.

Image: The game screen of Atari Breakout. The player controls the bottom paddle to bounce the ball and break the colored brick walls layer by layer. Codex achieved the theoretical full score of 864 in this game.

Subsequently, Codex added a "cycle - breaking" mechanism: if there is no reward for a long time, it periodically adds an offset to the landing point prediction to break the ball out of the local cycle. The score increased from 387 to 507. During further iterations, a new problem emerged: for fast low - balls, the normal interception would make the paddle "lead too far" and drift away. Codex added the parameter fast_low_ball_lead_steps = 3, and the score increased from 507 to 839. The final improvement from 839 to 864 was more like maintaining an already complex system: trying deadband, serve offset, stuck offset, brick balance bias, and look - ahead steps; many directions had no effect, and the ultimately useful change was the late - stage condition, "After the first brick wall is broken, only enable the stuck offset when the ball is far from the paddle, and gradually release it when the ball is close."

The final RAM default configuration stably output scores of 864 / 864 / 864 in three episodes, reaching the theoretical upper limit of Breakout. Codex then migrated the same geometric controller to the pure image - input version - without reading the RAM and only relying on RGB segmentation to identify the paddle, ball, and brick balance. The image version scored 310 in the first run, 428 in the second run, and reached 864 points after the seventh local episode, corresponding to 14,504 local policy environment steps.

Image: The sample efficiency curve of Codex in Breakout. The blue line represents the version that directly reads the game memory (RAM), and the red line represents the version that only looks at the screen (Vision). The RAM version experienced multiple jumps from 99 → 387 → 507 → 839 → 864 and finally reached the full score for the first time at the 81st episode and a cumulative 1.5 million environment steps; the Vision version, due to its mature structure migrated from the RAM version, reached 864 points in only 7 episodes and about 14,500 environment steps.

Weng Jiayi specifically noted that this should not be understood as "the image - input version reaching the full score from scratch in only 14,500 steps." The real process is that Codex first discovered the geometric controller, cycle - breaking, and late - stage offset release in the RAM version, and only after the structure was stable did it switch the state - reading layer from RAM to RGB. The 14,500 steps are the migration budget for the image version.

02

The Definition of Heuristic Learning

Finding a name for this continuously evolving "software strategy" was more difficult than writing the first - version strategy. Weng Jiayi finally named this process Heuristic Learning (HL) and the object it maintains Heuristic System (HS).

According to his definition in the blog, HL consists of program code. Like the common deep reinforcement learning today, it has a cycle of state, action, feedback, and update. The difference is that the object being updated is the software structure rather than neural network parameters; its feedback is digested by the coding agent and can come from environmental rewards, test cases, logs, videos, replays, or human feedback; its update does not use backpropagation but rather the coding agent directly edits strategies, state detectors, tests, configurations, or memories.

It should be noted that "using programs instead of neural networks for strategies" is not a concept first proposed by Weng Jiayi. The academic community has been discussing Programmatic Reinforcement Learning (Programmatic RL) for many years: in 2019, the PROPEL framework proposed by Rice University and Caltech studied a reinforcement learning method that represents strategies as short programs in symbolic languages; in 2021, the LEAPS work further learned the program embedding space and combined differentiable program strategies with RL training; in 2023, HPRL at ICML proposed hierarchical programmatic reinforcement learning, allowing the meta - policy to combine multiple programs; in 2024, the LLM - GS framework from NTU and Microsoft used the programming ability and common - sense reasoning of LLMs to guide the search for programmatic RL strategies.

The consensus of these studies is that compared with neural strategies, programmatic strategies have better interpretability, formal verifiability, and generalization ability to unseen scenarios.

Weng Jiayi's substantial contribution this time lies in regarding the coding agent as an engineering channel for maintaining the heuristic system. In the past, programmatic RL either relied on manually designed domain - specific languages or search algorithms in a restricted program space; Weng Jiayi used Codex to incorporate code, logs, tests, video replays, and parameter adjustments into the same agent's workflow, reducing the iteration cost of program strategies at once. In other words, he is demonstrating a new engineering path: when the coding agent is strong enough, the heuristic strategies that were previously considered "too expensive to maintain" may become cost - effective again.

Weng Jiayi provided a comparison table in his blog, clearly showing the differences between HL and Deep RL: in terms of strategy form, the former is code composed of rules, state machines, controllers, model - predictive control (MPC), and macro - actions, while the latter is neural network parameters; in terms of state form, the former is explicit variables, detectors, and caches, while the latter is an observation vector readable by the network; in terms of feedback form, the former regards tests, logs, and replays as valid signals, while the latter mainly relies on a fixed reward function; in terms of memory form, the former can explicitly store trials, summaries, failure reasons, and version diffs, while the latter has basically none in on - policy algorithms and relies on a replay buffer in off - policy algorithms.

This comparison shows that HL has some engineering - related properties: the strategy is interpretable and can be translated into natural language; the sample efficiency is measured in units of "one effective code change" rather than slow gradient updates; old capabilities can become regression tests, fixed - seed replays, or golden cases; overfitting to training seeds or test loopholes can be constrained through simplification, regression checks, and multi - seed evaluation; old capabilities do not have to exist only in weights but can also exist in rule sets and tests, which partly addresses the problem of catastrophic forgetting that neural networks have long failed to solve well.

03

Batch Validation on Atari57: Boundaries and Shortcomings

If we only look at Breakout, the story can easily be simplified to "AI wrote a perfect strategy." But Weng Jiayi didn't stop at Breakout. He extended this Codex workflow to Atari57 in batches, running 57 games, two observation modes, and three repetitions, resulting in a total of 342 "unattended" search trajectories.

The experimental design was quite strict. Each game was tested in two input ways: one was to directly read the game memory, and the other was to only look at the screen. Each way was repeated independently three times. This resulted in a total of 342 "unattended" experimental trajectories: each Codex agent received the same prompt template, explored actions on its own, wrote code on its own, ran experiments on its own, and recorded results on its own, without any human prompts. The constraints were strictly defined: no training of neural networks, no reading of game source code, and no use of any hidden information. All steps used for debugging and trial - and - error had to be included in the total cost. This was to prevent Codex from cheating in any "peeking at the answer" way.

When measuring the results, an index called HNS (Human - Normalized Score) is usually used - simply put, it standardizes the score of each game with "the average level of human players = 1" to facilitate horizontal comparison between different games.

Image: Comparison of sample efficiency across the Atari57 collection. The horizontal axis represents the number of environment steps (on a logarithmic scale), and the vertical axis represents HNS (Human - Normalized Score, where 1.0 means reaching the average level of human players). The Codex screen - input version (red line) significantly led the PPO baseline (blue/gray dashed line) in the early stage and reached 0.81 at 9.7 million steps, approaching the level of PPO around 10 million steps; the Codex memory - input version (purple line) converged at 0.59.

Measured by this standard, Codex was quite impressive in terms of early - stage efficiency. After only consuming 1 million environment steps, the median HNS of Codex with screen input reached 0.32, and with memory input, it reached 0.26, significantly higher than the level of classic reinforcement learning algorithms such as PPO at the same stage. At 9.7 million steps, the screen version of Codex reached 0.81, approaching the level of about 0.88 to