HomeArticle

Just now, NVIDIA revolutionized itself: The intelligent agent evolved autonomously for 7 days, eliminating all operator engineers and GPU experts.

机器之心2026-03-26 12:10
Oh no, human cognition is the bottleneck.

This should be the most explosive article that just came out today.

In many WeChat groups for operator development, it has caused quite a stir.

“This might be the first true display of superhuman intelligence in the software field,” Xu Bing from NVIDIA just asserted on X. What he was commenting on was a new NVIDIA research AVO in which he, Terry Chen, and Zhifan Ye are co-first authors.

In this research submitted to arXiv just this Thursday, NVIDIA built the Agentic Variation Operator (AVO), a new type of evolutionary variation operator. It replaces the fixed mutation, crossover, and manually designed heuristic methods in classical evolutionary search with autonomous coding agents and has achieved quite astonishing practical performance.

Xu Bing said, “In some highly optimized attention mechanism workloads, the agent can continuously search in the optimization loop for 7 days without human intervention, thus outperforming almost all human GPU experts.” — Such performance of AVO might make many kernel/DSL developers tremble.

Huang Zhipeng's tweet on X

Interestingly, in his tweet on X, Xu Bing also shared that when he and Terry Chen first started researching agent programming at NVIDIA a year and a half ago, they didn't know GPU programming. “So from the very beginning, we were committed to developing a fully automated system without human intervention.” They call it “blind coding.”

“In the past year and a half, the two of us have developed four generations of agents in two agent systems. Starting from the second generation, these agent stacks have started to evolve on their own. Now each agent has about 100,000 lines of code (non-empty code).”

He also emphasized the great significance behind AVO: “I bet: Blind coding is the future of software engineering. Human cognitive ability is the bottleneck.

Now let's take a detailed look at what contributions this paper, which may open a new era of “blind coding,” has made.

Paper title: AVO: Agentic Variation Operators for Autonomous Evolutionary Search

Paper address: https://arxiv.org/abs/2603.24517v1

Large language models have become powerful components in evolutionary search. They replace manually designed mutation operators with learned code generation. In these systems, the LLM generates candidate solutions based on selected parents, while the usually heuristic-based framework is responsible for parent sampling, evaluation, and population management. This combination has achieved remarkable results in the fields of mathematical optimization and algorithm discovery, including flagship systems such as FunSearch and AlphaEvolve.

However, restricting the LLM to the candidate solution generation function within a preset process fundamentally limits its discovery ability: each call only produces one output, and it cannot actively consult reference materials, test its changes, interpret feedback, or revise the plan before submitting the candidate. This limitation is particularly prominent for implementations that have been highly optimized manually and require in-depth iterative engineering to further improve.

The researchers studied this problem in the context of the attention mechanism. The attention mechanism is the core operator of the Transformer architecture and one of the most intensively optimized GPU operators. The FlashAttention series and NVIDIA's cuDNN library have pushed the attention throughput of successive generations of GPUs to the hardware limit; on the latest Blackwell architecture, both FlashAttention-4 (FA4) and cuDNN require months of manual optimization. To surpass these implementations, continuous and iterative interaction with the development environment is needed: study hardware documentation, analyze profiler output to identify bottlenecks, implement and test candidate optimization schemes, diagnose correctness failures, and revise strategies based on accumulated experience.

The latest progress in deep agents shows that LLMs combined with planning, persistent memory, and tool usage capabilities can autonomously handle such multi-step engineering workflows, with applications ranging from solving complex GitHub issues to generating critical deep learning software. This prompts the LLM to play a completely different role in evolutionary search: instead of restricting it to a fixed pipeline, the deep agent is promoted to the mutation operator itself.

For this reason, NVIDIA proposed the Agentic Variation Operators (AVO). In this mode, a self-directed code agent replaces the mutation and crossover processes in previous single-round LLM or fixed workflow systems. The AVO agent has access to all previous solutions, domain-specific knowledge bases, and evaluation tools. It can independently decide what to consult, what to modify, and when to evaluate, thus achieving continuous improvement over a long period.

To verify its effectiveness, NVIDIA applied AVO to the multi-head attention (MHA) kernel on the NVIDIA Blackwell B200 GPU and directly compared it with the expert-optimized cuDNN and FlashAttention-4 kernels. In a continuous 7-day autonomous evolution without human intervention, the agent explored more than 500 optimization directions and evolved 40 kernel versions. The final generated MHA kernel achieved a maximum throughput of 1668 TFLOPS at BF16 precision, surpassing cuDNN by up to 3.5% and FlashAttention-4 by up to 10.5% in the test configuration.

After analyzing the optimization schemes discovered by the agent, NVIDIA found that these optimizations cover multiple levels of kernel design, including register allocation, instruction pipeline scheduling, and load distribution, reflecting real hardware-level reasoning. Experiments show that the optimization techniques discovered on MHA can be effectively transferred to grouped query attention (GQA): the agent only needs an additional 30 minutes of autonomous adaptation to complete the support of the evolved MHA kernel for GQA, and its performance is improved by up to 7.0% compared with cuDNN and 9.3% compared with FlashAttention-4.

The main contributions of this research are as follows:

  • Propose Agentic Variation Operators (AVO): This is a new type of evolutionary variation operator that promotes the agent from a simple candidate generator to a mutation operator. The agent autonomously explores domain knowledge, implements modifications, and verifies results through iterative interaction with the environment.
  • Achieve SOTA performance: On the NVIDIA B200 GPU, the researchers achieved the top MHA throughput in the benchmark test configuration, reaching 1668 TFLOPS, outperforming cuDNN by up to 3.5% and FlashAttention-4 by up to 10.5%. In addition, they proved that these optimizations can be easily transferred to GQA, and significant performance gains can be obtained with only 30 minutes of autonomous evolution.
  • Microarchitecture optimization analysis: The researchers conducted a detailed analysis of the microarchitecture optimizations discovered by the agent in the benchmark test settings, indicating that the agent is performing real hardware-level reasoning rather than superficial code transformation.

Say goodbye to the pipeline, AI agents become real “evolutionary operators”

In the traditional LLM-based evolutionary search framework, the model is often trapped in a fixed pipeline and only serves as a candidate code generator. Each call can only output one result, and it cannot actively consult reference materials, test code, understand feedback, or revise strategies before final submission. This limitation is particularly fatal for top-level hardware optimization tasks that require in-depth and repeated iterations.

AVO breaks this limitation and instantiates the “mutation operator” as a self-driven agent loop. This AI agent can freely consult previous code version records, call domain-specific knowledge bases (such as the CUDA programming guide and PTX architecture documentation), and actively propose, fix, criticize, and verify code modifications based on execution feedback.

In short, AVO promotes AI from a passive “code generator” to a “evolutionary operator” that controls the overall situation.

Autonomous operation for 7 days, defeat top benchmarks on the Blackwell architecture

The research team deployed AVO on a very challenging task: optimizing the core code of multi-head attention (MHA) on the NVIDIA Blackwell (B200) GPU. The attention mechanism is currently the core of the Transformer architecture and one of the most optimized computational targets on AI chips.

Without any human intervention, the AVO agent ran autonomously for 7 consecutive days.

In these 7 days, the agent explored more than 500 optimization directions in the background and finally submitted 40 valid iterative versions. Finally, the MHA core it generated achieved a throughput of up to 1668 TFLOPS at BF16 precision.

In the benchmark test, AVO's results are amazing:

  • Compared with NVIDIA's official closed-source cuDNN library customized for Blackwell, the throughput is increased by up to 3.5%.
  • Compared with the current most advanced open-source benchmark FlashAttention-4, the throughput is increased by up to 10.5%.

Powerful generalization ability, transfer to grouped query attention in 30 minutes

What's even more impressive is that the underlying microarchitecture optimizations discovered by the agent are not overfitting for specific scenarios. When the researchers asked AVO to adapt the optimized MHA core to the grouped query attention (GQA) commonly used in large models today, the agent completed the task with only about 30 minutes of autonomous adjustment.

In the GQA test, AVO still maintains an absolute leading advantage, with performance up to 7.0% higher than cuDNN and up to 9.3% higher than FlashAttention-4. This shows that the computational and memory access optimization patterns discovered by the agent during the MHA evolution process can be effectively generalized to GQA tasks with different computational characteristics.

In-depth microarchitecture reasoning

Analyzing the code changes submitted by AVO shows that the AI agent is not doing superficial work but is conducting real in-depth hardware-level logical reasoning:

  • Branchless accumulator rescaling: By eliminating conditional branches, the agent eliminates the overhead of warp synchronization and replaces it with a lighter memory barrier, increasing the throughput of non-causal attention by 8.1% at once.
  • Error correction and tensor core (MMA) pipeline overlap: The agent reorganizes the execution pipeline, transforming the originally sequential execution dependencies into overlapping pipeline execution, significantly reducing the idle waiting time of the hardware.
  • Register rebalancing across warp groups: By analyzing the profiler data, the agent found that some operation groups overflowed data to the slow local memory due to insufficient registers. It reallocates the 2048 register budget of Blackwell, further squeezing out a 2.1% performance improvement.

This research by NVIDIA proves that AI agents already have the ability to handle joint reasoning of multiple hardware subsystems (such as synchronization, memory ordering, pipeline scheduling, and register allocation). As an evolutionary variation operator not limited to a specific domain, AVO points out a clear path for the optimization of future automated software systems. It can not only be used in the development of AI chips and the deep learning underlying ecosystem but also is expected to play a big role in all scientific and engineering fields with extreme requirements for computing power in the future.

Are you scared that AI agents can achieve this level of self-evolution?

Reference links

https://x.com/bingxu_/status/2036983004200149460?s=46

https://x.com/nopainkiller/status/2036986666410532972

This article is from the WeChat official account “Machine Intelligence” (ID: almosthuman2014), author: Machine Intelligence, published by 36Kr with authorization.