HomeArticle

Founder of Anthropic Praises Meta: Ushering in the Era of "Agent Evolution" for Advertising Infrastructure

新智元2026-01-15 08:12
An agent framework based on tree-shaped thought chain search is rewriting the underlying kernel of Meta's advertising system on complex heterogeneous hardware in an "unmanned driving" manner.

The technological foundation supporting trillion - level recommendation systems such as Facebook Ads, Instagram Ads, and Reels Ads is undergoing a self - reconstruction driven by AI agents. Facing the exponentially growing computing power requirements and the large - scale deployment of the self - developed chip MTIA, the traditional engineer - tuning model has reached its limit. Meta's latest paper reveals the secret behind it: an agent framework based on tree - structured thought - chain search is rewriting the underlying kernel of Meta's advertising system on complex heterogeneous hardware in an "unmanned driving" manner. The paper reveals how automated code generation can compress the development time of NVIDIA GPU, AMD, and Meta Training and Inference Accelerator (MTIA) kernels from weeks to hours and achieve a performance improvement of up to 17 times in the production environment.

In Meta's advertising recommendation business, the Deep Learning Recommendation Model (DLRM) is the core technology supporting the daily experiences of billions of users.

However, with the rapid expansion of the business scale, a systematic problem known as the "curse of dimensionality" is becoming a bottleneck restricting development.

This problem consists of three dimensions:

  1. Diversity of model architectures: From traditional retrieval models, coarse - ranking, and fine - ranking models to Transformer - based sequence models and generative recommendation models, each architecture has completely different computing requirements.
  2. Diversity of operator primitives: In addition to traditional dense computing operators such as matrix multiplication (GEMM), the recommendation system also relies on more than 200 data pre - processing operators, including operations such as feature extraction, normalization, deduplication, and masking. These seemingly simple operators are crucial in large - scale deployments.
  3. Hardware heterogeneity: Meta's infrastructure spans multiple generations of NVIDIA GPUs, AMD GPUs, and self - developed MTIA v1 - v3 accelerators. Each type of hardware has unique memory hierarchies, programming models, and architectural features, and code cannot be directly ported.

Figure 1 shows Meta's self - developed MTIA chip. From the macroscopic data center layout and rack deployment to the microscopic circuit connections and chip core, it presents the advanced design of MTIA in improving AI load performance and energy efficiency from multiple dimensions.

Figure 2 shows the details of the MTIA 2i architecture. Its core is an 8×8 processing element (PE) array interconnected through an on - chip network. Each PE integrates dual RISC - V cores and four dedicated hardware engines: the MLU for data conversion, the DPE for matrix operations, the RE for aggregation calculations, and the SIMD for vector processing, which are uniformly scheduled by the command processor (CP).

The multiplication of these three dimensions results in thousands of "model - operator - hardware" combinations.

Under the traditional manual optimization method, an experienced kernel engineer needs several weeks to complete a high - performance implementation for a single combination. This development model can no longer meet the rapidly iterating business requirements.

Facing this challenge, Meta proposed a kernel code generation framework called KernelEvolve based on agents, which redefines the kernel optimization process as a graph search and evolution process.

Paper link: https://arxiv.org/abs/2512.23236

The design of KernelEvolve is inspired by evolutionary algorithms, modeling kernel optimization as a classic search problem with four core components:

  • Selection Policy: Based on the Upper Confidence Bound (UCB) tree search algorithm, it intelligently selects the most promising optimization direction. The system dynamically adjusts the balance between exploration and exploitation based on historical execution results.
  • Universal Operator: This is the innovation of KernelEvolve. Different from traditional systems that use multiple static prompt templates, KernelEvolve uses a single, dynamically adaptable transformation function. Based on the runtime context, including performance analysis results, error information, hardware constraints, and historical optimization records, this function dynamically synthesizes prompts through retrieval enhancement, enabling large language models to conduct overall reasoning on correctness, performance, and architectural trade - offs.
  • Fitness Function: It comprehensively evaluates the correctness and performance of the kernel. The system not only verifies numerical accuracy but also comprehensively evaluates execution efficiency through multi - level performance analysis tools (from system - level to instruction - level).
  • Termination Rule: The search process automatically terminates when the computing budget is exhausted, the optimization progress stalls, or the performance threshold is reached.

This breakthrough has not only shocked the hardware circle but also caught the attention of global AI authorities.

Jack Clark, the co - founder of Anthropic, placed KernelEvolve at the top of his influential weekly newsletter "Import AI" (Issue 439) for in - depth analysis. He highly praised Meta for using a combination of models such as GPT, Claude, and Llama/CWM to achieve "automation of trillion - level infrastructure" and asserted that this indicates that "LLM agents will become the universal compilation layer for heterogeneous AI systems," marking a profound transformation in the software engineering paradigm.

Article link: https://jack - clark.net/2026/01/05/import - ai - 439 - ai - kernels - decentralized - training - and - universal - representations/

Multi - level abstraction and hardware adaptation

A key advantage of KernelEvolve is its support for multi - level programming abstractions, covering the entire software - hardware optimization stack from high - level DSLs to low - level hardware instructions:

  • Triton DSL: For rapid prototyping and cross - platform development
  • CuTe DSL: For in - depth optimization of NVIDIA GPUs
  • Hardware diagnostic language: For low - level optimization of proprietary accelerators such as MTIA

Figure 3 shows the Triton multi - objective compilation architecture. The source code is progressively degraded through MLIR: from platform - independent Triton - MLIR to dialects specific to certain hardware (GPU/AMDGPU/MTIA), and finally generates native binary files supporting NVIDIA (PTX), AMD (AMDGCN), and MTIA (RISC - V) platforms.

This multi - level design enables KernelEvolve to select the most appropriate abstraction level for each hardware platform.

More importantly, the system integrates a persistent knowledge base that encodes specific constraints and optimization experiences of various hardware. This allows the system to generate effective kernel codes even for proprietary accelerators that do not exist in the training corpus of large language models.

Agent architecture and self - improvement

KernelEvolve adopts a complex agent system architecture with multiple specialized sub - agents:

  • Context memory sub - agent: Analyzes dynamic runtime information (kernel implementation, performance measurement, error diagnosis), diagnoses performance bottlenecks, and synthesizes optimization instructions.
  • Deep search sub - agent: Performs more in - depth search and analysis when encountering complex optimization scenarios.
  • Hardware interpreter: Provides specialized execution environments for NVIDIA, AMD, and MTIA platforms to ensure accurate evaluation of code on real hardware.
  • LLM synthesizer: Generates dynamic prompts that can interface with external models (Claude 4.5, GPT - 5) or Meta's internal Code World Model (CWM) model.

The system also maintains a complete metadata storage that records the execution scores and parent - child relationships of each node in the search tree, supporting continuous learning and iterative improvement of optimization strategies.

Figure 4 shows the system architecture (top) and execution workflow (bottom) of KernelEvolve. The system uses a tree search state machine with "self - evolving" capabilities to coordinate sub - agents, evaluation tools, and AI hardware interpreters (MTIA/GPU/AMD). It dynamically generates Triton kernel candidate solutions using large - model backends such as Claude 4.5, GPT - 5, or Meta's internal CWM, and realizes closed - loop exploration and performance optimization of kernel optimization through a persistent knowledge base and metadata storage.

Closed - loop evolution

End - to - end evaluation pipeline

If Tree Search is the "brain" of KernelEvolve, then the end - to - end evaluation pipeline is its "neural reflex arc."

Meta did not simply throw the code to the compiler but built a highly rigorous automated verification and performance feedback closed - loop. The complete workflow of KernelEvolve reflects its engineering rigor. The entire system is divided into three main modules, forming a closed - loop optimization process:

Left: Tree search engine

This is the "brain" of the entire system, maintaining a dynamically evolving search tree. Each node in the tree represents a kernel candidate solution, including both the PyTorch baseline implementation and the Triton optimized version.

The system ensures that the AI - generated kernel is 100% consistent with the native code in mathematical logic by comparing the output results of the two under multiple sets of the same inputs, fundamentally solving the accuracy risk that may be brought by the code generated by large models. The search engine moves through the tree using the UCB strategy, continuously exploring new optimization paths. When a new candidate solution needs to be generated, the system calls a non - LLM static code generator to quickly generate standardized evaluation framework code based on templates.

Middle: AI toolchain code generation

This is the "source of creativity" of the system. The generated code is sent to a specialized toolchain for compilation and performance analysis.

Notably, KernelEvolve adopts a multi - level and multi - dimensional evaluation strategy: TritonBench verifies functional correctness, Torch Profiler provides a system - level performance view, NVIDIA NCU conducts in - depth analysis at the GPU instruction level, the Triton Proton tool measures internal kernel latency, and MTIA Insight provides exclusive diagnosis for Meta's self - developed chips. The feedback generated by these performance analysis tools is re - input into the search engine to guide the next round of iteration.

Right: Heterogeneous AI hardware platforms

This is the "testing ground" of the system. KernelEvolve is equipped with specialized interpreters for each hardware platform. Each interpreter can collect hardware - specific performance indicators in real - time, such as GPU memory throughput, L2 cache hit rate, computing unit utilization, and other fine - grained data, and can even track specific stalled instructions.

These hardware - level insights provide valuable optimization clues for LLMs.

The entire process forms an adaptive "generate - evaluate - feedback" cycle: the search engine selects candidate nodes → the code generation toolchain produces implementations → the hardware interpreter executes and collects performance data → the multi - dimensional analysis tools provide diagnostic feedback → the search engine adjusts the strategy based on the feedback.

This tightly integrated evaluation pipeline enables KernelEvolve to complete the optimization exploration that human engineers need weeks to finish within hours.

Figure 5 shows the end - to - end evaluation pipeline: the system generates candidate kernels with standard dual implementations (PyTorch baseline and Triton optimization) through tree search and executes them on specialized hardware interpreters (GPU, AMD, MTIA). It uses tools such as TritonBench, NCU, MPP, and MTIA Insight to collect platform - specific performance profiling metrics, and the feedback results directly guide subsequent search iterations. To achieve automated evaluation across heterogeneous accelerators, AlphaKernel built a standardized interpreter environment integrating a complete software stack, compilation toolchain, and runtime dependencies based on Meta's Bento platform.

Industrial - level verification

From benchmark to production

The effectiveness of KernelEvolve has been verified at multiple levels.

Benchmark test performance

On the public KernelBench test set, KernelEvolve demonstrated excellent robustness:

  • Achieved a 100% pass rate on all 250 problems in three difficulty levels
  • Tested 160 PyTorch ATen operators on three heterogeneous hardware platforms
  • All 480 "operator - platform" configurations were correct, with an accuracy rate of 100%

Production environment deployment

What's more impressive is its performance in Meta's real production environment:

  • Performance improvement: In diverse advertising training and inference workload