Meilenstein in der KI-Programmierung: Das Google-AI schreibt selbst Code und verblüfft die Ingenieure. Der GPU-Kernalgorithmus übertrifft den Menschen um 21 %.
[Introduction] Just now, AlphaEvolve has scored another big point! Based on its open - source implementation, OpenEvolve, through self - learning and writing code on its own, has directly evolved GPU kernel functions on Apple chips that are 21% faster than those written by humans! This moment is a true milestone in the history of automated programming. A new era of "AI programming for AI" has officially begun, and the singularity of automation is really coming.
Google's AlphaEvolve continues to create new miracles.
In mid - May, the bombshell Google dropped (known as the "37th step" moment, comparable to AlphaGo in the mathematical world) has been constantly challenging people's perceptions - AI has now acquired the ability to self - evolve!
Subsequently, many developers have used code to confirm that AlphaEvolve's breakthrough in matrix multiplication is real! A developer successfully proved that it completed the 4×4 matrix multiplication operation correctly with only 48 multiplications.
Just now, Asankhaya Sharma, the co - founder and CTO of patched.codes, used OpenEvolve, an open - source implementation based on the AlphaEvolve paper, to successfully and automatically discover high - performance GPU kernel algorithms.
Specifically, through self - evolving code, it automatically discovered a set of GPU Metal kernel functions on Apple Silicon that far exceed those optimized manually.
In real - world Transformer inference tasks, it brought an average performance improvement of 12.5%, and the peak even soared by 106%.
This improvement directly surpasses human engineers by 21%!
Without providing human expertise in GPU programming, this system discovered the following optimizations:
· Perfect SIMD optimization
· Two - stage online Softmax
· Specific memory layout optimization for GQA
This is not just a simple performance leap, but a true milestone in the history of automated programming - a system can, without human intervention, discover optimization paths in complex hardware architectures that are difficult for even experts to detect.
More importantly, this achievement is not confined to laboratories or papers. It has been successfully implemented in the real world, on Apple chips, and in today's most mainstream AI model tasks.
This proves the practical usability of automated code optimization technology in real - world systems.
It marks the beginning of a new era: Instead of humans manually writing optimizations for machines, machines are now starting to write better code for themselves.
In the future, as hardware architectures continue to iterate at a high - speed pace, the value of tools like OpenEvolve will become even more prominent - they will uncover deep optimization opportunities that are extremely difficult to find by human efforts alone.
Challenge: GPU Kernel Function Optimization
Why is the "GPU kernel function optimization" tackled by OpenEvolve so challenging?
Modern Transformer models rely heavily on highly optimized attention kernel functions. However, writing high - performance GPU code requires in - depth expertise in the following areas.
· Details of specific hardware architectures (such as the unified memory and SIMD units of Apple Silicon)
· Low - level programming languages (such as Metal Shading Language)
· Numerical algorithm design (such as attention mechanisms and numerical stability)
· Optimization of memory access patterns
So, is it possible to let OpenEvolve automatically evolve without human code - writing to generate GPU kernel function code with stronger performance?
For this reason, Sharma decided to use the Grouped Query Attention (GQA) implementation of the Qwen3 - 0.6B model as the target to test the capabilities of OpenEvolve and see if it could automatically generate code for the "scaled_dot_product_attention" kernel function that outperforms the production - level code of MLX.
Specifically, the target configuration of the project is as follows.
· Model: Qwen3 - 0.6B (40 query heads : 8 key - value heads)
· Hardware: Apple M - series GPUs with unified memory
· Baseline: MLX's highly optimized attention implementation
· Challenge: Fully automatic discovery of optimization methods for Metal kernel functions
Evolution Method
Sharma configured OpenEvolve to directly evolve the source code of Metal kernel functions while maintaining its integration with the MLX framework.
The entire system started with a basic three - stage attention implementation and went through more than 25 generations of evolution.
Evolution Settings
Evaluation Strategy
Each kernel function generated through evolution was comprehensively tested in the following dimensions:
- Correctness Verification: Compare numerical precision with the MLX baseline to ensure the calculation results are correct.
- Performance Testing: Conduct benchmark tests in 20 diverse inference scenarios (including short/long contexts and generation tasks).
- Safety Check: Include GPU error detection and Metal memory access verification.
- Robustness Analysis: Conduct statistical analysis through multiple repeated runs to ensure stable performance.
Key Optimizations
Surprisingly, during the evolution process, OpenEvolve independently discovered the following optimization strategies that demonstrate algorithmic innovation!
1. SIMD Optimization for Apple Silicon
On closer inspection, one of the highlights of OpenEvolve is that it independently discovered a very clever optimization -
For 128 - dimensional attention heads, processing data in groups of 8 perfectly matches the SIMD width of Apple Silicon hardware.
This is like automatically hitting the "sweet spot" of the hardware. Without any manual tuning, it can maximize performance and hardware utilization!
2. Two - Pass Online Softmax
During this process, OpenEvolve made a very smart innovation: it merged the two original separate steps - Softmax normalization and value accumulation - into one computational loop.
Originally, the traditional algorithm required three stages: calculating attention scores first, then normalizing, and finally performing weighted summation.
Now, it can be completed in two steps. The process is more concise, and it significantly reduces the memory bandwidth usage, naturally resulting in faster speed and less resource consumption.
3. Specific Memory Layout Optimization for GQA
Here, the innovation of OpenEvolve lies in its optimization specifically tailored to the special structure of the Qwen3 model.
The ratio of query heads to key - value heads in this model is 40:8 (i.e., 5:1). The system fully utilizes this characteristic to design a unique Coalesced Memory Access pattern.
This pattern is particularly suitable for the unified memory architecture of Apple Silicon, like a perfect fit, with extremely high efficiency and maximum performance.
Evaluation Results
Sure enough, the kernel functions generated through evolution showed significant performance improvements in various comprehensive benchmark tests:
Core Performance Indicator Gains
- Decode Speed: Average increase of +12.5% (standard deviation σ = 38.3%)
- Prefill Speed: Average increase of +14.4% (standard deviation σ = 17.6%)
- Total Throughput: Average increase of +10.4% (standard deviation σ = 30.7%)
- Memory Usage: Average reduction of +0.99% (standard deviation σ = 1.7%)
- Correctness: Maintained 100% numerical precision
- Reliability: Zero GPU errors or kernel function crashes
Detailed Benchmark Test Results
Most notably, when handling repetitive pattern generation tasks, the kernel functions generated by OpenEvolve increased the decoding speed by a whopping 106%!
This fully proves that this kernel function has extremely high performance when dealing with specific types of workloads.
Statistical Analysis
In general, from the statistical results, OpenEvolve indeed has strong optimization capabilities for certain types of workloads, and can uncover performance potential that is difficult to achieve with original hand - written code.
In 20 different test tasks, it showed significant improvements in 7 tasks, with performance growth exceeding 25%, indicating a "qualitative leap".
Significant gain (>25%): 7 out of 20 benchmarks
Moderate gain (5 - 25%): 3 out of 20 benchmarks
Performance unchanged (±5%): 4 out of 20 benchmarks
Performance decline (< - 5%): 6 out of 20 benchmarks
The Unsung Hero: High - Robustness Evaluation System
Note that the success of this project owes much to the evaluation system behind OpenEvolve.
It is not an ordinary benchmarking tool but is specifically designed for "hardcore" code like GPU kernel functions, aiming to address various challenges in the development process of GPU kernel functions.
GPU Safety Features
Command Buffer Protection: Automatically detect and recover from errors in Metal command buffers.
Memory Access Violation Handling: Safely handle GPU memory access violations.
Retry Logic: Provide an exponential back - off retry mechanism for transient GPU errors.
Fallback Mechanism: Gracefully degrade to an alternative solution when the kernel function fails completely.
Comprehensive Error Statistics
Thanks to the high stability and robustness of this evaluation system, OpenEvolve can boldly try various radical optimization schemes without worrying about "making things worse".
It should be noted that experimental code like GPU kernel functions is prone to errors, and a small problem can cause the entire program to crash.
Therefore, having such a high - robustness mechanism as a safeguard allows the system to boldly explore new optimization methods and continuously improve performance.
Technical In - Depth Analysis
Evolutionary Architecture for GPU Kernel Functions
In addition, the success of the project also depends on the collaborative work of multiple components in OpenEvolve:
- Intelligent Code Marking: Through specific markings, ensure that the evolution process only targets the source code of Metal kernel functions while fully retaining the integration code with the MLX framework.