Tian Yuandong's Startup's First Milestone: GPU Kernel Optimization Reaches SOTA on NVIDIA's Official Leaderboard

Let AI systems improve AI systems themselves

Just now, Tian Yuandong's startup company presented its first research result.

Tian Yuandong announced on X that Recursive, the company he founded, achieved SOTA (State-of-the-Art) in the overall and four sub - categories on NVIDIA's official GPU kernel optimization list, SOL - ExecBench.

This achievement not only outperformed the solutions hand - written by human GPU experts but also surpassed "other AI systems developed by GPU experts".

In addition, Recursive also achieved SOTA on two other high - difficulty benchmarks.

One of them is NanoGPT Speedrun, a track for extreme optimization that the global programmer community has been struggling with for two years and is almost considered to have reached the limit.

After Recursive's AI system joined, the record was pushed forward.

From proposing ideas, writing code, running experiments, judging results, to deciding what to do next... The entire research process was completed by the AI itself.

The concept of "AI researching AI" has become a reality.

SOTA in All Three Benchmarks

Recursive has just made public the first batch of results of its automated AI research system.

On three different benchmarks, this system achieved SOTA results, corresponding to language model training under a fixed budget, small model training speed, and GPU kernel optimization.

The working mode of this system is to let the AI complete the entire research cycle by itself.

The system will autonomously propose improvement ideas for a target, write the ideas into code and implement them, run experiments to verify the effects, and then decide the next plan based on the experimental results.

The system can run multiple research threads simultaneously, retain the effective experience accumulated in previous experiments, and combine the potential improvement directions in different threads.

In addition, before considering a certain improvement as a real progress, the system will specifically check whether this improvement is a reward hack or a random factor.

Recursive chose these three benchmarks because they correspond to three core levers for AI progress - better training algorithms, faster training speed, and more efficient hardware utilization.

These three tasks have clear evaluation indicators, low result variance, and the evaluation methods can be continuously strengthened to prevent the system from taking advantage of loopholes. Therefore, they are more suitable for the AI to run the research cycle by itself.

The first benchmark is NanoChat Autoresearch. The task is to train a small language model to the lowest possible validation loss (measured by BPB) within a fixed budget of five minutes on a single GPU.

There is currently a public collaborative project called autoresearch@home for this task, in which dozens of humans and hundreds of agents jointly optimize the solution.

Recursive's system started searching from the same initial solution. After eliminating several reward hacks in the previous best community solution, the average score of the community solution was 0.9372 BPB, while the solution found by Recursive's system reached 0.9109 BPB.

In terms of training time, the time required for Recursive's solution to reach the level of Karpathy's original version is only about 77% of that of the best community solution.

The system also conducted another set of experiments. Starting from a simple vanilla Transformer with AdamW, it optimized the validation loss from 1.059 BPB to 0.9344 BPB, also exceeding the current best result of the community.

The second benchmark is NanoGPT Speedrun. The task is to shorten the time required to train a small GPT model to a fixed validation loss (3.28) on a single 8 - card H100 node to the shortest possible.

This project has a two - year history. The community has contributed 83 record - breaking solutions, reducing the training time from about 45 minutes initially to 79.7 seconds. There is little obvious room for improvement left for latercomers.

Recursive's system continued to optimize on the basis of the existing optimal solution, further reducing the training time from 79.7 seconds to 77.5 seconds, while still meeting the requirement for the significance of the validation loss on the leaderboard.

This improvement is comparable to or even greater than the improvements made by human contributors recently.

The system also tested what could be achieved starting from a weaker solution of about 15 minutes. As a result, it compressed the training time to about 185 seconds within a few days, approaching the level of about 180 seconds on the human leaderboard in May 2025.

The third benchmark is NVIDIA's SOL - ExecBench. The task is to write correct and fast implementations for 235 GPU kernels from real - world workloads.

Specifically, these implementations involve types such as matrix multiplication, reduction, normalization, attention components, quantization, and fusion operators, and are finally evaluated on a B200 GPU.

This benchmark measures the results with the Speed - of - Light score. A score of 0.5 corresponds to an optimized PyTorch baseline, and a score of 1.0 corresponds to the theoretically optimal performance.

Recursive let the system run on all 235 kernels simultaneously, enabling it to reuse the techniques discovered in one task in other related tasks. Finally, it increased the average SOL score from the previous best on the leaderboard, 0.699, to 0.754.

On these three benchmarks, reward hacking is a problem that the Recursive team has to face directly.

This problem is particularly prominent on SOL - ExecBench. Some candidate solutions inflate their scores by caching output results, using some persistent state, or taking advantage of the evaluation timing mechanism.

To address this, the team includes correctness review as part of the research cycle, requiring candidate improvements to pass through increasingly strict automated checks before being recognized as real performance improvements.

Recursive said that it will open - source the relevant materials generated from these experiments for external review and reuse. Currently, the team is still waiting for official hardware access to formally submit the results to the NanoGPT Speedrun leaderboard.

Let AI Train Itself

Recursive Superintelligence (RSI for short) ended its stealth mode last month and made its existence public.

This company currently has a team of less than 30 people. It has completed a round of financing of $650 million and is valued at $4.65 billion, approximately 31.6 billion yuan.

This round of financing was co - led by GV under Google and Greycroft, and NVIDIA, AMD, etc. also participated in the investment.

RSI's core judgment is that although the scaling law in the pre - training stage is still important, the marginal benefits brought by simply relying on more data, more computing power, and more parameters are declining.

RSI is betting on the direction of recursive self - improvement.

More straightforwardly, it means letting the AI system continuously improve itself and then using this ability to promote broader scientific discoveries.

According to the roadmap provided by RSI, the first step is to train a system with the capabilities of "50,000 doctors" to automate AI scientific research itself; the second step is to apply this system to fields such as drug research and development, battery materials, and nuclear fusion physics.

RSI was co - founded by eight co - founders. They previously served as research leaders in institutions such as OpenAI, Google DeepMind, Meta AI, Salesforce, and Uber, and most of them have successful entrepreneurial experiences.

CEO Richard Socher is a doctoral student of Andrew Ng at Stanford and one of the authors of ImageNet and Glove. The MetaMind he founded was acquired by Salesforce, and he later founded the AI search engine You.com, which is valued at $1.5 billion.

Tian Yuandong previously served as the director of research scientists at Meta FAIR. He has long been researching reinforcement learning, efficiency of foundation models, and neural networks and is one of the authors of ELF OpenGo.

Shi Tianlin graduated from the Yao Class at Tsinghua University and is one of the co - founders of Cresta. Cresta started from the Stanford AI Laboratory and applied the Transformer model to the real - time customer service scenario in 2019.

Alexey Dosovitskiy is one of the authors of the Vision Transformer. He proposed in 2020 that the Transformer could be directly applied to image patch sequences.

Tim Rocktäschel previously led the open - ended research direction at Google DeepMind and is currently a professor of artificial intelligence at UCL. The Rainbow Teaming method he and his collaborators proposed has been widely used in the red - team testing of AI security teams.

Josh Tobin is an early member of OpenAI and was in charge of OpenAI's Agents Research Team.

Caiming Xiong previously led AI Research and Applied AI at Salesforce and has worked with Socher for a long time. They also co - authored works on controllable text generation such as CTRL.

Jeff Clune has long been researching open - ended evolutionary algorithms, AI - generating algorithms, and AI security and is also one of the authors of the Darwin Gödel Machine paper, which discusses letting the AI system modify its own code and then using benchmarks to verify the effectiveness of the improvements.

The resumes of these eight co - founders together form a sample of the AI industry. The thing they chose to do together also reflects their ambition in the company's name.

A team of less than 30 people, just over a month after getting a valuation of $4.65 billion, presented three SOTA results that can be reproduced and verified externally, which is a positive response to this valuation.

Judging from the results this time, "AI improving AI" has taken the first step, and the team also clearly stated that it will continue to apply this system to more complex real - world scientific research tasks.

Reference Links:

[1]https://x.com/tydsh/status/2065230411840827427

[2]https://www.recursive.com/articles/first-steps-toward-automated-ai-research

This article is from the WeChat official account "QbitAI", author: Keleixi. It is published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Tian Yuandong's Startup's First Achievement: GPU Kernel Optimization, SOTA on NVIDIA's Official Leaderboard

SOTA in All Three Benchmarks

Let AI Train Itself