StartseiteArtikel

Das von Anthropic warnende rekurente KI-Unternehmen von Tian Yuandong hat gerade seine ersten Schritte gemacht

机器之心2026-06-12 11:54
Ein System, das den KI-Forschungszyklus autonom vorantreibt und die Bestleistungen in drei Benchmark-Tests verbessert hat.

Some days ago, Anthropic published an article titled “When AI Builds Itself”, which quickly sparked intense discussions. The article reveals a remarkable amount of internal data: By May 2026, over 80% of Anthropic's codebases were written by Claude. The daily code merge volume of engineers is eight times higher than in 2024. In an internal test, Claude improved the runtime of a training code by about 52 times compared to the reference value, while an experienced human researcher usually takes 4 to 8 hours to achieve a 4 - fold acceleration.

Anthropic points this development path towards a deeper goal: “Recursive Self - Improvement” – an AI system that autonomously designs, builds, and trains its successor versions without human intervention at every step. Notably, the company also calls on the industry to coordinate and have the possibility to temporarily or even completely halt the development of top - tier AI when the moment of recursive self - improvement arrives. And Anthropic is already implementing this: The latest Claude Fable 5 is restricted for use in the research and development of top - tier AI.

Now, Recursive Superintelligence has announced that it has taken the first step towards automated AI research.

This new company, co - founded by Tian Yuandong, ended its confidentiality phase just a month ago and has now published its first public technical result. They have developed an open system for automated knowledge discovery and achieved SOTA results in three benchmark tests. Simply put, they have managed to make AI conduct experiments for you.

https://x.com/tydsh/status/2065062838255649082

First Result: Let AI Conduct Experiments for You

The first public technical result of Recursive is called “First Steps Toward Automated AI Research”.

Tweet: https://x.com/Recursive_SI/status/2064980090702962699

Repository Address: https://github.com/recursive-org/first-steps-toward-automated-ai-research

Blog Address: https://www.recursive.com/articles/first-steps-toward-automated-ai-research

In short, the core of this work is to develop a system that autonomously drives the AI research cycle forward and improves the previous best values in three benchmark tests.

Before we analyze the result in more detail, it is necessary to understand the design logic of this system.

The traditional AI research process is a highly human - dependent closed loop: “Develop ideas – Write code – Conduct experiments – Analyze results – Develop new ideas”. The inefficiency lies not in computing power but in humans. There are only a few researchers worldwide who can design top - tier training processes, and they must be highly involved in each experiment iteration.

Recursive's system tries to automate this closed loop.

It works like this: For a defined optimization goal, the system automatically formulates experiment ideas, implements the code, conducts validation, learns from it, and then decides how the next search should be carried out. Multiple research lines can be advanced in parallel, effective discoveries can be reused between different tasks, and a mechanism for detecting reward hacking is integrated into the entire cycle to prevent the system from “taking shortcuts” and improving the evaluation indicators without actually making any improvements.

This is not a special tool for fine - tuning a single problem but a general, interdisciplinary framework for automated research. Recursive proves this through three significantly different test scenarios.

Three Battlefields, Three New Records

Scenario 1: Training a Small Model with a Fixed Computing Budget (NanoChat Autoresearch)

The rules for this benchmark test come from Andrej Karpathy's (author of GPT - 2 and former co - founder of OpenAI) Autoresearch project: On a GPU, a small language model should be trained within a five - minute fixed training budget so that the validation loss (measured in BPB, the lower the better) is as low as possible.

This scenario is naturally suitable for automated research: The experiment cycles are short, the variance of the indicators is low, and reward hacking is relatively easy to detect. For this reason, a community project called “autoresearch@home” has been running on this benchmark for a long time – dozens of human researchers and hundreds of AI agents work together to continuously improve the indicator.

Recursive's system started with the same initial code and improved the validation BPB from 0.9372 (the previous best value in the community) to 0.9109, which corresponds to an improvement of 0.0263 BPB. In other words, to achieve the same training quality, Recursive's solution only needs 1.3 times less training time than the competition.

The improvements found by the system are not based on a single trick. It combines changes in the architecture, auxiliary losses, changes in the attention mechanism, the behavior of the optimizer, weight decay scheduling, and compiler settings. One of the most important discoveries is a more rich short - term memory mechanism: In the value path of the attention mechanism, bigrams (adjacent word pairs) and trigrams (triples) are simultaneously embedded via a hash table and weighted and mixed with learnable gating mechanisms. Different Transformer layers use different hash functions to reduce the probability of collisions between the layers.

This trick is conceptually related to works like DeepSeek Engram, but the system has used it in a specific variant that has not been described in the public literature for the fixed - budget scenario.

Scenario 2: Race for Maximum Training Speed (NanoGPT Speedrun)

While the previous scenario is an “evolution” of the results of an active community, this scenario is much more difficult.

The NanoGPT Speedrun is another benchmark initiated by Karpathy and continuously optimized by the community for more than two years: On 8 H100 GPUs, a GPT model should be trained to a validation loss of 3.28 in the shortest possible time. Since mid - 2024, the community has reduced the time from about 45 minutes to 79.7 seconds through 83 documented contributions. Each new solution must build on an extremely optimized code and save even more time, which is obviously difficult.

Recursive's system has further reduced the training time to 77.5 seconds starting from the previous best solution, which is 2.2 seconds less. This matches or exceeds the improvements recently achieved by human contributors.

The most important tricks found by the system this time are:

Attention calculation with FP8 precision. The community solution uses FP8 (8 - bit floating - point) only in the last layer of the model (language model head), while the system extends FP8 to the matrix calculations of the attention layer. In the forward calculation, FP8 is used to double the Tensor Core throughput, while in the backward calculation, BF16 is retained to ensure stability.

Annealing exploration noise in the optimizer. The system injects Gaussian noise with a mean of zero into the update step of the NorMuon optimizer, and its amplitude linearly decreases to zero as the training progresses. This behaves like a behavior pattern where the optimizer first “boldly explores” and then “stably converges”, which helps the final solution land in a flatter loss basin.

More efficient fused MLP kernel. The system has rewritten a Triton GPU kernel so that only the squared activations of ReLU are stored in the forward calculation. In the backward calculation, the non - squared intermediate results are recalculated in the kernel, which saves a complete read and write operation of the activation tensor in the high - speed memory – this is a direct acceleration at the hardware level.

These three improvements come from three different professional fields: precision strategy, optimizer design, and GPU kernel programming. The fact that the system has found room for improvement in the results of the two - year community optimization speaks for itself.

Scenario 3: Optimization of GPU Kernels (SOL - ExecBench)

The first two scenarios deal with the model training level, while the third scenario goes deeper: the optimization of GPU computing kernels.

The SOL - ExecBench is a benchmark test by NVIDIA that includes 235 kernel programming tasks covering various real - world workloads such as matrix multiplication, reduction, normalization layers, attention components, quantization routines, and fused blocks. The evaluation criterion is the SOL score: 0.5 corresponds to the reference implementation in PyTorch, and 1.0 corresponds to the theoretical hardware limit. The previous best public score was 0.699.

Recursive's system was run on all 235 kernels and allows the reuse of found optimization patterns (e.g., memory access strategies, blocking methods, reduction techniques) between tasks. Finally, the score was improved to 0.754, which reduces the gap to the hardware limit by 18%.

This scenario is particularly significant because kernel development is an extremely specialized field – there are only a few engineers worldwide who can write efficient Triton/CUDA kernels. The Recursive team admits in their blog that they themselves are not experts in the kernel field. “These ideas come from the system itself and not from our expertise.”

Recursive: AI Research for the Recursive Improvement of AI

The company Recursive Superintelligence, which published this result, was founded at the end of 2025 to the beginning of 2026 and has only...