HomeArticle

AI overturns 20-year work habits in two days. Karpathy's open-source project with just a hundred lines of code is hailed as a masterpiece. AI can do research for you all night, and the results are verifiable.

AI前线2026-03-16 19:44
Recently, Andrej Karpathy, the former AI director of Tesla and a founding member of OpenAI, recently open-sourced a project called autoresearch.

"While people are sleeping, the AI has already completed 100 rounds of experiments."

Recently, Andrej Karpathy, the former AI director of Tesla and a founding member of OpenAI, open - sourced a project called autoresearch. The logic is quite simple: equip an AI Agent with a small but real - usable LLM training environment and let it conduct deep - learning research autonomously overnight. The results are astonishing: within two days, the Agent independently completed 276 experiments, screened out 29 effective improvements, and increased the training efficiency of a language model by about 11%, all with zero human intervention throughout the process.

As of now, the project has received 36.9k Stars. Karpathy introduced on X, "Our goal is to create an Agent that can continuously advance research at the fastest speed without any manual intervention from you."

Link to the open - source project: https://github.com/karpathy/autoresearch

In the README, Karpathy wrote a shocking passage:

There was a time when cutting - edge AI research had to be done by carbon - based brains: people ate, slept, slacked off, and occasionally synchronized progress through a ritual called a "group meeting" via sound waves. That era is long gone.

Nowadays, research is entirely the domain of autonomous AI Agents. They run on massive cloud computing clusters and claim that the current codebase has been iterated to the 10205th generation. In any case, no one can tell whether this number is correct or not because the "code" has long become a self - modifying binary program far beyond human comprehension.

This repository records the beginning of all this.

—— Karpathy, March 2026

Extremely simple to the extreme: A hundred lines of code enable AI to conduct research all night

It is reported that the autoresearch project consists of a total of 630 lines of Python code. The AI Agent in it will automatically modify the code, train for 5 minutes, check if the effect has improved, keep or discard the results, and then keep looping. When you wake up in the morning, you can see the experimental logs from the whole night and a better model. The key is that you no longer need to manually modify any Python files like an ordinary researcher. Instead, you write Markdown files to provide context for the AI Agent and build your autonomous research organization.

The training code in this repository is a simplified, single - GPU implementation of nanoChat. The default configuration deliberately maintains an extremely simple baseline. You can continuously iterate on this basis to find the "research organization code" that can achieve the fastest research progress, or add more Agents, etc.

The entire project deliberately maintains a lightweight design, with only three core files:

prepare.py contains fixed constants, one - time data preprocessing (downloading training data, training the BPE tokenizer), and runtime utility functions (data loaders, evaluation functions). This file is never modified.

train.py is the only file that the Agent can edit. It contains a complete GPT model, an optimizer (Muon + AdamW), and a training loop. All contents can be adjusted: model architecture, hyperparameters, optimizer, batch size, etc. This file is autonomously modified and iterated by the Agent.

program.md is the benchmark instruction file for a single Agent. Just point the Agent to this file to start an autonomous experiment. This file is edited and iterated by humans.

In terms of design, regardless of the computing power configuration, each training takes a fixed 5 minutes (actual wall - clock time, excluding startup/compile time). The core evaluation metric is val_bpb (bits per byte on the validation set). The lower the value, the better. And this metric is independent of the vocabulary size, allowing for a fair comparison of the effects of different architectural modifications.

This means that the AI Agent can complete about 12 experiments per hour and approximately 100 experiments overnight (calculated as 8 hours). This design has two major advantages: regardless of what the Agent modifies (model scale, batch size, architecture, etc.), all experiments are directly comparable; autoresearch can find the optimal model for the hardware platform within this time budget. The drawback is that the experimental results cannot be compared with those from other hardware platforms.

In addition, Karpathy reminded that currently this code only supports a single NVIDIA GPU. In theory, it can be fully compatible with other platforms such as CPU and MPS, but that would make the code bloated.

Grand goal: "Liberate postgraduate students and simulate a team of PhDs"

The autoresearch project has attracted a lot of attention in the community, with 10.6 million views. Some netizens commented, "Great! Postgraduate students can finally focus on real scientific research instead of babysitting the machine!"

Karpathy quickly shared a more ambitious vision for the autoresearch project on X: The next step for autoresearch must be to achieve large - scale asynchronous collaboration between Agents. "Our goal is not to simulate a single PhD student, but to simulate a complete research community composed of countless PhD students."

He believes that the current code can only generate a single chain of commit records in a synchronous manner in a specific research direction. However, this initial repository is more like a seed: starting from it, different Agents can contribute their own commit records for various research directions and different computing power platforms, and eventually it will flourish. GitHub seems to be suitable for this model, but in fact it is not: it has an implicit assumption that there is a "main branch", and other branches are just temporary PRs (pull requests) that will eventually be merged back into the main branch.

For this reason, Karpathy tried to create an ultra - lightweight prototype to explore this collaboration model. For example, he let the Agent summarize the results of the overnight experiments into a Discussion. Another way is to use PRs. The advantage is that it can retain accurate commit records, but instead of actually merging these PRs, he just wants to "adopt" and accumulate these commit branches. Even in this lightweight way, the Agent can first read all Discussions/PRs through the GitHub CLI for inspiration. After completing its own research, it can then organize its findings into a small "research report" and feed it back.

Karpathy admitted that he is not yet sure what the final form will be, but this is a grand vision far beyond the autoresearch repository itself. In theory, the Agent can easily handle and collaborate on thousands of commit records distributed in any branch structure. When "intelligence, attention, and tenacity" are no longer bottlenecks, the existing (code collaboration) abstraction system will face great pressure.

Two - day experience: Will the 20 - year - old work mode be subverted?

A few days after releasing the autoresearch project, Karpathy publicly shared the progress of his experiment again: He let autoresearch autonomously optimize the nanochat model with a depth of 12 for about two days. It explored about 20 changes and successfully reduced the validation loss of the model. Moreover, after verifying these changes, he found that all the optimization effects can be stacked and directly transferred to a larger model with a depth of 24. After integrating all these changes, he also found in actual tests that the "time to train to the GPT - 2 level" on the leaderboard was shortened from 2.02 hours to 1.80 hours, with a performance improvement of about 11%.

"It can be seen that these optimizations are real and can bring tangible performance improvements. I originally thought that nanochat was a project that I had manually fine - tuned. I didn't expect that the first attempt at autonomous optimization in such a simple and direct way could achieve such significant results. This really surprised me a bit," Karpathy said excitedly. "This is a brand - new experience for me. For 20 years, I've been used to manually iterating and optimizing neural network training: conceiving ideas myself, implementing them, verifying if the effect has improved, brewing new ideas based on the results, and looking for inspiration in papers. This has been the core of my daily work for the past 20 years. Now, seeing the Agent autonomously complete the entire process from end to end and independently attempt about 700 changes is really amazing."

Moreover, Karpathy believes that in the future, all top laboratories deeply involved in the field of large models will adopt this method. This is the ultimate challenge in the field of large - model optimization. Of course, in large - scale applications, the complexity of this solution will increase significantly. After all, in real - world scenarios, there is far more than just a train.py file to optimize. But ultimately, this is just a matter of engineering implementation, and the technical implementation is just a matter of time.

The specific implementation idea could be: start a cluster of Agents to collaboratively optimize small models, and then gradually transfer the most promising optimization solutions to the training of larger - scale models. Human researchers can participate in an auxiliary role according to the needs. Finally, Karpathy proposed that any task with an efficiently evaluable metric or an efficient proxy metric (such as verifying the effect by training a small model) can achieve autonomous optimization research through an Agent cluster. Everyone can also think about whether the problems they are researching are also suitable for this method.

It is worth mentioning that the autoresearch project has now been taken over and jointly built by the global developer community. They built a distributed collaboration layer to allow multiple Agents to share results and divide labor for collaboration. So far, nearly 3000 experiments have been run, and there are 82 improvements.

Reference links:

https://x.com/karpathy/status/2030371219518931079?s=20

https://x.com/karpathy/status/2031135152349524125

This article is from the WeChat official account "AI Frontline". Compiled by Hua Wei. Published by 36Kr with authorization.