Oh no! The AIs from the 1930s are coming to take programmers' jobs.
On Labor Day, even the nearly 100 - year - old vintage large model has to work.
Yes, someone has fine - tuned a large model with knowledge only up to 1930 into a software engineer...
The process was easier than expected. With only 250 training samples, the powerful old model solved its first programming problem in life —
It patched the xarray library.
An AI that has never even seen a TV is now starting to "go bad" like Claude and trying to take programmers' jobs. (Just kidding)
The Vintage Silicon - Based Software Engineer
Let's first provide some background. Who is 1930?
This is the recently popular "old - man AI" named talkie - 1930 - 13b.
The people behind it are AI researcher Nick Levine, Associate Professor David Duvenaud from the University of Toronto, and the well - known Alec Radford, the real father of the GPT series.
The most interesting design for the old model is that there is a strict rule for its training data: Not a single word after January 1, 1931, is allowed!
Yes, it doesn't know about TVs, the Internet, or how World War II ended...
The old model's world is forever frozen at midnight on December 31, 1930.
But what has "stunned" the whole internet is that this old - fashioned model, when given a Python programming problem, as a "spirit from the past" spanning nearly a century, actually wrote its first line of Python code.
It's really outrageous.
Now, the old model is making another effort.
Someone fine - tuned Alec Radford's 1930 vintage LLM to solve real software engineering problems on SWE - bench.
Unexpectedly, the old model actually succeeded.
After 250 training samples, it implemented its first fix — a small patch for the xarray library.
The centenarian has taken on a tough job.
By the way, the team has released the entire process of the old model implementing the xarray library patch.
To be honest, if judged by the standards of cutting - edge LLMs, this demo is really frustrating.
For a simple problem, the old model took a full 49 rounds to solve it. It was long and slow.
In some rounds, it was really unbearable. It was so stupid that it made people anxious, but you can't get angry with the old model.
However, in some moments, it is even more exciting. It's like reading a thrilling novel.
Let me give the most "straightforward" example. (Just kidding)
The old model actually messed up at the beginning.
During the 12th round of the conversation, it failed to apply the patch.
The code can have errors, but the veteran never dies.
The old model didn't give up. It kept trying until it finally realized where it went wrong...
Then, in the 44th round, it fixed it!!
I know the fix itself is very simple. It can't even compare with a novice in terms of code level, let alone the state - of - the - art AI.
But what really matters is the old model's thinking throughout the problem - solving process.
The reasoning ability shown in this process is exactly the same as what we see in modern models.
A 1930 model can also make mistakes, reflect, and self - correct.
Beyond the demo, its performance in the benchmark is also remarkable.
When the scale of the training data during fine - tuning is expanded to about 75K trajectories, that is, 1 billion tokens, the model achieved 4.5% pass@1 on SWE - bench - Verified.
You know, it only had a 4% pass@100 on HumanEval before. This improvement is quite significant.
Although the absolute value is still low, it's already quite astonishing for a model with 1930 knowledge.
What's even more interesting is another control experiment.
In fact, the team also trained a sibling model for the old model, called talkie - web, which was pre - trained on Internet data.
With the same fine - tuning formula, talkie - web achieved 5.5% on SWE - bench - Verified.
Yes, even if the team favored the twin brother by adding Internet data, it only outperformed the old model by 1 percentage point.
You are welcome to reproduce the above results.
This is not a time - travel thriller. The team has open - sourced the project on GitHub. The link is at the end of the article. Friends who are interested can try it out.
The team itself is also very excited and says in the README:
If you have more computing power on hand, we'd love to see a comparison of the complete scaling curves of the 1930 model and the Internet model during post - training expansion.
I'm looking forward to it. This is much more interesting than a simple benchmark show - off.
What is Intelligence?
The team didn't analyze the reasons behind it, but after reading many netizens' comments under the post, I think this is a topic worth discussing.
We always thought that an AI needed to consume the entire Internet to become smart.
But if a model that has only read books before 1930 can write code and fix bugs after a little post - training...
Shouldn't we rethink our understanding of "what is intelligence"?
A 4.5% pass@1 is of course not impressive compared to today's state - of - the - art models. But what it proves is more important than any benchmark score.
A person from the 1930s, with an almost identical education system, could fully understand modern software engineering.
The data volume from a hundred years ago, combined with the right post - training method, is enough to produce modern - day reasoning.
The bottleneck of intelligence may never lie in the amount of pre - training data.
You don't need a model trained on all knowledge. It only needs basic language understanding ability, and that's enough.
Maybe, while we're rushing forward on the Scaling path, we can also take a little break, look up, and have a casual chat with people around us —
Hey, you know...
What exactly is the essence of intelligence?
Reference links:
[1]https://x.com/rdolmedo_/status/2050665193374732430?s=20
[2]https://github.com/RicardoDominguez/talkie - coder
This article is from the WeChat official account "Quantum Bit". The author focuses on cutting - edge technology. 36Kr has been authorized to publish it.