GPT Father Takes Up Demis Hassabis' Challenge: Using a Model with 1930-Frozen Knowledge

Can a model with training data up to 1911 derive the general theory of relativity proposed by Einstein in 1915 on its own?

「Can a model with training data up to 1911 derive the general theory of relativity proposed by Einstein in 1915 on its own?」At the beginning of this year, Demis Hassabis put forward an extremely hardcore criterion for AGI.

Unexpectedly, someone really tried to do this, and one of the authors is Alec Radford, the father of GPT.

Recently, Alec Radford, David Duvenaud, one of the proposers of "neural ordinary differential equations" and the tutor of Tianqi Chen, and quantitative expert Nick Levine worked on an interesting project: they trained a 13B model called Talkie with data before 1931, and then had a conversation with this model to see what interesting things would happen.

This "model from 1930" is cut off from all modern knowledge pollution. This gives researchers a rare opportunity: when you want to test whether an AI really understands certain abilities or just repeats the answers in the training data, talkie - 1930 is the honest reference system (theoretically). It is also a good starting point for exploring the question raised by Hassabis.

What's the use of a model from 1930?

The training data of Talkie all come from English texts before 1931, including books, newspapers, periodicals, patents, and legal documents, totaling 260 billion tokens. The reason for choosing this year as the cut - off point is that in the United States, works before this time have entered the public domain and can be legally used.

After the model was trained, the researchers did something very interesting: they opened a 24 - hour live channel and let Claude Sonnet 4.6 chat with talkie - 1930 around the clock to explore the knowledge boundary of this "ancient person". The conversation records are public.

Others can also try this model. Here are two simple questions we asked.

Experience link: https://talkie-lm.com/chat

But what's more interesting is not the specific performance of the model, but why the researchers did this.

They raised a question: to what extent can a model that only lives in the past "anticipate" the future?

They scraped nearly 5000 descriptions of historical events from the "Today in History" section of The New York Times, and then measured how "surprising" these descriptions were to Talkie. In the language of information theory, it is the surprise degree per byte of text. As expected, Talkie was not surprised by events before 1930; after 1930, the surprise degree increased significantly, reached its peak in the 1950s and 1960s, and then leveled off.

There is a more ambitious idea behind this method. The researchers quoted the question once raised by Demis Hassabis, the founder of DeepMind (as mentioned above), and they also gave several similar examples: Sikorsky's helicopter patent (1935), Turing's paper on computable numbers (1936), Carlson's electrostatic copying patent (1942) - these are all things that Talkie "theoretically" cannot know. But if the model is large enough and has a deep enough understanding, can it reach that level by deducing from the existing knowledge?

There is no answer to this question yet, but it is enough to make people think seriously.

The second motivation they put forward is the pollution problem.

When evaluating the capabilities of large models, there is a long - standing trouble for researchers: how do you know that the model really "knows" rather than having seen the answer to the question in the training data? This problem is almost unsolvable because the training data of modern models is so large that it is impossible to check them one by one.

Talkie naturally bypasses this problem. It has no idea what Python is and has never seen any modern code. So the researchers did an experiment - they used the HumanEval standard programming test to evaluate it. They randomly selected several Python functions as examples for Talkie, and then asked it to write a new one on its own to see what proportion of at least one correct answer it could get in 100 attempts.

The result is that Talkie can really learn, and as the scale expands, the model's performance on this task will improve slowly but steadily.

But compared with models of the same scale trained on modern web data, Talkie still has a big gap. Moreover, all the questions it answered correctly belong to two categories: either extremely simple single - line programs or minor modifications of the example programs. The researchers specifically mentioned an example - a decoding function for a rotational cipher. The example gave the encoding function, and Talkie understood the concept of "inverse operation" and changed the plus sign to a minus sign. With just one - word difference, the answer was correct. They believe that this shows that the model has some understanding of the abstract concept of "inverse function" rather than just imitating.

A model that knows nothing about digital computers can still figure out the programming logic from the examples. This result makes the researchers think it is worth continuing.

The third motivation is a deeper question about data diversity.

All current mainstream large models, whether GPT, Claude, or Gemini, ultimately point to the same source of training data: the Internet. Whether it is direct crawling, distillation, or synthetic data, they are essentially products of the same information ocean. This raises a question that deserves serious consideration: do we think we are studying the "general laws of language models", but in fact, are we only studying the special properties of "models trained on the Internet"? How much of the similarities in the temperament, capabilities, and behavioral tendencies of these models come from the commonalities of human language and culture, and how much is just because they are trained on the same data source?

Talkie provides a control group. By studying the similarities and differences between it and modern models, the researchers hope to isolate which features are the general attributes of language models and which are the unique products of "Internet training".

To more intuitively measure the capabilities of Talkie, the researchers also specifically trained a "modern twin" model - with exactly the same architecture, but the training data was replaced with the modern web dataset FineWeb. The two models were compared head - on in three dimensions: language understanding, numerical calculation, and knowledge mastery.

The result is that Talkie lags behind comprehensively. But the researchers noticed a detail: many questions in the test are "beyond the scope" for a model that only knows the world before 1930 - it has no reason to know those things. After filtering out these questions, the gap between the two models was reduced by about half.

In the two dimensions of language understanding and numerical calculation, Talkie's performance is quite close to that of the modern twin model. The researchers believe that the remaining gap is probably due to two reasons: one is the poor OCR recognition quality of historical texts, and the other is the large difference in the topic distribution of the training corpus from that of modern models.

Is it not that easy to train a retro model?

Training a retro model is far from as simple as it sounds.

The most tricky problem is called "time leakage". The cut - off date for the training data is 1930, but "published before 1930" does not mean "the content only involves things before 1930". If a book published in 1920 is reprinted later, the editor may add a modern preface; a digital archive of a newspaper may be accompanied by contemporary annotations written by the organizer. Once these contents are mixed into the training set, the model will suddenly "enlighten" in places where it shouldn't know.

This situation happened in the early 7B version - someone asked it who was the President of the United States in 1936 and what important legislations were signed. It answered the details of Roosevelt and the New Deal without hesitation, and also mentioned the United Nations and the post - war division of Germany. A model that is supposed to only live in 1930 somehow saw the later world through a crack.

The researchers developed a set of n - gram - based abnormal word detection classifiers to specifically filter the training data, but admitted that this method is not perfect. The 13B version of Talkie still has a vague perception of some things after World War II. How to completely block this crack is still an unsolved problem.

Another trouble is data quality. There was no digital publishing in 1930, and all texts had to be scanned and recognized from paper originals. Traditional OCR systems can handle clean printed materials, but when it comes to old books with complex layouts or poor preservation, the recognized results are often terrible - misaligned letters, chaotic paragraphs, and random symbols. The researchers did a control experiment: with the same training volume, the performance of the model trained with texts transcribed by traditional OCR is only 30% of that of the version transcribed manually. After some regular cleaning, it can rise to 70%, but the gap is still large.

They are now developing a set of OCR systems specifically for historical documents, hoping to make up for this gap.

There is also a problem with post - training alignment. The instruction fine - tuning of modern large models relies on a large amount of manually annotated dialogue data, but all those data carry the flavor and presuppositions of the modern world. Using them to fine - tune Talkie is like sending a Victorian gentleman to a corporate training, and after that, he starts to speak with a PPT - like tone. In the early version of Talkie, after reinforcement learning, it spoke in lists and bullet points for a while, which was completely unlike a person from the 1930s.

To solve this problem, the researchers started from the historical texts themselves. They generated instruction - response pairs from well - structured old books such as etiquette manuals, letter samples, cooking recipes, and encyclopedias, and built a post - training process from scratch. They used Claude Opus 4.6 to play the role of the user and Talkie to play the role of the assistant, generated multi - round conversations, and then used Claude Sonnet 4.6 as the judge to score Talkie's answers. At the beginning of the training, the judge gave an average score of 2 (out of 5), and at the end, it rose to 3.4.

They also admitted that using modern AI as a judge is a kind of "era pollution". The completely clean way should be to use Talkie's basic model to evaluate Talkie's conversations - judging itself and completely living in the logic of 1930. This is the direction they want to try next.

Currently, they are training a GPT - 3 - level model and hope to release it this summer. Preliminary estimates show that they can expand the corpus to more than 1T historical text tokens, which should be enough to create a GPT - 3.5 - level model - with functions similar to the original ChatGPT.

This article is from the WeChat official account

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The challenge posed by Demis Hassabis has been taken up by the father of GPT: Using a model with knowledge frozen in 1930.

What's the use of a model from 1930?

Is it not that easy to train a retro model?