MIT discovers the secret to making AI smart, and it's exactly the same as that of humans.
Have you noticed that when you ask AI to read a long article, it forgets what it read earlier? When you ask it to process an extremely long document, the answers it gives are completely off - topic? In academia, there's a specific term for this phenomenon called "context corruption". This is also a common problem with current AI: the large models have very poor memory. The longer the article, the dumber the model gets!
On the last day of 2025, the Massachusetts Institute of Technology (MIT) published a groundbreaking paper aimed at solving this problem.
The paper is titled "Recursive Language Models", or RLM for short.
It may sound very academic, but in plain English: Just let the AI do it one more time, and the performance will skyrocket.
Paper link: https://arxiv.org/pdf/2512.24601
Let me reveal two core data points first:
In complex reasoning tasks, simply having the model process the data 2 - 4 more times can increase the accuracy rate by 10% - 25%.
In processing extremely long documents, the RLM (Recursive Language Model) maintains stable performance even with a scale of over 10 million tokens, while traditional models completely collapse!
What does this mean?
Previously, we thought that if an AI wasn't smart enough, we should add more parameters, install more graphics cards, and buy more GPUs.
This MIT paper completely changes the game: Stop adding parameters. Let it redo the task, and the results might be much better. (It's really like having a human supervisor.)
It turns out that the solution to the problem is that simple!
Moreover, many big names on X have given their thumbs - up!
Starting with an infuriating problem
Have you ever had the following experiences?
You ask ChatGPT to write an article for you. It writes 3000 words in one go, but when you read it, you realize it's completely off - topic.
Or you ask it to write code. After it's done, when you run the code, it's full of bugs.
But miraculously, when you ask it to check again and think it through, sometimes it can suddenly correct the mistakes.
Researchers at MIT found that this isn't some kind of mystery; there's a pattern to it.
Most mistakes made by AI aren't because it doesn't understand but because it writes the first draft too quickly.
Just like when you write a thesis, the first draft is always terrible, but after revising it three or four times, it's like it was written by someone else.
The same goes for AI.
The problem is that most current large models operate in a one - pass mode. You input a question, and it outputs an answer, and that's it.
They don't initiate rework, self - check, or deliberate repeatedly on their own.
Let's understand the original thinking of large models from another perspective:
Suppose you're an intern who just joined a company. Your boss gives you a 500 - page document and asks you to summarize it into a report.
What would you do?
A normal person would first skim through it to find the key chapters, then read each chapter one by one, summarize each chapter after reading it, and finally string all the summaries together.
Right?
But large models don't do it this way.
Large models read the entire 500 - page document from start to finish in one go and then try to answer questions based on their memory.
There's no way they can remember everything.
This is the dilemma that large models face.
It's not that they aren't smart; it's that they can't remember.
The MIT paper aims to give AI the ability to rework.
The real bottleneck of AI: not a small brain but a poor memory
Before discussing MIT's solution, I need to clarify why this is so important.
You may have heard of the term "context window".
What does it mean?
You can imagine a large AI model as a genius, but this genius has a fatal flaw - his workbench is too small.
You give him a very long document and ask him to analyze it, but he can only put a small part of the document on his workbench to read.
What about the part that exceeds the size of the workbench? He can't see it and just ignores it.
Currently, the most powerful GPT - 5 can handle up to 270,000 tokens (roughly equivalent to 200,000 Chinese characters) on its workbench.
It sounds pretty impressive, right?
But here's the problem.
Even within this limit of 270,000 tokens, the model's performance drops sharply as the input gets longer.
When you give it 8000 tokens, it performs extremely well.
When you give it 80,000 tokens, it starts to get a bit confused.
When you give it 270,000 tokens, it starts to talk nonsense.
Why?
Because there's too much information for it to process, and its mind gets muddled.
It's like asking someone to remember an entire encyclopedia and then answer questions - they may remember it, but they can't find the relevant information.
This is the current dilemma of large models: It's not that the context window isn't long enough; it's that they can't use it well even if it is.
MIT's genius idea: Put the data in a drawer
Okay, the problem is clear. Now let's look at MIT's solution.
The traditional approach is to directly stuff the data into the AI's "brain".
MIT's approach is: Stop stuffing it. Put it in a drawer instead.
They invented something called RLM.
The core idea of RLM is: Don't let the AI directly read that extremely long document. Instead, let the AI use code to search through the document.
Let me give you an example.
Previously, an AI was like a student. You slapped an entire textbook in front of him and said, "Read it and then answer my questions."
The student would say, "??? I can't read it all. Can I just read part of it?"
Then he would grit his teeth and read the first part, and just give up on the rest.
RLM works differently.
It's more like equipping this student with a table - of - contents system and a search engine.
The document is still the same, but the student doesn't need to read it from start to finish. He can first look at the table of contents to get an overview of the structure, then search for relevant paragraphs based on the questions and extract the useful information.
Even more impressively, this student can break down a complex problem into several smaller problems, and then - here comes the key point - he can summon his clones to handle each small problem simultaneously and finally aggregate the answers.
This is what recursion means: AI can call its clones to help itself with the work.
Or to simplify the understanding:
It treats this extremely long document as a database stored externally instead of directly stuffing it into its "brain".
Then, the model can write code to search this database.
Need the content of Chapter 1? Write a piece of code to search for it.
Need the content of Chapter 10? Write another piece of code to search for it.
Need to compare the content of Chapter 1 and Chapter 10?
Then search for Chapter 1 first, summarize it, then search for Chapter 10, summarize it, and finally combine the two summaries.
This is like having an external hard drive with unlimited capacity.
It doesn't matter if the model's "brain" can't hold that much information.
It can search the hard drive at any time, looking up what it needs.
In theory, this way, the model can process infinitely long documents.
How exactly did they do it?
MIT's implementation is actually quite elegant.
They equipped the AI with a Python programming environment (REPL) and stored that extremely long document as a variable.
Then the AI no longer reads the document directly but operates on it using code.
For example:
Want to know how long the document is? Write a line of code len(input_text) to find out.
Want to see the first 1000 characters of the document? Write input_text[:1000].
Want to search for keywords in the document? Write a regular expression.
Even more amazingly, the AI can segment the document and assign each segment to a sub - AI for processing, and then aggregate the results itself.
This sub - AI actually uses the same model, but it calls itself recursively.
This design has two significant advantages:
First, the AI doesn't need to remember that extremely long document in its "brain".
The document is stored in an external drawer, and the AI uses code to retrieve it when needed.
This means that, in theory, the document can be infinitely long - as long as the drawer is big enough.
Second, the AI can decide for itself what to read and what not to read.
It won't stupidly read from start to finish but will smartly pick out the key parts to read.
This greatly saves computational costs and improves accuracy.
How effective is it?
MIT conducted a series of experiments in the paper, and the results are quite shocking.
Experiment 1: Understanding extremely long documents
They used many test sets, one of which is called OOLONG. This test requires the AI to understand extremely long documents and answer questions that require comprehensive information from the whole text.
Result: The accuracy rate of the GPT - 5 base model is 44%, while that of RLM reaches 56.5%.
In CodeQA, the accuracy rate of the GPT - 5 base model is 24%, while that of RLM reaches 62%, a direct increase of 2.7 times!
Experiment 2: Super - extremely long documents (over 10 million tokens)
They also increased the document length to over 10 million tokens (equivalent to the length of dozens of books).
GPT - 5? It can't handle it at all and just crashes.
RLM(GPT - 5)? It remains stable, and its performance hardly drops.
This is a qualitative leap.
Experiment 3: Cost comparison
You may think: Such a powerful thing must be extremely expensive, right?
Surprisingly, it isn't.
In the BrowseComp - Plus benchmark test, it costs about $1.5 - 2.7