Large models also need to sleep. Let the AI take a nap and it will be smarter when it wakes up.
Even AI can't handle 7×24 operations.
Carnegie Mellon University and the University of Maryland published a paper titled "Language Models Need Sleep".
When large models process long contexts without rest, they can really become "exhausted" and perform poorly.
The inspiration for this research comes from the operating mechanism of the human brain.
When people sleep, the hippocampus replays the short - term memories of the day over and over again, consolidating them into cortical synapses and turning them into long - term knowledge.
The research team believes that models can also do this. They designed a sleep mechanism. When the context window of the large model is almost full, instead of forcing it to continue, they let the model "take a nap". It repeats and chews on the recent context several times, compresses it into long - term weights, clears the cache, and then continues working after "waking up".
Tests show that reasonably increasing the number of "sleep" iterations can significantly improve the model's performance in deep reasoning tasks.
Especially for those difficult problems that require step - by - step derivation, the more complex the problem, the more "sleep" the model needs.
What's going on?
What's wrong with large models? Why do they need to "sleep"
The core of the Transformer is the attention mechanism. However, the attention mechanism has an inherent shortcoming: the longer the context, the more the computing power increases quadratically, and the KV cache also increases linearly.
For the same reasoning task, the computing power cost between an 8K context window and a 128K context window is extremely different. Most of the additional computing power is consumed in the correlation calculation of historical information.
So there are currently two approaches:
One is to persevere. When it can't hold on anymore, the old information is kicked out of the cache. Once the information is removed, the model acts as if it never happened.
The other is the SSM + Attention hybrid architecture that has been popular in the past two years, such as Samba and Qwen3.5.
The hybrid architecture is a compromise solution. It compresses the old information into fast weights, which do not take up cache space while retaining the ability to call the information.
This does relieve some memory pressure. However, the team found that even when the fast weights still have sufficient capacity, the model will still experience performance degradation when the reasoning steps increase and the logical chain becomes longer.
That is to say, the current bottleneck is not the lack of information storage capacity, but the inability to keep up with the deep reasoning ability.
Before historical information is removed from the KV cache, the model only has one forward propagation opportunity to internalize the information. A single - pass processing is simply not enough to support the decomposition and derivation of complex logic.
This is similar to the human brain. You experience a lot of things during the day, but you don't digest them all on the spot. Instead, your brain processes them when you are asleep.
The hippocampus replays the important fragments of the day over and over again during sleep, consolidating short - term memories into cortical synapses and turning them into long - term knowledge.
But this process must be offline. That is, you have to fall asleep first and turn off external stimuli temporarily so that the brain can concentrate its computing power on digestion.
Moreover, it's not just one replay; it needs to be repeated several times.
What does the model's "sleep" look like
The team applied this whole set of human - brain logic to the model.
Their design is that when the model's context window is almost full, instead of forcing it to continue, they directly let the large model "sleep".
Here, "sleep" means pausing to receive new tokens, entering a completely offline state, and performing multiple rounds of recursive forward propagation on all the accumulated contexts.
Relying on learnable local rules, the information is repeatedly refined and integrated, and the fast weights in the SSM module are gradually updated to complete the deep compression and digestion of information.
After digestion, the KV cache is cleared, and the model wakes up with the updated weights and continues working.
From the perspective of computing power allocation, all the additional computational overhead is concentrated in the "sleep" stage. The normal reasoning process after the model wakes up is the same as that of a normal model, only requiring one forward propagation.
The "sleep duration" here essentially refers to the number of information iteration processing rounds. The more rounds, the more fully the model sorts and refines the context content.
The team selected three types of tasks for testing: cellular automata, multi - hop graph retrieval, and GSM - Infinite infinite mathematical reasoning. These tasks can precisely control two variables: reasoning depth and memory load.
The test results clearly confirm that increasing the number of sleep iterations steadily improves the overall performance of the model, and the performance improvement is mainly reflected in high - difficulty deep reasoning tasks.
That is to say, simple problems can be solved while the model is "awake", but difficult problems require "sleep". Only after multiple rounds of sorting can the model clarify its thinking.
It can only be said that taking a break is indeed a good way to improve efficiency. Sometimes, you need to stop to think carefully (doge).
Paper link: https://arxiv.org/abs/2605.26099
This article is from the WeChat official account "QbitAI". Author: Wen Le. Republished by 36Kr with permission.