Reshaping the Memory Architecture: LLMs are Installing an "Operating System"
Large models with extremely long context windows often experience "memory loss," and "memory" needs to be managed.
As is well known, the context windows of modern large language models (LLMs) are generally limited. Most models can only handle thousands to tens of thousands of tokens. For example, the early GPT - 3 only had ~2,048 tokens. Although some recent models have expanded to windows of millions or even tens of millions of tokens (e.g., Meta's Llama 4 Scout claims to reach 10 million tokens).
The figure shows the evolution of the context window size of LLMs.
Note: The token count is an approximate maximum. "GPT - 4.1" refers to the GPT - 4 updated in April 2025, and "Scout" is a 17B - parameter Llama 4 variant designed specifically for long contexts.
LLMs have an inherent "memory defect," that is, their context windows are limited, which severely restricts their ability to maintain consistency in long - term interactions with multiple rounds and sessions.
Therefore, modern LLMs generally have difficulty maintaining long - term memory. This is quite unfavorable for many applications, as memory is the key to reflection and planning and an indispensable part of an intelligent agent system.
Overview diagram of an LLM - based autonomous intelligent agent system. Image source: Lil'Log https://lilianweng.github.io/posts/2023 - 06 - 23 - agent/
Recently, there has been an increase in research related to the memory of large models. MemOS, which was open - sourced a few days ago, has attracted a lot of attention.
Different from traditional Retrieval - Augmented Generation (RAG) or pure parameter storage, MemOS regards "memory" as a system resource as important as computing power. It continuously updates and manages the long - term memory of large models, schedules, integrates, archives, and manages the permissions of plain text, activation states, and parameter memory within the same framework, enabling large models to have the ability to continuously evolve and self - update.
The memory of large models is inseparable from the long - context processing ability
The large models discussed earlier can handle a large number of tokens, even reaching the level of tens of millions of tokens. These all belong to the long - context processing ability of LLMs. Practical experience in using LLMs tells us that LLMs with strong long - context processing ability also have stronger memory ability.
Long Context
- It refers to the length of historical text that the model can "see" during the current inference process.
- Essentially, it is the length of the sequence input into the model at one time.
- It is used to solve tasks that require context maintenance, such as document Q&A, multi - round dialogue, and code analysis.
The "long - context processing ability" includes:
Length generalization ability: The ability of the model to extrapolate on longer sequences that it has not seen during training. If the length exceeds the training length, some models may fail catastrophically.
Efficient attention ability: A mechanism (sub - quadratic algorithm) to reduce the computational/memory consumption of long sequences. This may include approximate attention, sparse patterns, or completely alternative architectures.
Information retention ability: It refers to the ability of the model to actually utilize distant information. If the model actually ignores the context content after a certain position, then even a large context window is ineffective. If not trained properly, the model may experience phenomena such as the attenuation of attention weights or the loss of context after exceeding a certain length.
Prompt design and utilization ability: Research on how to design prompts to maximize the advantages of long contexts.
Memory
- It refers to the information retained by the model across multiple rounds of dialogue/use.
- It is a persistent mechanism that records information about users, dialogues, preferences, etc.
Aurimas Griciūnas, the founder and CEO of SwirlAI, believes that the memory of LLMs can be divided into the following types:
1. Episodic memory - This type of memory includes the agent's past interactions and executed operations. Whenever an operation is completed, the control system writes the operation information into persistent storage for future invocation or backtracking.
2. Semantic memory - Semantic memory includes accessible external knowledge information and the agent's understanding of its own state and capabilities. This memory can be either background knowledge visible only inside the agent or a grounding context used to limit the information scope and improve the accuracy of answers, which filters out information relevant to the current task from massive Internet data.
3. Procedural memory - Procedural memory refers to the structural information related to the system's operating mechanism, such as the format of system prompts, available tools, and preset behavioral boundaries (guardrails).
4. In a specific task scenario, the agent system retrieves relevant information from long - term memory according to the needs and temporarily stores it in the local cache for quick access and task execution.
5. The information retrieved from long - term memory and the information currently in the local cache together constitute the agent's working memory (also known as short - term memory). This information will be integrated into the final prompt input to the large language model (LLM) to guide it to generate subsequent behavioral instructions or task responses.
As shown in the figure, 1 - 3 are usually marked as long - term memory, and 5 is marked as short - term memory.
The long - context ability and memory ability can work together:
Information in the memory system (such as user preferences) can be injected into the context as part of the prompt.
The long - context window can help the model maintain short - term "memory" in the current dialogue, reducing the reliance on the memory system.
Several methods to implement LLM memory
Methods for long contexts
As discussed earlier, when the dialogue content exceeds the context length, LLMs may forget user preferences, ask repetitive questions, or even conflict with previously confirmed facts. The most direct way to improve the memory ability of LLMs is to enhance their long - context processing ability. Currently, the methods to improve the long - context processing ability of LLMs are as follows:
1. RAG (Retrieval - Augmented Generation) is a very versatile method for building knowledge bases and guiding LLM generation through retrieval. By converting structured or unstructured data into retrievable semantic representations, RAG implements the process of "retrieve first, then generate," enabling LLMs to combine external knowledge to answer factual questions and reduce hallucinations.
The RAG architecture supports dynamic updates of documents, facilitating the construction of a real - time, scalable, and editable knowledge system, which provides a basis for the subsequent construction of LLM memory and the design of memory systems.
The figure compares the differences between the RAG process and the pure long - context method. RAG is efficient but may miss indirect context; the long - context method is comprehensive but requires the model to process a very large input.
2. Hierarchical summarization: When summarizing a book, each chapter can be summarized recursively to obtain intermediate summaries, and then these intermediate summaries can be further summarized, and so on. This method can handle inputs far exceeding the model's context length, but its operation process is cumbersome, and errors are likely to be introduced and accumulated during multiple rounds of summarization.
3. Sliding window inference: For tasks such as reading comprehension of long texts, the model can be applied to sliding windows of the text (e.g., paragraphs 1 - 5, then paragraphs 2 - 6, and so on), and then the output results of each window can be integrated through a certain method or a secondary model.
Researchers have explored various algorithmic approaches to expand the context window. Generally speaking, these methods can be divided into: (a) position encoding methods for length extrapolation, (b) efficient or sparse attention architectures, (c) alternative sequence models (replacing self - attention), and (d) hybrid or memory - augmented methods.
For more detailed information about the long - context windows of LLMs, you can refer to the article by Dr. Adnan Masood:
Article link: https://medium.com/%40adnanmasood/long - context - windows - in - large - language - models - applications - in - comprehension - and - code - 03bf4027066f
Methods for memory
Although context ability is closely related to large - model memory, the context window cannot be directly equivalent to memory.
Take the construction of a chatbot as an example. The chatbot needs to remember what the user said in previous conversations. As the conversation length increases, memory management moves information out of the input context and stores it in a searchable persistent database. At the same time, it summarizes the information to keep relevant facts in the input context. It also restores relevant content from earlier conversations when needed. This mechanism enables the chatbot to keep the most relevant information in its input context memory when generating the next round of responses.
The memory - based method seems very similar to RAG, and in fact, it is. Generally, it can be divided into two types.
Fixed memory pool
One type of method uses an external encoder to inject knowledge into the memory pool. For example, the Memory Network focuses on solving the forgetting problem in Recurrent Neural Networks (RNNs). Subsequent work calculates the weighted sum of the entire memory pool as the representative vector of memory. The most representative work, MemoryLLM, integrates a built - in memory pool in the latent space of the LLM. The design goal of this memory pool is to effectively integrate new knowledge within the limit of a fixed capacity and minimize information forgetting, thus avoiding the problem of infinite memory growth.
Another type of method directly uses the language model itself as an encoder to update memory. For example, Memory Transformer and RMT propose to add memory tokens when reading the context, where the memory pool contains at most 20 tokens.
Although these fixed - size memory pools show certain effects in experiments, their performance is still limited by the memory capacity.
Non - fixed memory pool
Other memory - based methods usually use non - fixed - size memory pools and introduce different forgetting mechanisms to deal with the problem of continuous memory growth. In these methods, the memory pool usually exists in the following forms:
1. Hidden states: For example, MemoryBank stores intermediate representations as persistent memory content.
2. Key - value pairs: Representative methods include KNN - LM and LONGMEM, which save and recall knowledge in a retrievable key - value structure.
3. Vectors in hidden space: For example, Memformer enhances context memory by saving vectors in the latent space.
4. Raw texts: For example, RET - LLM stores knowledge in the form of triples in memory and retrieves relevant information in the current context through API queries.
These methods provide a more flexible memory mechanism, but due to the lack of structured compression and management means, the stored knowledge may be redundant, affecting memory efficiency and model inference performance.
For some technologies related to large - model memory, you can refer to the following paper:
- Paper title: MemoryLLM: Towards Self - Updatable Large Language Models
- Paper link: https://arxiv.org/abs/2402.04624
Memory data management: Memory system
As mentioned earlier, the memory of LLMs is very similar to a database. Although RAG introduces external knowledge in plain text, it is still a stateless working method, lacking the ability to integrate lifecycle management and persistent representation.
The memory system is essentially almost the same as RAG retrieval, but the memory system mechanism adds more abundant information organization, management, and retrieval methods on the basis of memory storage. By combining memory storage management with the principles of computer operating systems, a more perfect memory mechanism can be constructed, enabling LLMs to have more persistent memory.
Recently, research on LLM memory systems has gradually come into the spotlight. Most of them are inspired by the memory mechanism of traditional operating systems and have established a new - architecture memory management mode. Take several recent representative research works as examples:
Andrew Ng, the co - founder of Coursera, the former head of Baidu's AI department, and the former founder and head of Google Brain, mentioned in a recent short course:
The input context window of large language models (LLMs) has limited space. Using a longer input context is not only more costly but also slower to process. Therefore, it is crucial to manage the content stored in this context window.
In the paper "MemGPT: Towards LLMs as Operating Systems," the authors propose using an LLM agent to manage this context window. The system is equipped with a large