The large model has exposed another weakness: it can't forget old memories and can't distinguish new ones, leading to a sharp drop in accuracy.
Large models have a problem: they have such good memory that they can't forget old memories and can't distinguish new memories! Cognitive tests based on working memory show that there are limitations in the context retrieval of large language models (LLMs). In a simple retrieval task where humans can consistently maintain a high accuracy rate, the model will almost certainly confuse invalid information with the correct answer.
People are increasingly aware that "finding information" in large language models (LLMs) is not simply like looking up a dictionary but is closely tied to the ability to "write information."
It is generally believed that feeding the model a longer context can make its retrieval more accurate. However, there is "mutual interference" within the context, and this phenomenon has been rarely studied.
To understand this problem, researchers from the University of Virginia and the Center for Neural Science at New York University borrowed the concept of "proactive interference" from psychology: earlier information can hinder our recall of later, updated content.
In humans, the stronger this interference, the smaller the working memory capacity usually is.
So, the research team designed a new test called PI-LLM using a classic paradigm in cognitive science. They fed the model a set of semantically related "key - value" pairs (e.g., "key: apple, value: red") one by one, like playing a TV series, and continuously updated these values. Finally, they only asked the model, "What is the latest value corresponding to a certain key?"
Although the latest value is placed right in the sentence before the question, as the number of previous interference items increases, the model's accuracy rate drops logarithmically to nearly zero. The main source of errors is that the model mistakes the old value for the new answer.
The researchers tried using prompt engineering, such as explicitly telling the model to "ignore all previous old information," but the effect was limited.
This shows that when facing interference, LLMs don't just have a problem of "reading" or "not reading" information. Instead, like humans, they have a "working memory bottleneck": even when the context is readily available, they have difficulty flexibly suppressing irrelevant information.
Next, perhaps new methods are needed to enable the model to actively "forget" content that shouldn't be used during retrieval.
Paper link: https://arxiv.org/abs/2506.08184 Repository link: https://github.com/zhuangziGiantfish/Unable-to-Forget Interactive demo: https://zhuangzigiantfish.github.io/Unable-to-Forget/
This paper discovers an information retrieval problem that affects all large language models (LLMs).
This task is not difficult for humans, but all LLMs make significant errors, which significantly damage their global memory and long reasoning abilities.
The paper has been accepted by the ICML 2025 Workshop on Long Context Foundation Models.
This research was jointly led by Wang Chupei (a bachelor of physics from the University of Virginia and an interdisciplinary researcher with a philosophy background) and Sun Jiaqiu (a doctoral student at the Center for Neural Science at New York University, supervised by Tian Xing, an associate professor of neuroscience and cognitive science at NYU Shanghai and a global distinguished associate professor at New York University). They are co - first authors and co - corresponding authors. The two authors have diverse backgrounds in physics, architecture, and philosophy and are committed to exploring the essence of intelligence from the breakdown points of cognitive systems.
Zheng Zheyang (a visiting researcher at the CCN of the Flatiron Institute and a doctoral student at New York University) and Kuang Yilun (a doctoral student at the CILVR Lab of New York University, supervised by Yann LeCun) provided crucial consultations and suggestions during the initiation and advancement of the project.
Core settings of the experiment
Task data input
Suppose we give the model a series of commonly used dynamically updated data (key - value pairs), such as:
"Blood Pressure = 120, Bp = 135, Bp = 119"
LLM task query
What is the last value of blood pressure (BP)?
Results
Currently, all mainstream LLMs (from the latest GPT - 4.1, Llama - 4, DeepSeek - V3 to Llama - 3, Qwen - 2.5, etc., with parameter sizes ranging from 0.6B to over 600B) cannot stably extract the last value, and the error patterns show a clear mathematical law of logarithmic decline.
Discussion
For humans, this task is very simple, and the answer is obviously the last value, 119. Because there is no difficulty in searching for this task.
This task pattern is extremely common in all fields that need to track dynamic data, such as finance (changes in account balances) and healthcare (tracking physiological indicators).
Experimental results
Core finding: A universal decay curve
As the number of updates increases, the accuracy rates of all models show a consistent log - linear decline.
As the interference increases, the accuracy rate eventually stabilizes at 0%. At this time, all models completely fail, producing 100% hallucinations and being unable to give the correct answer 100% of the time.
This consistent decay pattern transcends differences in model architectures, scales, and training resources, strongly suggesting that the root of the problem may lie in the basic levels such as the Transformer architecture or the attention mechanism it relies on.
When a language model needs to retrieve specific target information after a large number of semantically similar interference items, its retrieval accuracy rate will significantly and continuously decrease. This log - linear decline trend has been observed in all mainstream models.
An example of the basic input for the LLM - PI test: The model needs to process a continuously updated stream of key - value information (e.g., "visual art" corresponds to multiple values) and accurately retrieve the final value corresponding to each key after the update is completed (shown in bold in the figure).
Experimental settings
The test requires the model to process 1 to 46 different keys, and the number of updates for each key ranges from 1 to 400.
Randomly mix these updates out of order, and then measure the model's accuracy rate in correctly extracting the last value (the latest value) of each key.
Comparison with humans
The design of this task is essentially very simple:
(1) It does not involve complex searching.
(2) There is no logical difficulty.
Humans can easily adjust their attention to focus only on the latest information, and the interference from the previous content is limited.
Analysis of the wrong answers shows that the model often mistakenly extracts irrelevant previous update values as the final answer, indicating that current LLMs have difficulty effectively ignoring or filtering out non - target (old) information when processing such information streams.
Further analysis of the error distribution reveals that LLMs exhibit a behavior pattern similar to limited working memory capacity: they seem to record key - value pairs in a limited representational space. Once the number of updates exceeds this capacity, the retrieval performance will completely fail.
The researchers also found that there are multiple ways to trigger search failure, all with the same logarithmic decay curve: 1) increasing the number of keys to be tracked simultaneously, or 2) increasing the token length of the paired values.
These phenomena will significantly affect the accuracy of LLM retrieval tasks. Similar phenomena have also been found in human experiments, but human working memory does not completely fail in such tasks.
Interpretation of the phenomenon: "Unable to Forget"
Large models cannot ignore or forget irrelevant information, resulting in complete search failure:
Counter - intuitively, even when using the most intuitive natural - language intervention strategies, such as clearly indicating the area where the answer is located in the input, or directly telling the model to "focus on the latest update" or "forget the previous information," the model's performance cannot be significantly improved.
This shows that the interference effect is so strong that it can override clear natural - language instructions, forcing the model to pay attention to old information.
Therefore, to combat interference, it is likely that fundamental adjustments to the model architecture itself or the training paradigm are needed, rather than relying solely on prompt engineering.
Why is it difficult for LLMs to stably extract the latest information?
Analysis of the errors shows that the failures of LLMs are not random mistakes but are systematically affected by repeated updates.
As the amount of interference increases, the errors show a clear phased evolution:
Initial stage: Proximal interference dominates, and the main source of retrieval errors is the value adjacent to the end.
Middle stage: The interference range expands, and the sources of errors significantly expand to values in any area of the full text.
Late stage: Complete chaos, with the model's output being highly dispersed and retrieving a large number of values that were never input.
The model's responses to a certain key are statistically analyzed according to the position of their values in the update stream (divided into 11 intervals, Bin 1 being the earliest - Bin 11 being the latest).
The results show that as the number of updates increases (from left to right in the panels), the proportion of correctly hitting the final value (ochre) drops sharply. More notably, the wrong responses gradually shift from mainly clustering near the final update (e.g., Bin 10 - 11, possibly due to confusing adjacent updates) to being dispersed in earlier intervals (Bin 1 - 9).
In addition, the errors of returning non - existent values ("hallucinations," light gray) and not returning values ("failures," dark gray) also increase sharply, jointly depicting a picture of the collapse of the model's memory retrieval system under information overload.
Complete failure of top - down regulation
Very different from humans, the performance of LLMs in such extraction tasks is hardly affected by "top - down" prompts. This also explains why chain - of - thought (CoT) models do not show performance improvement in this problem.
Failure of natural - language prompts: This paper tested various prompt variants, clearly guiding the model to focus on the latest information or ignore historical interference (e.g., clearly marking the answer area, "focus on the following text," or the instruction "forget the previous content"). Results: All natural - language intervention measures failed to significantly improve the model's extraction accuracy rate or change the log - linear decline pattern of the accuracy rate. When the interference accumulates, the model still stubbornly slides towards complete error (0% accuracy rate).
The CoT model does not improve. Even when allowing the model to output a long reasoning process (CoT) without restrictions, the extraction error rate curve almost completely overlaps with that of the baseline model without using CoT. This indicates that reasoning cannot effectively improve the model's ability to resist the interference of context information.
This shows that the influence of interference information on the model's behavior exceeds the scope that natural - language instructions can guide or suppress. The model "understands" the instructions (e.g., claiming to focus on the latest value), but in actual operation, it cannot effectively execute them and is still strongly attracted by historical information.
The problem touches the root of the architecture or training: The ineffectiveness of prompts and CoT models implies that prompt engineering alone cannot fundamentally solve this problem. It is likely that innovative adjustments are needed at the level of model architecture design (e.g., attention mechanism, memory module) or training goals/methods (e.g., introducing explicit training signals to resist interference). This points to a key direction for future research.
The CoT model hardly improves the ability to resist interference in information retrieval. The performance curve of the version using CoT (dashed line) almost completely overlaps with or is even worse than that of the baseline model without using CoT. This confirms that the retrieval failure caused by interference is a problem at the underlying mechanism level and cannot be overcome by an additional "thinking" process.
The above figure shows five different natural - language intervention strategies (such as instructing the model to "forget" the history of a specific key, prompting to focus on subsequent information, self - evaluating relevance, soft session reset, and technical Mock QA reset), which are designed to be inserted later in the information stream to try to combat interference.
However, experiments show that all these prompt - engineering strategies fail to effectively alleviate the retrieval performance collapse caused by information overload, and the logarithmic decay pattern still exists, highlighting the limitations of existing natural - language interventions.
Unable to Forget
In addition, inspired by LLM prompt hacking, the researchers designed a non - natural - language adversarial prompting strategy by constructing a deceptive input that mimics the model's own response format and logic:
A false human - machine dialogue is constructed in the input, suggesting that all previous updates belong to another old question that has already been answered.
This "deceptive context isolation" strategy partially improves the accuracy rate, but the improved accuracy rate still follows the log - linear decay law.
This shows that LLMs cannot truly "forget" or ignore the interfering information and can only "shield" it to a certain extent through a specific input form.