Does an AI Agent really remember what it has seen? MemEye conducts a "visual check-up" on multimodal long-term memory.
In the past year, the imagination space for AI Agents has been continuously expanding: they can help us organize materials, write code, browse the web, and operate computers. They are also increasingly receiving visual information such as pictures, screenshots, photos, and video frames. A more natural next step is: if an Agent sees the layout of my room, my health dashboard, a screenshot of a card game, a product logo, or a photo of a route today, will it still remember them tomorrow?
This question may sound simple, but when it comes to multimodal long-term memory, it's not that easy. Because "seeing" doesn't equal "remembering", and "remembering" doesn't equal "being able to use it later".
Many systems seem to have multimodal memory, but in reality, they first convert pictures into text descriptions, i.e., captions, and then store these captions as ordinary text in the memory bank. This is of course very efficient and cost-effective. However, the problem is that once a picture is compressed into text, many details are lost forever.
Paper: https://arxiv.org/abs/2605.15128
Dataset: https://huggingface.co/datasets/MemEyeBench/MemEye
Code: https://github.com/MinghoKwok/MemEye
Summary of MemEye in one sentence
MemEye is a visual-centric evaluation framework for the long-term memory of multimodal Agents. What it wants to answer is not "Can the model understand a picture?", but:
When visual information is scattered across long multi-round conversations and multiple sessions, can the Agent retain key visual evidence and select the currently valid information when the state is constantly changing?
This is also the difference between MemEye and many existing benchmarks: it doesn't just give the model more pictures. Instead, it specifically tests visual memory problems that "can't be solved by relying solely on text, captions, or semantic retrieval".
Why do we need a new evaluation? Because caption hacking is too easy
In many multimodal memory tasks, although the questions come with pictures, the answers may be hinted at by the conversation text, options, or leaked through a rough caption. As a result, the model may seem to "remember the picture", but in fact, it only remembers the text.
Take a simple example. If the question is "Did the user upload a photo of the kitchen or the bedroom last time?", a caption saying "This is a photo of the kitchen" would be enough. The model doesn't need to actually retain the picture.
However, real scenarios are often not that simple. Users may ask:
- "Among the three material samples next to the floor last time, which one is the same as the one later placed next to the cabinet door?"
- "In the health dashboard, did the time corresponding to the highest point of the blood sugar curve change later?"
- "In the card game, after Player 2 first changed from 4 cards to 5 cards, how many red cards did Player 3 have in hand?"
- "The label on the display cabinet was replaced later. Which one is currently valid?"
These questions require more detailed visual evidence: local areas, similar instances, small characters, colors, quantities, positional relationships, and state updates over time. An ordinary caption may only mention "there are several samples", "there is a dashboard", or "several people are playing cards", but it won't save all the details that may be asked about in the future.
So the first core judgment of MemEye is: If a benchmark can be easily bypassed by captions, it's difficult to prove that the Agent truly has visual memory.
How is MemEye designed? Two axes to break down the problem clearly
The most important design of MemEye is a two-dimensional coordinate system. It breaks down "why visual memory is difficult" into two directions:
First, look at the X-axis: How detailed is the visual evidence?
X1 is scene-level: The model only needs to know the general scene, such as a kitchen, a street, a comic scene, or a health dashboard.
X2 is region-level: The model needs to focus on local areas in the picture, such as a corner of a room, one side of an intersection, or a certain module in an interface.
X3 is instance-level: The model needs to identify a specific object among multiple similar ones, such as two similar characters, several similar cards, or several similar material samples.
X4 is pixel-level: The model needs to read more detailed visual information, such as small characters, numbers, colors, textures, precise quantities, and OCR-like clues.
Next, look at the Y-axis: How should memory be inferred?
Y1 is atomic retrieval: Finding one relevant piece of evidence is basically enough to answer the question.
Y2 is relational association: The model needs to connect multiple non-conflicting clues, such as tracking the same character or object across sessions.
Y3 is evolutionary synthesis: This is the most difficult. Because later visual evidence may update, overwrite, or refute earlier visual evidence. The model not only needs to find relevant information but also determine which state is still valid.
There is a very crucial difference here: Relevant evidence is not necessarily valid evidence. An old screenshot may be very relevant to the question, but if it has been overwritten by a new screenshot, it is stale evidence.
MemEye dataset: Making pictures irreplaceable
Under this framework, MemEye has built a benchmark covering real-life scenarios: 371 questions, 221 sessions, 848 dialogue rounds, and 438 pictures. Each question has two forms: multiple-choice and open-ended.
The tasks cover 8 life scenarios, distributed among four categories: leisure, family, occupation, and personal: card game records, comic entertainment, home improvement, outdoor navigation, brand memory, cross-scenario memory, health care, and social chatting.
To avoid "fake visual questions", MemEye has also designed a multi-layer filtering mechanism. For example, if the model can answer correctly by only being given text and options, it means the question may have leaked the answer; if the model can still answer correctly after replacing the picture with a minimalist caption, it means the original picture is not necessary; if the model still can't answer after being given the correct picture and correct clues, it means the question itself may be unclear.
These filters make MemEye more like a visual memory check-up: It tries to ensure that the remaining questions truly require the model to retain and use the key evidence in the image.
What to look for in the experiment? 13 memory methods and 4 VLM backbones
MemEye evaluated 13 memory methods, which can be roughly divided into two categories.
The first category is text-based memory: Convert pictures into dense captions and then use a text system for full context, RAG, reflection, memory update, etc. This type of method is good at organizing text states but tends to lose visual details.
The second category is multimodal memory: Retain the original visual input or use image embeddings for retrieval. This type of method can better preserve details, but it also faces another problem: when the history is long and there are many similar pictures, it may find "relevant pictures" but not the "latest valid picture".
The VLM backbones covered in the experiment include Qwen3-VL-8B-Instruct, GPT-4.1-nano, GPT-5.4-mini, and Gemini-2.5-flash-lite. EM is used for multiple-choice questions, and LLM-as-a-Judge is mainly used for open-ended answers.
Research results
1. Captions are okay for coarse-grained questions, but performance drops on details
The results of MemEye show that caption-based memory is still competitive for scene-level and region-level questions. The reason is simple: The overall scene, main objects, and rough regions can usually be covered by text descriptions.
However, the gap starts to appear at the instance-level and pixel-level. Because the answers may be hidden in the identity of a specific object, small labels, small numbers, color differences, or local textures, and this information is easily omitted by captions.
This is not because the captions are not well-written, but because the caption representation itself has compression loss. It has to choose "which information is worth writing", but the key details needed for future questions may not be written at that time.
So the first important insight from MemEye is: If a task requires high-precision visual evidence, don't compress pictures into unrecoverable text too early.
2. Retaining the original picture helps, but it's not enough
If captions lead to loss of details, does retaining the original picture solve the problem? The answer is no.
Retaining the original picture does help with high X-axis questions, especially for instance-level and pixel-level visual evidence. However, in tasks like Y3 where "the state changes", the system must also know which picture represents the current state.
For example, the label in a room was originally A and was later replaced with B. The retrieval system may find both A and B because they are both relevant to the question. But the correct answer depends on which is the latest state.
This is also a very important finding in MemEye: Semantic relevance does not equal temporal validity. A memory system that only looks for similar content is easily misled by old evidence.
3. Current systems don't "fail to remember", but often get stuck at different stages
The value of MemEye is not just to tell us which method has a higher score, but to help locate where the failure occurs.
Some systems can organize state changes but lose detailed visual information; some systems retain the original pictures but retrieve expired pictures in a long history; some systems find relevant evidence but can't determine which evidence is still valid; and some systems are interfered with by irrelevant content when the history becomes longer and the topics increase.
Therefore, future multimodal long-term memory systems may not rely solely on a simple vector retrieval module, nor simply stuff all the history into the prompt. A more reliable direction may be a combination of the following three:
- Image memory: Retain fine-grained visual evidence;
- Text or structured memory: Record state changes, updates, conflicts, and coverage relationships;
- Temporally valid evidence selection: Select the currently valid evidence in a long history.
Significance: Not creating a leaderboard, but diagnosing the memory system
Many benchmarks end up as a total score leaderboard. But for Agent memory, the total score is not enough. Because two systems may have similar total scores, but the reasons for their failures are completely different.
MemEye is more like a diagnostic tool: it separates the granularity of visual evidence and the depth of memory reasoning, allowing us to clearly see whether the system loses visual details, finds the wrong evidence, or can't handle state updates.
This is very important for future multimodal Agents. In the real world, an Agent won't just face a static picture. It will encounter a constantly changing home, continuously updated health data, an evolving game state, a frequently switched work interface, and new evidence constantly emerging in personal context.
If an Agent can't distinguish between "what I've seen before" and "what is still valid now", it will be difficult for it to become a reliable long-term assistant.
Conclusion: True visual memory means remembering correctly, finding what's needed, and being able to use it
MemEye reminds us that multimodal long-term memory is not simply "storing more history" or converting pictures into captions and putting them into a vector library.