HomeArticle

Did Liang Wenfeng postpone V4 to cure the lobsters' amnesia?

字母AI2026-03-16 17:17
Either don't make a move, or make a game - changing one.

When will DeepSeek release V4? Everyone in the AI circle is speculating, but perhaps only Liang Wenfeng knows the correct answer.

Doubao, Qianwen, Yuanbao... Whether from large or small companies, they all rushed to release new versions during the Spring Festival, fearing that they would be overshadowed by the glory of V4 if they were a step late.

If any normal CEO were in this situation where the entire industry is eagerly awaiting, they would have pushed out a semi - finished product long ago.

Occupy the limelight first and then iterate slowly. This is the basic operation in the Internet industry.

But Liang Wenfeng refuses to do so. A peer who is close to him told the truth: "The team is stable and has a solid foundation. They won't make a hasty release."

Foreign media reported that V4 is an architectural - level reconstruction. It contains 1 trillion parameters, a million - token context, and is natively multimodal. It is scheduled to be released in April.

The core of this iteration is called LTM, Long - Term Memory.

LTM is a system that enables persistent memory within the model architecture. It allows the AI to remember who the user is, what has been talked about, and what the user's preferences are across conversations and tasks. It can precipitate important information like a human being, rather than starting from scratch every time it is turned on.

And this ability is exactly what OpenClaw lacks the most.

Although OpenClaw can do work for people, its memory system essentially just writes notes into local Markdown files. During operation, it continuously sends these notes to the large model, which results in more and more tokens being consumed for sending memories the longer OpenClaw is used.

The entire community is trying every means to solve this problem, such as patching, installing plugins, and installing Skills. However, no one can solve the problem at its root because the problem lies in the model itself; it has a poor memory by nature.

What LTM aims to do is to cure this problem at the architectural level.

The challenges brought by this update far exceed those of a regular version iteration. Moreover, modules such as the model's emotional interaction and personalized memory have not been fully migrated and need further optimization.

Therefore, Liang Wenfeng is not procrastinating but being prudent.

In an industry where everyone is competing to be the first to release and gain the most attention, Liang Wenfeng chooses to wait until everything is in place before making a move.

The reason why R1 became an instant success is not because of a head - start but because it left its opponents speechless as soon as it was launched.

He obviously intends to treat V4 in the same way - either not release it, or make it a game - changer when it is released.

01 What is Liang Wenfeng up to?

The popularity of OpenClaw has made people realize that when AI really starts to do work for people, the model's ability to understand and remember context is no longer just an added bonus but the bottom - line for its usability.

An agent that can't remember the previous conversation will make repeated mistakes, lose the task state, and forget what you said after just a few rounds.

So in the past two years, the industry has also introduced many LTM solutions.

For example, the Berkeley team proposed MemGPT in 2023. Drawing on the concept of virtual memory, it allows the model to decide when to load which information from external storage into the context window and when to swap it out.

Mem0, released in 2025, took this approach a step further in engineering. It outperformed OpenAI's built - in memory by 26% on the LOCOMO benchmark and reduced token consumption by 90%. It is also the most widely used memory plugin in the OpenClaw community today.

Recently, there have also been SYNAPSE, which uses spreading activation to simulate human associative memory retrieval, and SimpleMem, which uses recursive semantic compression to solve memory inflation.

However, all these solutions have a common ceiling: they are all middleware running outside the model.

The extraction, compression, and retrieval of memory are all done by the external system, and the model itself does not participate. Therefore, the quality of memory entirely depends on the engineering level of the external system, and the memory obtained by the model is uneven.

Moreover, all memories ultimately need to be injected into the model through the context window. Just like the problem OpenClaw faces, the more memories there are, the higher the token cost.

Another point is that the model cannot "learn" from the external memory. In this process, the model is just reading notes organized by others, rather than truly internalizing experience into ability.

Liang Wenfeng is likely to take a completely different path.

Judging from the Engram paper signed by Liang Wenfeng and the leaked V4 architecture, DeepSeek's approach is not to build a memory system outside the model but to directly embed memory ability into the model architecture itself.

Engram has proven that a dedicated conditional memory space can be created inside the Transformer. Static knowledge can be stored and retrieved using O(1) hash lookup. When retrieving stored knowledge, it does not occupy the capacity of the context window and does not increase the computational cost of inference.

More importantly, the "infinite memory mechanism" experiment of Engram shows that the capacity of this memory space can be expanded almost infinitely, and the inference overhead of the model remains constant.

To put it more simply, the only way for current models to "remember" something is to put it into the conversation window. When the window is full, something has to be discarded.

Engram is like installing an independent hard drive for the model. You can store memories in this external storage instead of piling them up on the model's own "hard drive". When you want to retrieve a certain memory, you just need to connect this hard drive.

Moreover, this hard drive can theoretically be expanded infinitely, and the search speed remains constant.

If this path is successful, it means that DeepSeek has skipped the entire "external memory" technical paradigm and directly entered the era of "native memory".

If you are familiar with OpenClaw, you will find that Liang Wenfeng is targeting the weakest link of OpenClaw. OpenClaw gives AI hands and feet but fails to give it a memory - capable brain.

OpenClaw's memory system has three structural defects.

The first is compression loss.

After the context window is full, OpenClaw will automatically compress the old conversation into a summary to free up space. Although the facts are retained, the context and logic of the conversation are completely lost and cannot be recovered.

In other words, what you were discussing, the reasoning chain of the decision - making, the tone, and the priority are all gone and cannot be retrieved.

For example, before compression, the agent remembers a complete debugging plan. After compression, there is only one sentence left, "The user is debugging a bug", and all the specific troubleshooting paths are lost.

The second is retrieval failure.

After a few weeks of use, the memory files pile up to hundreds of entries, and retrieval is based on vector similarity. However, vector retrieval can only match semantically similar segments and cannot understand the logical relationships between entries.

For example, if you use OpenClaw to create three plans, and these three plans are scattered in different files, and the last plan is finalized with the client. When you later want to retrieve the finalized plan, since all three plans are for sending to the client, it is possible that only the first or the second plan will be retrieved.

The third is the limited memory capacity.

OpenClaw's memory has two layers: the core memory (MEMORY.md) is fully injected into the context at the start of each session, and the log memory is recalled on - demand through a search tool.

It sounds reasonable, but the core memory has a hard limit. A single file is truncated at 20,000 characters, and the total of all bootstrap files does not exceed 150,000 characters.

However, the longer you use it, the longer MEMORY.md becomes. Either information will be truncated and lost, or the token consumption for each session will increase linearly.

Moreover, on the log side, the quality of on - demand retrieval entirely depends on the model's own judgment. If it thinks the information is irrelevant, it will not recall it, even if the information actually exists. It is very easy to lose important information.

To put it simply, these three problems are essentially the same: the window is only so big. The more things you put in it, either you remember incorrectly, or you can't find what you're looking for, or it's too expensive. OpenClaw's memory is not really "remembering" but "copying a bunch of notes and then not being able to find them".

If V4 really makes this path work at the architectural level, it will not only solve OpenClaw's problems but also turn the model into a "growing model".

The longer you use it, the better it understands you. This is fundamentally different from the user experience of all current large models because no matter how powerful current models are, they are like a blank sheet of paper every time they are opened.

A recent study by Tencent confirms the value of this path from another perspective.

Yao Shunyu, who joined Tencent as the chief AI scientist from OpenAI, published his first signed paper after joining in February.

The paper is called CL - bench, short for Context Learning Benchmark, which specifically tests whether large models can truly learn from the context.

It's not testing how much knowledge they have memorized but whether they can learn and apply from the materials you provide.

The results are not good.

The average correct rate of all cutting - edge models is only 17.2%. The model with the highest correct rate is GPT - 5.1, but it only got 23.7% correct. In other words, if you carefully prepare a detailed background material and feed it to the AI, there is more than an 80% chance that it won't really "learn".

Yao Shunyu's judgment in the paper is that the gap between current AI and true intelligence lies not in the amount of knowledge but in the ability to learn. An AI full of knowledge but unable to learn is like a person who has memorized an entire dictionary but can't write.

He also expressed a similar view at the AGI - Next Frontier Summit. He believes that the core bottleneck for large models to move towards high - value applications lies in whether they can "make good use of the context".

How to handle memory is likely to become the core theme in 2026. Once context learning and memory become reliable, models may be able to achieve autonomous learning.

Liang Wenfeng can't be unaware of this, which is why the release date has been postponed again and again.

02 The areas DeepSeek needs to improve

Vision is one thing, and reality is another.

While Liang Wenfeng has been in seclusion for a year, his competitors haven't stopped waiting for him. DeepSeek has more to catch up on than the outside world imagines.

The first shortcoming is multimodality, which is also the biggest one.

To date, DeepSeek is still a pure - text model. It can't process images, videos, or audio.

It's not that DeepSeek has no visual ability at all. In January this year, they released OCR 2, a small document - understanding model with 3B parameters. Its core is to replace the traditional visual encoder with an encoder called DeepEncoder V2, enabling the model to understand document pages in the same way a human reads.

In the document parsing benchmark test, OCR 2 defeated large - scale models like Qwen3 - VL - 235B with the least number of visual tokens.

However, OCR 2 can only do one thing: extract text, tables, and formulas from documents. In essence, it is a one - way conversion from "image to text" and not general visual understanding.

In other words, OCR 2 proves that DeepSeek has the ability to do well in visual encoding, but there is a huge technological gap between "being able to read documents" and "being able to process videos, audio, and understand natural scenes".

Meanwhile, other large companies have long entered the "full - modality" era.

ByteDance's Seedance 2.0 has demonstrated the large user base and commercial potential of excellent multimodal models. GPT - 5.4 natively supports audio, video, and computer operations.

It is reported that one of Liang Wenfeng's main tasks in the past six months has been to make up for the shortcoming in visual content processing.

The second shortcoming is agent ability.

The title of the top - pinned article on DeepSeek's WeChat official account is "The first step towards the agent era", which shows that Liang Wenfeng knows which direction to go.

As more and more people start using OpenClaw, both large and small companies are emphasizing the agent ability of their models.

Kimi K2.5 can autonomously schedule 100 sub - agents and process 1,500 steps in parallel. ChatGPT's agent function can automatically fill out forms, book tickets, and retrieve information across websites. Claude has launched Agent Teams, where multiple AIs collaborate to complete complex tasks.

The third shortcoming is AI programming.

This is the fastest - growing and most commercially mature track in 2026.

In the programming benchmark test SWE - bench Verified, Claude Opus 4.6 scored 80.8%, GPT - 5.3 Codex scored about 80%, and DeepSeek V3.2 only scored 73.1%.

In the more difficult benchmark SWE - bench Pro, DeepSeek V3.2 scored 40.9%, far lower than GPT - 5.4's 57.7%.

More importantly, the industry has evolved from "Vibe Coding" to "Agentic Engineering", enabling AI to independently complete engineering - level tasks.

The title of Zhipu's GLM - 5 paper is "From Vibe Coding to Agentic Engineering". It can run code continuously for 24 hours, make 700 tool calls, and perform 800 context switches to build a GBA emulator from scratch.

Previously, there were reports that the internal test results of DeepSeek - V4 showed that its programming ability exceeded that of Claude Sonnet 3. However, Claude Sonnet 3.5 has now been completely discontinued by Anthropic.

The fourth shortcoming is AI search.

Now almost all ChatBot products are connected to the Internet. You rarely see an app that has a separate switch for model Internet access.

OpenAI has ChatGPT Search, and Google has Gemini Embedding 2 search. DeepSeek's search ability has always been a weakness, and its search results often contain hallucinations.

Vectara's test shows that the hallucination rate of DeepSeek R1 is as high as 14.3%, nearly four times that of V3 (3.9%).

In the academic citation retrieval test, the situation is even more exaggerated. 91.43% of the cited results are wrong, including fabricating paper titles, inventing DOIs, and misattributing authors.

DeepSeek itself admits that hallucination is an "inevitable" problem at the current stage.

DeepSeek does not have its own search infrastructure and can only rely on third - party interfaces, so the quality of information sources is uncontrollable.

The model's own fact - checking ability is not strong enough. Even if it gets the correct search results, it may introduce errors in the generation process. The combination of these two problems leads to the poor user experience of "searching but getting inaccurate results".

In the agent era, search is not an added bonus but a necessity.

None of DeepSeek's shortcomings can be solved by minor adjustments. Liang Wenfeng is not just creating a stronger V -