Stream Intent Detection + Permanent Memory, NUS & NTU Release Pask: Bringing Jarvis AI into Reality
[Introduction] Making AI proactively assist like an assistant is what we envision for AGI. The concept of proactive agents has been proposed multiple times, but it's difficult to implement in real - life scenarios. Existing work remains at the conceptual level and fails to address issues such as real - time performance, depth, and memory required in the complex world. The team led by Xie Zhifei from Nanyang Technological University proposed Pask, which uses the architecture of "bottom - layer small - model streaming intent detection" + "upper - layer Agents execution" to achieve the first proactive agent capable of real - time operation, in - depth analysis, and self - evolution based on personal global memory.
During a meeting, when the discussion turned to "Can Pask's context be integrated with Obsidian or Feishu? It will be difficult to stand alone", Pask thought for three seconds and then presented the following idea:
"If you collect all user memories into your own system, you're essentially creating a new data silo. Although it's more controllable in the short term, there will be two long - term problems: Firstly, it's incompatible with the tools users are already using. Secondly, the user migration cost is too high, and the ecological ceiling will be low. A better positioning is not to build a self - contained memory library, but rather to serve as: a personal context entry + an active AI orchestration layer in the ecosystem. That is, users' knowledge and behaviors remain in the original systems, and you're responsible for understanding, indexing, triggering, coordinating, and executing. The ideal position for Pask is: a unified entry that best understands the user and an agent layer capable of proactive actions across ecosystems."
The ideal AI, the AGI that humans truly expect, must possess a certain ability: proactivity.
This proactivity is neither the "autonomy" of products like OpenClaw, nor the AI that writes an email for you after a meeting. It must be a super - assistive brain that deeply understands you, can guess your intentions, and provides crucial help at the right time.
It's not something to be invoked like a tool, but rather needs to surround you, making you feel "proactively cared for and highly valuable".
It may seem that this problem concerns the Agent mechanism, but once you start working on it, you'll find it's much more difficult than expected:
Low accuracy. Scattered messages seem like spam, and it's extremely difficult to provide in - depth real - time assistance.
Poor real - time performance. It takes 3 - 4 seconds just to infer human needs, while the maximum delay humans can tolerate is two seconds.
Deep understanding of humans with massive memory. Proactive AI receives a massive amount of new tokens every day. It's impossible to query every time. How can we ensure that the system can autonomously switch to the correct memory context and, most importantly, truly understand its owner?
After investigating a large number of relevant papers and products, researchers from Nanyang Technological University found that most previous work avoided these key issues, especially real - time performance.
So, the researchers decided to explore a method on their own and proposed the "demand detection - memory - proactive agent" paradigm PASK, which includes a new problem - solving paradigm, the IntentFlow streaming intent detection model, a self - evolving memory module, and a proactive agent engineering architecture.
Paper link: https://arxiv.org/abs/2604.08000
Demand Detection, Long - term Memory, and Proactive Agent Paradigm
First, we need to determine: What "components" does proactive AI need?
The researchers proposed a general paradigm to transform passive models into proactive intelligence, which includes three interacting modules: Demand Detection (DD), Long - term Memory (MM), and Proactive System (PAS).
Demand Detection (DD) is the first and most core step. It listens and watches with humans and detects current demands in real - time, such as "He needs to know the meaning of this word now" or "He may be doubting whether the other person is telling the truth".
Long - term Memory (MM) is responsible for the personalized part of the system. It grows and evolves with the user and serves as the "long - term context" throughout.
Proactive System (PAS) is the underlying execution logic of the entire Agent, which runs in a loop and drives the first two components to work together.
IntentFlow: Streaming Intent Detection Model
Over - proactive AI can be information harassment.
A good proactive AI must achieve a precise balance in real - time performance, accuracy, and trigger frequency. Unfortunately, accuracy, memory query, and real - time response are inherently contradictory.
The greater challenge is that this cannot be done in the traditional Agent way. If the entire process needs to be completed within 2 seconds, the time left for intent detection is at most 1 second, which is not even enough for a single API call.
Intent reasoning and memory query take at least 10 seconds.
Proactive AI is not something that a simple Agents mechanism can handle. Inspired by end - to - end streaming models in voice and video, the researchers chose the "model + Agents" implementation path. They retrained an intent detection model that runs in real - time on the "text stream" and built IntentFlow, which receives text - based multimodal information streams and user memories and autonomously determines what humans need at the moment.
As for the specific final result, IntentFlow doesn't care. It only focuses on what humans need.
IntentFlow is more like a bridge: on one side is the information stream faced by users, and on the other side is the latest and most powerful AI in the world. It only serves as a new entry for AI intervention at the right time.
MeMory: Multi - layer Self - evolving Memory System
The memory system is the core of the co - growth between proactive AI and humans, and the memory of proactive AI has an additional requirement: real - time.
In Pask, the researchers drew inspiration from the Cache - memory - external storage architecture of computer storage and designed a three - layer memory system:
- User Memory (similar to Cache): The AI needs to know who the user is and their preferences at any time.
- Workspace Memory (similar to memory): It is responsible for all context information within the current event.
- Global Memory (similar to external storage): Real - world events are often a series. Global memory plays the role of "super - context" and persists across events.
PAS: Streaming System at the Bottom of Proactive AI
How can proactive AI run stably in a complex real - world environment?
Its underlying system is quite complex: Each demand requires an independent process, all environmental variables need to be continuously maintained. There is a large DD - MM loop in the entire system, as well as numerous internal small loops.
The underlying system is divided into three layers:
- Front - end: Responsible for the input and output of information streams.
- Server Back - end: Responsible for multi - process execution, loop control, and data storage scheduling.
- AI Back - end: Responsible for connecting to external models and providing callable search, tools, and code execution environments.
Experimental Results
Pask was tested on ten types of tasks in three major fields: learning, work, and daily life, and its performance is comparable to that of closed - source models with thought chains.
In terms of latency, while other open - and closed - source models generally take 3 - 10 seconds to infer human potential needs, Intentflow only takes 1.5 seconds to complete a full - fledged intent detection by combining human personal, work, and global memories.
In the report, the researchers conducted detailed experiments on proactive AI and summarized 12 findings.
The Exploration of Proactive AI Has Just Begun
AI has come a long way in becoming smarter, but understanding humans is just starting.
There is no unified answer in the real world, only complex scenarios, roles, and tasks. Each industry has its own workflow, judgment method, and implicit rules. The same sentence may have completely different underlying demands when spoken by different people.
The core challenge of proactive AI is data.
Real - world intent data hardly exists. It's not because of the lack of manual annotation, but logically, annotation doesn't always hold.
Proactive AI no longer follows the logic of "I know what I don't know", but rather "I don't know what I don't know", because the so - called deeper and more valuable information often goes beyond the user's current cognition.
Often, people don't really know what they want, let alone what they need next. The proof that AI has guessed correctly is not whether a Q&A is aligned, but whether the user immediately feels "this is it" after receiving the help.
The combination of a bottom - layer streaming intent model and upper - layer Agents execution is the future of proactive AI.
During the one - year development of Pask, the researchers spent several months on the agents mechanism, and the final conclusion was simple: it won't work.
The reason is straightforward: The time delay humans can accept is not even enough for a complete model call, let alone intent reasoning.
The first author, Xie Zhifei, has a background in voice model development. Facing this core contradiction, he immediately realized that this was a repeat of the history of real - time dialogue models. Before 2024, when voice assistants had a 3 - second delay, no one could use them.
When real - time models like GPT - 4o emerged, the application of voice models exploded. The development team then thought of using a streaming model for intent detection, which led to the creation of Intentflow.
So, from the very beginning, Pask didn't aim to compete in "smarter" execution agents. Instead, it focuses on one thing: guessing people's minds faster and more accurately.
It doesn't build larger models or more complex call logics. Instead, it tries to answer a question: Can it understand you in a continuous context, know your deep - seated intentions before you speak, and provide the most valuable help in a very short time at the right moment?
The future of AI is that the ability to actively understand intentions will truly enter every mobile device around you, making AI not just answer questions, but become an AI soul knower that understands you in real - time, stays close to you continuously, and truly knows you.
Author Introduction
The first author of the paper, Xie Zhifei, is a Ph.D. student at Nanyang Technological University. The corresponding authors are Yan Shuicheng, Miao Chunyan, and Ye Deheng.
The Project Lead of the Pask research team is Xie Zhifei from Nanyang Technological University. Xie Zhifei is a current Ph.D. student at Nanyang Technological University.
His research direction is multimodal streaming models. As an undergraduate, he developed the world's first "open - source GPT4o" series of real - time dialogue models, the Mini - Omni series. Three of his first - author papers have been cited over a hundred times each, and his open - source projects have accumulated over 5k stars.
Subsequently, he hit it off with Professor Yan Shuicheng and left Tsinghua University to join NUS LV_Lab, becoming the first Ph.D. student after Professor Yan's return to academia. The corresponding authors of the paper are Professors Yan Shuicheng, Miao Chunyan, and Ye Deheng.
Reference: https://arxiv.org/abs/2604.08000
This article is from the WeChat official account "New Intelligence Yuan". Editor: LRST.