The former CTO of OpenAI presented a prototype of an AI that is always "present".
On May 11th, Thinking Machines released a new model called the Interaction Model. This AI laboratory, founded by Murati, the former CTO of OpenAI, previously published the OPD distillation paradigm that had a profound impact on DeepSeek V4. This time, they claim that the newly released model represents the next - generation mode of human - machine interaction.
The starting point of their argument is communication studies.
In 1991, Herbert Clark and Susan Brennan proposed three basic conditions for effective human communication in their classic paper "Grounding in Communication". Thinking Machines adopted these three conditions as a diagnostic framework to check the status of current AI interaction systems item by item.
Copresence, that is, both parties share the same perceptual field. The environment you see, hear, and are experiencing can also be perceived by the other party.
Contemporality, that is, reception is almost synchronous with transmission. While you are speaking, the other party is processing what you are saying, without a gap of "waiting for you to finish speaking before starting to understand".
Simultaneity, that is, both parties can send and receive information simultaneously. While you are speaking, the other party can give real - time feedback such as micro - expressions, nodding, or interrupting.
These three conditions are naturally met in face - to - face conversations. When you are chatting with a friend in a coffee shop, you share the same physical space (copresence). As soon as you start speaking, the other party is listening and understanding (contemporality). The other party will frown or nod while you are speaking to indicate "I'm following" or "I don't quite agree" (simultaneity).
Thinking Machines' diagnostic conclusion is that the current AI systems completely fail to meet the first two conditions. There has been some progress in the third condition in recent full - duplex voice models, but it is still incomplete.
AI has never been truly "present"
Thinking Machines believes that the most significant aspect in which current AI does not meet the definition of presence is that all dialogue systems are based on the concept of turns.
The user finishes speaking a paragraph, the model processes it, and the model outputs a response. One turn ends, and the next turn begins. This structure fundamentally cuts off copresence.
Firstly, it lacks copresence. AI only perceives you when you actively input. When you are silent, your world does not exist for it. It has no idea if you frown, walk to the window, or if a bad news pops up on the screen. Its perceptual field is limited to the narrow pipeline that you "actively push" to it via the keyboard or microphone.
Secondly, it lacks contemporality. The model has to wait for you to "finish speaking" before starting to process. Voice Activity Detection (VAD) needs to detect a long enough silence to determine the end of your turn. During this "waiting for you to finish speaking" gap, the model has no real - time understanding of what you are saying.
Thinking Machines used an analogy in their blog. Imagine you are discussing a key disagreement with a colleague, but you can only communicate via email. You write and send, then wait for a reply. The other party writes and sends, then waits for your next email. No one thinks this way is suitable for solving complex collaboration problems.
But this is the interaction mode of all current AI systems.
The third necessary condition, simultaneity, has made the fastest progress in the past two years. Real - time voice AI has been trying to enable the system to send and receive simultaneously. On May 7th, OpenAI released GPT - Realtime - 2, and ByteDance fully launched Seeduplex on Doubao. However, a closer look at the architecture reveals that different companies have different levels of implementation for concurrency.
Moreover, they only solve the issue of simultaneity, and the first two conditions remain unchanged.
Full - duplex at the communication layer, but the model layer is still waiting for you to finish speaking
GPT - Realtime - 2 is a voice model launched by OpenAI 4 days before Thinking Machines' release. It is also their most powerful real - time interaction solution currently. Let's first see what it has achieved.
It has GPT - 5 - level inference ability, a 128K context window, and most importantly, an improved parallel tool - calling ability, allowing you to control the system and call tools using voice. Therefore, it is 15.2% higher than its predecessor on Big Bench Audio, making it a very strong voice model.
But here we only care about one question: how far has it progressed in terms of the three conditions?
Let's first look at the architecture. The underlying layer of OpenAI's Realtime API is WebSocket, a full - duplex communication protocol. Your audio stream is continuously sent to the server, and the AI's audio stream is continuously returned to you, with both directions open simultaneously. So, simultaneity is solved at the communication level. You can start speaking while the AI is speaking, and the AI can continue to output while you are speaking. The channel is bidirectional, without the restriction of "one party must wait for the other to finish speaking before starting to speak".
The problem lies in the model behind the channel.
Although WebSocket continuously receives your audio, the model is not "always listening". There is a VAD (Voice Activity Detection) module on the server side between you and the model, acting as a gatekeeper. The job of VAD is to determine "whether the user has finished speaking". Only when it detects a long enough silence and determines the end of your turn will the model be awakened to process what you just said.
For example, the channel is like a two - way road where cars can drive in both directions at any time. But the model is like a toll station at the end of the road. It doesn't open the gate as soon as it sees a car coming. Instead, it waits for all the cars to arrive (you finish speaking) and then releases them for processing all at once.
What if you interrupt? If you start speaking while the AI is speaking, VAD detects new voice activity, the system cancels the AI's current output, and then waits for you to finish speaking before triggering a new round of generation.
Note that in this process, the interruption is triggered by VAD, not the model itself realizing that you have started speaking. The model is notified externally to "stop" and then waits for a new round of input to be sufficient before starting again.
Although there is a foundation for concurrency at the underlying layer, the old problem of turn - based interaction has not been solved, and contemporality cannot be solved at all.
Full - duplex at the model layer, but still doesn't know what you look like
ByteDance's Seeduplex, launched in April 2025, takes one step further than OpenAI. It is a large - scale voice model that achieves full - duplex at the model level.
The concurrency of GPT - Realtime - 2 relies on the communication layer. WebSocket allows bidirectional simultaneous transmission, but the model itself still "waits for you to finish speaking before thinking". Seeduplex advances concurrency to the inside of the model.
Its three - stream architecture (listening stream, speaking stream, and control stream) combined with R - PEC (Relative Position Encoding) enables the model to truly process input and output simultaneously. The listening stream continuously analyzes what you are saying, the speaking stream generates responses simultaneously, and the control stream conducts real - time arbitration between the two.
As a result, the false interruption rate is reduced by 50% compared to half - duplex models, and the rate of interrupting others is reduced by 40%.
This is a real improvement in terms of concurrency. The interruption mechanism of GPT - Realtime - 2 is "cancel and restart". The AI is stopped and waits for you to finish speaking before generating a new round. The interruption of Seeduplex is continuous. The AI listens to you while speaking. If it determines that you want to interrupt, it smoothly yields without the "cancel - wait - restart" break. It has upgraded from a walkie - talkie to a telephone.
Its three - stream architecture (listening stream/speaking stream/control stream) combined with the R - PEC (Relative Position Encoding) mechanism enables the model to truly send and receive information simultaneously. It is not false concurrency at the communication layer but real simultaneous processing of input and output streams inside the model. As a result, the false interruption rate is reduced by 50% compared to half - duplex models, and the rate of interrupting others is reduced by 40%. In terms of the three conditions, it makes up for the lack of simultaneity.
But what about copresence and contemporality? Just like GPT Realtime, they remain unchanged.
Both are pure voice models without visual input. When you are silent, you still don't exist for it. R - PEC is relative timing encoding. It knows that a certain token in the listening stream is "before" or "after" a certain token in the speaking stream, but it doesn't have an absolute clock to anchor each position to a specific moment in the real world.
It knows the sequence but doesn't have a continuous sense of presence. When there is no voice activity, there is nothing for the three streams to process, and the model is in an idle state.
Therefore, for example, OpenAI Realtime - 2 is a walkie - talkie that can be interrupted. You press the button and it stops to listen to you. Seeduplex is a real telephone where two people can speak simultaneously without confusion.
But what Thinking Machines wants to achieve is face - to - face interaction.
Face - to - face means that even when no one is speaking, two people still share the same space, the same period of time, and the same silence.
Integrate interactivity into the model
Both the walkie - talkie and the telephone only solve one of the three conditions. Thinking Machines aims to address all three. How?
Let's start with the first condition, copresence.
Copresence: Let AI access all the modalities you are interacting with
AI needs to have the same perceptual bandwidth as you. It should be able to see what you can see and hear what you can hear.
So they trained a multi - modal model. To meet the requirement of contemporality, they didn't choose the current mainstream approach of adding an encoder scaffold to the voice model to achieve multi - modal functionality. Instead, they retrained a unified model.
Contemporality requires that the processing of different modalities be unified in time. If the system needs to align multiple modal streams in terms of time precision, video frames, audio segments, and text tokens need to be simultaneously anchored to the same representation space. Any delay jitter of external components will disrupt the alignment.
For example, if vision uses an independent encoder (such as ViT), audio uses another (such as Whisper), and text uses a third, the three encoders have different processing delays. Vision may take 80ms, audio 40ms, and text is almost instantaneous.
Although these delay differences seem small, they can have a fatal impact in subsequent processes.
This is why Thinking Machine emphasizes in their technical document that "interactivity must be part of the model itself" rather than being assembled through external scaffolds.
Internalize all functions that require time precision into the model and conduct joint training from scratch. This is not a matter of aesthetic preference but an engineering necessity.
Specifically, for audio input, a lightweight dMel (mel spectrum) embedding layer is used for minimal pre - processing. For video input, the image is cut into 40×40 patches and then encoded using hMLP (Hierarchical MLP). For text, standard embedding is used. All components and the main Transformer are jointly trained from scratch using Encoder - free Early Fusion.
As a result, the path from input to the Transformer for all modalities is minimized, and the delay is made as uniform as possible.
Here, unified representation is not an independent innovation but an enabling condition. It ensures that the modalities do not slow each other down and provides a precision basis for the next step of time anchoring.
Of course, another reason for them to train the model from scratch is that Thinking Machine believes that interaction ability itself will grow with the growth of the model's ability, but the scaffold cannot.
Only by creating a unified model to enjoy this growth can interaction scale up.
Contemporality: Give the model a continuous internal clock
Contemporality is the core point of this architecture.
The model needs a continuous internal clock rather than being awakened by events to be "present" all the time.
Current language models are passive in the time dimension. Their view of time is event - driven. They wake up when there is something to do and sleep when there is nothing.
Thinking Machines has reversed this paradigm. Their Interaction Model operates on a 200ms micro - turn. Every 200ms, the model processes a set of input tokens and generates a set of output tokens. Whether you are speaking or not, whether an "event" occurs or not, this 200ms heartbeat never stops.
Why 200ms? Because this is the minimum meaningful feedback interval in human conversations. Conversation analysis research shows that 200ms is approximately the shortest time for a person to give a backchannel feedback ("um", "yes", "then what"). If the interval is shorter, the feedback seems unnatural; if it is longer, the other party feels that you are "not listening".
In each 200ms micro - turn, the model first reads all the input tokens (from various modalities) and then generates the tokens that