Ex-OpenAI Tech Steerer Aims to Disrupt OpenAI's Rules

The end of the turn-based era of large models

To be blunt, your communication with AI today is not much different from using a walkie - talkie.

You input, send, and it starts to think. You stare at the screen and wait for a few seconds, or even minutes. Then it spits out a long passage of text. After you finish reading, you input the next message.

If human - machine interaction always stays at this level, AGI will never arrive.

Because human collaboration has never been turn - based. When two people quarrel face - to - face, tone, expression, pauses, and interruptions occur, and information flows every millisecond. This is the real bandwidth.

There is a company rewriting this rule. It's called Thinking Machines Lab, founded by Mira Murati, the former CTO of OpenAI. Her goal is different from that of her former employer: OpenAI develops top - notch closed - source models, while she focuses on human - AI collaboration.

To achieve collaboration, we must first overturn the turn - based mode.

Yesterday, TML released TML - Interaction - Small. Despite its name "Small", it has 276 billion parameters and is the first large - scale model in the industry to natively support real - time, multi - modal human - machine collaboration. It has a response delay of 0.4 seconds, visual active intervention without wake - up, and synchronizes the four actions of listening, watching, thinking, and speaking.

In the benchmark tests of intelligence and interaction, it ranks first in both aspects. Some competitors don't even have the qualification to participate.

The battle in the second half of the large - scale model era has evolved from the accumulation of computing power and parameters to a revolution in machine emotional intelligence and interaction instincts.

01 External add - ons are a dead end

Think about it. Why is a face - to - face argument more efficient than sending an email?

Email communication is turn - based. You write a paragraph, and I reply with another. There is a time gap for thinking and typing in between, and all emotions, expressions, and tones are lost. It's different face - to - face. You interrupt me before I finish speaking; you adjust your words as soon as I frown. The exchange of information is parallel, continuous, and two - way.

Current AIs, including the flagship products of OpenAI and Anthropic, are essentially in the email mode.

In TML's technical report, this phenomenon is named "single - threaded real - world perception". Before the user finishes speaking, the AI is in a state of "losing its five senses". It can't hear your tone, see your expression, or tell whether your pause is due to hesitation or just taking a breath. During the process of generating a response, its perception is also frozen. Unless you forcefully interrupt, it's like a tape recorder playing from start to finish.

The root of this mechanism lies in the architecture. Most of the existing multi - modal AIs are stitched together with external add - ons. The voice activity detection module determines whether the user has finished speaking, the speech recognition module converts the voice into text, the large language model thinks, and the speech synthesis module reads out the text. It's a cascaded and serial process, with each step increasing the delay and losing information.

Rich Sutton, the father of reinforcement learning, said in "The Bitter Lesson" a sentence that TML posted in its report: All complex external add - on systems relying on manual human design will eventually be outperformed by the underlying models through brute - force computation and unified architecture.

In plain language: External add - ons have no future. True interaction ability must be an inherent part of the model, as natural as breathing. It should be upgraded from prompt - driven to accompanying collaboration.

02 Seamless two - way interaction

It sounds simple, but it's difficult to achieve. Completely breaking the shackles of the "turn - based" mode at the technical foundation is as difficult as replacing the engine of an airplane in the sky.

The reason why TML - Interaction - Small (hereinafter referred to as TML - Small) can synchronize the four actions of listening, watching, thinking, and speaking stems from four easily understandable and disruptive innovations in its underlying architecture:

1. Micro - rotations with time alignment

This is the most imaginative core of the TML architecture.

The traditional Transformer architecture compresses the input and output information flows into an ordered token sequence. However, the information volume and complexity of text, audio, and video are quite different and cannot be simply classified into the same dimension. Therefore, TML - Small divides the continuous audio and video streams in the real world into "micro - rotations" of 200 milliseconds each.

Within this 200 - millisecond micro - slice, the model simultaneously receives input and generates output. It doesn't need to wait for the user to complete the entire interaction process. It can continuously exchange information with the user in a high - frequency and fragmented way.

This calculus - like processing method effectively breaks the artificially set "turn boundaries", and the model can naturally understand the pauses caused by breathing and the handover of speaking rights when people talk. The main application scenario of current audio models, "simultaneous interpretation", can be achieved in this way.

2. Early fusion without an encoder

By saying goodbye to the "patchwork", TML has also achieved extreme early fusion.

Believing that external add - on modules are not the right path to AGI, this new model doesn't use a large - scale independent speech recognition system or a visual encoding model.

Audio is directly converted into dMel signals, video frames are divided into small 40×40 - pixel blocks and processed through a lightweight MLP network. Then these raw audio and video slices are sent together with text into the same Transformer architecture.

The secret of TML - Small's ability to achieve zero - loss and real - time native multi - modal perception lies in the joint training of all components from scratch.

3. Dual - track system of foreground interaction + background thinking

Global AI companies are racking their brains to break through the boundaries of the impossible triangle of performance, speed, and cost. Many end - to - end large - scale speech models can only engage in simple chit - chat or basic translation in order to achieve millisecond - level delay. Once they encounter complex mathematical reasoning or programming tasks, they simply crash.

TML presents an elegant architectural solution: Dual - track parallelism.

The interaction model always stays in the foreground, remains online in real - time, and is responsible for observing the situation, responding quickly, and keeping things under control, just like the front - line service staff in a human enterprise.

Once it encounters a complex task that requires in - depth thinking, search, or the use of tools, the foreground will package the rich context and send it to the background for asynchronous processing.

4. Computing power economics and underlying engineering of 276 billion parameters

Such high - frequency interaction will inevitably bring a fatal pressure on computing power cost. Fortunately, TML - Small lives up to its reputation. As a 276B - parameter Mixture of Experts (MoE) model, only 12B parameters are active during each inference.

At the same time, to deal with the inference overhead caused by a large number of 200 - millisecond - level fragments, the TML team, learning from domestic AI companies, has developed the Streaming sessions technology at the underlying level. By persistently retaining sequences in the GPU memory, frequent memory re - allocation can be avoided. This optimization scheme has also been contributed to the open - source framework SGLang.

03 Competitors can't even enter the test venue

The data on the list is thought - provoking.

In the comprehensive evaluation of "intelligence and interaction quality", TML - Small ranks at the top in both high intelligence and fast response. In the interaction delay test, it achieves a result of 0.40 seconds, which is faster than the latest real - time models of OpenAI and Google, approaching the limit of human instinctive reaction.

But what really shocks people are two other things.

First, TML was forced to create a new evaluation dimension. Because the scores of existing commercial models on these tasks are basically zero. The test is simple: the user asks to be reminded to take a deep breath every 4 seconds. The accuracy rate of TML - Small exceeds 60%. Other models remain silent. They have no sense of time.

Second, the active vision test. Traditional voice assistants only look at the screen after hearing the wake - up word. TML - Small actively stares at the screen and interrupts to give prompts when the user completes the target. Without a wake - up word and external add - ons, AI has truly grown eyes and gained a sense of time for the first time.

04 The world after the bandwidth leap

Once AI breaks through the collaboration bandwidth bottleneck of the turn - based mode, it is no longer just a text generator on the screen. The business logics of several industries will be rewritten.

The definition of digital employees needs to be changed. Current AI customer service can only recite from a script. It can't detect changes in your tone or notice your frown. A digital employee with TML's capabilities can stop its long - winded response before you get impatient and provide additional information when you hesitate. Industries that rely on human emotion recognition, such as customer service, sales, and consulting, will be significantly affected.

Space computing and the next - generation games will also change. Apple Vision Pro has been criticized for "lacking a soul", and what it lacks is a real - time accompanying intelligent agent. AR glasses powered by TML can have an intelligent agent that sees the same scenery as you, provides danger warnings, and performs simultaneous interpretation. NPCs in games no longer have to stand still in fixed positions. They have a sense of time and can actively interact, completely breaking free from scripts.

Embodied intelligence finally has a brain. The world that autonomous driving and robots face has no pause button. The traditional large - scale model's mode of "waiting for you to finish speaking before thinking" causes fatal delays for robots. TML's mechanism of processing every 200 milliseconds exactly matches the "perception - decision - control" cycle at the bottom of robots. This is the optimal and only solution at this stage.

05 Conclusion

TML admits its limitations at the end of the report: context management for ultra - long conversations and dependence on high - quality networks. However, a larger - scale model will be launched later this year.

In the past three years, the industry has been desperately piling up parameters to make AI write more complex code and solve more difficult math problems. One thing is being forgotten:

The greatness of human civilization lies not only in the individual's moment of inspiration but also in the instinct of collaboration and communication.

When humans try to create AGI, it is more important to make machines understand how to breathe in sync with humans and communicate seamlessly than to make them smarter.

The walkie - talkie era should come to an end.

This article is from the WeChat official account "Silicon - based Starlight", author: Siqi. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

She once steered the technology at OpenAI, and now she aims to disrupt OpenAI's rules.

01

External add - ons are a dead end

02

Seamless two - way interaction

03

Competitors can't even enter the test venue

04

The world after the bandwidth leap

05

Conclusion