JD.com and former OpenAI CTO Mira Murati have bet on the same AI track
Imagine such a scenario:
An elderly person living alone accidentally slips in the living room, and the pain prevents them from calling for help. At this moment, the smart device on their body or the camera at home "sees" the abnormality. Without waiting for any voice commands, the AI takes the initiative to issue an early warning and quickly contacts the family or the emergency center.
Or, you are watching an exciting football game. At the moment of a crucial goal, you don't have time to replay or ask questions, and the AI glasses automatically provide you with slow - motion analysis and tactical interpretation.
These scenarios are no longer fantasies about the future, but real problems that JoyAI - VL - Interaction, the world's first full - stack open - source visual - language interaction model just open - sourced by JD, attempts to solve.
In the past two years, the capability boundaries of large models have been continuously expanded, but the mainstream interaction method still remains in the "turn - based" logic of "users ask questions, and models provide answers". It is efficient, but in many scenarios, it is not reasonable. Many important events happen too quickly for users to ask questions, and there are no voice commands in many scenarios at all.
This year, a judgment is becoming an industry consensus: AI is moving from "predicting the next token" to "predicting the next physical state". This also means that AI needs to evolve from a passive information processor to an active participant.
At this juncture, JD open - sourced JoyAI - VL - Interaction, the world's first full - stack open - source real - time visual - language interaction model, which can independently determine when to respond, when to remain silent, and when to hand over complex tasks to the back - end model in a continuous video stream.
What JoyAI - VL - Interaction wants to prove is that AI that truly enters the physical world should not always wait to be asked. It should learn to see, make active judgments, and provide help at the right moment.
This is also a stronger signal released by JD AI: From model capabilities to industrial scenarios, the AI competition is moving from Q&A on the screen to the real world.
Why visual - language interaction?
In the real physical world, a large amount of key information occurs at moments when users don't have time to ask questions. The feeling of "not having enough time" is sometimes an experience issue, but more often it is a problem of capability boundaries caused by the model paradigm.
The industry is not unaware of this limitation.
In the first half of 2026, real - time interaction became the hottest keyword in multi - modal AI. The industry generally advanced along two routes: one is to make turn - based conversations faster, and the other is to make voice calls more natural.
The former emphasizes low latency or arbitrary input and output, but the core is still "it answers only when you ask"; the latter allows the model to speak while listening and be interrupted at any time, with an experience closer to a real - life conversation, but the focus is still on the voice scenario.
The problem is that a large number of changes in the real world do not first turn into a sentence. Fires, falls, approaching vehicles, changes in screen content, and abnormalities in the production line all appear in pictures before language. If AI can only wait for people to speak, it is difficult to be truly "present".
Thinking Machines Lab founded by Mira Murati made the same judgment as JD. On May 11th, this company proposed the concept of interaction models and released some research preview demos, pointing out that the autonomous response paradigm of interaction models has greater imagination space for human - AI collaborative cooperation compared with the traditional question - and - answer paradigm.
The fact that the two teams converged on the same idea almost at the same time is a signal in itself: Scaling up interactivity as an inherent ability of the model is an inevitable direction for the industry in the next few years.
The difference is that JD puts visual - language in a more central position, separating voice into a pluggable I/O, making visual - language the "primary driving modality" for the model's autonomous decision - making.
That is to say, from the moment the camera is turned on, JoyAI - VL - Interaction will continuously "watch" the changes in the physical world's pictures and independently determine whether to speak, what to say, and whether to hand over the task.
This is also where the imagination of visual interaction lies: It can be used in scenarios such as elderly and child care, blind assistance, AI glasses, sports commentary, store inspections, warehousing and logistics, and robot collaboration. Users don't need to organize questions into a sentence first, and AI can capture needs from environmental changes.
Therefore, vision is not just another input method, but an irreplaceable perceptual channel for AI to move towards "predicting the next physical state".
This point is also emphasized in JD's technical report on JoyAI - VL - Interaction. The report shows that in six real - world streaming scenarios, JoyAI - VL - Interaction has a winning rate of 77.6% against domestic leading models and 87.9% against foreign models; in the monitoring and early - warning scenario that most tests the event - capturing ability, the winning rate reaches 100%. The report believes that the gap is not just about the quality of answers, but about the ability to act at the right moment.
However, achieving active visual interaction is indeed more difficult.
The data acquisition for voice interaction is relatively direct. A large number of voice - command datasets allow the model to learn when humans speak, how to interrupt, and how to continue the conversation. The data required for visual interaction is completely different. The model needs to learn what signals in continuously changing pictures are worth responding to and what signals should be ignored.
A deeper barrier is the ability to define scenarios. In the scenario of voice interaction, there is a natural trigger boundary. When the user starts speaking, the interaction begins. In visual interaction, there is no clear start and end, and the model must determine the boundaries by itself in the unbounded information flow.
This is also where JD's uniqueness lies: This company does not search for scenarios in abstract laboratories but naturally operates in real business networks such as retail, logistics, health, and industry.
This means that JD AI faces not a single chat entry but a large number of real tasks: how goods are transferred, how devices cooperate, how robots work with humans, and how abnormalities can be detected in advance. The model can learn from real needs and iterate based on real feedback.
Although there are trade - offs in the technical route, the future interaction form of general AGI must be active intelligence. The intelligent agent must have a complete cycle of environmental perception, autonomous decision - making, and real - time response. Therefore, many companies are not reluctant to develop large visual - interaction models, but currently lack the soil for visual interaction to grow. This is why capital and computing power first flocked to the voice - interaction track.
So, JD's choice to start from vision is not only a technical - route choice but also determined by its strategic position. Compared with many large - model players, JD is closer to the operation site of the physical world and has a greater need for an AI that can actively perceive and respond in real - time.
If you want this day to come faster, someone needs to start earlier.
Lightweight, open - source, and deployable
What does it mean to be the world's first full - stack open - source model?
Redefining the interaction paradigm sounds grand, but in real - world applications, the first threshold is very simple: AI should not always disturb people, nor should it remain silent when it should give a reminder.
People usually expect AI to be more talkative, but in the scenario of real - time visual interaction, a model that keeps interrupting is not smart. The truly valuable ability is to appear actively at critical moments and remain silent at irrelevant moments.
Therefore, JoyAI - VL - Interaction trains "silence" as an ability. The model needs to master three - level judgments: in what scenarios it should respond actively, in what scenarios it should remain silent, and in what scenarios it should distribute the task to other models.
If this set of capabilities can only stay in papers, its value is limited. The key to JD's emphasis on "full - stack open - source" this time is to open up the model, the inference system, and the application - building path together, so that developers can truly run, modify, and use it.
JD chose an engineering route that is easier to spread: an 8B parameter model that can be deployed on a single 3090 graphics card. With this parameter, individual developers can run it, consumer - grade hardware can support it, and edge - side devices can implement it.
For real - time visual interaction, this lightweight nature does not mean a reduction in capabilities but a clearer division of labor.
JoyAI - VL - Interaction is more like a front - end interaction layer, responsible for observing the environment, judging the timing, and completing short - term communication. When encountering complex tasks that require in - depth reasoning, it automatically distributes them to agents such as OpenClaw, Codex, and Claude Code selected by the back - end users, so an 8B model is sufficient.
For example, the model can first tell the user "Let me think about it", then hand over the difficult problem to the back - end while remaining present; after the back - end returns the result, it synchronizes the answer to the user. During this process, it can also continue to help the user complete other real - time interactions.
JD also made a lightweight design in the underlying system: Through video encoding, long - term memory, and context compression, the model can continuously watch long video streams at a low cost and control the end - to - end latency to sub - second level. For ordinary readers, the key is not these technical terms but the result: AI can stay in real scenarios for a longer time and at a lower threshold.
The cost - effective and deployable choice also directly leads to JD's open - source strategy. Only when the model is lightweight enough, the system is complete enough, and the deployment threshold is low enough can real - time visual interaction transform from an experiment of a few teams into an application ecosystem jointly explored by more developers and enterprises.
JD has open - sourced this inference system with a clear goal: to enable anyone with a 3090 or higher graphics card and a camera to quickly build their own real - time visual - interaction application.
JoyAI - VL - Interaction has received day - 0 support from vLLM - Omni and has been natively integrated into the vLLM - Omni mainline.
Bring AI back to the physical world
The purpose of open - sourcing is to hand over the application imagination to a larger market. Because the value of technological breakthroughs ultimately needs to be tested by the real world.
The first - batch application imaginations of JoyAI - VL - Interaction are very intuitive: In sports live broadcasts, AI can automatically commentate at the moment of a crucial goal or a game - winning shot; when monitoring stocks, it can continuously observe screen changes and remind of abnormalities; in home care, it can actively issue early warnings when an elderly person falls or a child approaches a dangerous area; when paired with AI glasses, it can help users identify roads, products, screens, and the surrounding environment; when serving the blind, it can convert visual information into real - time assistance.
For JD, it is more expected that it can be applied to robots: A model that knows when to speak, when to remain silent, and when to seek help from the back - end system can make robots more efficient and closer to the "well - mannered" intelligent assistant people expect.
The fundamental reason why JD dares to "stir up" this field at this juncture is that it holds physical - world data assets that other large - model players do not have.
In the industry coordinate system of 2026, the weight of physical - world data assets is particularly heavy.
2026 is known as the "Year of Embodied Intelligence Data" in the industry. Against this grand background, a sharp contradiction is that high - quality physical - interaction data is extremely scarce and far from meeting the needs of large - scale training. The bottleneck of algorithm iteration has fully shifted from the model end to the data end.
At this time, JD announced that it will accumulate 10 million hours of high - quality real - world scenario video data within two years and mobilize 600,000 people to participate in the collection.
JD has more than 3,000 real - world business scenarios covering fields such as retail, logistics, health, and industry. This year, it also innovated a community - grid collection model in Suqian, deploying a large number of self - developed JoyEgoCam head - mounted terminals and mobilizing surrounding small and medium - sized enterprises and residents to collect data in real - world work scenarios.
The layout speed is very fast. In March, JD announced the completion of the world's first embodied - intelligence data collection center in Suqian; in April, it released the industry's first embodied - data infrastructure covering the entire link of collection, storage, labeling, training, evaluation, simulation, and testing; in May, JoyEgoCam achieved mass production and continuously collected first - person perspective data.
These data are the scarcest fuel for training embodied models and visual - interaction models. With the addition of embodied data to the training, the value of JoyAI - VL - Interaction will further extend from "a model that can actively see" to more specific physical spaces such as robots, unmanned vehicles, warehouses, stores, and homes.
Between the model and the application, JoyAI - Echo, which JD open - sourced on June 3rd, also plays a key role. Echo is good at real - time generation of long videos, and Interaction is good at real - time understanding and interaction. Open - sourcing two models in a month means that JD has connected the input and output ends of video multi - modality and placed AI's entry into the physical world in a more long - term position.
At the launch press conference of this year's 618 event, JD said it wants to become the "world's largest physical - world operation center".
In the era of human - machine interaction, the industry is increasingly concerned about how AI understands the physical world. JD's problem - solving logic is different from that of most large - model players: This company operates in the physical world.
Warehousing, distribution, retail, health, and industry are all training and testing grounds for AI and embodied intelligence.