A big gamble of "Stepping over the Stars"
Text by | Deng Yongyi
Edited by | Su Jianxun
On May 8th, Jiang Daxin, the long - absent CEO of Jieyue Xingchen, appeared at a media communication meeting in Beijing.
"In the multimodal field, if there is a shortcoming in any aspect, it will delay the process of exploring AGI." Jiang Daxin gave a clear judgment. In the past year, he has repeatedly mentioned on different occasions that multimodality is the only way to achieve AGI.
Among the "Six Little Dragons", compared with other players who are booming in terms of financing and market influence, Jieyue Xingchen's low - key stance is particularly noticeable.
This company has been the least vocal, yet it can be remembered for its unique position. In the past two years, it has not participated in the frenzy of application advertising investment and has only made tentative attempts in To C applications.
Multimodality has now become Jieyue's most prominent label. The company is pouring most of its efforts into exploring this path.
Since its establishment two years ago, Jieyue has released a total of 22 self - developed base models, covering text, voice, image, video, music, reasoning, etc. Among them, 16 are multimodal models, accounting for more than 70%. Therefore, Jieyue is known as the "king of multimodal competition" in the industry.
However, the development stage of multimodality is different from that of language models.
In the field of language models where the technical route has converged, almost all companies have iterated along similar technical routes. However, the technical exploration of multimodality is still in its early stage. From top - tier large enterprises to AI startups, they are all like travelers in the fog.
When Sora shocked the world in 2024, many AI entrepreneurs had different opinions. "When Sora came out, we were actually quite disappointed. We thought its main line should be to integrate understanding and generation, but they only focused on generation and did little on understanding," Jiang Daxin said.
Jiang Daxin told Intelligent Emergence that if we compare it with the technical evolution timeline of language models, the native multimodal direction of "integrating understanding and generation" that Jieyue is betting on may still be at a stage before GPT 1.0, when Transformer had just emerged.
One of the major difficulties in multimodality is that the performance of individual modalities cannot be lost during the fusion process, especially there should be no loss of intelligence. The technical route adopted by Jieyue can be described as "extremely difficult": the same large - scale model should be used for both understanding and generation. This has been the main development line set by Jieyue Xingchen since its establishment.
Understanding and generation are two sides of the same coin in the native multimodal direction, which means:
Capable of understanding: The model can understand the relationships between objects in the picture, which requires supervision from the generation end.
Capable of generation: The generated content also needs to be controlled by understanding to ensure that it does not exceed the cognitive scope of the physical world.
It wasn't until the release of GPT - 4o image in 2025, when the Ghibli - style and anthropomorphic filters ignited the world, that multimodality returned to the center of the global AI stage. The progress of reasoning models represented by DeepSeek can also make up an important piece of the puzzle for the technical exploration of multimodality.
Multimodality and Agent are undoubtedly the two keywords in 2025. In the past year, Jiang Daxin has also repeatedly emphasized on different occasions that multimodality is the only way to AGI. And Agent is the initial form explored by the industry on the path to AGI at present.
Currently, Jieyue is also focusing on the layout of the Agent field. In key application scenarios such as automobiles, mobile phones, embodied intelligence, and IoT, Jieyue has cooperated with enterprises such as Oppo, Geely Automobile Group, and Qianli Technology to apply Agent in key scenarios.
After DeepSeek soared to success and shocked the world, on the other hand, when the once - solid technical barriers were broken through, everyone was forced to reach an anxious crossroads: What should the technical route be next?
Large enterprises have existing scenarios and users, and still have sufficient time to adjust their directions and increase their efforts. For large - scale model startups, this question is particularly urgent. Within just two months, some of the "Six Little Tigers" of large - scale models have laid off teams and cut To C applications; others have stopped advertising investment and refocused on language models.
For startups, exploring more cutting - edge and unknown fields may be a more important and certain thing at this stage.
For Jieyue, this is also a high - stakes bet. Now, Jieyue has organized several teams with different technical routes internally. "Any route may achieve a breakthrough, and we need to form a concurrent state," Jiang Daxin said.
At this communication meeting, in addition to disclosing future model and product plans, Jiang Daxin also gave key judgments on the current multimodal field. After being edited and sorted out by Intelligent Emergence:
A shortcoming in any aspect of multimodality will delay the AGI process
Pursuing the upper limit of intelligence is still the most important thing at present. I have also repeated on many occasions that multimodality is the only way to achieve AGI.
In the past two years, we have seen that the evolution of models in the entire industry basically follows this roadmap: simulating the world - exploring the world - summarizing the world.
Technically speaking, the current development is also from single - modality to multimodality, from multimodal fusion to integrated understanding and generation, and then from reinforcement learning to AI for Science.
Jieyue has always believed from the beginning that multimodality is very important for general artificial intelligence. Why?
Firstly, AGI is benchmarked against human intelligence, and human intelligence is diversified. In addition to symbolic intelligence from language, everyone also has visual intelligence, spatial intelligence, and motor intelligence, etc. These types of intelligence need to be learned through vision and other modalities.
From an application perspective, no matter what application we develop, we need AI to be able to listen, see, and speak so that it can better understand the user's environment and communicate with users more naturally. Multimodality allows the intelligent agent to fully understand and perceive the world, so that it can better understand the user's intentions.
Therefore, in the multimodal field, a shortcoming in any direction will delay the process of achieving AGI.
At the multimodal level, there are two trends in the future development of models: First, adding reinforcement learning to the pre - trained base model can stimulate the model to generate long thinking chains during reasoning, greatly improving the model's reasoning ability.
From the release of OpenAI's o1 to the release of DeepSeek R1 before the Spring Festival, I think this is a sign that the reasoning model has changed from a trend to a paradigm. Now, the reasoning model basically dominates the language model field.
This is a very hot area where everyone is scrambling to innovate. A slightly newer ability that people don't pay much attention to is actually how to introduce reasoning into the multimodal field.
For example, if I show a picture (a football game) and ask the model where the place in the picture is and whose home field it is. This combines the model's perception ability with internal knowledge to conduct reasoning, which greatly enhances the original visual understanding.
△Source: Jieyue Xingchen
The second trend is the integration of multimodal understanding and generation. More precisely, it is the integration of understanding and generation in the visual field, where both understanding and generation are completed by one model.
Why must it be integrated? For example, in a video where a teacher is writing on the blackboard, the posture of the teacher's hand and the traces of the chalk on the blackboard. Sora can simulate what it will look like later. But when the teacher stops writing halfway and says what content he will write later, this requires an understanding model to predict.
The generated content needs to be controlled by understanding to ensure that the generated content is meaningful and valuable.
Conversely, understanding needs to be supervised by generation. Only when I can generate can I know that I truly understand.
The integration of understanding and generation can better assist in generative reasoning.
I'll give an example: When a person is painting a large picture, they usually don't finish it all at once. Now, when a model generates a picture, it produces the whole picture at once. But when a person paints, they have a concept first, maybe thinking about what the overall structure will be like, and then painting the details step by step.
Painting is actually a process of a thinking chain. Why isn't the model's generation a thinking chain? It's because there is no integration of understanding and generation. I should draw a framework first, then draw something based on the framework, generate based on the generated content, and even modify it if I think a stroke is not good and then generate again. Currently, we are still stuck at this problem, that is, the lack of integration of understanding and generation, so it's difficult for the model to generate step by step with a long thinking chain.
△Source: Jieyue Xingchen
In the language field, "Predict next token" is the only task, and the entire training process is about judging whether your prediction is correct.
When it comes to the visual field, people will ask: Can we use a model to do "predict next frame"? This is a soul - searching question in the visual field. Unfortunately, this problem remains unsolved.
The reason it remains unsolved lies in the complexity of modalities. People say that language is very complex, but statistically, language is a simple thing because there are at most a few hundred thousand tokens.
However, in the visual field, for a single picture (not to mention videos), a 1024×1024 picture is a million - dimensional space, and each dimension is a continuous space. The difficulty is different.
In the language field, the emergence of Transformer in 2017 was of great significance to the industry because it was a scalable architecture for integrated text understanding and generation. Before that, most other models were not scalable.
In 2020, the significance of GPT - 3 was that for the first time, we put a large amount of Internet data into this scalable architecture and used one model to handle all NLP (Natural Language Processing) tasks.
In 2022, ChatGPT emerged, which added instruction following on the basis of the pre - trained model. This is what GPT - 3.5 did.
With GPT - 4, this ability was further enhanced. The "GPT4 moment" refers to the stage where our model can truly reach a level similar to human intelligence in this modality.
Now, with the addition of reasoning, we can solve very complex problems.
What's next? Many people think it should be online learning or autonomous learning, that is, being able to continuously learn new knowledge according to the environment on its own.
So far, we think the technical route of language models has basically converged, and no other branches have emerged. Therefore, we believe that the visual field can also follow the same route.
The first step is to have a very scalable architecture. The "integration of understanding and generation" in multimodality, compared to language models, should even be at the level of Transformer. At that time, there was no GPT yet. Transformer emerged in 2017, and GPT - 1 emerged in 2018.
DeepSeek tells us that the logic of advertising investment is not valid
I think the emergence of DeepSeek gives us an experience that the logic of advertising investment is not valid. DeepSeek has never done advertising investment. If it were to open up the traffic, it would easily exceed 100 million users.
Of course, we need to rethink whether the traffic growth of products in the AI era is really like that in the traditional Internet, relying on advertising investment. After the emergence of DeepSeek, it provides a new perspective for everyone to look at this problem.
Not only DeepSeek, but also movies like Nezha 2 and games like Black Myth: Wukong have some commonalities. They don't rely on traditional large - scale advertising investment to accumulate users.
The breakthrough of the model precedes commercialization. I made an analogy just now. First, there was GPT - 3.5, and then there was ChatGPT. First, there were multimodal fusion and reasoning models, and then there are the current mature Agents. Only after achieving the integration of multimodal understanding and generation, especially the scalable integration, can we truly achieve the generalization of humanoid robots.
If a breakthrough is made in this area, its value will not be limited to the Agent field. I most hope to see new breakthroughs in the generalization of embodied intelligence and the establishment of world models.
△Source: Jieyue Xingchen
In 2025, we changed the product name from "Yuewen" to "Jieyue AI", which means its transformation from a ChatGPT - like product to one with Agent capabilities.
In terms of Agent products and commercialization, our intelligent terminals are actually targeted at the consumer market. Although we cooperate with leading enterprises, the products of Jieyue developed in cooperation with these leading enterprises ultimately serve end - users.
Why do we still insist on the research and development of basic large - scale models? I think the current trend of technological development in this industry is still in a very steep growth stage.
When Sora first came out in 2024, it shocked everyone. But looking back this year, people will think that Sora is not that amazing. Jieyue doesn't want to miss the mainstream growth or forward - moving trend in this process, so we will still insist on the research and development of basic models.
From an application perspective, we always believe that applications and models are complementary. That is to say, the model can determine the upper limit of the application, and the application provides specific application scenarios and data for the model.
Data is also very important. The product form evolves with the evolution of the model, which is a dynamic development process.
Partner with leading companies in the industry and focus on terminal Agents
As the capabilities of models continue to increase, the type of model determines what kind of applications can be unlocked, mature, and thrive.
In the early