HomeArticle

A bold gamble of "Stepping on the Stars"

咏仪2025-05-12 08:23
DeepSeek tells us that the logic of traffic investment doesn't hold.

Text by | Deng Yongyi

Edited by | Su Jianxun

On May 8th, Jiang Daxin, the CEO of Jieyue Xingchen, who had been out of the public eye for a long time, appeared at a media communication meeting in Beijing.

"In the multimodal field, if there is a shortcoming in any aspect, it will delay the process of exploring AGI." Jiang Daxin gave a clear judgment. In the past year, he has repeatedly mentioned on different occasions that multimodality is the only way to achieve AGI.

Among the "Six Little Dragons", compared with other players who are booming in terms of financing and market influence, Jieyue Xingchen's low - key stance is particularly noticeable.

This company has been the least vocal, but it can be remembered for its unique position. In the past two years, it has not participated in the frenzy of application advertising investment and has only made tentative attempts in To C applications.

Multimodality has now become Jieyue's most prominent label. The company is pouring most of its efforts into exploring this path.

Since its establishment two years ago, Jieyue has released a total of 22 self - developed base models, covering text, voice, images, videos, music, reasoning, etc. Among them, 16 are multimodal models, accounting for more than 70%. Therefore, Jieyue is known as the "multimodal over - achiever" in the industry.

However, the development stage of multimodality is different from that of language models.

In the field of language models where the technical route has converged, almost all companies have iterated along a similar technical route. However, the technical exploration of multimodality is still in its early stage. From top - tier large enterprises to AI startups, they are all groping in the fog.

When Sora shocked the world in 2024, many AI entrepreneurs had different opinions. "When Sora came out, we were actually quite disappointed. We thought its main line should be to integrate understanding and generation, but they only focused on generation and did little on understanding," Jiang Daxin said.

Jiang Daxin told "Intelligent Emergence" that if compared with the technical evolution timeline of language models, the native multimodal direction of "integrating understanding and generation" that Jieyue is betting on may still be at the stage before GPT 1.0, when Transformer had just emerged.

One of the major difficulties in multimodality is that the performance of individual modalities cannot be lost during the integration process, especially there should be no "intelligence degradation". The technical route adopted by Jieyue can be described as "extremely difficult": the same large - scale model should be capable of both understanding and generation. This has been the main development line set by Jieyue Xingchen since its establishment.

Understanding and generation are two sides of the same coin in the native multimodal direction, which means:

Capable of understanding: The model can understand the relationships between objects in the picture, which requires supervision from the generation end.

Capable of generation: The generated content also needs to be controlled by understanding to ensure that it does not exceed the cognitive scope of the physical world.

It was not until the release of GPT - 4o image in 2025 that the Studio Ghibli - style and anthropomorphic filters set the world on fire, and multimodality returned to the center of the global AI stage. The progress of reasoning models represented by DeepSeek can also make up an important piece of the puzzle for the technical exploration of multimodality.

Multimodality and Agent are undoubtedly the two keywords in 2025. In the past year, Jiang Daxin has also repeatedly emphasized on different occasions that multimodality is the only way to AGI. And Agent is the initial form explored by the industry on the path to AGI.

Currently, Jieyue is also focusing on the layout of the Agent field. In key application scenarios such as automobiles, mobile phones, embodied intelligence, and IoT, Jieyue has cooperated with enterprises such as Oppo, Geely Automobile Group, and Qianli Technology to apply Agent in key scenarios.

After DeepSeek soared to fame and shocked the world, on the other hand, when the once - solid technical barriers were broken through, everyone had to come to an anxious crossroads: Where should the technical route go next?

Large enterprises have existing scenarios and users, and they still have enough time to adjust their directions and increase their efforts. For large - scale model startups, this question is particularly urgent. In just two months, some of the "Six Little Tigers" of large - scale models have laid off teams and cut To C applications; others have stopped advertising investment and refocused on language models.

For startups, exploring more cutting - edge and unknown fields may be a more important and certain thing at this stage.

For Jieyue, this is also a high - stakes gamble. Now, Jieyue has organized several teams with different technical routes internally. "Any of these routes may achieve a breakthrough, and we need to create a concurrent development state," Jiang Daxin said.

At this communication meeting, in addition to disclosing future model and product plans, Jiang Daxin also gave key judgments on the current multimodal field. After being edited and sorted out by "Intelligent Emergence":

Any shortcoming in multimodality will delay the AGI process

  • Pursuing the upper limit of intelligence is still the most important thing at present. I have also repeated on many occasions that multimodality is the only way to achieve AGI.

    In the past two years, we have seen that the evolution of models in the entire industry basically follows this roadmap: simulating the world - exploring the world - summarizing the world.

    Technically speaking, the current development is also from single - modality to multimodality, from multimodal fusion to the integration of understanding and generation, and then from reinforcement learning to AI for Science.

  • Jieyue has always believed from the very beginning that multimodality is very important for general artificial intelligence. Why?

    First of all, AGI is benchmarked against human intelligence. Human intelligence is diverse. In addition to symbolic intelligence from language, everyone also has visual intelligence, spatial intelligence, and motor intelligence, etc. These types of intelligence need to be learned through vision and other modalities.

    From an application perspective, no matter what kind of application we develop, we need AI to be able to listen, see, and speak so that it can better understand the user's environment and communicate with users more naturally. Multimodality allows the intelligent agent to fully understand and perceive the world, so that it can better understand the user's intentions.

    Therefore, in the multimodal field, any shortcoming in any direction will delay the process of achieving AGI.

  • At the multimodal level, there are two trends in the future development of models: First, adding reinforcement learning to the pre - trained base model can stimulate the model to generate long thinking chains during reasoning, greatly improving the model's reasoning ability.

  • From OpenAI's release of o1 to the release of DeepSeek R1 before the Spring Festival, I think this is a sign that the reasoning model has changed from a trend to a paradigm. Now, the reasoning model basically dominates the language model field.

    This is a very popular area where everyone is scrambling to innovate. A slightly newer and less - noticed ability is actually how to introduce reasoning into the multimodal field.

    For example, if I show a picture (a football game) and ask the model where the picture is taken and whose home ground it is, this combines the model's perception ability with internal knowledge to conduct reasoning, which greatly enhances the original visual understanding.

△Source: Jieyue Xingchen

  • The second trend is the integration of multimodal understanding and generation. More precisely, it is the integration of understanding and generation in the visual field, where both understanding and generation are completed by one model.

    Why must we achieve integration? For example, in a video where a teacher is writing on the blackboard, the posture of the teacher's hand and the traces left by the chalk on the blackboard. Sora can simulate what it will look like later. However, if the teacher stops writing halfway and we want to predict what content he will write next, this requires an understanding model for prediction.

    The generated content needs to be controlled by understanding to ensure that it is meaningful and valuable;

    Conversely, understanding needs to be supervised by generation. Only when I can generate can I know that I have truly understood.

    The integration of understanding and generation can better assist in generative reasoning.

    Let me give an example: When a person is painting a large picture, they usually don't finish it all at once. Now, when the model generates a picture, it produces the whole picture at once. But when a person paints, they first have a concept, think about the overall structure, and then paint the details step by step.

    Painting is actually a process of a thinking chain. Why doesn't our model generate in a thinking - chain way? It's because there is no integration of understanding and generation. I should draw a framework, then add some details based on the framework, generate, and then generate based on the generated content. Even if I think a stroke is not good, I can go back and modify it and then generate again. Currently, we are still stuck on this problem, that is, the lack of integration of understanding and generation, so it's difficult to use a long - thinking - chain approach for generation.

△Source: Jieyue Xingchen

  • In the language field, "Predict next token" is the only task, and the entire training process is to see if your prediction is correct.

    When it comes to the visual field, people will ask: Can we use a model to "predict the next frame"? This is a soul - searching question in the visual field. Unfortunately, this problem remains unsolved.

    The reason for the lack of a solution lies in the complexity of the modality. People say that language is very complex, but statistically speaking, language is a simple thing because there are at most a few hundred thousand tokens.

    However, in the visual field, for a single picture (not to mention videos), a 1024×1024 picture has one million dimensions, and each dimension is a continuous space. The difficulty is different.

  • In the language field, the emergence of Transformer in 2017 was of great significance to the industry because it was a scalable architecture for the integration of text understanding and generation. Before that, most other models were not scalable.

    The significance of GPT - 3 in 2020 was that for the first time, we put a large amount of Internet data into this scalable architecture and used a single model to handle all NLP (Natural Language Processing) tasks;

    In 2022, ChatGPT emerged, which added instruction - following on the basis of the pre - trained model. This is what GPT - 3.5 did;

    With GPT - 4, this ability was further enhanced. The "GPT4 moment" refers to the stage when our model on this modality can truly reach a level similar to human intelligence.

    Now, with the addition of reasoning, we can solve very complex problems.

  • What's next? Many people think it should be online learning or autonomous learning, that is, the ability to continuously learn new knowledge based on the environment.

    So far, we believe that the technical route of language models has basically converged, and there are no other branches. Therefore, we believe that the visual field can also follow the same route.

    The first step is to have a very scalable architecture. The "integration of understanding and generation" in multimodality, when compared to language models, should even be at the level of Transformer. At that time, there was no GPT yet. Transformer emerged in 2017, and GPT - 1 came out in 2018.

DeepSeek tells us that the logic of advertising investment is not valid

  • I think the emergence of DeepSeek gives us an experience that the logic of advertising investment is not valid. DeepSeek has never done advertising investment. If it were to open up the traffic, it would easily reach over 100 million users.

    Of course, we need to rethink whether the traffic growth of products in the AI era is really the same as that in the traditional Internet, relying on advertising investment. After the emergence of DeepSeek, it provides a new perspective for everyone to re - examine this issue.

    Not only DeepSeek, but also movies like "Ne Zha 2" and games like "Black Myth: Wukong" actually have something in common. They don't rely on traditional large - scale advertising investment to accumulate users.

  • Model breakthroughs precede commercialization. I made an analogy just now. First, there was GPT - 3.5, and then there was ChatGPT. First, there were multimodal fusion and reasoning models, and then there is the current mature Agent. Only when there is the integration of multimodal understanding and generation, especially the scalable integration, can we truly achieve the generalization of humanoid robots.

    If a breakthrough is made in this area, its value will not be limited to the Agent field. I most hope to see new breakthroughs in the generalization of embodied intelligence and the establishment of world models.

△Source: Jieyue Xingchen

  • In 2025, we changed the product name from "Yuewen" to "Jieyue AI", which means its transformation from a ChatGPT - like product to one with Agent capabilities.

    In terms of Agent products and commercialization, our intelligent terminals are actually To C. Although we cooperate with leading enterprises, the products of Jieyue developed in cooperation with these leading enterprises ultimately serve the C - end users.

  • Why do we still insist on the R & D of basic large - scale models? I think the current trend of technological development in this industry is still in a very steep stage.

    When Sora first came out in 2024, it shocked everyone. But looking back this year, people will think that Sora is not that amazing. Jieyue doesn't want to miss the mainstream growth or development trend in this process, so we will still insist on doing the R & D of basic models.

    From an application perspective, we have always believed that applications and models are complementary. That is to say, the model can determine the upper limit of the application, and the application provides specific application scenarios and data for the model.

    Data is also very important. The product form evolves with the evolution of the model, which is a dynamic development process.

Partner with industry - leading companies and focus on terminal Agents

  • As the model's capabilities continue to increase, the type of model determines what kind of applications can be unlocked, mature, and thrive.

    In the early days, various chatbots were very popular. After the emergence of Agent, we can use it to solve math problems and write code;

    The next