Soul CTO Tao Ming: How Can Humans and AI "Keep Chatting"? | An Exclusive Interview with 36Kr
Author | Song Wanxin
Editor | Zheng Huaizhou
Entering 2024, domestic large-scale models have begun to take a development path different from their overseas counterparts - shifting from investing in the underlying model to exploring the application layer.
In the C-end, what the real needs of ordinary users for large-scale models are is the key for manufacturers to achieve the landing of AI.
Some time ago at the GITEX GLOBAL conference held in Dubai, Soul App demonstrated its latest progress in the landing of large-scale models in social scenarios, such as the latest self-developed 3D virtual human multimodal AI interaction experience. At the conference, 36Kr had a conversation with Soul App CTO Tao Ming.
Since the emergence of large-scale models based on speech and semantic understanding, "chatting" has been an application scenario inherent in large-scale models. Nowadays, many large-scale model manufacturers' products also use chatting as a scenario for search and interaction.
But further problems have also emerged. Why do users want to chat with robots? How long can such one-on-one chatting last? How much is this demand?
Tao Ming told 36Kr that from Soul's practice, if people and AI are not in the scenario, "continuous chatting" has a high threshold. This is also a difficulty that current AI chat products are jointly facing.
"AI only having cognitive ability is not personified enough. It must also have perception and long-term memory ability to bring more experiences to users."
In 2020, Soul officially launched the technical research and development work on AIGC. Currently, Soul has successively launched its self-developed language large-scale model Soul X, as well as voice generation large-scale model, voice recognition large-scale model, voice conversation large-scale model, music generation large-scale model and other voice large-scale model capabilities. In 2024, Soul AI large-scale model capabilities have been upgraded to a multimodal end-to-end large-scale model as a whole.
And if looking from an earlier gene, when Soul was founded in 2016, it was a stranger social product based on AI recommendation technology.
At that time, Soul did not adopt the model of importing the address book real relationship or LBS, but adopted an AI solution. Through the Lingxi engine based on AI algorithm, it analyzed the content and behavior of users on the platform and recommended other users with social possibilities.
In the era of large-scale models, after Soul has better AI tools, how to innovate social scenarios? The following is the edited conversation between 36Kr and Soul CTO Tao Ming:
01 About the application of AI in Soul
36Kr: Seeing that Soul mainly displayed the domestic version at Gitex, and the overseas version has not been launched yet. What is the reason?
Tao Ming: We have products overseas, but in terms of the experience level and the stickiness created for the user end, it is not yet sufficient, so the overseas products have not been released.
However, whether it is an overseas product or the main product, it is only for different markets, with different product manifestations, functions, and scenarios, but the underlying hope is to connect them, so the displayed basic technical capabilities are the same.
36Kr: How is the cost reduction and efficiency increase reflected in Soul's two AI models?
Tao Ming: One (model) is a 3D large-scale model based on the image level. Soul has been exploring since 2020, hoping that users can create another persona in the space and generate their own image with one click; the other is a multimodal large-scale model, hoping that users in Soul can not only have conversations with real people but also with AI. These two lines are moving forward in parallel, and the perception level of the AI large-scale model is already quite mature.
In terms of the organizational level, the original various functions have been reorganized into NLP, 3D, CV, voice, etc. The team has closed all the work for the single-modal model and built an integrated team to construct the multimodal model.
In the technical level, the organization will definitely bring changes in the technical direction, so now there are only two lines in the entire technology, one line is to build a multimodal model of 3D, CV, and voice around the similar GPU fusion; the other line is to try around the idea of the o1 model recently released by OpenAI.
We are now very focused and will not invest resources in other technical branches. In this way, it is actually relatively cost-reducing.
36Kr: From the user's perspective, is it a real need to communicate with a digital human?
Tao Ming: Human-computer conversation is the basic atomic ability of the product, but the atomic ability cannot be directly pushed to the users. Instead, a community where AI beings and human beings coexist should be constructed. This community is actually not something that can be maintained by a single-point chat, but requires more scenarios where AI and humans coexist.
In addition, in the aspect of single chat, including some AI chat products made by the "Big Model Six Tigers", they all have the ability to chat alone, but the difficulty is "continuous chatting". If people and AI are not in the scenario, single chatting has a quite high threshold.
So why do we insist on the direction of GPT-4o? AI only having cognitive ability is not personified enough. It must also have perception ability to bring more experiences to users.
36Kr: According to the direction of those interaction scenarios to develop, what is the difference between it and a companion-style game like Miracle Nikki?
Tao Ming: For games like Love and Producer, the chat interaction is one-time, but the difference with Soul is that, for example, if you catch a cold today, it may still remember and ask if you have recovered on the third day. This feeling is completely different from the "mechanical communication where you say one sentence and he replies one sentence".
So it is necessary to strengthen the perception and memory ability of AI. This is the most important thing,
36Kr: How to achieve the long-term memory ability of AI?
Tao Ming: At the beginning, the idea was search. Before answering, the answer was saved by searching the machine library; later, an AI small model was made. Before entering the conversation large-scale model, the small model will help the user extract memory points, which may be hundreds of memory points. The longer the time, the wider the scope of the memory points involved.
Now the idea is to directly input the long-term memory data, but this is a big technical direction, and there are many details. For example, the memory cannot be said to be completely continuous. For example, if a certain point in the entire memory is repeated multiple times, which time point of catching a cold should be taken? Different scenarios are different, and this requires some manual annotations and assistance, and it cannot be solved by just one model.
Therefore, there is still room for improvement to solve the user experience end-to-end. Leaving aside the product and operation, it is difficult to solve the end-to-end purely with technology.
36Kr: What indicators does Soul currently value more, is it the user's duration or the user's single-person asset cost, etc.?
Tao Ming: Currently, more attention is paid to active users, because the duration does not represent the comprehensive active concept, so the overall activity needs to be considered. Because AI itself is an inclusive tool, it cannot only serve a certain group of people, but any user in Soul can benefit.
02 About the application prospects of large-scale models
36Kr: Do you rent chips for your training?
Tao Ming: There are two types. We do not have ABC-type computer rooms ourselves. On the one hand, we have bought exclusive cards on various cloud platforms, and on the other hand, we have bought some elastic cards.
This is also considered from the cost perspective. If thousands of cards were bought last year, the value of the cards has already decreased by 60% this year. At the current level of integrating resources, we try to turn the fixed cost into a variable cost.
36Kr: Where is the difficulty in the research and development of large-scale models in the industry now?
Tao Ming: No cards. I was in the United States before and talked to the people from Llama about this matter. Because some technical documents of Llama are actually very detailed, I asked, aren't you afraid that your competitors or some overseas customers will catch up with you by releasing such technical documents?
They said, Even if this kind of technical document is released and many people see it, they cannot do it because there are no cards. In addition, there is also the time issue. It takes a lot of time to run the training for each technical detail.
36Kr: Some companies among the "Big Model Six Tigers" have begun to shrink the pre-training rhythm.
Tao Ming: Because in the concept of the pre-training level, everyone has seen where the ceiling is, so whether it is to reach the ceiling immediately at present, in the short term, or in the long term in the future, it is the same and has no meaning. When facing a certain thing and knowing what the final cards of each player are, everyone's mentality becomes less anxious.
36Kr: So where do you think the bottleneck is? Nvidia?
Tao Ming: Ultimately, it is Nvidia, but now it seems that OpenAI is still leading.
36Kr: Is the main bottleneck for the update of large-scale models because the B200 has not been launched yet?
Tao Ming: Yes, it is a very important factor. But for China, it is not a problem of resources now. Domestic resources are not so lacking. Especially since the second half of last year, many card merchants who originally hoarded cards are now selling them. As long as you want to get them, you can get them. It depends on whether you are willing to invest and make such a large investment.
But for overseas, it is indeed a problem of card resources. In the short term, the problem in China is not the computing power, but the problem of what each company does in the short term. It is equivalent to the "Big Model Six Tigers" doing pre-training, such as reaching the level of GPT, but what can be done after reaching it and what to do next? Actually, it is not thought of yet.
36Kr: In this round of AI technology wave, is your product pushing the technology forward or is the technology research and development pushing the product development?
Tao Ming: The original logic is that the product puts forward the requirements, and then the technology is realized. Now the situation is somewhat different.
Now there is a group within Soul. In this group, both the product and the AI algorithm engineers can put forward requirements. In other words, the product and the engineers are not distinguished. From the current stage, the engineers actually put forward more requirements.
Technical engineers know better what AI can and cannot do now, so many of their requirements are certain, but this situation is determined by the current technical stage. Regarding the boundary of AI, the cognition of the final product and the engineers will be leveled.
36Kr: How many people are there in the technical team now?
Tao Ming: The technical team is about several hundred people in size, and AI accounts for nearly half.
36Kr: Are the AI people newly added or transferred from the previous ones?
Tao Ming: There were people with this function originally, and now it has been expanded.
Follow for more information