Yang Zhilin can't respond to everything.
Written by Yongyi Deng
Edited by Jianxun Su
Entering 2024, Chinese large-scale model companies are facing an increasingly challenging situation. On one hand, the "Six Tigers" who quickly entered the game in 2023, obtaining substantial financing and high valuations, are facing various voices - the homogeneity of AI applications and the business model that has not yet been fully established.
On the other hand, the iteration speed of top models led by OpenAI has slowed down, and GPT-5 has not been released yet. Recently, the entire industry has been discussing: Has the Scaling Law of large-scale models failed?
But Yang Zhilin, the long-absent founder of the Dark Side of the Moon, said: The Scaling Law is still valid, but the things that are scaled have changed.
△Founder of the Dark Side of the Moon, Yang Zhilin. Source: Taken by the author
On November 16, the Dark Side of the Moon officially released the new mathematical model K0-math.
This is a mathematical model focused on computing power. In the Demo, K0-math not only shows the ability to solve difficult mathematical problems in mathematics competitions, but more importantly, it can show the distributed thinking steps when solving problems - from getting the problem to splitting the steps for thinking. When there is an error in the problem-solving steps, K0-math can also reflect on whether the logical thinking is incorrect and return to a specific step to re-start the reasoning.
The benchmark test released by the Dark Side of the Moon shows that the mathematical ability of Kimi k0-math can be compared to two publicly available models in the globally leading OpenAI o1: o1-mini and o1-preview.
Yang Zhilin also specifically emphasized that in order to make the comparison with o1 fair enough, the Dark Side of the Moon team used different types of test sets for real-time testing.
△Benchmark test results of the K0-math model. Source: Taken by the author
In the four mathematical benchmark tests including the Senior High School Entrance Examination, the College Entrance Examination, the Postgraduate Entrance Examination, and MATH including introductory competition questions, the first-generation model of k0-math outperformed the o1-mini and o1-preview models.
In the two more difficult competition-level mathematics question banks - OMNI-MATH and AIME benchmark tests, the performance of the first-generation model of k0-math reached 90% and 83% of the highest score of o1-mini, respectively.
Just one month ago, Kimi just released the latest version, "Kimi Exploration Edition", incorporating the ability of CoT (mainly the Chain of Thought) into the model. The AI autonomous search ability of the Kimi Exploration Edition can simulate the reasoning and thinking process of humans, multi-level decomposition of complex problems, perform in-depth search, and immediately reflect and improve the results.
Whether it is the Kimi Exploration Edition or the current K0-math, the information they release is similar: continuously improving the intelligence and thinking level of the model. This is also the first step that Kimi has taken to catch up with the top models led by OpenAI o1.
However, Yang Zhilin also admits that the current K0-math still has many limitations.
For example, for difficult problems such as difficult questions in the College Entrance Examination, IMO mathematics competitions, etc., K0-math still has a certain probability of making mistakes. Sometimes, the model may also overthink - for simple mathematical problems such as 1 + 1 = 2, the model may take unnecessary steps to repeatedly verify the answer, and even "guess the answer", but cannot show why the correct answer can be obtained.
As a representative of the "Technical Idealism School" among domestic AI start-up companies, Yang Zhilin himself has repeatedly emphasized the significance and importance of the Scaling Law (the most important technical principle of large-scale models).
But now, he also clearly stated that the industry paradigm is changing: from the original expansion of computing and parameter scale to the current technical route dominated by reinforcement learning, focusing on improving the intelligence level of the model.
"The development of AI is like swinging on a swing. We will switch back and forth between two states: Sometimes, the algorithm and data are ready, but the computing power is not enough, and what we need to do is to increase the computing power; but today we find that continuously expanding the computing power scale may not necessarily directly solve the problem, so at this time, we need to break through this bottleneck by changing the algorithm." Yang Zhilin explained.
The reason why the mathematical model K0-math was chosen to be released today also has its special significance: November 16 is the first anniversary of the Dark Side of the Moon's first product, Kimi Chat.
In the past two years, the Dark Side of the Moon has been one of the most watched AI start-up companies in China. After the explosion of the Kimi Assistant in 2023, the rapid investment growth in 2024, and the recent arbitration turmoil, this team has been at the center of the storm, as if traveling through the fog.
But now, the Dark Side of the Moon obviously does not intend to respond to everything. In the press conference, Yang Zhilin only talked about new models and technical-related issues, and simply announced a number: As of October 2024, the monthly active users of Kimi have reached 36 million.
△The latest user data of Kimi. Source: Taken by the author
"I still maintain a more optimistic attitude." Yang Zhilin predicts that the shift in the industry paradigm does not mean that the pre-training model mainly based on expanding the scale is completely ineffective - the top models can still release many potentials of pre-training in the next half to one generation.
And after the thinking ability of the model is further improved, this also means that the large-scale model can be further implemented to solve more proprietary tasks in various fields.
The following are more speeches and responses from Yang Zhilin at the press conference, edited and organized by "Intelligent Emergence":
The development of AI is like swinging on a swing. Essentially, we need to be good friends with Scaling
Q: After turning to the reinforcement learning route, will data become a relatively big challenge for model iteration?
Yang Zhilin: This is indeed the core issue of the reinforcement learning route. Previously, when we were making the next field prediction, we usually used static data, and our technologies for filtering, scoring, and screening data were relatively mature.
But on the reinforcement learning route, all data is generated by itself (such as some thinking processes). When the model is thinking, it actually needs to know whether the idea is right or wrong, which will put higher requirements on the model's reward model. We also need to do a lot of alignment work, which can suppress these problems to a certain extent.
Q: In the process of model iteration, whether it is the previous route of expanding computing power or reinforcement learning, how to balance it?
Yang Zhilin: I think the development of AI is a process of swinging on a swing, that is, you will switch back and forth between two states. If your algorithm and data are very ready, but the computing power is not enough, then what you need to do is to do more engineering, make the Infra better, and then it can be continuously improved.
From the birth of Transformer to GPT 4, I think basically the more contradiction is how to Scale, and there may be no essential problems in the algorithm and data.
But today, when Scale is almost enough, you will find that adding more computing power may not necessarily directly solve the problem. The core is that there is not so much high-quality data. A few tens of terabytes of tokens is the upper limit that the human Internet has accumulated for more than 20 years.
So we need to change the algorithm to prevent this from becoming a bottleneck. All good algorithms are friends with Scaling to release their greater potential.
We started doing this reinforcement learning-related thing very early. I think this is a very important trend in the next step. Through this way, we change the objective function and the learning method to make them continue to Scale.
Q: Will the non-Transformer route solve this problem?
Yang Zhilin: No, because it is not an Architecture problem in itself. It is a learning algorithm or a problem without a learning goal. I think there is no essential problem with Architecture.
Q: Regarding the reasoning cost, after the mathematics version is launched to the Kimi Exploration Edition, can users choose different models, or will you allocate them according to the questions? And, your current main model is tipping, not subscription. How to balance the cost problem?
Yang Zhilin: Our next version will most likely allow users to choose by themselves. In the early stage, this method can better allocate or better meet the expectations of users. We also don't want it to take a long time to think about something as simple as 1 + 1 =?, so I think this solution may be used in the early stage.
But in the end, this may still be a technical problem. First, we can dynamically allocate the optimal computing power to it. If the model is smart enough, it will know what kind of problem matches what kind of thinking time, just like a person, and will not think about a "1 + 1" problem for a long time.
Second, the cost is also a process of continuous decline. For example, if you reach the level of the GPT4 model last year this year, you may only need a dozen B parameters to achieve it, while last year you may need more than a hundred B. So I think the general rule of the entire industry is to make it larger or smaller.
Q: Will the AI circle be limited by the Scaling Law now?
Yang Zhilin: I am more optimistic. The core is that originally you used a static data set, and the static data set is actually a relatively simple and rough way of using. Now, in many cases, the reinforcement learning method involves human participation in this process.
For example, if you label 100 pieces of data, you can have a very significant effect, and the rest is the model thinking by itself. I think more and more will be solved in this way in the future.
From the perspective of the approach, (the reinforcement learning route) has a relatively high certainty, and many times the problem is how to really tune out (the model). I think the upper limit is very high.
Q: You said last year that long text is the first step to landing on the moon. How many steps do you think the mathematical model and deep reasoning are?
Yang Zhilin: It is the second step.
Q: Now everyone thinks that the Scale of pre-training has encountered a bottleneck. After the United States encounters a bottleneck, what do you think is the impact on the pattern of the Sino-US large-scale models? Do you think the gap is getting larger or smaller?
Yang Zhilin: I always think that this gap is relatively a constant, and for us, it may be a good thing.
Suppose you keep pretraining, and your budget is 1B this year, 10B or 100B next year. It may not be sustainable.
Of course, you also need to Scale for Post-train (post-training), but the starting point of your Scaling is very low. For a long time, your computing power may not be a bottleneck, and at this time, the innovation ability is more important. In this case, I think it is an advantage for us.
Q: The deep reasoning previously released, as well as the mathematical model you mentioned today, is it a function that is relatively far from ordinary users? How do you view the relationship between this function and users?
Yang Zhilin: Actually, it is not far.
I think there are two values. The first aspect is that the mathematical model has a very great value in educational products today, and it also plays a very important role in our overall traffic.
The second is that it is a technical iteration and verification. And we can put this technology in more scenarios, such as doing a lot of searches in the exploration version we just mentioned. I think it will have two meanings like this.
Maintain a single product form and maintain the highest ratio of cards to people
Q: Now everyone is discussing the problem of AI applications. The Super App has not yet appeared, and a large number of AI applications are very homogeneous. What do you think?
Yang Zhilin: I think the Super App has already appeared. ChatGPT has more than 500 million monthly active users. Is it a super application? At least half, this problem has been largely verified.
Even for products like CharacterAI, there were quite a lot of users at the beginning, but it was difficult to break out of the circle later. In this process, we will also judge according to the situation of the US market to determine which business will eventually be the largest and have a higher probability of success.
We will still focus on what we think has the highest upper limit and is most related to our AIG's mission.
Q: Now the entire industry has seen AI start-up companies being acquired, as well as talents leaving and returning to large companies. What do you think?
Yang Zhilin: We have not encountered this problem, but some other companies may have. I think it is normal because the industry has entered a new stage. It has changed from having many companies doing it at the beginning to having fewer companies doing it now.
Next, the things everyone does will gradually be different. I think this is an inevitable law. Some companies cannot continue to operate, and these problems will arise. I think this is the law of industry development.
Q: You rarely talk about the situation of model training. How is your pre-training situation now?
Yang Zhilin: Let me answer the first question. I think there is still room for pre-training. There will be a model of about half to one generation, and this space will be released next year. Next year, I think the leading models will bring pre-training to a relatively extreme stage.
But we judge that the most important thing next will be on reinforcement learning, that is, there will be some changes in the paradigm. Essentially, it is still Scaling, not that you don't need to Scale, but that you will Scale in different ways. This is our judgment.
Talk about the future, competition, and going global
Q: Sora is about to release a product. When will you release a multimodal product? How do you view multimodality?
Yang Zhilin: We are also doing it. Several of our multimodal capabilities are in internal testing.
Regarding multimodality, I think the two most important capabilities of AI in the next step are thinking and interaction, and the importance of thinking is far greater than interaction.
It is not that interaction is not important, but thinking will determine the upper limit. Interaction is a necessary condition. For example, for Vision, if there is no Vision ability, then interaction cannot be done.
But thinking is like this - you look at the task to be done, how difficult the annotation task is, do you need a doctor to annotate it? Or can everyone annotate it? Which thing is more