Zhipu AI CEO Zhang Peng: It is too early to worry that the Scaling Law has hit the ceiling. | WISE2024 King of Business
The environment is constantly changing, and the times are always evolving. The "Business Kings" follow the trend of the times, insist on creation, and seek new driving forces. Based on the current large-scale transformation of the Chinese economy, the WISE2024 Business Kings Conference aims to discover the truly resilient "Business Kings" and explore the "right things" in the Chinese business wave.
From November 28 to 29, the two-day 36Kr WISE2024 Business Kings Conference was grandly held in Beijing. As an all-star event in the Chinese business field, the WISE Conference is already in its twelfth year, witnessing the resilience and potential of Chinese business in an ever-changing era.
2024 is a year that is somewhat ambiguous and has more changes than stability. Compared with the past ten years, people's pace is slowing down, and development is more rational. 2024 is also a year to seek new economic momentum, and new industrial changes have put forward higher requirements for the adaptability of each subject. This year's WISE Conference takes Hard But Right Thing as the theme. In 2024, what is the right thing has become a topic we want to discuss more.
In the morning session of the WISE Conference, Zhang Peng, the CEO of Zhipu, gave a sharing titled "GLM Large Model and General Artificial Intelligence" on the recent hot topics of AI and the development stage of AI.
Zhipu AI is a domestic star startup in the large model field. Even before OpenAI's ChatGPT was launched, when few people in China delved into the large model field, Zhipu AI had already started the exploration of large models.
As an important cornerstone of general artificial intelligence, the GLM large model not only integrates the powerful computing power and generalization ability of deep learning, but also shows excellent performance in semantic understanding, knowledge reasoning, and other aspects.
In his speech, Zhang Peng not only reviewed the classic curve of the development of the artificial intelligence field, but also explained the recent hot topics in the large model field, such as the 碰壁 (I'm not sure about the specific meaning of this term. Please provide more context or explanation for a more accurate translation) of the Scaling Law and the delay in the research and development progress of large models.
Zhang Peng said that essentially, from the perspective of research and technology, this generation of large models is not like the previous generation of artificial intelligence technology. With the significant improvement of language ability, large models still have considerable room for development in other modalities including vision and hearing.
"A large number of researchers are making new breakthroughs, and each new breakthrough will bring some new opportunities. It is indeed a bit premature for us to worry about hitting the ceiling or hitting a dead end," Zhang Peng said.
In addition, Zhang Peng also proposed five development stages of AI, including language ability, logical thinking ability, tool invocation ability, self-learning ability, etc.
Zhang Peng
The following is the transcript of Zhang Peng's speech:
Zhang Peng: Good afternoon, everyone! It's already past 12 o'clock. Despite the cold outside, the venue is still lively. Those who are still here listening to me must be true fans.
On my way here today, I was thinking about what to talk to you about. Our marketing department required the content to be prepared early and submitted to our organizers. Maybe the content is still something we prepared before.
In the past two days, everyone has also been discussing some new things, including whether the Scaling Law has hit the ceiling and whether the large model will continue to develop. There are many views, and I have also been constantly asked such questions. So I may not necessarily speak according to all the content in this, but rather talk about our recent thoughts.
This is the artificial intelligence curve released in the middle of this year. You can see that there are many words related to artificial intelligence and large language models on it. These words are in different positions, representing the public's attention to this matter. Personally, I think it is increasingly like the curve of public opinion and media heat, rather than the curve truly led by technology.
You can see that there are many new things, including embodied intelligence, agents, and other technologies, which are still on a very rapidly rising curve. Therefore, from the perspective of the development of this wave of artificial intelligence, it is still in a very rapidly rising stage.
So, people are worried that the entire industry will suddenly enter a low period because of the 碰壁 (I'm not sure about the specific meaning of this term. Please provide more context or explanation for a more accurate translation) of the Scaling Law. This worry may be a bit redundant. Just like we are worried that artificial intelligence will rule humanity, it is still too early. Let's wait and see.
In a very narrow sense, the Scaling Law has indeed encountered some challenges. You can look at this curve. In terms of language ability, simply from the perspective of language ability, large models have indeed encountered a ceiling problem similar to that of the previous generation of artificial intelligence: all abilities have approached the limit of humans and the limit that human experts can evaluate.
You can recall that this is why the previous generation of artificial intelligence entered a development bottleneck period. We humans have no way to teach AI how to break through. Our human ceiling is there, and all the data fed to AI comes from humans. Whether this ceiling can be broken is a question that everyone may need to think about now.
This is from the perspective of the language model. But essentially, from the perspective of research and technology, it is not like the previous generation of artificial intelligence technology.
The underlying neural networks and convolutional neural networks of the previous generation of artificial intelligence technology seem relatively simple now. Basically, everyone has converged very quickly and tended to be stable.
However, the research at the bottom of the pre-trained model or the large training model is still iterating very quickly, and there is still a lot of blank space. A large number of researchers are making new breakthroughs, and each new breakthrough will bring some new opportunities. It is indeed a bit premature for us to worry about hitting the ceiling or hitting a dead end.
What we just said is the language model. The language model has indeed encountered very real problems. It seems that all the data has been fed in, and the speed of intelligence improvement has slowed down to some extent. But in addition to language, there are many other things, such as vision and hearing, where there is still a very, very large space.
For example, there are still a lot of problems that need to be studied in visual understanding. For the complex scenarios seen by the human eye, the current models cannot yet be well and comprehensively analogized, and the gap with humans is still very, very obvious. There are still many things we need to do in this regard.
We have also spent a lot of effort in this aspect recently to do such things, to combine our visual understanding ability with hardware and end-side devices, so that end-side devices have a stronger understanding ability. Because many tasks in our real world require the input of different modal information, including language, vision, and hearing.
In summary, we can look at this picture. We divide the development stages and progress steps of artificial intelligence into five levels. In fact, OpenAI also has a similar classification method.
In our understanding, the first three levels are very similar to OpenAI. First, the simplest and most important is language ability, as well as other modalities. We collectively call them multi-modal abilities - vision and hearing still have a lot of space.
The second level is logical thinking ability, which is also the o1 that has been widely discussed by OpenAI recently. Many teams in China are also working on models with strong reasoning ability and complex problem-solving ability, and they are also continuously evolving. In terms of logical reasoning ability, we can probably reach about 60% of the human level.
Going further, how do we enable large models and AI to have hands and feet, use a variety of rich tools, and generate greater productivity? That is, the ability to invoke tools is also a hot topic recently.
We have also made a small breakthrough recently, allowing the agent to help people operate apps on mobile phones and applications on PCs to solve some repetitive and procedural tasks.
Going further, there may be some slight differences (from OpenAI). We believe that the fourth and fifth levels are respectively the self-learning ability of AI.
The reason why human ability and human intelligence can be continuously updated and iterated is that humans have the ability to self-learn. Humans can continuously improve themselves through continuous practice and feedback, creating new data, experience, intelligence, and tools. These abilities are the core abilities for humans to move forward and create new things.
We hope that AI can have a stronger self-learning ability, so that it is possible to break through the existing ceiling like humans and create new things. In the future, we can use this ability to explore, research, and find new boundaries of science.
One of the things we have done recently is to study how to integrate visual, auditory, and voice abilities with a very powerful language understanding ability in the multi-modal ability to solve some problems in reality.
OpenAI's classification and development path for the development of artificial intelligence is called the Road to General Artificial Intelligence, and you can see a clear evolution route. From large language models to multi-modal, to the use of tools, to self-learning, we can see that the entire path is very clear.
Why is this? Language is the foundation. The human brain intelligence itself is multi-modal, and then uses tools to connect to the physical world, and finally achieves self-learning.
We have conducted some discussions with interdisciplinary experts such as brain science researchers and neuroscience experts. The evolution of current artificial intelligence technology has touched some aspects of general artificial intelligence.
The human brain is divided into blocks and is diverse, which is a fact confirmed by modern brain science, including language ability, logical reasoning ability, visual ability, tactile ability, motor ability, and so on. The colored parts are actually the parts that AI or large models have touched now, and the gray parts are the parts that we have not touched or are relatively less involved in, including the ability to use external tools such as hands and feet. Just like the stepped diagram we drew earlier, in terms of the understanding and use of natural language, it has approached the best upper limit of humans.
In terms of logical reasoning, emotion, innovation, and tool use, we also have certain breakthroughs, but there are still many areas with a lot of blanks. So what is the next generation of Scaling law? We may find more places where the Scaling law can be effective in these blank areas or areas that are not yet developed so perfectly.
In this process, we can find that the development path of Zhipu is actually benchmarking against OpenAI. Benchmarking against OpenAI is that our concepts are very similar. We believe that human intelligence or general artificial intelligence must not be a breakthrough in the upper limit of a single ability.
Think about it, what is the ultimate outcome of the breakthrough in the upper limit of a single ability in the previous generation of artificial intelligence, whether it is NLP (Natural Language Processing) or computer vision? It is still a tool-type achievement that cannot solve the general problems that we expect to solve in real life, and can only use big data to solve small tasks.
The problems that this generation of generative artificial intelligence or large models can solve are precisely other problems. We use big data and small tasks, simple tasks for training to solve more problems. This is the core problem that this generation of generative artificial intelligence needs to solve.
But to solve diverse problems, it must be a combination of multiple project capabilities. It is hard to imagine that in our daily work tasks or social life, we only use one perceptual ability. It is hard to imagine using only one perceptual ability.
To solve problems in real life and work, it must be a combination of multiple abilities. This is why we do a comprehensive combination of various types and different modal abilities.
The new generation model we newly released in August is a product matrix that contains a combination of various abilities. Starting with the best text ability as the base, it combines visual, language, and code abilities to enable it to have comprehensive and generalized capabilities to face generalized tasks.
In August, we also accepted public evaluations and inspections from the industry, academia, and the public, and the results were very good. The fourth-generation model has had both wins and losses against the first-tier models in the international arena, which is something we are very pleased to see.
On this basis, there will be new capabilities, such as the ability to generate videos. We have made a new upgrade with a higher bitrate, 60 frames per second video generation, higher 4K clarity, more realistic pictures, and it can also be combined with our voice ability to automatically dub the videos.
As everyone knows, the progress from silent films to sound films is a very important leap in the history of cinema. The generation of videos from completely silent pictures to being able to generate dubbing at the same time is also a huge progress, indicating that we are taking another step towards the understanding and generation of the physical world.
We can not only generate videos with higher clarity, but also have a higher video ratio, a higher generation speed, and generate multiple videos simultaneously through multiple channels, so that everyone can be more efficient.
This is AutoGLM, allowing everyone to experience the ability to manipulate mobile phones with language and voice. I believe that friends who follow us have recently seen such video introductions and experienced the internal testing of our corresponding products.
After the release of this product, we have received widespread attention. During the process, we have indeed received a lot of feedback, although it is still a relatively early attempt. After this month of testing and feedback, we are working hard to continue the iteration of this product. If you are interested, you can continue to follow it, and we will have new updates for you.
Due to time constraints, I have relatively little time. We will not finish playing this video. If you are interested, you can experience it yourself.
Through the path of reinforcement learning, we have greatly improved the success rate of large models using tools like the human brain. We believe that the success rate on the original general tasks, which may have been only about 20%, has doubled to nearly 40% accuracy.
In the future, we hope to connect the large model brain with more intelligent devices through our comprehensive model capabilities, multi-modal and cross-modal, as well as the general AI Agent capabilities, so that the AI capabilities can be implemented more quickly and enter the physical world, bringing a new experience of human-computer interaction.
I feel that this era is already coming. In this process, Zhipu also adheres to the dual-wheel drive concept, continuously making breakthroughs in technology, and at the same time, not forgetting to transform our technology into new products to create more customer value in the market.
This is the end of my sharing. Thank you all!