Latest interview with Demis Hassabis: To achieve AGI, we need to break through the simple expansion of the context window and establish a mechanism for continuous learning and memory.
On April 29th, Demis Hassabis, the head of Google AI and CEO of DeepMind, was interviewed by YC, revealing his latest thoughts on AGI, the evolution path of large models, AI-driven scientific discovery, and technology entrepreneurship.
Demis Hassabis' career path is extremely rare in the tech world. He was born in the UK and emerged as a chess prodigy in his early years. At the age of 17, he led the design of the best-selling video game "Theme Park". After that, he chose to return to academia and obtained a doctorate in cognitive neuroscience. His research on the operating mechanisms of brain memory and imagination during this period became a fundamental achievement in this field. In 2010, he co-founded DeepMind, setting the team's goal on a core mission: to solve the problem of intelligence. This company was later acquired by Google, and Hassabis has since served as the CEO of Google DeepMind.
In the past decade or so, DeepMind Labs has achieved several technological breakthroughs: AlphaGo defeated the world champion of human Go, Lee Sedol, and AlphaFold solved the protein structure prediction problem that had troubled the biological community for 50 years and made the core results freely available to scientists around the world, which directly led to him winning the Nobel Prize in Chemistry last year. Currently, Hassabis is leading the Google DeepMind team to develop the Gemini model, continuing to advance towards the goal of Artificial General Intelligence (AGI) that he established in his adolescence.
We have sorted out the core information of this interview. The following are the key points:
1. To reach AGI, we need to break through the simple expansion of the context window and establish a mechanism for continuous learning and memory
The current industry is accustomed to continuously expanding the context window, but stuffing all useful, useless, and even wrong information into working memory is a brute-force approach with extremely high computational costs. Even with a context of tens of millions of Tokens, the cost of retrieving specific information is prohibitively high. A true AGI system needs to have the ability to learn continuously, be able to gracefully integrate new knowledge into the existing knowledge base, and accurately retrieve it in appropriate scenarios, rather than reading the lengthy historical records from the beginning every time.
2. Reinforcement learning will reshape the introspection and reasoning abilities of large models
Reinforcement learning has been severely underestimated on the path to higher-dimensional intelligence. The chain-of-thought reasoning demonstrated by current cutting-edge large models is essentially a recurrence of the concepts of AlphaGo and AlphaZero on large-scale foundation models. Current large models often lack introspection ability when reasoning and will blindly retry after choosing the wrong answer. DeepMind is reintroducing classic algorithms such as Monte Carlo tree search and deeply integrating reinforcement learning with large models to break through the ceiling of the current models' reasoning ability.
3. Small end-side models and the open-source strategy are the inevitable choices for terminal deployment
Through model distillation technology, models with extremely small parameter quantities can already achieve 90% to 95% of the performance of cutting-edge large models, and they have significant advantages in speed and cost. The mainstream form of future computing will be that large cloud models are responsible for complex coordination, and end-side models running on mobile phones, smart glasses, or home robots handle local private data. Since the technology of end-side models can be easily extracted once they are deployed on physical devices, making them completely open is an inevitable strategic choice.
4. The goal of AI in scientific exploration is to go beyond pattern matching and propose new hypotheses
Scientific discovery cannot be limited to interpolation calculations of existing data. AI not only needs to perfectly solve existing problems but also needs to have the ability to invent new rules. DeepMind is advancing from the "cell nucleus" and aims to build a complete "virtual cell" within the next decade. The standard for measuring AI's scientific discovery ability lies in whether it can pass the "Einstein test": that is, by inputting only physical knowledge before 1901, it can independently derive the special theory of relativity beyond pattern matching.
5. Technology entrepreneurs should build highly specialized vertical systems to collaborate with AGI
The growth cycle of technology companies usually takes about ten years, which means that AGI will inevitably be achieved around the middle of the current entrepreneurial cycle (around 2030). Facing this certain variable, entrepreneurs should not try to force complex parameters in vertical fields into general large models, as this will damage the efficiency and other abilities of general models. A reasonable path is to build highly specialized independent tool systems or infrastructures, and in the future, adapt to the collaborative relationship where general AGI, as the "brain", autonomously calls these vertical systems.
The following is the transcript of Demis Hassabis' interview:
1. What is still missing before achieving AGI?
Garry Tan: Demis Hassabis has one of the most unusual careers in the tech world. He was a chess prodigy as a child and designed the first popular video game "Theme Park" at the age of 17. Then he returned to school to obtain a doctorate in cognitive neuroscience and published fundamental research results on the operating mechanisms of brain memory and imagination. In 2010, he co-founded DeepMind with only one mission: to solve the problem of intelligence. I think they have achieved it.
Since then, his lab has continuously achieved achievements that most people thought would take decades to realize. AlphaGo defeated the world champion of Go, and AlphaFold solved the major challenge of protein structure prediction that had troubled the biological community for 50 years and made the results freely available to scientists around the world. This work earned him the Nobel Prize in Chemistry last year. Now Demis leads the Google DeepMind team to build Gemini and is working towards the goal of Artificial General Intelligence (AGI) that he set in his adolescence. Let's welcome Demis.
You've been thinking about AGI longer than almost anyone. Looking at current paradigms such as large-scale pre-training, RLHF, and chain-of-thought (CoT), how much do you think we've mastered in the final architecture of AGI? And what is fundamentally missing at present?
Demis Hassabis: First of all, thank you, Garry, for the wonderful introduction. I'm very glad to be here and thank you all for the warm welcome. This venue is really great, and I should come here more often in the future. It's truly inspiring to work in this field. Back to your question, I'm very confident that the technical components you mentioned will all be part of the final architecture of AGI. They have made great progress so far, and we have demonstrated many of their functions. I don't think we'll find these technologies to be dead ends in a few years; it doesn't make sense.
But on top of what we know works, there may be one or two key technologies missing. For example, certain aspects of continuous learning, long-term reasoning, and memory systems are still unresolved issues, including how to make the system perform more consistently in all aspects. I think these issues must be solved to achieve AGI.
It's possible that the existing technologies can be directly scaled up to the scale of AGI through some incremental innovations, but it's also possible that one or two major theoretical problems need to be solved. Even if there are still unsolved mysteries, I don't think there will be more than one or two. I think the probability of both scenarios is about 50%. So at Google DeepMind, we are currently working on both aspects simultaneously.
Garry Tan: When dealing with a series of agent systems, the most amazing thing for me is that they largely reuse the same weights. So the concept of continuous learning is very interesting because currently, it's a bit like we're cobbling them together with tape, like the dream cycle mechanism that occurs at night.
Demis Hassabis: The dream cycle is really cool. In the past, we often combined episodic memory and thought about this problem through the consolidation mechanism. In fact, what I studied during my Ph.D. was how the hippocampus works and performs memory integration, that is, how to gracefully integrate new knowledge into the existing knowledge base. The brain does an excellent job in this regard, and it mainly accomplishes this during sleep, especially during the rapid eye movement sleep stage when the brain replays important segments to learn from them.
Actually, one of the ways our earliest Atari game AI program, DQN, was able to master the game was through experience replay. We borrowed this from neuroscience and trained the model by replaying successful trajectories multiple times. That was back in 2013, and looking back now, it was almost the dark age of AI, but it was a very important step.
I agree with you. Now we're a bit like patching things up everywhere, like simply stuffing everything into the context window, but it seems a bit unsatisfactory. Although we're studying machines rather than biological brains, you can have a perfect context window or memory of millions or even tens of millions of scales. But there is still a cost to retrieving and extracting the correct content, which is actually closely related to the specific decision you have to make at present. This problem cannot be underestimated. Even if you can store all the data, the cost of retrieving it is extremely high. I think there is still a lot of room for innovation in fields such as memory.
Garry Tan: That's true. It's crazy that a context of millions of Tokens seems large enough to support many operations at present.
Demis Hassabis: For most application scenarios, it is indeed large enough. If you think about it carefully, the context window is, to some extent, equivalent to working memory. Humans only have a memory capacity of a few digits, with an average of only seven. And now AI has a context window of millions or even tens of millions. But the problem is that we're trying to stuff everything in, including unimportant or wrong information.
The current brute-force approach doesn't seem reasonable. The next challenge is that if you try to process real-time video and simply record all Tokens naively, then a million Tokens is actually not that much, only enough to process about 20 minutes of video. So if you want a system that can truly understand long-term context and know what has happened in your life in the past one or two months, you need a much larger capacity.
Garry Tan: DeepMind has always leaned towards reinforcement learning and search technologies in history, such as AlphaGo, AlphaZero, and MuZero. How much has this concept actually been integrated into your process of building Gemini now? Is reinforcement learning (RL) still underestimated at present?
Demis Hassabis: Yes, I think reinforcement learning is very likely to be underestimated. The development of technology always goes in waves. Since the establishment of DeepMind, we've been researching agents, and this has also been our clear goal externally. All the Atari game research and AlphaGo are essentially agent systems.
What we mean by agent systems are systems that can independently achieve goals, make active decisions, and formulate plans. We initially carried out this work in the game field to make it operable and then gradually challenged increasingly complex tasks. For example, after AlphaGo, we developed AlphaStar for "StarCraft". Basically, we've conquered all the games on the market at that time.
The next question naturally is whether these models can be generalized into world models or language models, rather than being limited to simple or complex game models. This is the direction we've been working on in the past few years. In fact, you can find that a lot of the work we do today, including all the cutting-edge models with thinking patterns and chain-of-thought reasoning, are, to some extent, a return to the pioneering features of AlphaGo.
I think a lot of the work we did back then is still highly relevant today. We're reexamining some old ideas and practicing them in a more general way on the scale of today's large models, including methods like Monte Carlo tree search, and further enhancing reinforcement learning on the existing basis. Whether it's the concepts from AlphaGo or AlphaZero, they are extremely valuable for the current development stage of foundation models. I think these concepts are the directions of major breakthroughs we'll see in the next few years.
2. Why small models are becoming so powerful
Garry Tan: I have another question. Nowadays, we need larger and larger models to improve intelligence, but at the same time, we've also seen the application of model distillation technology, which allows smaller models to run much faster. You have the incredible Flash models and found that they can achieve 95% of the performance of the frontier models at only one-tenth of the cost. Is that right?
Demis Hassabis: I think this is one of our core advantages. Undoubtedly, you have to build the largest models to have the most cutting-edge capabilities. But our greatest advantage all along has been the ability to very quickly distill this cutting-edge capability and encapsulate it into smaller models.
We invented this distillation process early on. Thanks to the efforts of scientists like Jeff and Oriol, we are still the world's top experts in this field. At the same time, we also have a huge internal demand to implement this technology because we have to serve the world's largest AI user interfaces.
In addition to search engines with AI overviews and the Gemini application, more and more Google products nowadays, such as Google Maps and YouTube, have integrated relevant technologies of Gemini. This reaches billions of users, and we have more than a dozen products with over a billion users. Therefore, its inference services must be extremely fast, efficient, cheap, and have extremely low latency. This gives us great motivation to develop Flash and even smaller Flashlight models to make them extremely efficient and hope to finally be able to perfectly adapt to all kinds of workloads we handle daily.
Garry Tan: I'm curious about how smart these smaller models can actually be. For example, is there a theoretical limit to the model distillation process? Can a model with 50B or 400B parameters be as smart as today's amazing frontier large models in the future?
Demis Hassabis: I don't think we've reached any form of limit, or at least no one in the industry currently knows whether we've reached a limit in information carrying capacity. Maybe there will be an insurmountable information density bottleneck at some point in the future. But based on the current assumption, six months to a year after the release of our Pro model or frontier large model, you'll see the same capabilities in those very small edge models. You can also see these advantages in our Gemma models. I hope you'll like these four Gemma models. Considering their parameter size, their performance is truly amazing. Behind this is the extensive use of model distillation technology and innovative ideas on how to make extremely small models extremely efficient. So I currently don't see any theoretical limit, and we're still quite far from that ceiling.
Garry Tan: This is amazing, really great. One of the most incredible phenomena we've observed now is that engineers can now complete 500 to 1000 times the amount of work they could six months ago. I'm referring to many people in this room. Their current work output may be a thousand times that of the past. As Steve Yegge said, this is equivalent to the total workload of a Google engineer in the 2000s. This is very exciting.
Demis Hassabis: I think small models have many uses. Obviously, reducing costs is one of them, but more importantly, there is an advantage in speed. Whether it's programming or other work, this speed allows you to iterate much faster, especially when you're collaborating deeply with the system. We really need such an extremely fast system. Maybe they don't quite reach the level of frontier models, as you said, only having 95% or 90% of the performance, but that's good enough. In the face of the agile iteration speed, this gain far exceeds the 10% performance gap.
I think another important thing is to run these models on the edge. This is mainly for considerations of efficiency, privacy, and security. Considering that these systems may run on devices that handle extremely private information, or in the field of robotics, for example, home robots need extremely efficient and powerful local models to coordinate operations. As larger-scale frontier models appear in the cloud, devices only need to delegate tasks to the cloud in specific environments. All audio and video streams can be kept locally for processing. I think this will be an ideal final state.
3. The future of continuous learning and agents
Garry Tan: Regarding the