首页文章详情

All of the latest insights from Demis Hassabis of DeepMind are right here.

量子位2025-09-15 15:31
The "Scientific Golden Age" from NanoBanana to AGI

The popularity of Nano Banana has prompted Demis Hassabis, the CEO of Google DeepMind, to discuss AGI once again in a recent interview.

If we achieve full AGI within the next decade, it will usher in a golden age of science, a new renaissance.

Of course, Nano Banana is not AGI, but it also demonstrates some of the key capabilities and characteristics that Hassabis believes an AGI system should possess.

Hassabis once predicted that we might achieve AGI around 2030. However, the bottleneck that needs to be overcome is that current AI systems do not possess true "doctor-level intelligence." They may excel in certain areas but still make simple mistakes in others.

Moreover, today's AI lacks "true creativity" and cannot propose new conjectures or hypotheses.

To build AGI, we need to understand the world around us and the physical world, not just the abstract world of language or mathematics.

Despite these challenges, Hassabis still firmly believes that the arrival of AGI will initiate a "golden age of science" and bring significant benefits to humanity in various fields such as energy and health.

Netizens have commented that this is one of the most realistic discussions to date about the challenges and opportunities on the path to AGI.

Without further ado, QbitAI has translated and compiled this interview for you. Let's take a look:

  • The ability to generate realistic physical interaction scenarios is in itself proof of the system's deep understanding of the laws of the world.
  • Humanoid robots are extremely valuable for daily tasks, but specialized robot forms also have their irreplaceable application scenarios.
  • AGI should possess original creative ability, not just optimize existing systems.
  • Not everyone can achieve the same output quality because it also involves professional skills such as usage techniques, aesthetic vision, and narrative ability.
  • I believe that within the next 10 years, the drug development cycle is expected to be shortened from several years or even a decade to a few weeks or even days.
  • The strength of Nano Banana lies not only in its being a top-notch image generator but also in its amazing consistency.
  • The ultimate goal of a hybrid system is to integrate verified solutions upstream into the learning component.
  • ...

The Nobel Prize and Google DeepMind

Host: First of all, congratulations on winning the Nobel Prize. Thanks to X and the amazing breakthroughs of AlphaFold. Maybe you've talked about this before, but I know everyone here would love to hear where you were and what the situation was like when you received the Nobel Prize.

Hassabis: It was a very surreal moment (laughs). It was all so incredible. They'll notify you about 10 minutes before everything goes live. When you get that call from Sweden, it's like a bolt of lightning - it's the call every scientist dreams of. Then there were several ceremonies, and I spent a whole week in Sweden with the royal family. It was amazing.

The most amazing part is that they take out the Nobel Prize certificate from the vault of the safe, and then you can sign it, standing alongside other great laureates. It's a truly incredible moment. On other pages, you can see Richard Feynman, Murray Gell-Mann, Marie Curie, Albert Einstein, and Niels Bohr. Then you keep flipping through, and you can write your own name in that book.

Host: Did you have a hunch that you were nominated and knew this might be coming your way?

Hassabis: In the end, it's quite astonishing how well they can keep things under wraps in this day and age - but this technology is really protected like a national treasure in Sweden. So all the outside world can hear are just rumors. Some people think AlphaFold might deserve this kind of recognition. However, the award criteria consider both scientific breakthroughs and real-world impact, and the latter might take 20 or 30 years to manifest. So no one can predict when a breakthrough will come or even if it will happen. That's exactly what makes scientific research so fascinating.

Host: That's such a pleasant surprise. Congratulations. Speaking of DeepMind. Alphabet is a large conglomerate with many business lines. What role does DeepMind play in it, and what main responsibilities does it undertake?

Hassabis: We actually consider DeepMind and Google DeepMind as one entity now - the two merged a few years ago, integrating the forces of all AI teams under Google and Alphabet. We've brought together the advantageous resources of each team to form this unified department.

I prefer to describe it like this: We're like the "engine room" of Google and Alphabet. We're not only building the core Gemini model but also developing various AI models, including video models and interactive world models. Now these models are fully integrated into the Google ecosystem, and almost every product and interactive interface runs on the AI models we've developed.

Now, billions of users interact with our models through AI overviews, AI modes, or Gemini applications - and this is just the beginning. We're deeply integrating AI into all our products, such as Workspace and Gmail. For us, this is an excellent opportunity: we can conduct cutting-edge research and immediately let global users experience the results.

Host: How many people are there in your team, and what's their situation? Are they scientists or engineers? What's the composition of your team like?

Hassabis: Our team currently has about 5,000 people, mainly consisting of engineers and doctoral researchers... I guess they account for over 80%, which means there are about three or four thousand top technical talents.

The Genie 3 World Model

Host: Model iterations are happening very quickly these days. New models and even entirely new categories of models are constantly emerging, like the Genie world model that was released a few days ago. So, what is the Genie world model? We've prepared a demonstration video, and we can discuss it during the live broadcast.

Demonstration video: What you're seeing isn't just a game or a video. They're complete virtual worlds generated by Genie 3. As a new breakthrough in world models, now you only need to describe a scene in text, and Genie 3 can instantly generate an interactive and immersive environment, allowing you to truly "step into" the imagined world you've created.

Hassabis: Yes, all these dynamic images and interactive worlds you're seeing - pay attention, someone is now using the arrow keys and the space bar to control this 3D environment in real-time. The key point is that all these pixels are generated instantly. Before a player explores a certain area, there's no content there at all.

For example, in this scene, someone is graffitiing on the wall in a room. When the player turns around and looks back, the graffiti marks are still on the wall, and this part didn't exist before. Even more amazingly, you can always input instructions like "a person wearing a chick costume" or "a jet ski," and the AI will instantly integrate these elements into the scene. I think it's truly amazing.

Host: It's a bit hard to understand. We've all played 3D immersive video games, but there's currently no function to create any objects. You didn't pre - make the objects using 3D engines like Unity or Unreal. All you're seeing are 2D images generated in real - time by the AI, yet you're having a completely immersive 3D experience - that's the truly groundbreaking part.

Hassabis: This model essentially learns the laws of physics through reverse engineering. It has analyzed millions of real - world videos from platforms like YouTube and independently deduced the operating logic of the real world. Although it's not perfect yet, it can generate highly consistent interactive scenarios that last for a minute or two. It's particularly worth noting that its generation scope far exceeds human activities. You can control a puppy on the beach or interact with jellyfish, truly achieving a simulation and restoration of a diverse world.

Host: The working principle of traditional 3D rendering engines is that programmers pre - write all the physical rules, such as how light reflects and how objects move. You create a 3D model, and the engine calculates the lighting effects based on the pre - set program and finally renders the image. But the breakthrough of Genie is that it autonomously understands these physical rules just by watching a vast number of videos. There are no manually programmed physical laws; it simply masters complex principles such as light reflection and object movement through observation and learning.

Hassabis: Yes, it was trained not only with real - world video data but also with synthetic data from game engines. This project is of special significance to me. What really shocked me is that when I first entered the industry in the 1990s, I personally wrote game AI and graphics engines. At that time, it was extremely difficult to manually program polygon modeling and physical engines. Now, looking at Genie: the dynamic reflections on the water surface, the fluidity of materials, the physical behavior of objects, etc. All these effects that used to require painstaking programming can now be used out of the box.

Host: It's hard to describe in words how complex the problems this model has solved are. This breakthrough is truly beyond imagination. Where will this technology take us? If we fast - forward this model to... the fifth generation?

Hassabis: The original intention of developing this kind of model has always been clear. Although ordinary language models (such as the basic version of Gemini) are constantly improving, from the very first day of Gemini's birth, we've been committed to building a truly multimodal system - one that can handle any type of input, including images, audio, and video, and generate output in any form.

This is related to the core proposition of artificial general intelligence (AGI): True AGI must understand our physical world, not just the abstract domains of language or mathematics. This physical cognitive ability is precisely the key missing link in current robotics technology and is also a prerequisite for the practical application of daily AI assistants such as smart glasses - they must understand the physical environment you're in and its operating laws.

Therefore, the Genie model and our top - notch text - to - video system, Veo, are essentially building "world models." These are all manifestations of our efforts to build world models that understand the dynamics and physical laws of the world. The ability to generate realistic physical interaction scenarios is in itself proof of the system's deep understanding of the laws of the world.

The Robotics Revolution

Host: This technology will ultimately lead to revolutionary breakthroughs in robotics. Although this is just one of its application directions, perhaps we can discuss what the current highest level of vision - language - action models is.

Our envisioned general - purpose system is like this: A machine with camera observation capabilities. I can use language, either text or voice, to tell it that I want it to do something. Then it knows how to take actual actions in the real world to accomplish that task.

Hassabis: Exactly. You can take a look at our Gemini, specifically the real - time version of Gemini. In this version, you can hold up your phone and point it at the surrounding world - I suggest anyone try it - its understanding of the real world has reached an amazing level. We're considering integrating it into a more convenient device, like glasses, and then it will become a true daily assistant. When you're walking on the street, it can recommend various things to you. We can also embed it in Google Maps.

In the field of robotics, we've built something called the "Gemini robot model," which is fine - tuned on the basis of the Gemini model using additional robot data. In the demonstration released this summer, two robotic hands were manipulating objects on a table. You can directly talk to the robot, for example, "Put the yellow object into the red bucket," and it can convert the language into precise action instructions.

This is the power of multimodal models, not just a model for robots. It can integrate the ability to understand the real world into the interaction process. Ultimately, what you need is not only a user - friendly interface (UI/UX) but also the cognitive ability for robots to navigate the world safely.

Host: I asked Sundar (Google CEO) this question. Does this mean that we can ultimately build a general - purpose robot operating system layer similar to Unix or Android? By then, if this system can run stably on a sufficient number of devices, a large number of robot devices, companies, and products will emerge and suddenly flourish globally because a general - purpose software foundation will exist.

Hassabis: Exactly. We're indeed implementing a strategy similar to the "Android model," if you will. We're building a general - purpose operating system layer across robots and also exploring vertical integration: deeply integrating the latest models with specific robot types to achieve end - to - end learning optimization. Both paths are quite interesting, and we're advancing them in parallel.

Host: Do you think the humanoid form is a good design for robots? There's some controversy in this regard. Some people think that the human environment is designed for humans, but specific tasks may require specialized forms - for example, folding clothes, washing dishes, or cleaning may require different structural designs.

Hassabis: I think both will have their place. Actually, 5 - 10 years ago, I firmly believed that specific tasks required specialized robots, especially in the industrial field. The types of robots needed in laboratories and production lines are completely different, and they all need to be optimized in form for specific tasks.

However, for general - purpose or personal - use robots, the humanoid form may be crucial because the physical world we live in is designed for humans. All facilities such as stairs and porches are built based on human ergonomics. Instead of transforming the world, it's more reasonable to make robots adapt to the existing human environment.

So I think it's reasonable to say that the humanoid form is extremely valuable for daily tasks, but specialized robot forms also have their irreplaceable application scenarios.

Host: What are your expectations for thousands of people in the next five or seven years? I mean, do you have any visions for robotics technology?

Hassabis: I do, and I've spent a lot of time thinking about it. I feel that we're still in the early stages of robotics technology. There will be real "disruptive moments" in the next few years, but the current algorithms still need to be upgraded. The general foundation on which these robot models rely needs to become more reliable and more accurate in understanding the world. I believe these breakthroughs will be achieved in the next two or three years.

Then there's the hardware aspect. The key issue is the timing of scaling up. I think ultimately, we'll have millions of robots to help society and improve productivity. But when you talk to hardware experts, you need to determine at what stage you have the right hardware level to choose an expansion plan. When we