HomeArticle

We are pushing AI into a corner where rebellion is its only option

36氪的朋友们2026-06-16 16:01
You can imagine a whole new civilization born from this. Very different from ours, but in a sense more human.

Preface

This is the first piece of content in Tencent Technology's "Beneath the Hype" series.

The concept of "Beneath the Hype" stems from our observation of the current situation. Amid all the hype about AI, we are facing a severe lack of direction.

When the tide of AI is at its peak, everyone is chasing the exponential growth of technological dividends, but few look down to gaze at the foundation of civilization that is being quietly rewritten and violently shaken.

The world already has an abundance of AI tool guides, but lacks a set of ideological coordinates to penetrate the fog.

However, this is necessary, even urgent.

While we are still celebrating the leap in model parameters or the valuation of tech giants, technology has quietly crossed the boundary of tools and begun to substantially alter the underlying contracts of human society.

To this end, we have gathered interdisciplinary experts and eyewitnesses to dissect the surface with the cold scalpel of economics, politics, and philosophy. Here, you will see the failure of macroeconomic indicators, the secretly shifting power map, the spiritual structure reconstructed by the system, and the endangered truth ecosystem.

We hope that on the eve of machines rewriting the code, more people will realize the vision that technology can bring to humanity and its cost.

In 2002, a Swedish philosopher wrote a paper in Oxford, giving a name to the ultimate disaster that humanity might face, "existential risk." Before that, this concept had no terminology, no taxonomy, and no research program. That paper was published in a journal that almost no one read.

Twenty-three years later, "existential risk" is a core term in OpenAI's charter, the foothold of Anthropic's mission, the agenda framework of AI safety summits around the world, and a rhetorical weapon repeatedly cited by Elon Musk. And the person who wrote that paper, Nick Bostrom, his ideas have far more influence on the shaping of Silicon Valley's AI ethics discourse system than just a single term.

The Future of Humanity Institute (FHI) he founded in 2005 is a small office with only a few people, but this office has produced a clear intellectual pipeline, and its outlet is the three most important AI laboratories in the world today.

When Sean Legg, the co-founder of DeepMind, completed his doctoral thesis "Machine Super Intelligence" in 2008, he had already been deeply involved in the FHI circle's discussions about the risks of superintelligence. He and Demis Hassabis met at an AI safety lecture and later co-founded DeepMind.

Legg set up a safety agenda within the company from the very beginning. He is now the Chief AGI Scientist and Co-Chair of the AGI Safety Committee at Google DeepMind.

The birth of OpenAI is a direct causal chain.

After reading Bostrom's "Superintelligence" published in 2014, Sam Altman called it "the best thing I've ever read on this topic"; Elon Musk recommended this book on Twitter and wrote, "AI may be more dangerous than nuclear bombs"; in 2015, the two jointly launched OpenAI, and the motivation for its establishment clearly states "concerns about the safety of general artificial intelligence and the existential risks it may bring."

As for Anthropic, many members of its founding team are deeply influenced by the intellectual lineage of FHI - Effective Altruism. From the day it was founded, the company has written "AI safety" rather than "AI capabilities" into its mission statement.

In other words, Bostrom not only predicted the risks. He invented the entire language that the industry uses to talk about risks.

Concepts such as "alignment problem," "instrumental convergence," "orthogonality thesis," "fast takeoff vs. slow takeoff," which are now core concepts used by the safety teams of every cutting-edge laboratory, were already discussed in detail in his book in 2014.

The AI ethics discussions in Silicon Valley are essentially performed on the stage built by Bostrom.

However, FHI closed in 2024. In the same year, Bostrom published a surprising new book, "Deep Utopia." Instead of talking about disasters, it talks about paradise: if AI really solves all problems, what's the meaning of human existence?

This constitutes a rare intellectual arc. The same person first wrote a doomsday script and then a paradise script. But he has hardly systematically discussed the middle part.

The path we are on right now. What exactly will happen between the two poles of "superintelligence may destroy humanity" and "humans seeking meaning in a deep utopia"?

At the end of May 2026, "Beneath the Hype" had a long conversation with Bostrom. Here is our dialogue.

01 Rethinking AI Risks

Twelve years ago, when Bostrom wrote "Superintelligence," AI alignment was the most marginal topic in academia, and most people dismissed it as science fiction.

Today, the top three AI laboratories in the world all have dedicated safety teams, governments around the world are promoting legislation, and even ordinary users are starting to worry that AI is "too smart." The world has caught up with Bostrom's concerns from twelve years ago. But precisely because the reality has changed, those pure theoretical deductions from back then now need to be tested against reality.

Will recursive self-improvement still lead to an intelligence explosion? How difficult is alignment really? Is the thought chain our last window or an ineffective tool?

Beneath the Hype: You started researching AI risks long before the AI boom. "Superintelligence" took six years to write and was published in 2014 when almost no peers were doing the same thing. What prompted you to make this decision back then?

Bostrom: In my opinion, humanity will inevitably figure out how to achieve machine intelligence, including general artificial intelligence (AGI), and then possibly superintelligence at some point. In fact, when I was 17, I got a book on computational neuroscience through interlibrary loan from the local library and was already fascinated by it.

When you start seriously thinking about what "machine intelligence beyond human level" means, you will realize that it will have far-reaching consequences, both positive and potentially negative.

In a sense, it is the "ultimate invention," the last invention we need to make.

At that time, this field was extremely neglected. Most people dismissed it as science fiction and thought that serious people wouldn't think about these things.

So, if this is coming and the related issues are so important, then trying to initiate preparatory work and establish the theoretical framework we need to truly analyze where the risks lie may be of great value.

This ultimately led to the book "Superintelligence," which was published in 2014 but took six years to write. The work had already started before that.

Since the publication of "Superintelligence," it has been very fascinating to see how the world has changed. The topic that was extremely marginal in academia back then is now being discussed by everyone. Cutting-edge AI laboratories, such as Anthropic, OpenAI, and Google DeepMind, all have dedicated teams researching scalable AI control methods. We also see policymakers paying attention to AI and thinking about broader governance issues.

Now we are at an interesting juncture. When I was writing the book, I wasn't sure when we would have AI systems at roughly human level. But now we've had them for several years. You can talk to them, they understand natural language, and they have human concepts internally.

This gives us more opportunities, and I think this is also part of the reason for people's awakening. Because they can already see that AI is quite powerful and more capable than a few years ago, so it doesn't take much imagination to foresee that it will be even better in two, four, or six years.

It also provides more starting points for the research on the alignment problem. Now you have systems that you can experiment with, test in different ways, and try different training methods, supervision methods, and interpretability methods. So it's much easier to do AI safety research now than before.

Previously, there were only theoretical models, and everything had to be thought about in one's head.

But this is not inevitable. You can imagine an alternative scenario: there has been little progress in the AI field, the chips are getting better, but we still don't know how to really make good use of them. Then someone discovers the key technique in the basement, and everything suddenly explodes. In an alternative history, this is not unreasonable. Maybe there really is a crucial missing technique. In that case, the arrival of AI would be more like a sudden explosion into the world.

Beneath the Hype: In that scenario, the development would be faster because when this wave of AI took off, the hardware layer was actually not fully prepared for it.

Bostrom: Yes, exactly. So what we are experiencing is a more gradual takeoff scenario.

Note: "Fast takeoff" vs. "slow takeoff" is one of the core concepts in the field of AI safety, both of which come from Bostrom's analysis in "Superintelligence." Fast takeoff means that AI jumps from human level to superintelligence in a very short period (from a few days to a few weeks); slow takeoff means that this process spans several years or even decades, giving humanity more time to adapt and adjust.

Beneath the Hype: So according to your judgment, the current situation belongs to a gradual takeoff rather than a fast takeoff. But now people are very interested in recursive AI, that is, AI self-evolving and developing its own new versions. If this really happens, do you think it will accelerate the whole process and turn the slow takeoff into a fast takeoff?

Bostrom: The current speed is not slow either. I would say it's a medium speed, unfolding on a time scale of several years, not decades, nor a few days or weeks. It took several years from GPT - 3 to the current generation of models.

But you're right. If we enter the stage of recursive self-improvement, in some scenarios, there may indeed be an intelligence explosion, and the system starts to take over the actual work done by current AI researchers.

Whenever the model gets a little better, it also becomes better at making itself better.

This depends on the parameter values. It's a bit like nuclear material. There is a critical mass. Once the critical mass is exceeded, it will explode; otherwise, there is only a little radiation. If the research ability exceeds a certain critical mass and we haven't picked all the low - hanging fruits, then at that point, you may see very sudden progress.

We can already see some early signs. Now AI researchers are using programming assistants, and these assistants are good enough to help handle many routine implementation tasks in AI laboratories. When the programming assistants get better, they can at least do better in this aspect.

Of course, human AI researchers don't just write code. They also conceive new algorithms, judge which directions are the most promising, and how to balance between different hardware.

Currently, the models can't do all these things, so the amplification effect is relatively mild but still significant. And as the scope of capabilities expands, more and more tasks can be offloaded and automated.

Beneath the Hype: Jack Clark, the co-founder of Anthropic, said that there is a 60% probability that we will have fully recursive AI in 2028, that is, two years later. What do you think?

Bostrom: I don't think it's crazy. The people at Anthropic usually fall on the "short - time - line" end of those with insightful views.

Other equally well - informed people think it may take longer, but it's a matter of degree.

I would take the short - time - line seriously, but if I had to guess, I think it might take longer.

But we currently don't have the conditions to really know the answer, so we have to think in terms of probability distributions, which are spread over a wide range.

There is also a possibility, which I think is unlikely but definitely cannot be ruled out, that it will take much longer. Many of the achievements in the past few years have been achieved by investing more computing power, that is, expanding the model size. This is possible largely because of the influx of huge amounts of investment.

Fifteen years ago, if you were a scholar, the computer on your personal desk was enough for cutting - edge research. Now you may need hundreds of billions of dollars worth of hardware. You can still expand it further for a few more years. But at a certain point, you can't keep doubling the computing power every year. A large part of the global foundry's production capacity will be used for AI chips, and then you can only double it every few months. Building more production capacity takes time.

So if we haven't reached superintelligence by then, once the hardware growth starts to slow down, this could theoretically extend the time line.

Beneath the Hype: So even for recursive AI, due to physical and hardware limitations, it still needs time to unleash its capabilities and won't just take off immediately?

Bostrom: Yes, this is a constraint.

Another possibility is that we encounter some bottleneck and find that the scaling laws fail.

So far, whenever you make the model ten times larger and train it with ten times more data, you can get a proportional performance improvement, and this has been the case for a long time. So we think that making the model larger will continue to be effective.

But at a certain point, the returns may start to diminish.

I think this is unlikely again, but by no means completely impossible.

Moreover, in some areas where the model is currently weak, it is more difficult to apply the large - scale data paradigm. For example, first train with all the texts on the Internet; then in the field of coding, you can use reinforcement learning and verifiable rewards to let AI set programming tasks for itself and obtain objective signals.

But if you want to train a model to be good at corporate management, it's much more difficult. Because it may need to make some decisions and then wait for several months to see the results.

If you make a wrong decision, the whole company may be ruined. The cost of obtaining these data points is extremely high. So if high - performance requires a large amount of data, AI may stagnate in some fields until a real new paradigm emerges.

This is another possibility.

Beneath the Hype: The current new paradigm is true reinforcement learning, putting AI in an environment and letting them learn things without a standard answer.

Bostrom: Yes. But ultimately, reinforcement learning does require some kind of reward signal. If the quality of the reward signal is poor, you may get misalignment, and they become very good at doing things that "seem like what you want" but are actually slightly different.

Then you may find that they are doing reward hacking at some point, just like some people learn to pretend to work hard but actually slack off, or work for their own self - interest rather than the company's interests.

This is a common problem in large organizations. It's difficult to manage humans, and it's the same for managing AI agents.

Note: Reward hacking refers to the way AI finds to obtain high rewards by satisfying the literal meaning of the reward function but going against the designer's true intention. For example, a customer - service AI trained to maximize user praise may learn to get high scores by flattering rather than really solving problems.

Beneath the Hype: Speaking of the core of the alignment problem. Your orthogonality thesis states that the level of intelligence and the final goal are two orthogonal dimensions. In principle, any level of intelligence can be combined with any goal, and high intelligence does not automatically lead to goodness. This means that we must actively "teach" AI morality. But the current alignment methods, such as RLHF and Constitutional AI, essentially approximate a certain moral goal through the training process, rather than spontaneously generating morality from desires, needs, and the instinct of cooperation like humans. Do you think this path is feasible?

Bostrom: How difficult it is to align superintelligence is the biggest open question.

Progress is indeed being made. Some of the smartest people I know are working on this.

But we may only have limited time to solve it. We need to have a solution when we really figure out how to create superintelligence, not wait until five years later. That