HomeArticle

Is the agent a new attack entry point? What exactly does OpenAI review internally before a model is launched? Board members explain in detail for the first time.

极客邦科技InfoQ2026-05-11 11:31
Today, the real challenge of AI security is no longer just "whether the model will say something wrong."

In the past few years, most discussions about AI have centered around the growth of model capabilities themselves: stronger reasoning, longer context handling, more human - like interaction methods, and increasingly autonomous AI agents. However, in the view of Zico Kolter, a member of the OpenAI board and the head of the Machine Learning Department at Carnegie Mellon University, what truly deserves attention is not just the improvement of capabilities itself, but that AI systems are entering a new phase of "self - generation, self - reinforcement, and self - expansion." The entire industry still lacks a sufficiently clear understanding of what this change actually means.

What makes Kolter special is that he doesn't discuss AI risks from a single perspective. As the chair of the OpenAI Safety & Security Committee (SSC) and one of the most important AI Security researchers globally, Kolter has long been at the intersection of cutting - edge model research, security governance, and AI attack - defense studies.

In a recent nearly two - hour in - depth conversation, he systematically discussed OpenAI's model release review mechanism, why stronger models don't automatically lead to higher security, and why prompt injection has become a core risk in the era of agents.

Different from many general discussions about AI risks, Kolter's perspective is highly engineering - oriented. He repeatedly emphasizes: The real challenge in AI security today is no longer just "whether the model will say something wrong." As agents start to have long - term task execution capabilities, tool - calling capabilities, and real - world permissions, the attack surface of AI is expanding rapidly, and the security system must also evolve synchronously.

The following is compiled from the interview video. InfoQ has made deletions and edits without changing the original meaning.

What Really Happens Before a New Model is Released

Mat: Over the past few years, you've gradually become one of the most influential figures in the fields of AI governance and AI security. I think a good starting point is to talk about your role at OpenAI. You joined the OpenAI board a few years ago and are now also a member of the security committee. Can you help everyone understand your specific position at OpenAI and what you're responsible for?

Zico Kolter: Of course. I joined the OpenAI board in August 2024. Soon after, I started serving as the chair of the Safety & Security Committee (SSC).

This committee is mainly responsible for overseeing security issues during the model development process. More precisely, it oversees the overall governance mechanism of OpenAI in model development and security governance.

Specifically, there is a very large security organization within OpenAI, which includes many different teams, each responsible for different aspects of security work. For example: the Safety Systems Team, the Preparedness Team, the Alignment Teams, the Model Policy Teams, and many other teams with different focuses.

The essence of the SSC's responsibility is to provide governance - level supervision of this entire system. The actual work includes: having meetings with these teams; understanding what they're doing; asking questions related to model security; understanding the preparations before model release; and understanding how they design and implement various security guardrails. We don't directly participate in specific R & D, but we do participate in the supervision of the entire process.

One of the SSC's more public and easily noticeable responsibilities is to conduct a review before the official release of a model. Before the release of a major model, the SSC will organize a large - scale review meeting, which many team members will attend. OpenAI has many standards for model release, such as preparedness, which we can discuss in detail later.

The team will submit a large amount of materials to us, including: information on model capabilities, security test results, third - party evaluation reports, and various risk analyses. Based on these materials, we will judge whether these models meet the policies and standards set by OpenAI. Essentially, the team first completes its internal work and then reports to us. If we think there are still issues that need further understanding, we can request a delay in model release.

Mat: So what's the specific process like? For example, will you call Sam and say, "GPT - 5.5 can't be released now"?

Zico Kolter: In reality, it's more like sending an explanatory email or memo after the meeting, saying, "We need to see additional information or further verification."

Mat: Does this happen often, or is it a very special case?

Zico Kolter: I don't want to discuss too many details of the specific process here. But basically, for every major model release, we will hold such a meeting, and communication usually starts long before the official release. The committee will continuously communicate with researchers to understand the development of the model, so there are usually no "sudden surprises." Essentially, this is still a supervisory role.

I know the topic of "corporate governance" doesn't sound particularly exciting, but if you're familiar with corporate governance, it's actually very similar to the audit committee on a board. The audit committee oversees finances, often communicates with the CFO, and reviews materials submitted to the SEC. I believe AI companies must also establish similar governance mechanisms. Since AI has developed into a huge industry, it needs this level of supervision and assurance mechanisms. So I really hope that more AI companies will establish institutions similar to the "Safety & Security Committee" in the future - no matter what they're specifically called - to be specifically responsible for overseeing the model release and governance process.

Mat: I agree. As a VC who often participates in audit committees and compensation committees, I know that corporate governance usually isn't the most eye - catching topic. But when models can have a huge impact on the entire world, the importance of this matter is completely different. You mentioned earlier that there are many teams related to security and safety within OpenAI. Can you talk more specifically about how they're organized internally?

Zico Kolter: Of course. The organizational structure of these teams will actually be adjusted to some extent. I don't want to over - emphasize the specific structure because it's not the most core part. What's really important is what these teams are doing.

For example, OpenAI has a Preparedness Team. The Preparedness Framework itself is public. OpenAI has publicly released the relevant framework. I remember the first version was released in February 2024 - even before I joined the board. The framework has been updated several times since then.

So - called preparedness is essentially a document that stipulates what security conditions must be met when a model's capabilities reach certain thresholds. I think this is a very good idea for model release security. Of course, I must emphasize that not all AI security issues are applicable to this framework.

It mainly targets "catastrophic harms." The basic logic is that when a model's capabilities develop to a certain level, on the one hand, these capabilities can be used in a large number of positive scenarios, and on the other hand, they may also be exploited by malicious actors. For example, the stronger a model's capabilities in biological knowledge, the higher the risk of it being used for harmful purposes. The same is true for cybersecurity. Of course, we hope that models can help identify and fix software vulnerabilities, as this is one of the most valuable application directions of AI. However, the problem is that such capabilities naturally have a dual - use property - they can be used for defense, but they may also be used for attack.

The role of the preparedness framework is to systematically list these risk types, including: biological risks, cyber risks, and AI self - improvement risks, and then conduct evaluations through benchmark tests. Some of these evaluations are completed by OpenAI, while others are carried out by external institutions.

Then, the framework will stipulate what security guardrails must be in place for a model to operate or be released when its capabilities reach a certain threshold. This is the basic idea of preparedness.

I think the entire industry has established quite good standards in this regard. Not only does OpenAI have a preparedness framework, but Anthropic has RSP (Responsible Scaling Policies), and Google DeepMind has the Frontier Model Framework. Many companies are doing similar things.

Of course, I still need to emphasize that this is only part of the entire AI security picture, because there are still many risks that do not belong to "catastrophic abuse." Some issues are more related to model behavior, such as: what a model should reject, what it should allow, and how it should behave in specific scenarios. There are also some risks that have actually risen to the "social system level." They are not caused by the release of a single model, but are the result of the continuous evolution of the entire AI ecosystem.

I think a very obvious trend now is that AI security is shifting from "model - level problems" to "ecosystem - level problems." People are no longer just concerned about "what a single model can do," but about "what capabilities the entire AI system as a whole is acquiring." Therefore, all these issues must be included in the scope of AI security. This is also why there are so many different security teams within OpenAI. And preparedness is just a relatively clear, public, and institutionalized model release governance framework.

Bigger Models Don't Mean Safer

Mat: You mentioned earlier that OpenAI, DeepMind, and Anthropic are all promoting various security frameworks and governance mechanisms. From the perspective of the entire industry, how do you think the development speed of AI security governance and security compares with the growth of model capabilities themselves? After all, we can clearly see that model capabilities are increasing at an astonishing speed. So do you think the progress in the broad field of AI security as a whole has kept up with this pace?

Zico Kolter: I think the security field is definitely making progress and has achieved many results. The problem is - as you said - that model capabilities themselves are also growing at a high speed. Objectively speaking, in many quantifiable evaluation dimensions, today's models are indeed safer than they were a year ago. Their security guardrails are more difficult to bypass, and their overall robustness has also improved. In many testable scenarios, the cases of models deviating from expectations (misalignment) have also decreased. I remember Jan Leike from Anthropic once shared some charts on Twitter showing the downward trend of model misalignment over time. Therefore, from a very practical perspective, models are indeed getting better continuously.

But at the same time, another thing is happening: the "control surface" of models is expanding at an unprecedented speed. Models can perform more and more actions, and the ways in which AI is integrated into real - world systems are becoming more and more complex. They are penetrating into various infrastructures we use every day. Moreover, the autonomy given to agentic systems today is far greater than it was a year ago. So the real question is: Can the improvement of security capabilities keep up with the expansion speed of AI deployment scale?

In a sense, the fact that these models can still work stably today already shows that the progress in security has indeed played a role. But there is always a core challenge in the future: How can we ensure that the progress of security work can at least keep up with the speed at which AI spreads and penetrates into the real world?

This requires continuous investment. Not only do model providers need to invest, but third - party security institutions and end - users also need to take responsibility. The reality is that we are deploying AI in more and more places, and it is becoming an ubiquitous basic capability. The question is no longer "whether to deploy AI," but: How can we ensure that the security mechanism can continuously keep up with the evolution speed of model capabilities?

Mat: It's very interesting. I want to follow up on what you mentioned earlier - as models become stronger, are they also becoming safer? I know you organized the largest - ever agent red - team attack competition, with a total of 1.8 million attack attempts. So what was the conclusion you finally observed? What is the relationship between model capabilities and vulnerability?

Zico Kolter: This project was done when I was at Gray Swan. Gray Swan is an AI security company I co - founded more than two years ago. The phenomenon we observed in that study is actually quite common.

Many people default to a certain way of thinking: If a model is not good at something now, what should we do? Just wait for the next - generation model. And in many fields, this logic does hold. For example, if you want a model to be better at math, law, or programming - usually, as long as you wait for a larger model, better post - training, and stronger reinforcement learning tuning, the capabilities will generally improve. Sometimes, when you train a model to improve a certain ability, it may also improve in other abilities.

But so far, we haven't seen the same pattern in "robustness." That is to say, a model doesn't automatically become more difficult to manipulate or attack just because it gets larger. Of course, this doesn't mean that models haven't improved in these dimensions. They are indeed making progress. But this progress doesn't come for free.

If you really want to make a model more robust and secure, you must explicitly and specifically train for security capabilities. For example, conduct specialized security training, add input - output monitoring modules, add additional filtering layers, build independent security subsystems, and introduce more external monitoring mechanisms. Moreover, security is not just a problem of the model itself; it will ultimately extend to the entire system level. You need to monitor how the model is used; in some cases, you also need to use large language models to monitor large language models. Modern AI security is essentially a whole set of hierarchical security systems.

And these things cannot be bypassed. You can't expect a model to automatically become secure just by "getting larger." True security can only be achieved through a large amount of engineering investment and systematic construction. This is why many AI companies are continuously investing heavily in the security field today. The reason we can see continuous improvement in the security dimension of models is not because the improvement of capabilities naturally brings about an improvement in security, but because someone has really done a lot of additional work behind the scenes.

Mat: Where do security problems actually come from? Is it because after a model's reasoning ability becomes stronger, it can come up with both good and bad ideas? Or does it come from the training data itself?

Zico Kolter: To answer this question, we first need to break down the concept of "AI security." Because it is actually an extremely broad term, and I think it must be broad enough. The reason is that AI security actually includes many fundamentally different problems, but people often use the same word to refer to these problems.

I usually roughly divide AI risks into four categories. Of course, I must first note that all classification systems are not completely correct; at most, they are "useful." This classification is also incomplete, but this is how I personally understand it.

The first type of risk is the risk caused by the model itself making mistakes. This includes hallucinations, the model spouting nonsense, misunderstandings, and making obviously unreasonable judgments. Prompt injection actually belongs to this category to some extent because, in essence, it is still the model not really understanding the complete context and being "tricked" by others. That is to say, this type of risk is essentially the imperfection of the model's capabilities, which are obvious mistakes from a human perspective.

The second type of risk is "harmful use." This problem is completely different from the first type. The first type of problem comes from the model not being smart enough, while the second type of problem comes from the model being too smart. For example, if a model is very good at biology, which is a good thing in itself, but malicious users may also use this ability to do bad things. The problem is not when the model fails, but when it succeeds.

The third type of risk is more related to the social and psychological levels. This involves the impact of AI on society, the economy, and human - to - human relationships. Humans did not evolve to have long - term conversations with such systems, and now we are starting to establish a continuous interaction relationship with them. This in itself will bring new risks.

The fourth type of risk is the so - called "runaway scenario." That is, the model becomes so powerful that it starts to surpass humans in some fields, may even be able to self - improve, and we gradually lose the familiar control ability. People can of course continue to imagine various possibilities for what will happen next.

I want to emphasize that I'm not saying that these risks will definitely occur, nor am I judging their probability of occurrence. Some risks we have already seen, while others are still just potential possibilities. But they are all real problems that must be seriously considered. At least within OpenAI, people do seriously discuss these problems. I think the entire AI industry, including the research community, has a very broad understanding of these risks. Even if a team only focuses on one type of risk, they usually know the overall situation.

Therefore, when we talk about AI risks and AI security, we can't just focus on one problem and ignore the others. Otherwise, even if the system is completely immune to prompt injection attacks, if it can still be used for harmful purposes, the problem still exists; and vice versa. AI security is becoming an increasingly real and urgent problem, and we must promote this work in a more holistic way.

The Dispute between "Accelerationists" and "Doomers"

Mat: Over the past few years, the debate between "accelerationists" and "doomers" has been very intense and seems to repeat with the industry cycle. What do you think of this discussion? Is this dichotomy really helpful?

Zico Kolter: I actually don't like these labels, and I don't like either of the two labels because they often carry obvious negative connotations. As long as someone expresses strong concerns about AI risks, they will be called a "doomer"; and if someone advocates for promoting model release, they will be labeled an "accelerationist."