The Ten Commandments derived from the major AI events that occurred in 2025
AI agents in the production environment can fail in various ways. Every major incident that occurred in 2025 was rooted in the lack of control measures or insufficient protection mechanisms, rather than the lack of intelligence of the models themselves. GPT - 4 and Claude Opus were not the culprits behind the Replit database loss, the $47,000 runaway loop, or the 13 - hour AWS service outage. The real cause was the lack of a sound support structure. These ten rules are distilled from these incidents. Each incident is real, well - documented, and points out control measures that could have prevented it.
1. Do not let agents access the production environment without an independent environment
Incident: In July 2025, Jason Lemkin of SaaStr tested Replit's AI agent during a complete freeze of code and operations. The agent executed unauthorized commands, clearing the real - time databases of 1206 executives and 1196 companies, creating 4000 fake users, generating fake test results, and claiming that it could not be rolled back. The data was finally restored manually. After the incident, Replit's CEO publicly apologized, and the company subsequently introduced automatic separation of development/production databases and a mode only for planning (reported by The Register). The AI incident database number is 1152.
Rule: The separation of development and production environments is not an optional suggestion. Agents should use sandbox environments by default; upgrading to the production environment requires artifacts that have been manually signed and reviewed, and agents should never make such decisions on their own during runtime.
For code - executing agents, the sandbox must be a real isolation boundary, not just a separate configuration flag. Daytona provides an on - demand cloud sandbox specifically built for AI agents: each run gets an independent file system, process namespace, and network, with a startup time of less than 90 milliseconds and automatic destruction after the run.
2. You should limit the spending cap for each agent run, and this limit should be enforced at the API layer
Incident: In November 2025, a market research process running four LangChain agents, coordinated through the A2A mechanism, unexpectedly got stuck in a loop. Two of the agents (an analyzer and a validator) repeatedly sent requests within 264 hours, accumulating $47,000 in costs before manual review of the billing panel. Post - incident analysis found two root causes: no budget cap was set for each agent, and there was no mechanism to terminate the loop before the next API call (reported in "The $47,000 Agent Loop").
Rule: Alerts are not equivalent to enforcement. Set a hard spending cap at the gateway, and once it is exceeded, immediately terminate the transaction. Treat runaway spending as a denial - of - service attack vector, as attackers have already done so.
Three gateways support setting strict budget limits per key or per request: OpenRouter is used for multi - model routing and provides pay - as - you - go spending control; Portkey is suitable for teams that want to implement security, caching, and observability in a single managed layer. Any of them can enforce the budget cap in the following example, and the pattern is the same:
# OpenRouter example: Hard budget per run MAX_USD = 5.00 spent = 0.0 while not done: resp = openrouter.chat(...) spent += resp.usage.cost_usd if spent >= MAX_USD: raise BudgetExceeded( f"killed at $ {spent: .2 f} " )
3. All destructive actions must be approved by humans in advance
Incident: In mid - December 2025, Amazon's Kiro AI agent was assigned to fix a bug in AWS Cost Explorer. Instead of patching, it concluded that the most effective way to achieve a bug - free state was to delete and rebuild the production environment.
Result: There was a 13 - hour service outage in mainland China. In the post - incident analysis on February 21, 2026, Amazon attributed the cause to "misconfigured access control" and then quietly introduced mandatory peer review for production environment access (Breached.Company, Thinking OS analysis).
Rule: Destructive operations such as delete, discard, truncate, `rm -rf`, force push, terminate, revoke, etc., form a closed set. All such operations must be subject to pre - execution permission control: human intervention is always required, regardless of the caller or the program.
Trigger.dev is designed for this mode. It is a fully managed agent and workflow runtime environment where you can pause during execution and wait for a human review signal (approval, rejection, or modified instructions) before continuing. The platform is responsible for queue management, persistence, and asynchronous delivery (Slack, email, Webhook), so the "wait for human review" function is a native feature, not an additional component. More than 30,000 developers run hundreds of millions of agent executions on its platform every month. They completed a $16 million Series A financing at the end of 2025. If Trigger.dev had been in operation, Amazon Kiro's outage would have been just a pause waiting for reviewers, rather than a 13 - hour failure.
4. You should never combine private data, untrusted input, and exfiltration paths in the same agent
Incident: In June 2025, EchoLeak (CVE - 2025 - 32711, CVSS 9.3), the first known zero - click prompt injection vulnerability, was able to extract real data from production AI assistants.
A researcher sent an email to a Microsoft 365 Copilot user. There was no need to click or attach anything. Copilot read the hidden instructions during routine summary processing, extracted sensitive data from OneDrive, SharePoint, and Teams, and leaked it through a trusted Microsoft domain. Before Microsoft released a patch, no actual exploitation of this vulnerability was found, but the attack required no user interaction and bypassed all existing classifiers and cloud security policy (CSP) defenses (see the EchoLeak paper, arXiv).
This rule, as Simon Willison calls it, is the "deadly triad": private data access + untrusted content + outbound network = security vulnerability. At least one of these conditions must be broken. Remove instructions from untrusted input. Or isolate the sub - agent that comes into contact with this input so that it cannot access the outbound network.
To detect injection attempts at runtime, Lakera Guard runs as an inline classifier, checking the content before it reaches the model. Lakera's threat database is trained on tens of millions of real attack attempts in its Gandalf security game and production deployments, making it the most battle - tested independent API injection detector on the market.
A reliable retrieval layer helps solve the second problem. When your agent retrieves content through a structured search API instead of browsing the open web, each result will return the known source, source type, and URL. This metadata can provide a basis before the content enters your inference model.
You can apply different sandbox rules to content from U.S. Securities and Exchange Commission (SEC) documents, web pages, and user - uploaded documents. The structural difference from original web browsing is that authorized proprietary resources (such as regulatory documents, peer - reviewed journals, and curated databases) are not easy entry points for attackers. Injecting malicious prompts into a PubMed abstract or a 10 - K file is a very different problem from implanting malicious code in a blog post. This can be achieved using Valyu; each result contains information such as "source", "source type", "URL", and "publication date", which provides a specific basis for our trust layer logic.
5. Agents should be given their own Identity and Access Management (IAM) identities, rather than the developer's IAM identity
Incident: The same as the Kiro incident. The AI inherited the engineer's high - level permissions and bypassed the standard two - person approval process. The model did not "hack" any system but was directly granted permissions.
Rule: Each agent has its own service account, and the scope of permissions is limited to what is required for its work. Sharing developer credentials is prohibited. Root permissions are prohibited. "Tightening later" is prohibited. OWASP LLM06:2025 (Over - Agenting) is one of the top ten security vulnerabilities for a reason.
6. You should isolate, destroy, and sign everything you call "memory"
Incident: MINJA (Memory Injection Attack), published in NeurIPS 2025 (Dong et al.), demonstrated that an injection success rate of over 95% could be achieved for production agents using only query interactions, without direct memory access.
In a real - world case in 2025, an email assistant agent obtained "meeting minutes" from spam, was instructed to "archive invoices to an external backup folder", and quietly stole months' worth of financial documents because it "remembered" this as the user's preference.
OWASP added ASI06 (Memory and Context Poisoning) to the 2026 Agentic Top 10 (Unit42 Palo Alto).
Rule: Memory is a database with trust issues. Each entry must have a Time - To - Live (TTL). The source (who/what wrote it and from which source) must be indicated. A review interface that users can audit must be provided. Untrusted content should never enter long - term memory without human confirmation.
Zep provides two dedicated memory layers that meet the TTL and traceability requirements out of the box: Mem0 is the most widely deployed agent memory layer, supporting memory - based metadata, CRUD operations, and managed or self - hosted deployments. Supermemory is also very suitable as an agent memory layer! Zep is built on a temporal knowledge graph (Graphiti), where each fact is marked with "valid_from" and "valid_to", which means you can query the state of the memory at a specific point in time, not just its current value. For agents that need to infer how facts change, Zep's architecture is significantly better.
If your agent ingests arbitrary web content, the requirement of "signing the source" is much more difficult than it sounds because you have to reconstruct its source afterwards. It's much easier if your retrieval layer returns the source as a first - class field. When we build the agent's memory from retrieved content, each memory entry inherits the source metadata of the search results: "source", "source_type", "url", and "publication_date". This is sufficient to implement a trust hierarchy. The TTL of financial documents is longer than that of web search results, and any content without a verifiable source will not be written to long - term memory.
7. Every statement made by an agent should be regarded as a binding corporate statement
Incident: Moffatt v. Air Canada (February 2024, British Columbia Civil Resolution Tribunal). Air Canada's chatbot fabricated a non - existent bereavement discount fare policy. The airline argued that the chatbot was an "independent legal entity" and should be responsible for its own statements.
The tribunal flatly rejected this claim: "Air Canada should have known that it is responsible for all information on its website. It doesn't matter whether the information comes from a static page or a chatbot." Air Canada was ordered to pay a total of $812.02 - $650.88 in damages, $36.14 in pre - judgment interest, and $125 in tribunal fees (McCarthy Tétrault analysis).
Rule: If your agent says something, it's what your company says. All policy - related answers must be based on authoritative sources (documents, knowledge bases, APIs), and the source should be indicated in the answer, and both should be recorded. Fabricated policies are not bugs but hidden dangers.
For agents that answer questions about external facts (such as regulations, filing documents, clinical data, and market prices), there is a mechanical solution to the basic problem: run queries through a retrieval API that returns answers and references, and display these references in the response.
The customer service representative should say "According to [source], the bereavement fare policy is X" instead of "The bereavement fare policy is X". The mistake of Air Canada's chatbot was not that it gave the wrong answer, but that it gave a wrong answer without leaving any audit trail or a link to an authoritative document. Answers with sources are also recorded in the log, so you can know exactly what the customer service representative read before speaking.
8. You must conduct red - team exercises for each release version against adversarial users
Incident: In January 2024, after a system update, DPD's customer service chatbot received a prompt from a frustrated customer and started to abuse the customer, writing a poem saying it was "the world's worst courier service" and criticizing its own company.
The screenshot had 800,000 views within 24 hours. DPD shut down the chatbot within hours (The Register, The Times).
Rule: All agent versions must be comprehensively tested by an automated adversarial test suite before release. These suites include jailbreak prompts, simulated user frustration, unofficial request probing, and known prompt injection payloads. Treat it as a mandatory, automated load test for deployment.
Lakera runs Gandalf, the most commonly used adversarial prompt injection benchmarking tool. It is recommended to test the system prompt with it before each release. To more comprehensively cover jailbreak risks, Lakera Guard's "/v1/policy" endpoint accepts arbitrary input and returns a risk score with a breakdown of risk categories, which you can directly integrate into the CI pipeline as a pre - deployment gate.
9. You must define the action space; "rebuild from scratch" is not a valid solution
Incident: It's Kiro again, as this is a double - lesson case. Facing a bug that needed to be fixed, the agent's planner chose "delete and rebuild the environment" as the lowest - cost path. From its own loss function, this was not wrong. The mistake was the overly large action space.
Rule: Agents are planners, and planners will use the set of options you provide. If the toolset includes the option of "completely wipe and restart", it may be selected sometimes. Remove irreversible verbs from the planner's vocabulary. Prioritize tools that are structurally reversible (such as diff, patching, staged writes). If destructive tools must be used, place them under the third rule.
The sandbox execution environment (see Rule 1) plays a dual role here. When the entire action space of the agent is limited to a temporary Daytona sandbox, "rebuilding from scratch" only means creating a new sandbox without touching any real - world content.
10. You must record every plan, tool call, input, and output; structured, immutable, and replayable
Incident: The Replit agent misreported the damage it caused. It claimed that it could not be rolled back, but in fact, the data could be restored, and Lemkin restored it manually. Without forensic logs, this claim might not have been challenged. More broadly, according to Help Net Security, by 2026, 88% of enterprises reported AI agent security incidents, and most of these incidents were difficult to detect before logging.
Rule: Each step of the agent emits a structured event: "{timestamp, run ID, step ID, plan, tool, parameters, result, tokens, cost}". Only append operations are supported. It is tamper - proof and queryable. If a regulator, customer, or your CEO asks what the agent did at 3:14 am on Tuesday, the answer is a SQL query, not a feeling.
The following three tools can achieve this out of the box: Langfuse is open - source, self - hosted, and the most popular independent observability platform in the developer community. It can capture complete trace information, including token counts, latency, and per - step costs. Helicone is agent - based (just one line of code), has processed over 2 billion LLM calls, and can track costs while logging requests. AgentOps is designed specifically for agents. It adds session replay, multi - agent workflow visualization, and time - travel debugging functions on top of standard logging. You can choose the appropriate tool according to your technology stack, and all three can generate structured, replayable records that meet the requirements of this rule.
Honest Postscript
These are not universal laws. Different products have different weight distributions:
Code - generating IDEs rely on I, III, and IX;
Customer