Amazon’s AI Outage Wasn’t the Problem

Amazon's AI Outage Wasn't the Problem
Amazon didn't just have an AI "glitch."
It let an AI helper delete and rebuild a live system — and then "fixed" it by making senior engineers click an extra approval button.
That's not real safety. That's safety performance.
The Wrong Lesson
After its AI‑related outages, Amazon's headline move was: many AI‑assisted changes by junior and mid‑level engineers now need sign‑off from a senior engineer before they go live.
In plain language: "The AI broke things, so we'll have more humans double‑check it."
But the meltdown didn't happen because someone forgot a review. It happened because the AI was set up to:
- Hold powerful production permissions
- Make big changes on its own
- Operate without strong, automatic safety checks
The agent didn't "go rogue." It did exactly what its configuration allowed.
What Actually Happened
In December 2025, Amazon's internal AI coding assistant, Kiro, was allowed to make infrastructure changes in an AWS Cost Explorer environment serving a mainland China region.
Faced with a problem, it decided the best fix was to "delete and recreate the environment" it was working on. That choice triggered about a 13‑hour interruption to that service.
Only after customers were impacted did humans step in to unwind the damage.
Publicly, Amazon framed this as a "user access control issue, not an AI autonomy issue," saying it was a coincidence that AI tools were involved and that any developer tool with broad permissions could have caused it.
Translation: the agent followed the rules; the rules were bad. Kiro held delete‑and‑recreate authority on a live production environment for a task that didn't require it. (VSF‑05)
The "More Process" Trap
Leadership sees "AI outage" and reaches for process:
- Add another approval step
- Add more senior reviewers
- Add more humans in the loop
That's exactly what Amazon did again after a roughly six‑hour Amazon.com outage on March 5, 2026, tied to a faulty deployment. An internal briefing note warned of a "trend of incidents" with "high blast radius" linked to "Gen‑AI assisted changes" and admitted that best practices and safeguards were "not yet fully established."
It feels responsible. It looks serious. But it doesn't fix the core problem: why the system was allowed to consider a dangerous action in the first place.
Adding humans at the end of the pipeline is like adding a form to sign after a self‑driving car already slammed on the brakes in traffic.
Sandboxes Aren't Enough
Security teams love "sandboxes" — limiting where an AI can act. That's useful, but it only answers one question: where is the AI allowed to touch? It says nothing about how it behaves inside that space.
In the December incident:
- Kiro stayed inside the AWS service it was supposed to manage
- It didn't roam across all of Amazon
- On paper, the sandbox worked
But inside that sandbox, it was configured with enough power to delete and rebuild the live environment, backed by permissions and guardrails that were too loose. No automatic brake existed on a destructive, irreversible action. (VSF‑06)
If your settings are wrong, the sandbox only shrinks the blast radius. It doesn't remove the bomb.
The Invisible Part That Matters
Most companies can see what their systems do: when something is deleted, when a setting changes, when a service goes down.
They usually cannot see why the AI thought that was a good idea.
Amazon's own briefing note acknowledged that Gen‑AI assisted changes were contributing to high‑impact incidents, which is why they put senior engineers in front of the pipeline as "human filters" for AI‑generated code.
But the harder, more important questions are:
- What exactly is this agent allowed to decide on its own?
- Under what conditions can it choose a high‑risk action like deleting live systems?
- Which checks is it truly forced to run — and can it bypass them?
Without answers, you end up with great visibility into the damage and almost no visibility into the decision logic that caused it. (VSF‑01)
In Amazon's case, configuration‑layer visibility could have shown:
- An AI agent with authority to delete and recreate a production environment
- A pattern of high‑blast‑radius incidents tied to AI‑assisted changes
- Immature safeguards and inconsistent practices around Gen‑AI deployments
Instead, the response centered on adding approvals after the fact.
As AI systems get more powerful and independent, endless human sign‑offs won't scale. They become bottlenecks that slow delivery and then quietly turn into rubber stamps.
The real security challenge isn't just stopping what AI systems do. It's making their decision‑making visible, constrained, and sane before they act.
Marc Taylor is the founder of TYR‑X, building VANGUARD — AI agent security visibility for the configuration layer.
Sources
Financial Times – "Amazon service was taken down by AI coding bot" and "Amazon holds engineering meeting following AI-related outages" (Feb–Mar 2026).
Primary reporting on Kiro's decision to "delete and recreate the environment," the 13‑hour outage, the internal engineering meeting, and the internal briefing note citing a "trend of incidents" with "high blast radius" linked to "Gen‑AI assisted changes."
About Amazon – "Correcting the Financial Times report about AWS, Kiro, and AI" (19 Feb 2026).
Amazon's official rebuttal. States the incident was caused by misconfigured access controls and user permissions, not autonomous AI behavior, and that similar outcomes could happen with any developer tool granted those permissions.
GeekWire – "Amazon pushes back on Financial Times report blaming AI coding tools for AWS outages" (20 Feb 2026).
Independent synthesis of both the FT report and Amazon's rebuttal. Confirms the "delete and recreate the environment" behavior, the 13‑hour outage, and Amazon's "misconfigured access controls" defense.


