Enriched principles #6 (Cheapest Model per Task), #9 (Checkpoint/Resume), #10 (Diagnose Before Retrying), and #29 (Emergency Stop) with 14 concrete case studies and 24 industry cross-references including GitHub Copilot, Knight Capital, Apache Kafka, Tesla Autopilot, and workspace-internal examples.
5.6 KiB
Principle #29: Emergency Stop (Not-Aus) — Case Studies
Every autonomous system needs a kill switch. One button, kills everything, no confirmation cascade.
Case Study 1: Tesla Autopilot Disengagement
Tesla's Autopilot includes multiple emergency stop mechanisms: the driver can disengage by pressing the brake, turning the steering wheel, or pressing the stalk button. These are immediate, no confirmation dialog, no "are you sure?" prompt. The system also self-disengages when it detects situations it cannot handle. NHTSA data shows that the vast majority of Autopilot disengagements are initiated by the driver — the kill switch is used constantly and by design.
When followed: Driver notices the car approaching a construction zone with unusual lane markings. One tap on the brake instantly returns full control to the human. The transition is sub-second.
When violated: Systems without clear disengagement paths create "automation surprise" — the human isn't sure if they're in control or the machine is. The March 2018 Uber self-driving fatality in Tempe, Arizona involved a safety driver who may not have been monitoring the system, partly because the emergency intervention procedure was not reflexive enough.
Case Study 2: AWS Auto Scaling Gone Wrong
In 2019, a misconfigured AWS Auto Scaling policy caused a company's infrastructure to scale from 10 instances to over 500 in minutes, generating a five-figure bill in hours. The team had no "stop scaling" button — they had to identify the runaway policy, navigate to the correct AWS console page, find the scaling group, and modify the policy. By the time they did, the damage was done.
When followed: Teams that implement cost alerts with automatic scaling caps (e.g., "never exceed 50 instances regardless of policy") have a built-in emergency stop. AWS now offers billing alarms and budget actions that can automatically restrict resource creation when spending exceeds thresholds.
When violated: Without a kill switch, the only way to stop runaway scaling was to manually find and disable the specific policy — a multi-step process that took 20+ minutes under pressure. The system had an accelerator but no brake.
Case Study 3: OpenAI's Rate Limiting as Emergency Stop
When OpenAI launched the ChatGPT API, they implemented aggressive rate limiting and spending caps as an implicit emergency stop mechanism. Users can set hard monthly spending limits that completely cut off API access when reached. This prevents runaway agent loops (where an AI agent calls the API in an infinite loop) from generating unbounded costs. The design choice was informed by early incidents where developer scripts accidentally generated thousands of dollars in API charges.
When followed: A developer's agent loop hits the $100 spending cap after 2 minutes. API returns 429 errors. The developer investigates, finds the infinite loop, fixes it. Total damage: $100 (the cap).
When violated: Before spending caps existed, developers reported waking up to four-figure API bills from scripts that ran overnight with bugs. One widely reported case involved a developer who left a recursive summarization script running, generating $1,000+ in charges before they noticed.
Case Study 4: Dispatch Emergency Stop (This Workspace)
The dispatch job orchestration system in this workspace was built with an explicit "Not-Aus" (emergency stop) button from Sprint 1. One API call cancels all running jobs, pauses all workers, and logs the event. The button is visible at the top of the control UI at all times. This was a Day 1 requirement, not a post-incident addition — because the system dispatches work to autonomous agents that spend real money (LLM API calls).
When followed: During testing, a misconfigured job template caused workers to spin up recursive sub-jobs. One click on the emergency stop cancelled everything, logged the event, and the system was cleanly paused. Total damage: ~$2 in wasted API calls.
When violated (hypothetically): Without the emergency stop, the recursive dispatch would have continued until either the API budget was exhausted or someone found and killed each individual worker process manually — a race condition between spending and human reaction time.
Industry Cross-References
| Pattern | Who Uses It | Reference |
|---|---|---|
| E-Stop (physical) | Every industrial robot (ISO 13850), manufacturing lines, elevators | Red mushroom button — international standard since the 1970s |
| Dead man's switch | Train operators, industrial machinery, nuclear facilities | System stops if the human DOESN'T act — inverse of kill switch |
| Circuit breaker (financial) | NYSE, NASDAQ, all major stock exchanges | Trading halts when prices move too fast — automatic emergency stop |
| Spending caps | OpenAI, Anthropic, Google Cloud, AWS Budgets | Hard limits that cut off service rather than allow unbounded spending |
| Feature flags as kill switches | LaunchDarkly, Unleash, Flagsmith | Instantly disable a feature in production without deploying code |
| Kubernetes pod disruption budgets | Kubernetes | Prevents accidentally killing too many pods at once — a form of controlled emergency stop |
Key Insight
An emergency stop is not a feature you add after the first incident. It's a design constraint that must exist before the system goes live. The cost of building it upfront is trivial — a cancel endpoint, a pause flag, a visible button. The cost of not having it is unbounded — measured in dollars (runaway scaling), reputation (prolonged outages), or worse (physical safety). If your system can take autonomous action, it must have an equally autonomous way to stop.