Files

Christian Nennemann 2502f17bff docs: add case studies for 4 key principles with industry cross-references

Enriched principles #6 (Cheapest Model per Task), #9 (Checkpoint/Resume),
#10 (Diagnose Before Retrying), and #29 (Emergency Stop) with 14 concrete
case studies and 24 industry cross-references including GitHub Copilot,
Knight Capital, Apache Kafka, Tesla Autopilot, and workspace-internal examples.

2026-04-11 10:56:35 +02:00

5.6 KiB

Raw Blame History

Principle #29: Emergency Stop (Not-Aus) — Case Studies

Every autonomous system needs a kill switch. One button, kills everything, no confirmation cascade.

Case Study 1: Tesla Autopilot Disengagement

Tesla's Autopilot includes multiple emergency stop mechanisms: the driver can disengage by pressing the brake, turning the steering wheel, or pressing the stalk button. These are immediate, no confirmation dialog, no "are you sure?" prompt. The system also self-disengages when it detects situations it cannot handle. NHTSA data shows that the vast majority of Autopilot disengagements are initiated by the driver — the kill switch is used constantly and by design.

When followed: Driver notices the car approaching a construction zone with unusual lane markings. One tap on the brake instantly returns full control to the human. The transition is sub-second.

When violated: Systems without clear disengagement paths create "automation surprise" — the human isn't sure if they're in control or the machine is. The March 2018 Uber self-driving fatality in Tempe, Arizona involved a safety driver who may not have been monitoring the system, partly because the emergency intervention procedure was not reflexive enough.

Case Study 2: AWS Auto Scaling Gone Wrong

In 2019, a misconfigured AWS Auto Scaling policy caused a company's infrastructure to scale from 10 instances to over 500 in minutes, generating a five-figure bill in hours. The team had no "stop scaling" button — they had to identify the runaway policy, navigate to the correct AWS console page, find the scaling group, and modify the policy. By the time they did, the damage was done.

When followed: Teams that implement cost alerts with automatic scaling caps (e.g., "never exceed 50 instances regardless of policy") have a built-in emergency stop. AWS now offers billing alarms and budget actions that can automatically restrict resource creation when spending exceeds thresholds.

When violated: Without a kill switch, the only way to stop runaway scaling was to manually find and disable the specific policy — a multi-step process that took 20+ minutes under pressure. The system had an accelerator but no brake.

Case Study 3: OpenAI's Rate Limiting as Emergency Stop

When OpenAI launched the ChatGPT API, they implemented aggressive rate limiting and spending caps as an implicit emergency stop mechanism. Users can set hard monthly spending limits that completely cut off API access when reached. This prevents runaway agent loops (where an AI agent calls the API in an infinite loop) from generating unbounded costs. The design choice was informed by early incidents where developer scripts accidentally generated thousands of dollars in API charges.

When followed: A developer's agent loop hits the $100 spending cap after 2 minutes. API returns 429 errors. The developer investigates, finds the infinite loop, fixes it. Total damage: $100 (the cap).

When violated: Before spending caps existed, developers reported waking up to four-figure API bills from scripts that ran overnight with bugs. One widely reported case involved a developer who left a recursive summarization script running, generating $1,000+ in charges before they noticed.

Case Study 4: Dispatch Emergency Stop (This Workspace)

The dispatch job orchestration system in this workspace was built with an explicit "Not-Aus" (emergency stop) button from Sprint 1. One API call cancels all running jobs, pauses all workers, and logs the event. The button is visible at the top of the control UI at all times. This was a Day 1 requirement, not a post-incident addition — because the system dispatches work to autonomous agents that spend real money (LLM API calls).

When followed: During testing, a misconfigured job template caused workers to spin up recursive sub-jobs. One click on the emergency stop cancelled everything, logged the event, and the system was cleanly paused. Total damage: ~$2 in wasted API calls.

When violated (hypothetically): Without the emergency stop, the recursive dispatch would have continued until either the API budget was exhausted or someone found and killed each individual worker process manually — a race condition between spending and human reaction time.

Industry Cross-References

Pattern	Who Uses It	Reference
E-Stop (physical)	Every industrial robot (ISO 13850), manufacturing lines, elevators	Red mushroom button — international standard since the 1970s
Dead man's switch	Train operators, industrial machinery, nuclear facilities	System stops if the human DOESN'T act — inverse of kill switch
Circuit breaker (financial)	NYSE, NASDAQ, all major stock exchanges	Trading halts when prices move too fast — automatic emergency stop
Spending caps	OpenAI, Anthropic, Google Cloud, AWS Budgets	Hard limits that cut off service rather than allow unbounded spending
Feature flags as kill switches	LaunchDarkly, Unleash, Flagsmith	Instantly disable a feature in production without deploying code
Kubernetes pod disruption budgets	Kubernetes	Prevents accidentally killing too many pods at once — a form of controlled emergency stop

Key Insight

An emergency stop is not a feature you add after the first incident. It's a design constraint that must exist before the system goes live. The cost of building it upfront is trivial — a cancel endpoint, a pause flag, a visible button. The cost of not having it is unbounded — measured in dollars (runaway scaling), reputation (prolonged outages), or worse (physical safety). If your system can take autonomous action, it must have an equally autonomous way to stop.

5.6 KiB Raw Blame History