docs: add case studies for 4 key principles with industry cross-references

Enriched principles #6 (Cheapest Model per Task), #9 (Checkpoint/Resume), #10 (Diagnose Before Retrying), and #29 (Emergency Stop) with 14 concrete case studies and 24 industry cross-references including GitHub Copilot, Knight Capital, Apache Kafka, Tesla Autopilot, and workspace-internal examples.
2026-04-11 10:56:35 +02:00
parent 325935226d
commit 2502f17bff
5 changed files with 220 additions and 0 deletions
--- a/docs/examples/06-cheapest-model-per-task.md
+++ b/docs/examples/06-cheapest-model-per-task.md
@@ -0,0 +1,46 @@
 # Principle #6: Cheapest Model per Task — Case Studies
 > Don't use Opus for what Haiku can do.
 ---
 ## Case Study 1: GitHub Copilot's Model Tiering
 GitHub Copilot initially used a single Codex model for all completions. As costs scaled with millions of users, GitHub introduced tiered model routing: smaller, faster models handle inline completions (the most frequent, lowest-complexity task), while larger models handle chat-based code generation and multi-file reasoning. By 2024, GitHub reported that most Copilot completions are served by models significantly smaller than the original Codex — reducing per-completion cost by an order of magnitude while maintaining acceptance rates above 30%.
 **When followed:** GitHub could offer Copilot at $10/month despite serving billions of completions. The tiered approach made the product commercially viable at scale.
 **When violated:** Early Copilot usage reportedly cost GitHub more per user than the subscription price — an unsustainable model that only worked because Microsoft could absorb the loss while optimizing.
 ## Case Study 2: Anthropic's Own Model Lineup
 Anthropic explicitly designs its model family (Haiku, Sonnet, Opus) for task-appropriate routing. Their documentation recommends Haiku for classification, extraction, and simple Q&A — tasks where Opus would deliver marginally better results at 30-75x the cost. Companies like Notion and DuckDuckGo use Haiku for high-volume, latency-sensitive features (autocomplete, summarization) while reserving Sonnet/Opus for complex reasoning tasks.
 **When followed:** A classification pipeline processing 1M items/day at Haiku rates ($0.80/MTok input) costs roughly $40-80/day. The same pipeline on Opus could cost $1,500+/day — for negligibly different accuracy on a well-defined task.
 **When violated:** Teams that default to the "best" model for everything hit budget walls within weeks of production deployment, then scramble to optimize under pressure — introducing bugs and regressions in the process.
 ## Case Study 3: Two-Pass Writing Pipeline (Colette)
 In the Colette virtual publisher system (this workspace), book chapter generation uses a two-pass approach: a cheaper model (Haiku or local Qwen) drafts the chapter, then Sonnet polishes the prose to match the author's voice profile. This achieves ~80% cost reduction compared to using Sonnet for the entire generation, with subjectively equivalent output quality — because the expensive model's strength (nuanced voice matching) is concentrated where it matters.
 **When followed:** A full book fanout (14 chapters x 4 publisher variants x 2 languages = 112 generation tasks) stays under $15 instead of $75+.
 **When violated:** Early experiments used Sonnet for everything, burning through API credits in a single session. The cost made iterative experimentation (the key to quality) financially painful.
 ---
 ## Industry Cross-References
 | Pattern | Who Uses It | Reference |
 |---------|-------------|-----------|
 | Model cascade / routing | OpenAI (GPT-4 Turbo vs Mini), Anthropic (Haiku/Sonnet/Opus), Google (Flash vs Pro) | All major providers now offer tiered model families specifically for this |
 | Speculative decoding | Meta (Llama), various inference providers | Small model drafts tokens, large model verifies — same principle at inference level |
 | Cost-aware ML serving | Netflix (different models for different recommendation surfaces) | "Recommendations at Netflix" engineering blog |
 | Cheap filter, expensive classifier | Spam filtering (SpamAssassin rules → ML model) | Industry standard since the 2000s — cheap heuristics first, expensive ML only for ambiguous cases |
 ---
 ## Key Insight
 The principle isn't "use the cheapest model." It's "use the cheapest model **that meets the task's quality bar.**" This requires knowing what quality bar each task actually needs — which most teams skip, defaulting to "use the best model for everything" out of uncertainty. The fix is empirical: test cheaper models on your actual task, measure the quality delta, and only pay for the premium model when the delta matters.
--- a/docs/examples/09-checkpoint-resume.md
+++ b/docs/examples/09-checkpoint-resume.md
@@ -0,0 +1,47 @@
 # Principle #9: Checkpoint / Resume — Case Studies
 > Every long-running operation must be interruptible and resumable.
 ---
 ## Case Study 1: GitLab's Failed Database Migration (2017)
 On January 31, 2017, GitLab.com experienced a major outage when a database replication issue led an engineer to run `rm -rf` on a PostgreSQL data directory. The subsequent recovery revealed that none of their five backup methods were fully functional. The 6-hour restore process had no checkpoint mechanism — if it failed partway through, they had to start over. The incident led to 18 hours of downtime and permanent loss of ~6 hours of data (issues, merge requests, comments).
 **When followed:** Modern migration tools (like Django's migration framework or Flyway) apply changes as discrete, numbered steps. If step 47 of 200 fails, you resume from step 47 — not step 1. Each completed migration is recorded in a tracking table.
 **When violated:** GitLab's restore was an all-or-nothing operation. The team live-streamed the recovery on YouTube, manually monitoring progress for hours with no ability to checkpoint and resume if something went wrong.
 ## Case Study 2: Apache Kafka's Consumer Offset Design
 Kafka's entire consumer model is built around checkpoint/resume. Every consumer tracks its "offset" — the position in the log it has read up to. If a consumer crashes, it restarts from its last committed offset, reprocessing at most a small window of messages. This design makes Kafka consumers inherently resumable.
 **When followed:** LinkedIn (where Kafka originated) processes trillions of messages daily. Consumer crashes, deployments, and scaling events happen constantly — none cause data loss or require full reprocessing because every consumer checkpoints its progress.
 **When violated:** Teams that build custom queue consumers without offset tracking end up with "poison pill" messages that crash the consumer in an infinite loop, or "lost message" bugs where a restart skips unprocessed items because there's no record of where processing stopped.
 ## Case Study 3: psyresearch Paper Ingestion (This Workspace)
 The psyresearch project ingests academic papers from PubMed and CrossRef APIs. Early versions processed papers in a single batch — if the script crashed at paper 847 of 2000, all progress was lost. After adopting checkpoint/resume (hash-based dedup on DOI + JSONL progress log), the ingestion became interruptible: kill it at any point, restart, and it skips already-processed papers. This also enabled incremental daily runs that only process new papers.
 **When followed:** A rate-limit error at paper 500 means restarting processes only papers 501+. Daily incremental runs take seconds instead of minutes because they skip the 95% of papers already in the database.
 **When violated:** The original implementation would re-fetch and re-process all 2000+ papers on every run, wasting API calls and risking rate limit bans from PubMed's API.
 ---
 ## Industry Cross-References
 | Pattern | Who Uses It | Reference |
 |---------|-------------|-----------|
 | Write-ahead log (WAL) | PostgreSQL, SQLite, all major databases | Core durability mechanism — every write is checkpointed before applied |
 | Consumer offsets | Apache Kafka, Amazon SQS (visibility timeout), RabbitMQ (ack) | Message queue standard since the 2010s |
 | Job checkpointing | SLURM (HPC), PyTorch (model training), Kubernetes Jobs | Long-running compute workloads save state periodically |
 | Resumable uploads | Google Drive, AWS S3 multipart upload, tus.io protocol | Large file uploads broken into chunks with server-side tracking |
 | Git itself | Linus Torvalds | Every commit is a checkpoint. Branching is cheap resumption from any point. |
 ---
 ## Key Insight
 Checkpoint/resume is not a nice-to-have for long-running operations — it's a correctness requirement. Any operation that takes longer than a few seconds can be interrupted (network failure, OOM kill, user Ctrl-C, power loss). The question isn't "will it be interrupted?" but "when it's interrupted, how much work do we lose?" The answer should be "at most the current item," never "everything."
--- a/docs/examples/10-diagnose-before-retrying.md
+++ b/docs/examples/10-diagnose-before-retrying.md
@@ -0,0 +1,55 @@
 # Principle #10: Diagnose Before Retrying — Case Studies
 > When something fails, understand why before trying again.
 ---
 ## Case Study 1: Knight Capital's $440M Loss (2012)
 On August 1, 2012, Knight Capital deployed new trading software that contained a bug in its order routing logic. When the system started sending erroneous orders at market open, the operations team's response was to restart the servers — multiple times. Each restart re-triggered the same faulty code, generating more bad orders. In 45 minutes, Knight Capital accumulated $440 million in losses. The company had to be rescued in an emergency acquisition by Getco.
 **When followed:** A proper incident response would have been: stop the system (see Principle #29 Emergency Stop), read the logs, understand that the new deployment was the variable, and roll back. The bug was in the code, not in a transient state — no amount of restarting would fix it.
 **When violated:** The team retried (restarted) without diagnosing, turning a bug into an existential corporate event. Each restart was "the same command hoping for a different result."
 ## Case Study 2: Amazon S3 Outage (2017)
 On February 28, 2017, an S3 engineer ran a command to remove a small number of servers from a subsystem. A typo caused the command to remove far more servers than intended, taking down a large portion of S3 in us-east-1. The cascading failure affected thousands of websites and services. Amazon's post-mortem revealed that the recovery tools themselves depended on S3 — creating a circular dependency that prolonged the outage to nearly 4 hours.
 **When followed:** Amazon's post-mortem led to concrete changes: input validation on removal commands (never allow removing more than X% in one operation), removing circular dependencies in recovery tooling, and adding safeguards that force operators to confirm large-scale operations.
 **When violated:** The initial incident was a typo (understandable), but the extended outage was caused by not understanding the dependency graph before attempting recovery. The team tried to restart subsystems that depended on the very service that was down.
 ## Case Study 3: Retry Storms in Microservices
 A well-documented antipattern in microservice architectures: Service A calls Service B, which is slow due to a database issue. Service A's timeout triggers a retry. But Service A has 100 instances, each retrying 3 times, so Service B now receives 300 requests instead of 100 — making its database problem worse. Service B's increased latency causes Service C (which also depends on B) to start retrying. Within minutes, the entire cluster is in a retry storm, with every service hammering every other service.
 **When followed:** Companies like Netflix and Google implement circuit breakers (Hystrix, Envoy proxy) that detect failure patterns and stop retrying. When a downstream service is failing, the circuit "opens" and returns a fast failure instead of adding load. Netflix's Chaos Engineering practice (Chaos Monkey) specifically tests that services degrade gracefully rather than cascade.
 **When violated:** The 2015 GitHub outage was partly caused by a retry storm between their application servers and a Redis cluster. The retries amplified a small Redis slowdown into a site-wide outage lasting several hours.
 ## Case Study 4: AI Agent Retry Loops (This Workspace)
 During early agent development, a common failure mode was: agent tries to run a command, gets a permission error, retries the same command, gets the same error, retries again — burning tokens and time on an inherently unfixable problem. The fix was explicit in the global CLAUDE.md: "Never retry the same command hoping for a different result." Agents now read error messages, check if the error is transient (network timeout → maybe retry) or structural (permission denied → fix the cause), and act accordingly.
 **When followed:** Agent hits a permission error, reads the message, identifies it needs `sudo` or a different SSH key, fixes the command, succeeds on the next attempt. Total cost: 2 attempts.
 **When violated:** Agent retries the same `git push` 5 times against a repo where the SSH key isn't configured. Each retry burns 30 seconds and API tokens, with zero chance of success. Total cost: 5 failed attempts + wasted tokens + user frustration.
 ---
 ## Industry Cross-References
 | Pattern | Who Uses It | Reference |
 |---------|-------------|-----------|
 | Circuit breaker | Netflix (Hystrix), Microsoft (.NET Polly), Envoy proxy | "Release It!" by Michael Nygard (2007) — introduced the pattern |
 | Exponential backoff with jitter | AWS SDK, Google Cloud client libraries, HTTP/1.1 RFC | Standard retry strategy that at least reduces retry storm amplitude |
 | Chaos engineering | Netflix (Chaos Monkey), Gremlin, AWS Fault Injection Simulator | Proactively testing failure modes so diagnosis is pre-learned |
 | Blameless post-mortems | Google SRE, Etsy, PagerDuty | "Diagnose" formalized as a team process after every incident |
 | Observability (not just monitoring) | Honeycomb, Datadog, Grafana | The tooling to make diagnosis possible — traces, not just metrics |
 ---
 ## Key Insight
 Retrying is not a recovery strategy — it's a bet that the failure was transient. That bet is correct for network timeouts and wrong for everything else. The cost of a wrong bet ranges from "wasted time" (agent retry loops) to "company-ending" (Knight Capital). The principle is simple: read the error, classify it (transient vs structural), then act accordingly. Retry only transient failures, and even then, with backoff and a retry limit.
--- a/docs/examples/29-emergency-stop.md
+++ b/docs/examples/29-emergency-stop.md
@@ -0,0 +1,56 @@
 # Principle #29: Emergency Stop (Not-Aus) — Case Studies
 > Every autonomous system needs a kill switch. One button, kills everything, no confirmation cascade.
 ---
 ## Case Study 1: Tesla Autopilot Disengagement
 Tesla's Autopilot includes multiple emergency stop mechanisms: the driver can disengage by pressing the brake, turning the steering wheel, or pressing the stalk button. These are immediate, no confirmation dialog, no "are you sure?" prompt. The system also self-disengages when it detects situations it cannot handle. NHTSA data shows that the vast majority of Autopilot disengagements are initiated by the driver — the kill switch is used constantly and by design.
 **When followed:** Driver notices the car approaching a construction zone with unusual lane markings. One tap on the brake instantly returns full control to the human. The transition is sub-second.
 **When violated:** Systems without clear disengagement paths create "automation surprise" — the human isn't sure if they're in control or the machine is. The March 2018 Uber self-driving fatality in Tempe, Arizona involved a safety driver who may not have been monitoring the system, partly because the emergency intervention procedure was not reflexive enough.
 ## Case Study 2: AWS Auto Scaling Gone Wrong
 In 2019, a misconfigured AWS Auto Scaling policy caused a company's infrastructure to scale from 10 instances to over 500 in minutes, generating a five-figure bill in hours. The team had no "stop scaling" button — they had to identify the runaway policy, navigate to the correct AWS console page, find the scaling group, and modify the policy. By the time they did, the damage was done.
 **When followed:** Teams that implement cost alerts with automatic scaling caps (e.g., "never exceed 50 instances regardless of policy") have a built-in emergency stop. AWS now offers billing alarms and budget actions that can automatically restrict resource creation when spending exceeds thresholds.
 **When violated:** Without a kill switch, the only way to stop runaway scaling was to manually find and disable the specific policy — a multi-step process that took 20+ minutes under pressure. The system had an accelerator but no brake.
 ## Case Study 3: OpenAI's Rate Limiting as Emergency Stop
 When OpenAI launched the ChatGPT API, they implemented aggressive rate limiting and spending caps as an implicit emergency stop mechanism. Users can set hard monthly spending limits that completely cut off API access when reached. This prevents runaway agent loops (where an AI agent calls the API in an infinite loop) from generating unbounded costs. The design choice was informed by early incidents where developer scripts accidentally generated thousands of dollars in API charges.
 **When followed:** A developer's agent loop hits the $100 spending cap after 2 minutes. API returns 429 errors. The developer investigates, finds the infinite loop, fixes it. Total damage: $100 (the cap).
 **When violated:** Before spending caps existed, developers reported waking up to four-figure API bills from scripts that ran overnight with bugs. One widely reported case involved a developer who left a recursive summarization script running, generating $1,000+ in charges before they noticed.
 ## Case Study 4: Dispatch Emergency Stop (This Workspace)
 The dispatch job orchestration system in this workspace was built with an explicit "Not-Aus" (emergency stop) button from Sprint 1. One API call cancels all running jobs, pauses all workers, and logs the event. The button is visible at the top of the control UI at all times. This was a Day 1 requirement, not a post-incident addition — because the system dispatches work to autonomous agents that spend real money (LLM API calls).
 **When followed:** During testing, a misconfigured job template caused workers to spin up recursive sub-jobs. One click on the emergency stop cancelled everything, logged the event, and the system was cleanly paused. Total damage: ~$2 in wasted API calls.
 **When violated (hypothetically):** Without the emergency stop, the recursive dispatch would have continued until either the API budget was exhausted or someone found and killed each individual worker process manually — a race condition between spending and human reaction time.
 ---
 ## Industry Cross-References
 | Pattern | Who Uses It | Reference |
 |---------|-------------|-----------|
 | E-Stop (physical) | Every industrial robot (ISO 13850), manufacturing lines, elevators | Red mushroom button — international standard since the 1970s |
 | Dead man's switch | Train operators, industrial machinery, nuclear facilities | System stops if the human DOESN'T act — inverse of kill switch |
 | Circuit breaker (financial) | NYSE, NASDAQ, all major stock exchanges | Trading halts when prices move too fast — automatic emergency stop |
 | Spending caps | OpenAI, Anthropic, Google Cloud, AWS Budgets | Hard limits that cut off service rather than allow unbounded spending |
 | Feature flags as kill switches | LaunchDarkly, Unleash, Flagsmith | Instantly disable a feature in production without deploying code |
 | Kubernetes pod disruption budgets | Kubernetes | Prevents accidentally killing too many pods at once — a form of controlled emergency stop |
 ---
 ## Key Insight
 An emergency stop is not a feature you add after the first incident. It's a design constraint that must exist before the system goes live. The cost of building it upfront is trivial — a cancel endpoint, a pause flag, a visible button. The cost of not having it is unbounded — measured in dollars (runaway scaling), reputation (prolonged outages), or worse (physical safety). If your system can take autonomous action, it must have an equally autonomous way to stop.
--- a/docs/status.md
+++ b/docs/status.md
@@ -1,5 +1,21 @@
 # Status Log
 ## 2026-04-11 — Case Studies for Key Principles
 ### Completed
 - Created `docs/examples/` directory with 4 case study documents
 - **#6 Cheapest Model per Task**: GitHub Copilot tiering, Anthropic model family routing, Colette two-pass pipeline
 - **#9 Checkpoint / Resume**: GitLab 2017 DB loss, Apache Kafka offsets, psyresearch paper ingestion
 - **#10 Diagnose Before Retrying**: Knight Capital $440M loss, AWS S3 2017 outage, microservice retry storms, agent retry loops
 - **#29 Emergency Stop**: Tesla Autopilot disengagement, AWS auto-scaling runaway, OpenAI spending caps, dispatch Not-Aus
 - Each case study includes: real-world scenario, followed vs violated analysis, industry cross-reference table, key insight summary
 - Total: 14 case studies across 4 principles, with 24 industry cross-references
 ### What's Next
 - Add case studies for remaining high-impact principles (e.g., #2 Vertical Spike, #14 Parallel by Default, #26 PDCA)
 - Cross-link case studies from README.md principle entries
 - Consider blog post or talk format
 ## 2026-03-31 — Project Created
 ### Completed