Files
research.ai-dev-principles/docs/examples/10-diagnose-before-retrying.md
Christian Nennemann 2502f17bff docs: add case studies for 4 key principles with industry cross-references
Enriched principles #6 (Cheapest Model per Task), #9 (Checkpoint/Resume),
#10 (Diagnose Before Retrying), and #29 (Emergency Stop) with 14 concrete
case studies and 24 industry cross-references including GitHub Copilot,
Knight Capital, Apache Kafka, Tesla Autopilot, and workspace-internal examples.
2026-04-11 10:56:35 +02:00

5.5 KiB

Principle #10: Diagnose Before Retrying — Case Studies

When something fails, understand why before trying again.


Case Study 1: Knight Capital's $440M Loss (2012)

On August 1, 2012, Knight Capital deployed new trading software that contained a bug in its order routing logic. When the system started sending erroneous orders at market open, the operations team's response was to restart the servers — multiple times. Each restart re-triggered the same faulty code, generating more bad orders. In 45 minutes, Knight Capital accumulated $440 million in losses. The company had to be rescued in an emergency acquisition by Getco.

When followed: A proper incident response would have been: stop the system (see Principle #29 Emergency Stop), read the logs, understand that the new deployment was the variable, and roll back. The bug was in the code, not in a transient state — no amount of restarting would fix it.

When violated: The team retried (restarted) without diagnosing, turning a bug into an existential corporate event. Each restart was "the same command hoping for a different result."

Case Study 2: Amazon S3 Outage (2017)

On February 28, 2017, an S3 engineer ran a command to remove a small number of servers from a subsystem. A typo caused the command to remove far more servers than intended, taking down a large portion of S3 in us-east-1. The cascading failure affected thousands of websites and services. Amazon's post-mortem revealed that the recovery tools themselves depended on S3 — creating a circular dependency that prolonged the outage to nearly 4 hours.

When followed: Amazon's post-mortem led to concrete changes: input validation on removal commands (never allow removing more than X% in one operation), removing circular dependencies in recovery tooling, and adding safeguards that force operators to confirm large-scale operations.

When violated: The initial incident was a typo (understandable), but the extended outage was caused by not understanding the dependency graph before attempting recovery. The team tried to restart subsystems that depended on the very service that was down.

Case Study 3: Retry Storms in Microservices

A well-documented antipattern in microservice architectures: Service A calls Service B, which is slow due to a database issue. Service A's timeout triggers a retry. But Service A has 100 instances, each retrying 3 times, so Service B now receives 300 requests instead of 100 — making its database problem worse. Service B's increased latency causes Service C (which also depends on B) to start retrying. Within minutes, the entire cluster is in a retry storm, with every service hammering every other service.

When followed: Companies like Netflix and Google implement circuit breakers (Hystrix, Envoy proxy) that detect failure patterns and stop retrying. When a downstream service is failing, the circuit "opens" and returns a fast failure instead of adding load. Netflix's Chaos Engineering practice (Chaos Monkey) specifically tests that services degrade gracefully rather than cascade.

When violated: The 2015 GitHub outage was partly caused by a retry storm between their application servers and a Redis cluster. The retries amplified a small Redis slowdown into a site-wide outage lasting several hours.

Case Study 4: AI Agent Retry Loops (This Workspace)

During early agent development, a common failure mode was: agent tries to run a command, gets a permission error, retries the same command, gets the same error, retries again — burning tokens and time on an inherently unfixable problem. The fix was explicit in the global CLAUDE.md: "Never retry the same command hoping for a different result." Agents now read error messages, check if the error is transient (network timeout → maybe retry) or structural (permission denied → fix the cause), and act accordingly.

When followed: Agent hits a permission error, reads the message, identifies it needs sudo or a different SSH key, fixes the command, succeeds on the next attempt. Total cost: 2 attempts.

When violated: Agent retries the same git push 5 times against a repo where the SSH key isn't configured. Each retry burns 30 seconds and API tokens, with zero chance of success. Total cost: 5 failed attempts + wasted tokens + user frustration.


Industry Cross-References

Pattern Who Uses It Reference
Circuit breaker Netflix (Hystrix), Microsoft (.NET Polly), Envoy proxy "Release It!" by Michael Nygard (2007) — introduced the pattern
Exponential backoff with jitter AWS SDK, Google Cloud client libraries, HTTP/1.1 RFC Standard retry strategy that at least reduces retry storm amplitude
Chaos engineering Netflix (Chaos Monkey), Gremlin, AWS Fault Injection Simulator Proactively testing failure modes so diagnosis is pre-learned
Blameless post-mortems Google SRE, Etsy, PagerDuty "Diagnose" formalized as a team process after every incident
Observability (not just monitoring) Honeycomb, Datadog, Grafana The tooling to make diagnosis possible — traces, not just metrics

Key Insight

Retrying is not a recovery strategy — it's a bet that the failure was transient. That bet is correct for network timeouts and wrong for everything else. The cost of a wrong bet ranges from "wasted time" (agent retry loops) to "company-ending" (Knight Capital). The principle is simple: read the error, classify it (transient vs structural), then act accordingly. Retry only transient failures, and even then, with backoff and a retry limit.