docs: add case studies for principles #2, #13, and #30

- #2 Vertical Spike Before Framework: protobuf, Rails, Meta LLaMA
- #13 Autonomous but Auditable: Constitutional AI, GPT-4 red teaming, EU AI Act
- #30 Self-Monitoring Guardian Pattern: Netflix chaos eng, Google Borg, DeepMind AlphaFold
This commit is contained in:
2026-04-12 14:18:04 +00:00
parent 2502f17bff
commit ad2a2e07c2
3 changed files with 142 additions and 0 deletions

View File

@@ -0,0 +1,47 @@
# Principle #2: Vertical Spike Before Framework — Case Studies
> Validate architecture through working code, not docs.
---
## Case Study 1: Google's Protocol Buffers — Spike to Standard (2001-2008)
Protocol Buffers (protobuf) were not designed as a universal serialization framework. They started as an internal tool at Google around 2001, built to solve a specific problem: efficient serialization for Google's internal RPC system. The team built a working spike for their own use case (search infrastructure), iterated on it through real production traffic, and only extracted it as a general-purpose framework years later. Google open-sourced protobuf in 2008 — seven years after the initial spike. By that time, protobuf had been battle-tested across virtually every Google service.
**When followed:** Protobuf's design reflects years of real-world usage patterns (backward compatibility, schema evolution, compact wire format). These features weren't theorized — they were discovered through production incidents and evolving requirements. The framework was extracted from proven code, not designed in advance.
**When violated:** Apache Thrift, Facebook's equivalent, was open-sourced in 2007 with a broader scope (multiple languages, multiple serialization formats, a full RPC framework). The wider scope led to a more complex system with less cohesive design. Thrift works, but its adoption never matched protobuf's — partly because it tried to be a framework from day one instead of growing organically from a spike.
## Case Study 2: Ruby on Rails — Extracted from Basecamp (2004)
David Heinemeier Hansson (DHH) built Basecamp (a project management tool) first, then extracted Ruby on Rails from the working application. Rails was not designed as a web framework and then used to build Basecamp — it was the other way around. The framework emerged from patterns that proved useful in a real product. DHH has repeatedly stated this was the key design decision: "Frameworks are extracted, not designed."
**When followed:** Rails' conventions (convention over configuration, RESTful routing, ActiveRecord) reflect patterns that actually worked in Basecamp's codebase. This gave Rails an opinionated but coherent design that developers could learn quickly. By 2006, Rails had become one of the most popular web frameworks — because its design decisions were validated by production use.
**When violated:** Java's Enterprise JavaBeans (EJB) specification was designed as a framework by committee before widespread production use. EJB 1.0 and 2.0 were notoriously complex, requiring dozens of boilerplate files for simple operations. The spec was driven by anticipated needs rather than observed patterns. It took until EJB 3.0 (2006) — after years of community backlash and competition from Spring — to simplify the framework to something practical.
## Case Study 3: Meta's LLaMA Release Strategy (2023-2024)
Meta's approach to open-source LLM release followed a spike-before-framework pattern. LLaMA 1 (Feb 2023) was released as a research artifact — a set of model weights with minimal tooling, no API, and a restrictive license. It was a spike to test the hypothesis that open-weight models could compete with proprietary ones. The community's response (massive adoption, fine-tuning ecosystem) validated the approach. Only then did Meta invest in LLaMA 2 (July 2023) with a proper commercial license, safety fine-tuning (RLHF), and enterprise-ready tooling. LLaMA 3 (2024) further expanded with multimodal capabilities and a full framework (torchtune, llama-stack).
**When followed:** Each LLaMA release incorporated lessons from the previous one's real-world usage. LLaMA 2's safety training was informed by LLaMA 1's observed misuse patterns. LLaMA 3's tooling reflected the community's actual needs (quantization, fine-tuning, deployment) rather than guesses about what developers might want.
**When violated:** Google's initial Bard launch (March 2023) attempted to release a full product (chat interface, integration with Google services, consumer-facing UX) without first validating the underlying model's capabilities through a research spike. The result was a product that made factual errors in its launch demo and spent months playing catch-up on reliability.
---
## Industry Cross-References
| Pattern | Who Uses It | Reference |
|---------|-------------|-----------|
| Spike and stabilize | Extreme Programming (XP), Lean Startup | "Build the simplest thing that could work, then iterate" |
| Extracted frameworks | Rails (from Basecamp), Django (from Lawrence Journal-World), Flask (from an April Fools' joke) | Most successful frameworks were extracted, not designed |
| Walking skeleton | Alistair Cockburn (2004) | "A tiny implementation of the system that performs a small end-to-end function" |
| Steel thread | ThoughtWorks | End-to-end implementation of one feature through all layers |
| Tracer bullet development | The Pragmatic Programmer (Hunt & Thomas, 1999) | "A single tracer through all layers of the system" |
---
## Key Insight
The pattern is remarkably consistent: successful frameworks and platforms start as solutions to specific, concrete problems — and only become generalized after proving themselves in production. Teams that start with the framework (designing for flexibility, anticipating needs, building abstractions) consistently produce systems that are either over-engineered for the actual use case or miss the actual use case entirely. The principle isn't anti-planning — it's anti-speculation. Build something real first, then extract the patterns.

View File

@@ -0,0 +1,47 @@
# Principle #13: Autonomous but Auditable — Case Studies
> Agents work independently. Humans can always follow what happened.
---
## Case Study 1: Anthropic's Constitutional AI (2023)
Anthropic developed Constitutional AI (CAI) as a method for training AI systems where the model self-improves its responses based on a set of written principles — autonomously, without human feedback on every output. The key design choice: every step of the self-critique and revision process is logged and auditable. Researchers can inspect which principle the model invoked, what the original response was, and how it was revised. This transparency allowed Anthropic to iterate on the constitution itself based on observed patterns in the revision logs.
**When followed:** Anthropic published the full set of constitutional principles and detailed the RLAIF (Reinforcement Learning from AI Feedback) process in their research paper. External researchers could evaluate whether the principles led to the intended behavior. The audit trail from constitution to behavior made the system debuggable — when Claude produced unexpected outputs, researchers could trace back to which constitutional principle was (or wasn't) triggered.
**When violated:** Earlier RLHF approaches at various labs relied on human labelers whose decision-making was opaque. When the model learned unexpected behaviors, there was no audit trail — you couldn't trace a specific model behavior back to a specific labeling decision. Debugging was guesswork.
## Case Study 2: GPT-4 Red Teaming and System Card (2023)
Before releasing GPT-4, OpenAI conducted an extensive red teaming program with over 50 external experts across domains (cybersecurity, biorisk, political science, AI alignment). The results were published in the GPT-4 System Card — a 60-page document detailing the model's capabilities, limitations, and safety evaluations. Every red team finding was documented with the test methodology, the model's response, and the mitigation applied.
**When followed:** The System Card allowed policymakers, researchers, and the public to evaluate GPT-4's safety profile before and after deployment. When issues were found post-launch (e.g., jailbreaks), the System Card provided a baseline for understanding what had been tested and what hadn't. The red teaming process was autonomous (experts worked independently) but fully auditable (everything documented in the System Card).
**When violated:** Google's rushed launch of Bard in February 2023 included a factual error in the demo itself (misidentifying which telescope first photographed an exoplanet). The lack of a published evaluation process meant there was no audit trail showing what testing had been done — leading to a $100B market cap drop and public perception that the product was released without adequate quality checks.
## Case Study 3: EU AI Act — Transparency Requirements (2024)
The EU AI Act, which entered into force in August 2024, codifies "autonomous but auditable" as law for high-risk AI systems. Article 14 requires human oversight capabilities, Article 12 mandates automatic logging of system operations, and Article 13 requires that AI systems be designed to be "sufficiently transparent to enable deployers to interpret a system's output and use it appropriately." High-risk AI systems must maintain logs for the entire lifecycle, enabling post-hoc auditing by authorities.
**When followed:** Companies deploying AI hiring tools in the EU must now log every decision the system makes, the data it used, and the criteria it applied. When a candidate challenges a rejection, the audit trail exists to investigate whether the system discriminated — even though the system operated autonomously at scale.
**When violated:** Before the AI Act, Amazon's experimental AI recruiting tool (scrapped in 2018) was found to systematically downgrade resumes containing the word "women's" (as in "women's chess club"). The system had operated autonomously for years without sufficient logging of its decision patterns. By the time the bias was discovered, the damage was done and there was no audit trail to determine exactly which candidates were affected.
---
## Industry Cross-References
| Pattern | Who Uses It | Reference |
|---------|-------------|-----------|
| Model cards | Google (2019), Hugging Face (standard practice) | "Model Cards for Model Reporting" — Mitchell et al., standardized transparency documents |
| System cards | OpenAI (GPT-4, DALL-E 3) | Pre-release safety documentation with red team findings |
| AI audit trails | EU AI Act (Art. 12), NIST AI RMF | Regulatory requirement for high-risk AI logging |
| Experiment tracking | MLflow, Weights & Biases, Neptune | Every training run logged with hyperparameters, metrics, artifacts |
| Decision logs (ADRs) | Software architecture (Nygard, 2011) | "Architecture Decision Records" — same principle applied to engineering decisions |
---
## Key Insight
Autonomy without auditability is a liability — legally, ethically, and practically. The pattern is consistent across all case studies: systems that operate independently but leave a clear trail can be debugged, improved, and trusted. Systems that operate as black boxes eventually produce a failure that nobody can explain or trace, eroding trust and inviting regulation. The cost of logging and documentation is trivial compared to the cost of "we don't know what happened."

View File

@@ -0,0 +1,48 @@
# Principle #30: Self-Monitoring (Guardian Pattern) — Case Studies
> The system monitors itself. A background watchdog checks health every N minutes and logs findings.
---
## Case Study 1: Netflix's Chaos Engineering and Health Checks (2011-present)
Netflix pioneered self-monitoring at scale with their Chaos Engineering practice, starting with Chaos Monkey in 2011. But the lesser-known counterpart is their health check infrastructure: every Netflix microservice (over 1,000 of them) implements standardized health check endpoints that are continuously polled by an internal monitoring system. Services report not just "alive/dead" but detailed health metrics: connection pool usage, cache hit rates, error rates per endpoint, and dependency health. When a service detects its own degradation (e.g., database response time exceeding a threshold), it can self-report as unhealthy and trigger load balancer rerouting — before external monitoring even notices.
**When followed:** During a 2015 AWS outage affecting us-east-1, Netflix's self-monitoring systems detected degraded dependencies within seconds. Services automatically shed load, failover systems engaged, and most users experienced no disruption. The guardian pattern — services monitoring their own health and reporting it upstream — meant the system healed faster than any human could have responded.
**When violated:** In contrast, the 2016 Delta Airlines outage (caused by a power failure in their data center) lasted over 6 hours partly because their monitoring systems were in the same data center that went down. The systems couldn't monitor themselves because the monitors were co-located with the infrastructure they were supposed to watch — a single point of failure in the guardian itself.
## Case Study 2: Google's Borg and Self-Healing Infrastructure (2003-present)
Google's Borg cluster management system (predecessor to Kubernetes) implements the guardian pattern at infrastructure level. Borg continuously monitors all running tasks: checking process health, resource usage, and task output. When a task fails, Borg automatically reschedules it on a different machine. When a machine becomes unresponsive, Borg marks it as dead and migrates all its tasks. Google's published Borg paper (2015) reports that this self-monitoring and self-healing handles thousands of machine failures per day across their fleet — with no human intervention.
**When followed:** Kubernetes (the open-source descendant of Borg) made this pattern accessible to everyone. Liveness probes, readiness probes, and startup probes are the guardian pattern in standard form: the system asks each component "are you healthy?" at regular intervals, and takes corrective action (restart, stop routing traffic, wait) based on the answer. A typical production Kubernetes cluster handles dozens of pod restarts per day automatically.
**When violated:** Early Docker deployments (pre-Kubernetes, 2014-2016) often ran containers with no health checking — a "fire and forget" model. Containers that hung (consuming CPU but producing no output) or entered a zombie state (process alive but non-functional) would run indefinitely until a human noticed. The docker ecosystem's response was to add HEALTHCHECK instructions and orchestrators with probe-based monitoring — essentially, adding the guardian pattern that was missing.
## Case Study 3: DeepMind's Safety Monitoring for AlphaFold (2020-2024)
DeepMind's AlphaFold system, which predicts protein structures with near-experimental accuracy, includes extensive self-monitoring for computational correctness. The system runs confidence scoring on every prediction (pLDDT scores), flags low-confidence regions automatically, and logs detailed metrics about each prediction run. When AlphaFold DB was scaled to predict structures for over 200 million proteins (2022), the self-monitoring system was essential: it automatically identified and flagged predictions that fell below quality thresholds, preventing low-confidence predictions from being published alongside high-confidence ones.
**When followed:** Researchers using AlphaFold DB can filter by confidence scores because the system self-assessed every prediction. This guardian-level quality monitoring allowed DeepMind to scale from 350,000 predictions (2021) to 200+ million (2022) without proportionally scaling human review — the system's self-monitoring replaced most manual quality checks.
**When violated:** Earlier computational biology tools often produced predictions without confidence metrics or self-assessment. Researchers had to manually validate results through wet-lab experiments — a process that doesn't scale. Publications based on unmonitored computational predictions led to retractions when the predictions turned out to be wrong but had been trusted at face value.
---
## Industry Cross-References
| Pattern | Who Uses It | Reference |
|---------|-------------|-----------|
| Health check endpoints | Kubernetes (liveness/readiness probes), Consul, AWS ALB target groups | Industry standard for service health monitoring |
| Watchdog timer | Embedded systems, operating systems (Linux watchdog), hardware BMC | Hardware-level guardian — system resets if the watchdog isn't fed |
| Self-healing infrastructure | Kubernetes, Netflix Titus, Google Borg, AWS Auto Scaling | Automatic corrective action based on health monitoring |
| Canary analysis | Google (Canarying deployments), Kayenta, Flagger | Automated comparison of new vs old deployment health metrics |
| Anomaly detection | Datadog, New Relic, Amazon CloudWatch Anomaly Detection | ML-based guardian that learns normal patterns and alerts on deviations |
| Confidence scoring | AlphaFold (pLDDT), LLM token probabilities, weather forecast confidence | Self-assessment built into the prediction pipeline |
---
## Key Insight
Self-monitoring is what separates systems that scale from systems that don't. Without a guardian, scaling means proportionally scaling human oversight — which is expensive, slow, and error-prone. The guardian pattern inverts this: the system watches itself, and humans only intervene when the system escalates. The key design decision is what to monitor and what corrective action to take. Good guardians check outcomes (is the prediction correct? is the response time acceptable?) not just inputs (is the process running?). A process can be running and completely broken — only outcome-based monitoring catches that.