Files
research.ai-dev-principles/docs/examples/06-cheapest-model-per-task.md
Christian Nennemann 2502f17bff docs: add case studies for 4 key principles with industry cross-references
Enriched principles #6 (Cheapest Model per Task), #9 (Checkpoint/Resume),
#10 (Diagnose Before Retrying), and #29 (Emergency Stop) with 14 concrete
case studies and 24 industry cross-references including GitHub Copilot,
Knight Capital, Apache Kafka, Tesla Autopilot, and workspace-internal examples.
2026-04-11 10:56:35 +02:00

47 lines
4.0 KiB
Markdown

# Principle #6: Cheapest Model per Task — Case Studies
> Don't use Opus for what Haiku can do.
---
## Case Study 1: GitHub Copilot's Model Tiering
GitHub Copilot initially used a single Codex model for all completions. As costs scaled with millions of users, GitHub introduced tiered model routing: smaller, faster models handle inline completions (the most frequent, lowest-complexity task), while larger models handle chat-based code generation and multi-file reasoning. By 2024, GitHub reported that most Copilot completions are served by models significantly smaller than the original Codex — reducing per-completion cost by an order of magnitude while maintaining acceptance rates above 30%.
**When followed:** GitHub could offer Copilot at $10/month despite serving billions of completions. The tiered approach made the product commercially viable at scale.
**When violated:** Early Copilot usage reportedly cost GitHub more per user than the subscription price — an unsustainable model that only worked because Microsoft could absorb the loss while optimizing.
## Case Study 2: Anthropic's Own Model Lineup
Anthropic explicitly designs its model family (Haiku, Sonnet, Opus) for task-appropriate routing. Their documentation recommends Haiku for classification, extraction, and simple Q&A — tasks where Opus would deliver marginally better results at 30-75x the cost. Companies like Notion and DuckDuckGo use Haiku for high-volume, latency-sensitive features (autocomplete, summarization) while reserving Sonnet/Opus for complex reasoning tasks.
**When followed:** A classification pipeline processing 1M items/day at Haiku rates ($0.80/MTok input) costs roughly $40-80/day. The same pipeline on Opus could cost $1,500+/day — for negligibly different accuracy on a well-defined task.
**When violated:** Teams that default to the "best" model for everything hit budget walls within weeks of production deployment, then scramble to optimize under pressure — introducing bugs and regressions in the process.
## Case Study 3: Two-Pass Writing Pipeline (Colette)
In the Colette virtual publisher system (this workspace), book chapter generation uses a two-pass approach: a cheaper model (Haiku or local Qwen) drafts the chapter, then Sonnet polishes the prose to match the author's voice profile. This achieves ~80% cost reduction compared to using Sonnet for the entire generation, with subjectively equivalent output quality — because the expensive model's strength (nuanced voice matching) is concentrated where it matters.
**When followed:** A full book fanout (14 chapters x 4 publisher variants x 2 languages = 112 generation tasks) stays under $15 instead of $75+.
**When violated:** Early experiments used Sonnet for everything, burning through API credits in a single session. The cost made iterative experimentation (the key to quality) financially painful.
---
## Industry Cross-References
| Pattern | Who Uses It | Reference |
|---------|-------------|-----------|
| Model cascade / routing | OpenAI (GPT-4 Turbo vs Mini), Anthropic (Haiku/Sonnet/Opus), Google (Flash vs Pro) | All major providers now offer tiered model families specifically for this |
| Speculative decoding | Meta (Llama), various inference providers | Small model drafts tokens, large model verifies — same principle at inference level |
| Cost-aware ML serving | Netflix (different models for different recommendation surfaces) | "Recommendations at Netflix" engineering blog |
| Cheap filter, expensive classifier | Spam filtering (SpamAssassin rules → ML model) | Industry standard since the 2000s — cheap heuristics first, expensive ML only for ambiguous cases |
---
## Key Insight
The principle isn't "use the cheapest model." It's "use the cheapest model **that meets the task's quality bar.**" This requires knowing what quality bar each task actually needs — which most teams skip, defaulting to "use the best model for everything" out of uncertainty. The fix is empirical: test cheaper models on your actual task, measure the quality delta, and only pay for the premium model when the delta matters.