Files
research.ai-dev-principles/docs/examples/06-cheapest-model-per-task.md
Christian Nennemann 2502f17bff docs: add case studies for 4 key principles with industry cross-references
Enriched principles #6 (Cheapest Model per Task), #9 (Checkpoint/Resume),
#10 (Diagnose Before Retrying), and #29 (Emergency Stop) with 14 concrete
case studies and 24 industry cross-references including GitHub Copilot,
Knight Capital, Apache Kafka, Tesla Autopilot, and workspace-internal examples.
2026-04-11 10:56:35 +02:00

4.0 KiB

Principle #6: Cheapest Model per Task — Case Studies

Don't use Opus for what Haiku can do.


Case Study 1: GitHub Copilot's Model Tiering

GitHub Copilot initially used a single Codex model for all completions. As costs scaled with millions of users, GitHub introduced tiered model routing: smaller, faster models handle inline completions (the most frequent, lowest-complexity task), while larger models handle chat-based code generation and multi-file reasoning. By 2024, GitHub reported that most Copilot completions are served by models significantly smaller than the original Codex — reducing per-completion cost by an order of magnitude while maintaining acceptance rates above 30%.

When followed: GitHub could offer Copilot at $10/month despite serving billions of completions. The tiered approach made the product commercially viable at scale.

When violated: Early Copilot usage reportedly cost GitHub more per user than the subscription price — an unsustainable model that only worked because Microsoft could absorb the loss while optimizing.

Case Study 2: Anthropic's Own Model Lineup

Anthropic explicitly designs its model family (Haiku, Sonnet, Opus) for task-appropriate routing. Their documentation recommends Haiku for classification, extraction, and simple Q&A — tasks where Opus would deliver marginally better results at 30-75x the cost. Companies like Notion and DuckDuckGo use Haiku for high-volume, latency-sensitive features (autocomplete, summarization) while reserving Sonnet/Opus for complex reasoning tasks.

When followed: A classification pipeline processing 1M items/day at Haiku rates ($0.80/MTok input) costs roughly $40-80/day. The same pipeline on Opus could cost $1,500+/day — for negligibly different accuracy on a well-defined task.

When violated: Teams that default to the "best" model for everything hit budget walls within weeks of production deployment, then scramble to optimize under pressure — introducing bugs and regressions in the process.

Case Study 3: Two-Pass Writing Pipeline (Colette)

In the Colette virtual publisher system (this workspace), book chapter generation uses a two-pass approach: a cheaper model (Haiku or local Qwen) drafts the chapter, then Sonnet polishes the prose to match the author's voice profile. This achieves ~80% cost reduction compared to using Sonnet for the entire generation, with subjectively equivalent output quality — because the expensive model's strength (nuanced voice matching) is concentrated where it matters.

When followed: A full book fanout (14 chapters x 4 publisher variants x 2 languages = 112 generation tasks) stays under $15 instead of $75+.

When violated: Early experiments used Sonnet for everything, burning through API credits in a single session. The cost made iterative experimentation (the key to quality) financially painful.


Industry Cross-References

Pattern Who Uses It Reference
Model cascade / routing OpenAI (GPT-4 Turbo vs Mini), Anthropic (Haiku/Sonnet/Opus), Google (Flash vs Pro) All major providers now offer tiered model families specifically for this
Speculative decoding Meta (Llama), various inference providers Small model drafts tokens, large model verifies — same principle at inference level
Cost-aware ML serving Netflix (different models for different recommendation surfaces) "Recommendations at Netflix" engineering blog
Cheap filter, expensive classifier Spam filtering (SpamAssassin rules → ML model) Industry standard since the 2000s — cheap heuristics first, expensive ML only for ambiguous cases

Key Insight

The principle isn't "use the cheapest model." It's "use the cheapest model that meets the task's quality bar." This requires knowing what quality bar each task actually needs — which most teams skip, defaulting to "use the best model for everything" out of uncertainty. The fix is empirical: test cheaper models on your actual task, measure the quality delta, and only pay for the premium model when the delta matters.