docs: add dogfood comparison report (plain Claude vs ArcheFlow PDCA)
This commit is contained in:
78
docs/dogfood-2026-04-04.md
Normal file
78
docs/dogfood-2026-04-04.md
Normal file
@@ -0,0 +1,78 @@
|
|||||||
|
# ArcheFlow Dogfood Report: Colette Expose/Pitch Generation
|
||||||
|
|
||||||
|
Date: 2026-04-04
|
||||||
|
Task: Implement expose and pitch generation steps in Colette's fanout pipeline
|
||||||
|
Project: writing.colette (Python, 27 modules, 457 tests)
|
||||||
|
|
||||||
|
## Task Description
|
||||||
|
|
||||||
|
The fanout pipeline in `src/colette/fanout.py` had two placeholder steps (`generate_expose`, `generate_pitch`) that logged "not yet implemented". The task was to replace them with real LLM-powered implementations that generate publishing proposals and pitch letters.
|
||||||
|
|
||||||
|
## Conditions
|
||||||
|
|
||||||
|
| Condition | Strategy | Agents | Time | Lines |
|
||||||
|
|-----------|----------|--------|------|-------|
|
||||||
|
| **Plain Claude** (no orchestration) | None | 0 | ~3 min | 107 (+75 impl, +32 test) |
|
||||||
|
| **ArcheFlow PDCA** (standard workflow) | pdca | 4 (Explorer, Creator, Maker, Guardian) | ~15 min | 230 (+145 impl, +85 test) |
|
||||||
|
|
||||||
|
## Findings
|
||||||
|
|
||||||
|
### Bugs introduced
|
||||||
|
|
||||||
|
| Condition | Bug | Caught by | Severity |
|
||||||
|
|-----------|-----|-----------|----------|
|
||||||
|
| Plain Claude | None | N/A | N/A |
|
||||||
|
| ArcheFlow | `task_type`/`file_path` kwargs passed to `LLMClient.create()` but only exist on `GuardedLLMClient` | Guardian review | CRITICAL (runtime crash on non-guarded clients) |
|
||||||
|
|
||||||
|
**Key observation:** ArcheFlow's Maker introduced a bug that plain Claude avoided. The Guardian caught it, but the net result was: introduce bug + catch bug = extra work for the same outcome.
|
||||||
|
|
||||||
|
### Code comparison
|
||||||
|
|
||||||
|
| Metric | Plain Claude | ArcheFlow |
|
||||||
|
|--------|-------------|-----------|
|
||||||
|
| Implementation lines | 75 | 145 |
|
||||||
|
| Test lines | 32 | 85 |
|
||||||
|
| LLMClient compatibility | Clean (protocol args only) | Needed fix (extra kwargs) |
|
||||||
|
| Prompt detail | Adequate (10 sections listed) | More detailed (explicit section descriptions) |
|
||||||
|
| Defensive coding | Minimal (follows existing patterns) | More (mkdir guards, fallback paths) |
|
||||||
|
| Test thoroughness | Basic (file existence, call count) | More thorough (token accumulation, error states) |
|
||||||
|
|
||||||
|
### Process overhead
|
||||||
|
|
||||||
|
| Phase | Time | Value added |
|
||||||
|
|-------|------|-------------|
|
||||||
|
| Explorer research | ~60s | Low — task was well-scoped, pattern was obvious from reading 2 lines |
|
||||||
|
| Creator proposal | ~45s | Low — 300-line plan for 75-line task, mostly restated what the code already showed |
|
||||||
|
| Maker implementation | ~90s | Same as plain Claude, but produced more verbose code + a bug |
|
||||||
|
| Guardian review | ~30s | Mixed — caught 1 real bug (out of 5 findings, 80% noise) |
|
||||||
|
|
||||||
|
### Why plain Claude won
|
||||||
|
|
||||||
|
1. **Pattern-following task.** Two placeholder functions, one existing pattern to copy. No ambiguity, no design decisions, no security concerns.
|
||||||
|
2. **Direct protocol reading.** Plain Claude checked the `LLMClient.create()` signature and used only standard args. The Maker, working from the Creator's plan (which didn't mention the protocol), used extra kwargs it saw in the `GuardedLLMClient`.
|
||||||
|
3. **Less indirection = fewer errors.** The Creator-to-Maker handoff introduced information loss. The Creator specified "call llm_client.create()" but didn't specify the exact signature constraints. Plain Claude read the source of truth directly.
|
||||||
|
|
||||||
|
### When ArcheFlow would have been worth it
|
||||||
|
|
||||||
|
This task had none of these signals:
|
||||||
|
- Ambiguous requirements (need Explorer)
|
||||||
|
- Multiple valid approaches (need Creator to evaluate)
|
||||||
|
- Security-sensitive code (need Guardian for real threats)
|
||||||
|
- Cross-cutting changes (5+ files, interaction risks)
|
||||||
|
- Unfamiliar codebase (need research phase)
|
||||||
|
|
||||||
|
### Improvement opportunities
|
||||||
|
|
||||||
|
1. **Auto-select should skip orchestration** for pattern-following tasks (placeholder + existing pattern in same file)
|
||||||
|
2. **Creator compact mode** — for simple tasks, emit a 10-line diff-style plan, not a 300-line essay
|
||||||
|
3. **Explorer budget cap** — 60s max for single-file tasks
|
||||||
|
4. **Guardian calibration** — inject project conventions to reduce false positives from 80% to ~40%
|
||||||
|
5. **Baseline capture** — run the same task without ArcheFlow to enable A/B comparison
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
For this specific task (simple, pattern-following, single-file, well-scoped), ArcheFlow added cost without adding quality. Plain Claude was faster, produced less code, and avoided a bug that the Maker introduced.
|
||||||
|
|
||||||
|
This is not a failure of ArcheFlow's design — it's a calibration problem. The auto-select heuristic should have detected this as a skip-orchestration task. The complexity threshold for ArcheFlow activation needs to be higher than "touches 2+ files."
|
||||||
|
|
||||||
|
**Honest assessment:** ArcheFlow's value-add starts at tasks requiring genuine design decisions, security review, or cross-module coordination. Below that threshold, it's ceremony.
|
||||||
Reference in New Issue
Block a user