docs: add dogfood report #2 (batch API) with 7 improvement hypotheses
This commit is contained in:
181
docs/dogfood-2026-04-04-batch.md
Normal file
181
docs/dogfood-2026-04-04-batch.md
Normal file
@@ -0,0 +1,181 @@
|
||||
# ArcheFlow Dogfood Report #2: Batch API Integration
|
||||
|
||||
Date: 2026-04-04
|
||||
Task: Wire Anthropic Batch API into Colette's fanout pipeline with CLI commands and state persistence
|
||||
Project: writing.colette (Python, 27 modules, 457 tests)
|
||||
Complexity: High — 4 files, async API, state persistence, error recovery, CLI commands
|
||||
|
||||
## Experimental Setup
|
||||
|
||||
Same task, same starting commit, two conditions:
|
||||
1. **Baseline**: Plain Claude, no orchestration, single pass
|
||||
2. **ArcheFlow**: PDCA standard workflow (Maker + Guardian review)
|
||||
|
||||
No Explorer or Creator used this time — task scope was clear enough to skip planning and go directly to Maker + Guardian (effectively a fast workflow).
|
||||
|
||||
## Quantitative Comparison
|
||||
|
||||
| Metric | Baseline | ArcheFlow | Delta |
|
||||
|--------|----------|-----------|-------|
|
||||
| Lines added | 189 | 279 | +48% |
|
||||
| Files touched | 4 | 4 | same |
|
||||
| Time | ~5 min | ~12 min | +140% |
|
||||
| Commits | 1 | 4 | cleaner history |
|
||||
| Tests written | 1 | 2 | +1 |
|
||||
| Tests passing | 13/13 | 14/14 | +1 |
|
||||
| Bugs introduced | 0 | 1 | worse |
|
||||
| Bugs caught by review | 0 | 5 | better |
|
||||
| **Real bugs in final code** | **1** | **0** (after fix) | **ArcheFlow wins** |
|
||||
|
||||
## Bug Analysis
|
||||
|
||||
### Bugs found only by Guardian (not present in baseline)
|
||||
|
||||
| # | Bug | Severity | Impact |
|
||||
|---|-----|----------|--------|
|
||||
| 3 | `hash()` non-deterministic across processes for chapter index mapping | HIGH | Data loss on resume — chapters mapped to wrong files |
|
||||
|
||||
This bug was **introduced by ArcheFlow's Maker** and caught by the Guardian. Baseline used `enumerate(i)` and avoided it entirely. Net: zero value.
|
||||
|
||||
### Bugs present in BOTH versions, caught only by Guardian
|
||||
|
||||
| # | Bug | Severity | Impact |
|
||||
|---|-----|----------|--------|
|
||||
| 4 | Timeout marks variant as "done" — permanently loses batch state | HIGH | Silent data loss — timed-out batches can never be resumed |
|
||||
|
||||
This is the **key finding**. Both implementations had this design-level bug. Only ArcheFlow's Guardian caught it. Plain Claude missed it because there was no review step.
|
||||
|
||||
### Bugs in both, not caught by either initially
|
||||
|
||||
| # | Bug | Severity | Impact |
|
||||
|---|-----|----------|--------|
|
||||
| 1 | API key resolution inconsistency (env vs config) | CRITICAL | Wrong key used under mixed-key environments |
|
||||
| 5 | No JSON error handling on corrupted state files | HIGH | Crash on truncated state file |
|
||||
|
||||
Guardian flagged these. Baseline would have shipped them silently.
|
||||
|
||||
## Qualitative Observations
|
||||
|
||||
### Where Guardian added real value
|
||||
|
||||
1. **Error path analysis**: Guardian systematically checked "what happens when X fails?" for timeout, cancellation, corruption, and cross-process resume. Plain Claude focused on the happy path.
|
||||
2. **Cross-process state**: The `hash()` non-determinism finding required reasoning about Python's hash randomization across interpreter invocations — a subtle runtime property that isn't visible from reading the code in isolation.
|
||||
3. **Data loss scenarios**: Finding #4 (timeout → "done" → lost forever) requires understanding the interaction between `wait_and_retrieve`'s timeout branch and the caller's unconditional status assignment. This is a 2-module interaction that single-pass implementation doesn't systematically check.
|
||||
|
||||
### Where Guardian added noise
|
||||
|
||||
1. **Finding #2 (batch_id validation)**: Technically valid but the Anthropic SDK already rejects malformed IDs. Low practical risk.
|
||||
2. **Finding #1 (API key source)**: Valid but matches existing patterns throughout the codebase — flagging it here without flagging it elsewhere is inconsistent.
|
||||
|
||||
### The Maker problem
|
||||
|
||||
The ArcheFlow Maker introduced a bug (hash-based indexing) that the baseline avoided. This happened because:
|
||||
- The Maker was working from a task description, not reading the existing sequential rewrite code as closely
|
||||
- The Creator's plan (when used in dogfood #1) over-specified some things and under-specified others
|
||||
- Working through an intermediary (plan → implementation) introduces information loss
|
||||
|
||||
This is a structural weakness of the PDCA model: the Plan-to-Do handoff can corrupt information.
|
||||
|
||||
## Conclusions
|
||||
|
||||
### Complexity threshold confirmed
|
||||
|
||||
| Task type | Orchestration value |
|
||||
|-----------|-------------------|
|
||||
| Simple (pattern-following, single file) | **Negative** — adds cost, Maker introduces bugs |
|
||||
| Medium (multi-file feature, clear scope) | **Neutral** — extra code but similar outcome |
|
||||
| Complex (error handling, state, async, resume) | **Positive** — Guardian catches design-level bugs |
|
||||
|
||||
The differentiator is **error path coverage**. Guardian's systematic "what if this fails?" analysis catches bugs that single-pass implementation misses because implementers focus on making things work, not on making failures safe.
|
||||
|
||||
### The honest ROI question
|
||||
|
||||
For this task: Guardian caught 1 bug the baseline missed (timeout data loss). That bug would have caused real data loss in production when a batch times out. The cost was ~7 extra minutes and a Maker-introduced bug that had to be fixed.
|
||||
|
||||
Is preventing a production data loss bug worth 7 extra minutes? Yes. But only because this was a task where data loss was possible. For a pure UI change or a refactor with no persistence, the answer would be no.
|
||||
|
||||
---
|
||||
|
||||
## Improvement Hypotheses
|
||||
|
||||
Based on both dogfood runs, here are concrete hypotheses about how to improve ArcheFlow's value-to-cost ratio:
|
||||
|
||||
### H1: Guardian-Only Mode (skip Plan/Do orchestration)
|
||||
|
||||
**Observation**: In both dogfoods, the Maker produced equivalent-or-worse code than plain Claude. The value came entirely from the Guardian review.
|
||||
|
||||
**Hypothesis**: A "review-only" mode where the user implements normally and then runs ArcheFlow as a post-implementation review would capture the Guardian's value without the Maker's overhead.
|
||||
|
||||
**Test**: Implement the same task plain, then run `af-review` (Guardian + Skeptic on the diff). Compare bug catch rate to full PDCA.
|
||||
|
||||
**Expected outcome**: Same bug catch rate, ~60% less cost.
|
||||
|
||||
### H2: Pre-Implementation Threat Modeling (Guardian before Maker)
|
||||
|
||||
**Observation**: Guardian found error-handling bugs (timeout, corruption) that the Maker didn't anticipate. If Guardian's "what could go wrong?" analysis ran BEFORE implementation, the Maker could build in error handling from the start.
|
||||
|
||||
**Hypothesis**: Running a lightweight Guardian analysis on the Creator's plan (not the code) would produce a "threat list" that the Maker addresses during implementation, eliminating the need for a fix cycle.
|
||||
|
||||
**Sequence**: Creator → Guardian(plan) → Maker(plan + threats) → Guardian(code)
|
||||
|
||||
**Expected outcome**: Fewer Maker-introduced bugs, shorter fix cycle, Guardian's code review focuses on implementation correctness rather than missing error paths.
|
||||
|
||||
### H3: Differential Review (only review what the Maker DIDN'T get from the plan)
|
||||
|
||||
**Observation**: The Maker copies most of the plan correctly. The bugs are in the gaps — things the plan didn't specify (error handling, cross-process state, timeout recovery).
|
||||
|
||||
**Hypothesis**: Instead of reviewing the entire diff, focus the Guardian on the delta between the plan and the implementation — what the Maker added, changed, or skipped that wasn't in the plan.
|
||||
|
||||
**Test**: Extract the plan's explicit instructions, diff against the implementation, and give Guardian only the unplanned additions.
|
||||
|
||||
**Expected outcome**: Higher signal-to-noise ratio (fewer false positives on code that correctly follows the plan), focused attention on the dangerous gaps.
|
||||
|
||||
### H4: Project Convention Calibration (reduce false positives)
|
||||
|
||||
**Observation**: Guardian flagged API key handling (finding #1) and batch_id validation (finding #2) — both valid in absolute terms but inconsistent with the project's existing patterns. The project doesn't validate IDs or centralize key management anywhere else.
|
||||
|
||||
**Hypothesis**: Injecting a "project conventions" summary before Guardian review (e.g., "this project uses env vars for API keys, does not validate external IDs, handles errors via outer try/except") would let Guardian calibrate its expectations and only flag deviations from convention, not the convention itself.
|
||||
|
||||
**Test**: Run Guardian with and without convention context on the same diff. Count false positives.
|
||||
|
||||
**Expected outcome**: 30-50% reduction in noise findings without missing real bugs.
|
||||
|
||||
### H5: Abandon PDCA for Implementation, Keep It for Review
|
||||
|
||||
**Observation**: Across both dogfoods, the cycle-back mechanism (Plan→Do→Check→Act→cycle back) never triggered. All reviews were APPROVED_WITH_FIXES, and fixes were applied in a single pass. The cyclic model added structural overhead (event tracking, artifact routing, convergence detection) that was never used.
|
||||
|
||||
**Hypothesis**: For most tasks, a linear pipeline (implement → multi-reviewer check → targeted fix) is sufficient. Reserve cyclic PDCA for tasks where reviewers fundamentally reject the approach (not just the implementation).
|
||||
|
||||
**Test**: Compare PDCA standard (cycle-back enabled) vs pipeline (no cycle-back) on 10 tasks. Measure: how often does cycle-back actually improve the outcome?
|
||||
|
||||
**Expected outcome**: Cycle-back triggers in <10% of tasks. Pipeline matches PDCA quality for 90%+ of cases at lower cost.
|
||||
|
||||
### H6: Evidence-Gated Findings Actually Work
|
||||
|
||||
**Observation**: Of Guardian's 5 findings in this dogfood, 3 were substantive (timeout data loss, hash non-determinism, no JSON error handling) and 2 were low-value (API key pattern, batch_id format). The substantive ones cited specific code paths and failure scenarios. The low-value ones cited general principles without evidence of actual exploitation.
|
||||
|
||||
**Hypothesis**: The evidence-gating mechanism added in v0.7.0 (ban hedged phrases, require command output or code citation) would have automatically downgraded finding #2 ("could corrupt log output") while preserving findings #3 and #4 (which cite specific code paths and failure mechanisms).
|
||||
|
||||
**Test**: Re-run the Guardian review with evidence-gating active. Count how many findings survive vs. get downgraded.
|
||||
|
||||
**Expected outcome**: 1-2 findings correctly downgraded, 0 real bugs missed.
|
||||
|
||||
### H7: Shadow Detection for the Maker
|
||||
|
||||
**Observation**: The Maker introduced a bug (hash-based indexing) because it deviated from the existing codebase pattern (enumerate-based indexing). This is the "Rogue" shadow — the Maker going off-script from what the codebase already does.
|
||||
|
||||
**Hypothesis**: A pre-commit check that compares the Maker's implementation against the existing codebase patterns (e.g., "how are chapter indices computed elsewhere in fanout.py?") would catch Rogue deviations before the Guardian review.
|
||||
|
||||
**Test**: Add a "pattern conformance" check to the Do phase that greps for how the modified variables/functions are used elsewhere in the file.
|
||||
|
||||
**Expected outcome**: Catches Rogue shadow bugs at implementation time rather than review time, saving a review cycle.
|
||||
|
||||
---
|
||||
|
||||
## Recommended Next Steps (Priority Order)
|
||||
|
||||
1. **H1**: Build `af-review` mode (Guardian-only on existing diff) — lowest effort, highest expected ROI
|
||||
2. **H4**: Project convention injection — reduce noise without missing signal
|
||||
3. **H2**: Pre-implementation threat modeling — address the root cause of missing error handling
|
||||
4. **H5**: Default to pipeline strategy, reserve PDCA for rejections
|
||||
5. **H7**: Maker pattern conformance check — reduce Maker-introduced bugs
|
||||
Reference in New Issue
Block a user