Files

Christian Nennemann aebf55a9a7 docs: add dogfood report #2 (batch API) with 7 improvement hypotheses

2026-04-04 18:05:48 +02:00

11 KiB

Raw Blame History

ArcheFlow Dogfood Report #2: Batch API Integration

Date: 2026-04-04 Task: Wire Anthropic Batch API into Colette's fanout pipeline with CLI commands and state persistence Project: writing.colette (Python, 27 modules, 457 tests) Complexity: High — 4 files, async API, state persistence, error recovery, CLI commands

Experimental Setup

Same task, same starting commit, two conditions:

Baseline: Plain Claude, no orchestration, single pass
ArcheFlow: PDCA standard workflow (Maker + Guardian review)

No Explorer or Creator used this time — task scope was clear enough to skip planning and go directly to Maker + Guardian (effectively a fast workflow).

Quantitative Comparison

Metric	Baseline	ArcheFlow	Delta
Lines added	189	279	+48%
Files touched	4	4	same
Time	~5 min	~12 min	+140%
Commits	1	4	cleaner history
Tests written	1	2	+1
Tests passing	13/13	14/14	+1
Bugs introduced	0	1	worse
Bugs caught by review	0	5	better
Real bugs in final code	1	0 (after fix)	ArcheFlow wins

Bug Analysis

Bugs found only by Guardian (not present in baseline)

#	Bug	Severity	Impact
3	`hash()` non-deterministic across processes for chapter index mapping	HIGH	Data loss on resume — chapters mapped to wrong files

This bug was introduced by ArcheFlow's Maker and caught by the Guardian. Baseline used enumerate(i) and avoided it entirely. Net: zero value.

Bugs present in BOTH versions, caught only by Guardian

#	Bug	Severity	Impact
4	Timeout marks variant as "done" — permanently loses batch state	HIGH	Silent data loss — timed-out batches can never be resumed

This is the key finding. Both implementations had this design-level bug. Only ArcheFlow's Guardian caught it. Plain Claude missed it because there was no review step.

Bugs in both, not caught by either initially

#	Bug	Severity	Impact
1	API key resolution inconsistency (env vs config)	CRITICAL	Wrong key used under mixed-key environments
5	No JSON error handling on corrupted state files	HIGH	Crash on truncated state file

Guardian flagged these. Baseline would have shipped them silently.

Qualitative Observations

Where Guardian added real value

Error path analysis: Guardian systematically checked "what happens when X fails?" for timeout, cancellation, corruption, and cross-process resume. Plain Claude focused on the happy path.
Cross-process state: The hash() non-determinism finding required reasoning about Python's hash randomization across interpreter invocations — a subtle runtime property that isn't visible from reading the code in isolation.
Data loss scenarios: Finding #4 (timeout → "done" → lost forever) requires understanding the interaction between wait_and_retrieve's timeout branch and the caller's unconditional status assignment. This is a 2-module interaction that single-pass implementation doesn't systematically check.

Where Guardian added noise

Finding #2 (batch_id validation): Technically valid but the Anthropic SDK already rejects malformed IDs. Low practical risk.
Finding #1 (API key source): Valid but matches existing patterns throughout the codebase — flagging it here without flagging it elsewhere is inconsistent.

The Maker problem

The ArcheFlow Maker introduced a bug (hash-based indexing) that the baseline avoided. This happened because:

The Maker was working from a task description, not reading the existing sequential rewrite code as closely
The Creator's plan (when used in dogfood #1) over-specified some things and under-specified others
Working through an intermediary (plan → implementation) introduces information loss

This is a structural weakness of the PDCA model: the Plan-to-Do handoff can corrupt information.

Conclusions

Complexity threshold confirmed

Task type	Orchestration value
Simple (pattern-following, single file)	Negative — adds cost, Maker introduces bugs
Medium (multi-file feature, clear scope)	Neutral — extra code but similar outcome
Complex (error handling, state, async, resume)	Positive — Guardian catches design-level bugs

The differentiator is error path coverage. Guardian's systematic "what if this fails?" analysis catches bugs that single-pass implementation misses because implementers focus on making things work, not on making failures safe.

The honest ROI question

For this task: Guardian caught 1 bug the baseline missed (timeout data loss). That bug would have caused real data loss in production when a batch times out. The cost was ~7 extra minutes and a Maker-introduced bug that had to be fixed.

Is preventing a production data loss bug worth 7 extra minutes? Yes. But only because this was a task where data loss was possible. For a pure UI change or a refactor with no persistence, the answer would be no.

Improvement Hypotheses

Based on both dogfood runs, here are concrete hypotheses about how to improve ArcheFlow's value-to-cost ratio:

H1: Guardian-Only Mode (skip Plan/Do orchestration)

Observation: In both dogfoods, the Maker produced equivalent-or-worse code than plain Claude. The value came entirely from the Guardian review.

Hypothesis: A "review-only" mode where the user implements normally and then runs ArcheFlow as a post-implementation review would capture the Guardian's value without the Maker's overhead.

Test: Implement the same task plain, then run af-review (Guardian + Skeptic on the diff). Compare bug catch rate to full PDCA.

Expected outcome: Same bug catch rate, ~60% less cost.

H2: Pre-Implementation Threat Modeling (Guardian before Maker)

Observation: Guardian found error-handling bugs (timeout, corruption) that the Maker didn't anticipate. If Guardian's "what could go wrong?" analysis ran BEFORE implementation, the Maker could build in error handling from the start.

Hypothesis: Running a lightweight Guardian analysis on the Creator's plan (not the code) would produce a "threat list" that the Maker addresses during implementation, eliminating the need for a fix cycle.

Sequence: Creator → Guardian(plan) → Maker(plan + threats) → Guardian(code)

Expected outcome: Fewer Maker-introduced bugs, shorter fix cycle, Guardian's code review focuses on implementation correctness rather than missing error paths.

H3: Differential Review (only review what the Maker DIDN'T get from the plan)

Observation: The Maker copies most of the plan correctly. The bugs are in the gaps — things the plan didn't specify (error handling, cross-process state, timeout recovery).

Hypothesis: Instead of reviewing the entire diff, focus the Guardian on the delta between the plan and the implementation — what the Maker added, changed, or skipped that wasn't in the plan.

Test: Extract the plan's explicit instructions, diff against the implementation, and give Guardian only the unplanned additions.

Expected outcome: Higher signal-to-noise ratio (fewer false positives on code that correctly follows the plan), focused attention on the dangerous gaps.

H4: Project Convention Calibration (reduce false positives)

Observation: Guardian flagged API key handling (finding #1) and batch_id validation (finding #2) — both valid in absolute terms but inconsistent with the project's existing patterns. The project doesn't validate IDs or centralize key management anywhere else.

Hypothesis: Injecting a "project conventions" summary before Guardian review (e.g., "this project uses env vars for API keys, does not validate external IDs, handles errors via outer try/except") would let Guardian calibrate its expectations and only flag deviations from convention, not the convention itself.

Test: Run Guardian with and without convention context on the same diff. Count false positives.

Expected outcome: 30-50% reduction in noise findings without missing real bugs.

H5: Abandon PDCA for Implementation, Keep It for Review

Observation: Across both dogfoods, the cycle-back mechanism (Plan→Do→Check→Act→cycle back) never triggered. All reviews were APPROVED_WITH_FIXES, and fixes were applied in a single pass. The cyclic model added structural overhead (event tracking, artifact routing, convergence detection) that was never used.

Hypothesis: For most tasks, a linear pipeline (implement → multi-reviewer check → targeted fix) is sufficient. Reserve cyclic PDCA for tasks where reviewers fundamentally reject the approach (not just the implementation).

Test: Compare PDCA standard (cycle-back enabled) vs pipeline (no cycle-back) on 10 tasks. Measure: how often does cycle-back actually improve the outcome?

Expected outcome: Cycle-back triggers in <10% of tasks. Pipeline matches PDCA quality for 90%+ of cases at lower cost.

H6: Evidence-Gated Findings Actually Work

Observation: Of Guardian's 5 findings in this dogfood, 3 were substantive (timeout data loss, hash non-determinism, no JSON error handling) and 2 were low-value (API key pattern, batch_id format). The substantive ones cited specific code paths and failure scenarios. The low-value ones cited general principles without evidence of actual exploitation.

Hypothesis: The evidence-gating mechanism added in v0.7.0 (ban hedged phrases, require command output or code citation) would have automatically downgraded finding #2 ("could corrupt log output") while preserving findings #3 and #4 (which cite specific code paths and failure mechanisms).

Test: Re-run the Guardian review with evidence-gating active. Count how many findings survive vs. get downgraded.

Expected outcome: 1-2 findings correctly downgraded, 0 real bugs missed.

H7: Shadow Detection for the Maker

Observation: The Maker introduced a bug (hash-based indexing) because it deviated from the existing codebase pattern (enumerate-based indexing). This is the "Rogue" shadow — the Maker going off-script from what the codebase already does.

Hypothesis: A pre-commit check that compares the Maker's implementation against the existing codebase patterns (e.g., "how are chapter indices computed elsewhere in fanout.py?") would catch Rogue deviations before the Guardian review.

Test: Add a "pattern conformance" check to the Do phase that greps for how the modified variables/functions are used elsewhere in the file.

Expected outcome: Catches Rogue shadow bugs at implementation time rather than review time, saving a review cycle.

Recommended Next Steps (Priority Order)

H1: Build af-review mode (Guardian-only on existing diff) — lowest effort, highest expected ROI
H4: Project convention injection — reduce noise without missing signal
H2: Pre-implementation threat modeling — address the root cause of missing error handling
H5: Default to pipeline strategy, reserve PDCA for rejections
H7: Maker pattern conformance check — reduce Maker-introduced bugs

11 KiB Raw Blame History