feat: add run replay for archetype effectiveness analysis

- archeflow-decision.sh records decision points during runs - archeflow-replay.sh: timeline, whatif, compare commands - What-if replay with adjustable archetype weights - /af-replay skill for interactive use - Tests in archeflow-replay.bats
2026-04-06 21:43:29 +02:00
parent 506143d613
commit 4f8e2a9962
8 changed files with 129 additions and 4 deletions
--- a/.claude-plugin/plugin.json
+++ b/.claude-plugin/plugin.json
@@ -1,7 +1,7 @@
 {
  "name": "archeflow",
  "description": "Multi-agent orchestration with Jungian archetypes. PDCA quality cycles, shadow detection, git worktree isolation. Zero dependencies — works with any Claude Code session.",
-  "version": "0.8.0",
+  "version": "0.9.0",
  "author": {
    "name": "Chris Nennemann"
  },
@@ -19,7 +19,7 @@
    "colette-bridge", "git-integration", "multi-project", "cost-tracking",
    "custom-archetypes", "workflow-design", "domains",
    "templates", "autonomous-mode", "using-archeflow",
-    "af-status", "af-score", "af-dag", "af-report"
+    "af-status", "af-score", "af-dag", "af-report", "af-replay"
  ],
  "hooks": "hooks/hooks.json"
 }
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -2,6 +2,11 @@

 All notable changes to ArcheFlow are documented in this file.

+## [0.9.0] -- 2026-04-06
+
+### Added
+- Run replay: `decision.point` events via `archeflow-decision.sh`; `archeflow-replay.sh` with `timeline`, `whatif` (weighted archetype weights + threshold), and `compare`; skill `af-replay`; DAG labels for `decision.point`.
+
 ## [0.7.0] -- 2026-04-04

 ### Added
--- a/README.md
+++ b/README.md
@@ -194,11 +194,13 @@ ArcheFlow ships with 19 skills organized by function. The `run` skill is self-co

 ## Library Scripts

-Eight shell scripts in `lib/` power the process infrastructure.
+Ten shell scripts in `lib/` power the process infrastructure.

 | Script | Purpose | Usage |
 |--------|---------|-------|
 | `archeflow-event.sh` | Append structured JSONL events to a run log | `archeflow-event.sh <run_id> <type> <phase> <agent> '<json>'` |
+| `archeflow-decision.sh` | Log a `decision.point` (phase, archetype, input, decision, confidence) | `archeflow-decision.sh <run_id> check guardian 'diff' 'needs_changes' 0.85` |
+| `archeflow-replay.sh` | Timeline + weighted what-if over recorded verdicts | `archeflow-replay.sh compare <run_id> --weights sage=2,guardian=1` |
 | `archeflow-dag.sh` | Render ASCII DAG from JSONL events | `archeflow-dag.sh events.jsonl --color` |
 | `archeflow-report.sh` | Generate Markdown process report | `archeflow-report.sh events.jsonl --output report.md --dag` |
 | `archeflow-progress.sh` | Regenerate live progress file from events | `archeflow-progress.sh <run_id>` |
--- a/docs/status.md
+++ b/docs/status.md
@@ -1,5 +1,11 @@
 # ArcheFlow — Status Log

+## 2026-04-06: Run replay (v0.9.0)
+
+- `lib/archeflow-decision.sh` — append `decision.point` (phase, archetype, input, decision, confidence).
+- `lib/archeflow-replay.sh` — `timeline` / `whatif` (weighted archetypes, threshold) / `compare`; optional `--json`.
+- Skill `af-replay`, plugin bump, DAG renders `decision.point`, `tests/archeflow-replay.bats`.
+
 ## 2026-04-04: Triple Release Sprint (v0.4 → v0.6)

 ### What happened
--- a/skills/af-replay/SKILL.md
+++ b/skills/af-replay/SKILL.md
@@ -0,0 +1,42 @@
+---
+name: af-replay
+description: "Replay and analyze a recorded ArcheFlow run: decision timeline and weighted what-if. Usage: /af-replay <run-id> [--timeline|--whatif|--compare] [--weights arch=w,...]"
+user-invocable: true
+---
+
+# ArcheFlow Run Replay
+
+Inspect a completed or in-progress run logged in `.archeflow/events/<run_id>.jsonl`. Use this to study which archetypes drove outcomes and to simulate **weighted** consensus (what-if).
+
+## Recording (during PDCA)
+
+After each meaningful orchestration choice, log a **decision point** (in addition to `review.verdict` where applicable):
+
+```bash
+./lib/archeflow-decision.sh <run_id> <phase> <archetype> '<input_summary>' '<decision>' <confidence> [parent_seq]
+```
+
+Fields stored: `phase`, `archetype`, `input`, `decision`, `confidence`, `ts` (event timestamp). The event type is `decision.point`.
+
+Lower-level alternative:
+
+```bash
+./lib/archeflow-event.sh "$RUN_ID" decision.point check guardian \
+  '{"archetype":"guardian","input":"diff","decision":"needs_changes","confidence":0.85}' 7
+```
+
+## Commands (from project root)
+
+| Action | Shell |
+|--------|--------|
+| Timeline | `./lib/archeflow-replay.sh timeline <run_id>` |
+| What-if | `./lib/archeflow-replay.sh whatif <run_id> [--weights guardian=2,sage=0.5] [--threshold 0.5] [--json]` |
+| Both | `./lib/archeflow-replay.sh compare <run_id> [--weights ...]` |
+
+- **Timeline** lists `decision.point` rows and `review.verdict` (check phase).
+- **What-if** reads the **last** `review.verdict` per archetype in check. **Original** outcome uses strict any-veto (any non-approve → BLOCK). **Replay** uses weighted mean strictness: each reviewer contributes weight × (1 if not approved, else 0); BLOCK if mean ≥ threshold (default 0.5).
+- **`--json`** emits machine-readable output for dashboards or scripts.
+
+## Learning effectiveness
+
+Correlate `decision.point` confidence and verdicts with cycle outcomes (`cycle.boundary`, `run.complete`) and `./lib/archeflow-score.sh extract` to see which archetypes add signal for which task shapes.
--- a/skills/run/SKILL.md
+++ b/skills/run/SKILL.md
@@ -352,6 +352,7 @@ Emit events via `./lib/archeflow-event.sh <run_id> <type> <phase> <agent> '<json
 | After agent returns | `agent.complete` | archetype, duration_ms, artifacts, summary |
 | Phase boundary | `phase.transition` | from, to, artifacts_so_far |
 | Alternative chosen | `decision` | what, chosen, alternatives, rationale |
+| Orchestrator decision (replay) | `decision.point` | archetype, input, decision, confidence — use `./lib/archeflow-decision.sh` |
 | Reviewer verdict | `review.verdict` | archetype, verdict, findings[] |
 | Fix addressing review | `fix.applied` | source, finding, file, line |
 | End of PDCA cycle | `cycle.boundary` | cycle, max_cycles, exit_condition, convergence |
@@ -403,6 +404,12 @@ Scores stored in `.archeflow/memory/effectiveness.jsonl`. After 10+ runs, recomm

 ---

+## Run replay (decision log + what-if)
+
+After key choices (routing, fast-path skip, escalation), emit `decision.point` via `./lib/archeflow-decision.sh` so runs can be inspected with `./lib/archeflow-replay.sh timeline|whatif|compare <run_id>`. Weighted what-if helps estimate how much each review archetype swayed the effective ship/block outcome. See skill `af-replay`.
+
+---
+
 ## Dry-Run Mode

 When `--dry-run`: Run Plan phase only. Display workflow, agent counts, confidence scores, cost estimate. Ask user to proceed. If yes, continue with `--start-from do`.
--- a/skills/using-archeflow/SKILL.md
+++ b/skills/using-archeflow/SKILL.md
@@ -7,7 +7,7 @@ description: Use at session start when implementing features, reviewing code, de

 On activation, print ONE line then proceed silently:
 ```
-archeflow v0.8.0 · 23 skills · <domain> domain
+archeflow v0.9.0 · 24 skills · <domain> domain
 ```
 Domain auto-detected: `writing` if `colette.yaml` exists, `research` if paper/thesis files, `code` otherwise.

@@ -46,6 +46,7 @@ Do NOT use for: single-line fixes, questions, reading/exploring, config tweaks,
 | `/af-memory` | Cross-run lesson memory |
 | `/af-fanout` | Colette book fanout via agents |
 | `/af-dag` | DAG of current/last run |
+| `/af-replay <run_id>` | Decision timeline + weighted what-if on recorded events |

 ## Mini-Reflect Fallback

--- a/tests/archeflow-replay.bats
+++ b/tests/archeflow-replay.bats
@@ -0,0 +1,62 @@
+# Tests for archeflow-replay.sh — timeline, what-if, and compare modes.
+
+setup() {
+  load test_helper
+  _common_setup
+
+  mkdir -p .archeflow/events
+  cat > ".archeflow/events/replay-run.jsonl" <<'EVENTS'
+{"ts":"2026-04-03T10:00:00Z","run_id":"replay-run","seq":1,"parent":[],"type":"run.start","phase":"plan","agent":null,"data":{"task":"replay test"}}
+{"ts":"2026-04-03T10:05:00Z","run_id":"replay-run","seq":2,"parent":[1],"type":"decision.point","phase":"check","agent":"guardian","data":{"archetype":"guardian","input":"diff","decision":"needs_changes","confidence":0.88}}
+{"ts":"2026-04-03T10:06:00Z","run_id":"replay-run","seq":3,"parent":[1],"type":"review.verdict","phase":"check","agent":"guardian","data":{"archetype":"guardian","verdict":"needs_changes","findings":[]}}
+{"ts":"2026-04-03T10:07:00Z","run_id":"replay-run","seq":4,"parent":[1],"type":"review.verdict","phase":"check","agent":"sage","data":{"archetype":"sage","verdict":"approved","findings":[]}}
+{"ts":"2026-04-03T10:08:00Z","run_id":"replay-run","seq":5,"parent":[1],"type":"run.complete","phase":"act","agent":null,"data":{"agents_total":2,"fixes_total":0}}
+EVENTS
+}
+
+@test "replay: usage without args" {
+  run "$LIB_DIR/archeflow-replay.sh"
+  [ "$status" -eq 1 ]
+  [[ "$output" == *"Usage"* ]]
+}
+
+@test "replay: timeline shows decision.point" {
+  run "$LIB_DIR/archeflow-replay.sh" timeline replay-run
+  [ "$status" -eq 0 ]
+  [[ "$output" == *"decision.point"* ]]
+  [[ "$output" == *"guardian"* ]]
+  [[ "$output" == *"needs_changes"* ]]
+}
+
+@test "replay: whatif strict blocks when any reviewer blocks" {
+  run "$LIB_DIR/archeflow-replay.sh" whatif replay-run
+  [ "$status" -eq 0 ]
+  [[ "$output" == *"BLOCK"* ]]
+}
+
+@test "replay: whatif weighted can ship when blocker is down-weighted" {
+  run "$LIB_DIR/archeflow-replay.sh" whatif replay-run --weights guardian=0.2,sage=3
+  [ "$status" -eq 0 ]
+  [[ "$output" == *"Weighted replay"* ]] || [[ "$output" == *"SHIP"* ]]
+  [[ "$output" == *"SHIP"* ]]
+}
+
+@test "replay: whatif --json is valid JSON" {
+  run "$LIB_DIR/archeflow-replay.sh" whatif replay-run --json
+  [ "$status" -eq 0 ]
+  echo "$output" | jq -e '.run_id == "replay-run"' >/dev/null
+}
+
+@test "replay: compare includes timeline and whatif" {
+  run "$LIB_DIR/archeflow-replay.sh" compare replay-run
+  [ "$status" -eq 0 ]
+  [[ "$output" == *"Decision timeline"* ]]
+  [[ "$output" == *"What-if replay"* ]]
+}
+
+@test "decision: logs decision.point via wrapper" {
+  run "$LIB_DIR/archeflow-decision.sh" replay-run check trickster 'diff only' 'edge_case' 0.61 1
+  [ "$status" -eq 0 ]
+  last=$(jq -r 'select(.type=="decision.point") | .data.decision' ".archeflow/events/replay-run.jsonl" | tail -1)
+  [ "$last" = "edge_case" ]
+}