docs: add HITL discussion — Wiggum Breaks as formal autonomy boundary

New subsection in Discussion framing Wiggum Breaks as the formal boundary between autonomous and human-supervised operation. Derives HITL from convergence theory rather than pre-defined approval gates. Covers oscillation, divergence, and repeated shadow detection as provably unproductive conditions that trigger human escalation.
feat: introduce Wiggum Break as named circuit breaker
2026-04-08 05:21:20 +02:00 · 2026-04-08 05:19:35 +02:00 · 2026-04-08 05:15:55 +02:00 · 2026-04-08 05:13:59 +02:00 · 2026-04-08 04:54:14 +02:00 · 2026-04-06 23:08:11 +02:00
42 changed files with 4398 additions and 805 deletions
--- a/.claude-plugin/plugin.json
+++ b/.claude-plugin/plugin.json
@@ -1,7 +1,7 @@
 {
  "name": "archeflow",
  "description": "Multi-agent orchestration with Jungian archetypes. PDCA quality cycles, shadow detection, git worktree isolation. Zero dependencies — works with any Claude Code session.",
-  "version": "0.7.0",
+  "version": "0.9.0",
  "author": {
    "name": "Chris Nennemann"
  },
@@ -14,12 +14,12 @@
    "shadow-detection", "workflows"
  ],
  "skills": [
-    "run", "orchestration", "plan-phase", "do-phase", "check-phase", "act-phase",
+    "run", "sprint", "review", "check-phase", "act-phase",
-    "shadow-detection", "attention-filters", "convergence", "artifact-routing",
+    "shadow-detection", "memory", "progress", "presence",
-    "process-log", "memory", "effectiveness", "progress",
+    "colette-bridge", "git-integration", "multi-project", "cost-tracking",
-    "colette-bridge", "git-integration", "multi-project",
+    "custom-archetypes", "workflow-design", "domains",
-    "custom-archetypes", "workflow-design", "domains", "cost-tracking",
+    "templates", "autonomous-mode", "using-archeflow",
-    "templates", "autonomous-mode", "using-archeflow", "presence"
+    "af-status", "af-score", "af-dag", "af-report", "af-replay"
  ],
  "hooks": "hooks/hooks.json"
 }
--- a/.gitignore
+++ b/.gitignore
@@ -8,3 +8,11 @@ Thumbs.db
 # Editor
 *.swp
 *~
 # Paper build artifacts
 paper/*.aux
 paper/*.bbl
 paper/*.blg
 paper/*.log
 paper/*.out
 paper/*.pdf
 paper/*.toc
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -2,6 +2,11 @@
 All notable changes to ArcheFlow are documented in this file.
 ## [0.9.0] -- 2026-04-06
 ### Added
 - Run replay: `decision.point` events via `archeflow-decision.sh`; `archeflow-replay.sh` with `timeline`, `whatif` (weighted archetype weights + threshold), and `compare`; skill `af-replay`; DAG labels for `decision.point`.
 ## [0.7.0] -- 2026-04-04
 ### Added
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1,71 +1,119 @@
 # archeflow — Multi-Agent Orchestration Plugin for Claude Code
-Workspace-level orchestration: parallel agent teams across project portfolios, PDCA cycles with Jungian archetype roles, sprint runner, and post-implementation review. Installed as a Claude Code plugin.
+PDCA quality cycles with Jungian archetype roles, corrective action framework, sprint runner, and post-implementation review. Zero dependencies — pure Bash + Markdown.
 ## Tech Stack
 - **Runtime:** Bash (lib scripts) + Claude Code skill system (Markdown skills)
 - **No build step, no dependencies** — pure bash + markdown
 - **Plugin format:** Claude Code plugin (skills/, hooks/, agents/, templates/)
 ## Key Commands
 ```bash
 # Use via Claude Code slash commands:
 /af-sprint          # Main mode: work the queue across projects
 /af-run <task>      # Deep orchestration with PDCA cycles
 /af-review          # Post-implementation security/quality review
 /af-status          # Current run status
 /af-init            # Initialize ArcheFlow in a project
 /af-score           # Archetype effectiveness scores
 /af-memory          # Cross-run lesson memory
 /af-report          # Full process report
 /af-fanout          # Colette book fanout via agents
 ```
 ## Architecture
 ```
-skills/                 Slash command implementations (one dir per skill)
+skills/              Slash commands and internal protocols (one SKILL.md per dir)
-  sprint/               /af-sprint — queue-driven parallel agent runner
+  run/               /af-run — self-contained PDCA orchestration (core skill)
-  run/                  /af-run — PDCA orchestration
+  sprint/            /af-sprint — queue-driven parallel agent dispatch
  review/            /af-review — Guardian-led code review
-  plan-phase/           PDCA Plan phase
+  check-phase/       Shared reviewer protocol (used by run + review)
-  do-phase/             PDCA Do phase
+  act-phase/         Finding collection, fix routing, exit decisions
-  check-phase/          PDCA Check phase
+  shadow-detection/  Corrective action framework (archetype + system + policy)
  act-phase/            PDCA Act phase
  memory/            Cross-run lessons learned
-  cost-tracking/        Token/cost awareness
+  cost-tracking/     Token/cost awareness and budget enforcement
  domains/           Domain detection (code, writing, research)
-  ...                   ~25 skill directories
+  colette-bridge/    Writing context loader from colette.yaml
-hooks/
+  multi-project/     Cross-repo orchestration with dependency DAG
-  hooks.json            Hook definitions
+  git-integration/   Per-phase commits, branch strategy, rollback
-  session-start/        Auto-activation on session start
+  templates/         Workflow/team bundle gallery
-agents/                 Archetype agent definitions
+  autonomous-mode/   Unattended session protocol
-  explorer.md           Divergent thinking, research
+  using-archeflow/   Session-start activation (auto-loaded via hook)
-  creator.md            Design, architecture
+agents/              Archetype personality definitions (one .md per archetype)
-  maker.md              Implementation
+lib/                 Bash helper scripts (events, git, memory, progress, etc.)
-  guardian.md            Security, risk, quality gates
+hooks/               Session-start hook (injects using-archeflow)
  sage.md               Wisdom, patterns, trade-offs
  skeptic.md            Devil's advocate
  trickster.md          Edge cases, unconventional approaches
 lib/                    Bash helper scripts (git, DAG, events, progress, etc.)
 templates/bundles/   Pre-configured workflow bundles
 docs/                   Roadmap, dogfood notes, test reports
 ```
-## Domain Rules
+## Commands
- Skills are Markdown files with frontmatter — follow existing skill format exactly
+| Command | Purpose |
- Agents are archetype personas — maintain their distinct voice and perspective
+|---------|---------|
- Dogfood observations go to `archeflow/.archeflow/memory/lessons.jsonl`
+| `/af-run <task>` | PDCA orchestration with full agent cycle |
- Cost tracking: prefer cheap models for bulk ops, expensive for creative/review
+| `/af-sprint` | Work the queue across projects |
- PDCA cycle order is mandatory: Plan -> Do -> Check -> Act
+| `/af-review` | Review existing code changes |
 | `/af-status` | Current/last run status |
 | `/af-init` | Initialize ArcheFlow in a project |
 | `/af-score` | Archetype effectiveness scores |
 | `/af-memory` | Cross-run lesson memory |
 | `/af-report` | Full process report |
 | `/af-fanout` | Colette book fanout via agents |
-## Do NOT
+## Core Concepts
- Add runtime dependencies — this must stay zero-dependency
+### PDCA Cycle
- Change archetype personalities without updating all referencing skills
+```
- Skip the Check phase in PDCA cycles (quality gate)
+Plan (Explorer + Creator) -> Do (Maker in worktree) -> Check (Guardian first, then others) -> Act (fix, merge, or cycle)
- Modify hooks.json format without testing plugin reload
+```
- Use ArcheFlow to orchestrate simple single-file tasks (overhead not justified)
+
 ### Archetypes
 Explorer (research), Creator (design), Maker (implement), Guardian (security), Skeptic (assumptions), Trickster (edge cases), Sage (quality). Each has a virtue and a shadow — see `shadow-detection` skill.
 ### Corrective Action Framework
 Three layers, one escalation protocol:
 - **Archetype shadows** — individual agent dysfunction
 - **System shadows** — orchestration-level issues (echo chamber, tunnel vision, scope creep)
 - **Policy boundaries** — operational limits (checkpoints, budgets, Wiggum Breaks)
 ### Workflows
 | Risk Level | Workflow | Agents |
 |------------|----------|--------|
 | Low | `fast` | Creator -> Maker -> Guardian |
 | Medium | `standard` | Explorer + Creator -> Maker -> Guardian + Skeptic + Sage |
 | High | `thorough` | Explorer + Creator -> Maker -> All 4 reviewers |
 ## Guardrails
 ### DO
 - Keep skills self-contained. The `run` skill needs zero prerequisites — it was consolidated for a reason.
 - Write skills as operational instructions Claude can follow, not software specifications.
 - Use tables for reference data, numbered steps for protocols.
 - Emit events via `./lib/archeflow-event.sh` — but never let logging block orchestration.
 - Maintain the corrective action framework when adding new agent types.
 - Test skill changes by running `/af-run --dry-run` and verifying the flow.
 - Keep archetype personalities distinct — each agent definition in `agents/` has a specific voice.
 ### DO NOT
 - **Add runtime dependencies.** This must stay zero-dependency (Bash + Markdown only).
 - **Bloat skills back up.** The consolidation from 27 to ~15 skills was intentional. Do not create new skills for internal implementation details — inline them.
 - **Write bash pseudo-code in skills.** Skills are Claude instructions, not shell scripts. Use one-liner commands or lib script references, not multi-line bash blocks.
 - **Duplicate protocol definitions.** Finding format lives in `check-phase`. Routing table lives in `act-phase`. Shadow detection lives in `shadow-detection`. One source of truth per concept.
 - **Skip the Check phase** in PDCA cycles. It's the quality gate.
 - **Change archetype personalities** without updating all referencing skills and agent definitions.
 - **Use ArcheFlow for trivial tasks.** Single-file fixes, config changes, questions — just do them directly.
 - **Let skills exceed ~200 lines.** If a skill is growing past this, it probably needs splitting or the content belongs in a lib script.
 ### Skill Writing Rules
 1. **Frontmatter**: `name` (kebab-case), `description` (one-liner + `<example>` tags for user-invocable skills)
 2. **Structure**: Imperative voice. Lead with what to do, not why. Tables > prose. Steps > paragraphs.
 3. **Agent templates**: Keep Agent() spawn templates concise. Include only the prompt, subagent_type, and isolation mode.
 4. **Cross-references**: Use `archeflow:<skill-name>` backtick syntax to reference other skills. Avoid circular dependencies.
 5. **Bash commands**: One-liners only in skills. Multi-step logic belongs in `lib/` scripts.
 ### Cost Awareness
 - Prefer cheap models (haiku) for analytical tasks (validation, diff scoring)
 - Use capable models (sonnet/opus) for creative tasks (writing, complex design)
 - Budget enforcement via `cost-tracking` skill and `.archeflow/config.yaml`
 - Track token spend per agent in events for post-run analysis
 ### Git Rules
 - Signing: `git config gpg.format ssh`, key at `~/.ssh/id_ed25519_dev.pub`
 - Push: `GIT_SSH_COMMAND="ssh -i /home/c/.ssh/id_ed25519_dev -o IdentitiesOnly=yes" git push origin main`
 - Conventional commits: `feat:`, `fix:`, `chore:`, `docs:`, `refactor:`
 - No Co-Authored-By trailers
 - All work on worktree branches until explicitly merged
 - Merges use `--no-ff` (individually revertable)
 ## Dogfooding
 When using ArcheFlow to develop ArcheFlow itself:
 - Log observations to `.archeflow/memory/lessons.jsonl`
 - Note friction points, shadow false positives, skill gaps
 - Test skill changes with `/af-run --dry-run` before committing
--- a/README.md
+++ b/README.md
@@ -146,69 +146,61 @@ Shadow detection is quantitative, not vibes. Explorer output exceeding 2000 word
 ## Skills Reference
-ArcheFlow ships with 24 skills organized by function.
+ArcheFlow ships with 19 skills organized by function. The `run` skill is self-contained -- no prerequisites needed.
 ### Core Orchestration
 | Skill | Description |
 |-------|-------------|
-| `archeflow:run` | Automated PDCA execution loop -- single-command orchestration with `--start-from`, `--dry-run`, and cycle-back |
+| `archeflow:run` | Self-contained PDCA orchestration -- Plan/Do/Check/Act with adaptation rules, pipeline strategy, and cycle-back |
-| `archeflow:orchestration` | Step-by-step PDCA execution guide for manual orchestration |
+| `archeflow:sprint` | Queue-driven parallel agent dispatch across projects (primary mode) |
-| `archeflow:plan-phase` | Explorer and Creator output formats and protocols |
+| `archeflow:review` | Guardian-led code review on diff/branch/commit range |
-| `archeflow:do-phase` | Maker implementation rules and worktree commit strategy |
+| `archeflow:check-phase` | Shared reviewer protocol -- finding format, evidence requirements, attention filters |
-| `archeflow:check-phase` | Shared reviewer protocols and output format |
+| `archeflow:act-phase` | Finding collection, fix routing, exit decisions |
 | `archeflow:act-phase` | Post-Check decision logic: collect findings, route fixes, exit or cycle |
 ### Quality and Safety
 | Skill | Description |
 |-------|-------------|
-| `archeflow:shadow-detection` | Quantitative dysfunction detection and automatic correction |
+| `archeflow:shadow-detection` | Corrective action framework -- archetype shadows, system shadows, policy boundaries |
 | `archeflow:attention-filters` | Context optimization per archetype -- each agent gets only what it needs |
 | `archeflow:convergence` | Detects convergence, stalling, and oscillation in multi-cycle runs |
 | `archeflow:artifact-routing` | Inter-phase artifact protocol -- naming, storage, routing, archiving |
 ### Process Intelligence
 | Skill | Description |
 |-------|-------------|
 | `archeflow:process-log` | Event-sourced JSONL logging with DAG parent relationships |
 | `archeflow:memory` | Cross-run memory that learns recurring findings and injects lessons |
 | `archeflow:effectiveness` | Archetype scoring on signal-to-noise, fix rate, cost efficiency |
 | `archeflow:progress` | Live progress file watchable from a second terminal |
 ### Integration
 | Skill | Description |
 |-------|-------------|
 | `archeflow:colette-bridge` | Bridges ArcheFlow with the Colette writing platform |
-| `archeflow:git-integration` | Git-per-phase commits, branch-per-run, rollback to any phase boundary |
+| `archeflow:git-integration` | Per-phase commits, branch-per-run, rollback |
 | `archeflow:multi-project` | Cross-repo orchestration with dependency DAG and shared budget |
 | `archeflow:cost-tracking` | Budget enforcement, per-agent cost aggregation, model tier recommendations |
 ### Configuration
 | Skill | Description |
 |-------|-------------|
 | `archeflow:domains` | Domain adapters for writing, research, and non-code workflows |
 | `archeflow:custom-archetypes` | Create domain-specific roles (database reviewer, compliance auditor, etc.) |
-| `archeflow:workflow-design` | Design custom workflows with per-phase archetype assignment and exit conditions |
+| `archeflow:workflow-design` | Design custom workflows with per-phase archetype assignment |
 | `archeflow:domains` | Domain adapters for writing, research, and other non-code workflows |
 | `archeflow:cost-tracking` | Budget enforcement, per-agent cost aggregation, model tier recommendations |
 | `archeflow:templates` | Template gallery for sharing workflows, teams, and setup bundles |
-| `archeflow:autonomous-mode` | Unattended overnight sessions with progress logging and safe stopping |
+| `archeflow:autonomous-mode` | Unattended sessions with corrective action checkpoints |
 | `archeflow:progress` | Live progress file watchable from a second terminal |
 | `archeflow:presence` | User-facing output format -- show outcomes, not mechanics |
 ### Meta
 | Skill | Description |
 |-------|-------------|
-| `archeflow:using-archeflow` | Session-start skill -- activation criteria, workflow selection, quick reference |
+| `archeflow:using-archeflow` | Session-start activation -- decision tree, workflow selection, commands |
 ## Library Scripts
-Eight shell scripts in `lib/` power the process infrastructure.
+Ten shell scripts in `lib/` power the process infrastructure.
 | Script | Purpose | Usage |
 |--------|---------|-------|
 | `archeflow-event.sh` | Append structured JSONL events to a run log | `archeflow-event.sh <run_id> <type> <phase> <agent> '<json>'` |
 | `archeflow-decision.sh` | Log a `decision.point` (phase, archetype, input, decision, confidence) | `archeflow-decision.sh <run_id> check guardian 'diff' 'needs_changes' 0.85` |
 | `archeflow-replay.sh` | Timeline + weighted what-if over recorded verdicts | `archeflow-replay.sh compare <run_id> --weights sage=2,guardian=1` |
 | `archeflow-dag.sh` | Render ASCII DAG from JSONL events | `archeflow-dag.sh events.jsonl --color` |
 | `archeflow-report.sh` | Generate Markdown process report | `archeflow-report.sh events.jsonl --output report.md --dag` |
 | `archeflow-progress.sh` | Regenerate live progress file from events | `archeflow-progress.sh <run_id>` |
@@ -341,47 +333,28 @@ archetypes: [explorer, creator, maker, guardian, db-specialist]
 ```
 archeflow/
-├── .claude-plugin/plugin.json   # Plugin manifest (v0.5.0)
+├── .claude-plugin/plugin.json   # Plugin manifest
 ├── agents/                      # 7 archetype personas (behavioral protocols)
-│   ├── explorer.md              #   Plan: research and context mapping
+│   ├── explorer.md, creator.md  #   Plan phase agents
-│   ├── creator.md               #   Plan: solution design and proposals
+│   ├── maker.md                 #   Do phase agent
-│   ├── maker.md                 #   Do: implementation in isolated worktree
+│   └── guardian.md, skeptic.md, #   Check phase agents
-│   ├── guardian.md              #   Check: security and reliability review
+│       trickster.md, sage.md
-│   ├── skeptic.md               #   Check: assumption challenging
+├── skills/                      # 19 skills (consolidated from 27)
-│   ├── trickster.md             #   Check: adversarial testing
+│   ├── run/                     #   Self-contained PDCA orchestration (core)
-│   └── sage.md                  #   Check: holistic quality review
+│   ├── sprint/                  #   Queue-driven parallel agent dispatch
-├── skills/                      # 24 behavioral skills
+│   ├── review/                  #   Guardian-led code review
-│   ├── run/                     #   Automated PDCA loop
+│   ├── check-phase/             #   Shared reviewer protocol + attention filters
-│   ├── orchestration/           #   Manual PDCA execution guide
+│   ├── act-phase/               #   Finding collection + fix routing
-│   ├── plan-phase/              #   Plan protocols
+│   ├── shadow-detection/        #   Corrective action framework (3 layers)
 │   ├── do-phase/                #   Do protocols
 │   ├── check-phase/             #   Check protocols
 │   ├── act-phase/               #   Act phase decision logic
 │   ├── shadow-detection/        #   Dysfunction detection
 │   ├── attention-filters/       #   Context optimization
 │   ├── convergence/             #   Cycle convergence detection
 │   ├── artifact-routing/        #   Inter-phase artifact protocol
 │   ├── process-log/             #   Event-sourced JSONL logging
 │   ├── memory/                  #   Cross-run learning
-│   ├── effectiveness/           #   Archetype scoring
+│   └── ...                      #   + 12 config/integration skills
-│   ├── progress/                #   Live progress file
+├── lib/                         # 10 shell scripts (events, git, memory, etc.)
 │   ├── colette-bridge/          #   Colette writing platform bridge
 │   ├── git-integration/         #   Per-phase git commits
 │   ├── multi-project/           #   Cross-repo orchestration
 │   ├── custom-archetypes/       #   Domain-specific roles
 │   ├── workflow-design/         #   Custom workflow design
 │   ├── domains/                 #   Domain adapters
 │   ├── cost-tracking/           #   Budget and cost management
 │   ├── templates/               #   Template gallery
 │   ├── autonomous-mode/         #   Unattended sessions
 │   └── using-archeflow/         #   Session-start activation
 ├── lib/                         # 8 shell scripts (process infrastructure)
 ├── hooks/                       # Auto-activation (SessionStart)
 ├── examples/                    # Walkthroughs, templates, custom archetypes
 └── docs/                        # Roadmap, changelog
 ```
-The flow: skills define behavioral rules (what agents should do), agents define personas (how they think), lib scripts handle tooling (event logging, git, reporting), and hooks wire it all together at session start. Events are emitted at every phase transition, forming a DAG that can be rendered, reported, or scored after the run.
+Skills define behavioral rules, agents define personas, lib scripts handle tooling, hooks wire it together at session start. The `run` skill is self-contained -- it absorbed 8 previously separate skills (orchestration, plan-phase, do-phase, artifact-routing, process-log, convergence, effectiveness, attention-filters) into one 459-line operational guide.
 ## Philosophy
--- a/docs/plans/archeflow-roadmap-v1.md
+++ b/docs/plans/archeflow-roadmap-v1.md
@@ -0,0 +1,235 @@
 # ArcheFlow Roadmap — From Framework to Tool
 Status: Planning (2026-04-06)
 Context: v0.8.0 shipped — consolidated skills, corrective action framework, 110 tests. The scaffolding is solid. Now make it genuinely useful.
 ## Guiding Principle
 Every feature must close a feedback loop or remove friction. No features that add complexity without measurable improvement in either speed, cost, or quality.
 ---
 ## Tier 1: Make the Sprint Runner Smart (highest impact)
 ### 1.1 Queue from Git Issues
 **Problem:** Manual `queue.json` is the biggest friction point. Nobody wants to maintain a JSON file by hand.
 **Solution:** `./scripts/ws sync-issues` that:
 - Reads Gitea/GitHub issues via API (`gh issue list` or Gitea REST)
 - Maps labels to priority: `P0`=critical/blocker, `P1`=high, `P2`=medium, `P3`=low/enhancement
 - Maps labels to estimate: `size/S`, `size/M`, `size/L`, `size/XL` (default: M)
 - Extracts `depends_on` from "blocks #N" / "depends on #N" in issue body
 - Upserts into `queue.json` (doesn't overwrite manual edits, merges by issue ID)
 - Skips issues with `wontfix`, `duplicate`, `question` labels
 **Scope:** One script in `scripts/`, ~100 lines. Gitea API + GitHub API (detect from remote URL). Needs API token in env var `GITEA_TOKEN` or `GITHUB_TOKEN`.
 **Test:** bats tests with mock API responses (curl fixture files).
 ### 1.2 Cost Estimation
 **Problem:** Users don't know what a sprint will cost before running it.
 **Solution:** `/af-sprint --dry-run` shows estimated cost:
 ```
 Sprint estimate: 7 tasks, ~18 agents, est. $1.20-$2.40, ~12 minutes
  P1: writing.colette fanout (L) — est. $0.50, 4 agents
  P1: tool.archeflow review (M) — est. $0.15, 2 agents
  ...
 Proceed? [y/n]
 ```
 **How:** Track actual token counts per task size (S/M/L/XL) in `.archeflow/memory/cost-history.jsonl`. After 5+ tasks per size bucket, use median. Before that, use defaults: S=$0.05, M=$0.15, L=$0.50, XL=$1.50.
 **Scope:** Update `sprint` skill with estimation section. Add cost logging to `archeflow-event.sh` (include `tokens_used` in `agent.complete` data). New script `lib/archeflow-cost.sh` for estimation.
 ### 1.3 Smart Workflow Selection
 **Problem:** Current auto-selection uses keyword matching ("fix" -> pipeline). This is crude.
 **Solution:** Analyze the actual task + codebase signals:
 | Signal | Source | Workflow |
 |--------|--------|----------|
 | Files matching `auth|crypto|secret|token|session` | task description + file paths | -> thorough |
 | Public API changes (OpenAPI spec modified, exported functions changed) | git diff | -> thorough |
 | <3 files changed, all in same dir | git diff | -> fast/pipeline |
 | Test files only | git diff | -> pipeline |
 | Historical: this project's last 3 runs needed 0 cycles | memory | -> fast |
 | Historical: this project's last run had 2+ CRITICALs | memory | -> thorough |
 **Scope:** Add to the `run` skill's Strategy Selection section. Read git diff stats + memory lessons before choosing. ~20 lines of logic replacing the current keyword table.
 ---
 ## Tier 2: Close the Learning Loop
 ### 2.1 Confidence Calibration
 **Problem:** Creator's confidence scores (0.0-1.0) are self-reported and uncalibrated. A Creator that always says 0.8 but gets rejected 40% of the time is not useful.
 **Solution:** After each `run.complete`, log calibration data:
 ```jsonl
 {"run_id":"...","creator_confidence":{"task":0.8,"solution":0.7,"risk":0.6},"actual_outcome":"rejected","cycles":2,"criticals":1}
 ```
 At run start, inject calibration context into Creator prompt:
 ```
 Your historical calibration: You rate task understanding at 0.8 avg,
 but 35% of runs with that score needed cycle-back. Consider scoring
 more conservatively.
 ```
 **Scope:** New field in `archeflow-memory.sh` calibration store. ~30 lines in `run` skill to log + inject. Needs 5+ runs before meaningful.
 ### 2.2 Archetype Auto-Tuning
 **Problem:** The effectiveness scoring system exists (`archeflow-score.sh`) but nothing acts on it.
 **Solution:** After 10+ runs, auto-generate recommendations:
 ```
 Archetype Recommendations (based on 15 runs):
  Guardian: essential (caught real issues in 80% of runs)
  Sage: keep (useful findings in 60% of runs)
  Skeptic: demote to thorough-only (useful in 20%, mostly INFO)
  Trickster: keep for thorough (caught 2 bugs Guardian missed)
 ```
 Add to `/af-score` output. Store recommendation in config as `reviewers.recommended`:
 ```yaml
 reviewers:
  recommended:
    always: [guardian]
    default: [sage]
    thorough_only: [skeptic, trickster]
  # Auto-generated 2026-04-06 from 15 runs. Override with explicit config.
 ```
 **Scope:** Update `archeflow-score.sh` with recommendation logic. Update `run` skill to read recommended config. Add to `af-score` skill display.
 ### 2.3 Campaign Memory
 **Problem:** Related runs (e.g., "harden all API endpoints") don't share context.
 **Solution:** Optional `--campaign <id>` flag on `/af-run`:
 - Links runs under a campaign ID
 - Cross-run context: "In Run 1, we found the auth pattern uses middleware X. In Run 2, the same pattern applies."
 - Campaign-level progress: "3/8 endpoints hardened, 2 CRITICALs remaining"
 - Campaign memory injected into Explorer/Creator prompts
 **Scope:** New field in event schema. Campaign index in `.archeflow/campaigns/`. Update memory injection to filter by campaign. ~50 lines in `run` skill.
 ---
 ## Tier 3: Integrate with Real Workflow
 ### 3.1 Findings as PR Comments
 **Problem:** Review findings live in `.archeflow/artifacts/`. Nobody reads artifact files — they read PR comments.
 **Solution:** After Check phase, if a PR exists for the branch:
 ```bash
 # Post each CRITICAL/WARNING as a PR review comment
 gh api repos/{owner}/{repo}/pulls/{pr}/comments \
  --field body="🛡️ **Guardian** [CRITICAL/security]\n\n${description}\n\nSuggested fix: ${fix}" \
  --field path="${file}" --field line="${line}"
 ```
 **Scope:** New `--pr <number>` flag on `/af-run` and `/af-review`. Script `lib/archeflow-pr.sh` for posting comments. Falls back gracefully if no PR or no API token.
 ### 3.2 CI Hook Mode
 **Problem:** ArcheFlow runs manually. It should run automatically on PRs.
 **Solution:** Lightweight CI integration:
 ```yaml
 # .github/workflows/archeflow-review.yml (or Gitea equivalent)
 on: pull_request
 jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: claude --plugin-dir ./archeflow -p "/af-review --branch ${{ github.head_ref }} --pr ${{ github.event.number }}"
 ```
 Only runs Guardian (fast, cheap). Posts findings as PR comments. No PDCA overhead.
 **Scope:** Template workflow file in `examples/ci/`. Update `review` skill to support `--pr` flag. Documentation.
 ### 3.3 Watch Mode
 **Problem:** You have to remember to run `/af-review` after pushing.
 **Solution:** `/af-watch` — background process that monitors a branch:
 - Uses `git log --since` polling (every 60s)
 - On new commits: auto-run `/af-review` on the diff
 - Posts findings as PR comments if PR exists
 - Respects budget gate from corrective action framework
 **Scope:** New skill `af-watch/SKILL.md` (~30 lines). Uses the `loop` skill infrastructure. Low priority — CI hook mode covers most use cases.
 ---
 ## Tier 4: Replay and Analysis
 ### 4.1 Decision Journal
 **Problem:** No visibility into why ArcheFlow made specific choices during a run.
 **Solution:** Already started with `archeflow-decision.sh` and `archeflow-replay.sh`. Extend:
 - Log every decision point: workflow selection, A1/A2/A3 triggers, fix routing, shadow detections
 - `/af-replay <run_id> --timeline` shows the decision chain
 - `/af-replay <run_id> --whatif --workflow thorough` simulates: "What would thorough have found?"
 **Scope:** Mostly built. Needs integration into the `run` skill (emit `decision.point` events at each choice). The replay script needs the what-if simulation logic.
 ### 4.2 Run Comparison
 **Problem:** No way to evaluate whether workflow X is better than workflow Y for a project.
 **Solution:** `/af-replay compare <run_a> <run_b>`:
 ```
 Run A (standard, 4m30s, $0.80): 5 findings, 4 resolved, 1 INFO remaining
 Run B (thorough, 12m, $2.10):   7 findings, 6 resolved, 1 INFO remaining
 Delta: +2 findings (both INFO), +165% cost, +167% time
 Verdict: Standard was sufficient for this task.
 ```
 **Scope:** Update `archeflow-replay.sh` with comparison mode. Needs at least 2 runs on similar tasks.
 ---
 ## Implementation Order
 ```
 v0.9.0 — Sprint Intelligence
  1.1 Queue from issues
  1.2 Cost estimation
  1.3 Smart workflow selection
 v0.10.0 — Learning Loop
  2.1 Confidence calibration
  2.2 Archetype auto-tuning
  2.3 Campaign memory
 v0.11.0 — Integration
  3.1 Findings as PR comments
  3.2 CI hook mode
  3.3 Watch mode (stretch)
 v0.12.0 — Analysis
  4.1 Decision journal (mostly done)
  4.2 Run comparison
 ```
 Each version is independently shippable. No version depends on a later one.
 ## What NOT to Build
 - **Web dashboard** — Terminal is the interface. Don't add a server.
 - **Embedding-based memory** — Keyword matching works. Don't add vector DBs.
 - **Agent marketplace** — Focus on the 7 built-in archetypes being excellent.
 - **Multi-user collaboration** — ArcheFlow is a single-user tool. Git is the collaboration layer.
 - **Plugin system for plugins** — ArcheFlow IS a plugin. Don't go meta.
--- a/docs/status.md
+++ b/docs/status.md
@@ -1,5 +1,11 @@
 # ArcheFlow — Status Log
 ## 2026-04-06: Run replay (v0.9.0)
 - `lib/archeflow-decision.sh` — append `decision.point` (phase, archetype, input, decision, confidence).
 - `lib/archeflow-replay.sh` — `timeline` / `whatif` (weighted archetypes, threshold) / `compare`; optional `--json`.
 - Skill `af-replay`, plugin bump, DAG renders `decision.point`, `tests/archeflow-replay.bats`.
 ## 2026-04-04: Triple Release Sprint (v0.4 → v0.6)
 ### What happened
--- a/hooks/session-start
+++ b/hooks/session-start
@@ -7,7 +7,7 @@ const path = require("path");
 try {
  const pluginRoot = path.resolve(__dirname, "..");
-  const skillFile = path.join(pluginRoot, "skills", "using-archeflow", "SKILL.md");
+  const skillFile = path.join(pluginRoot, "skills", "using-archeflow", "ACTIVATION.md");
  if (!fs.existsSync(skillFile)) {
    console.log("{}");
--- a/lib/archeflow-dag.sh
+++ b/lib/archeflow-dag.sh
@@ -87,6 +87,9 @@ EVENTS_PARSED=$(jq -r '
    elif .type == "agent.complete" then
      (.data.archetype // .agent // "unknown") + " (" + .phase + ")" +
      (if (.data.tokens // 0) > 0 then " [" + (.data.tokens | tostring) + " tok]" else "" end)
    elif .type == "decision.point" then
      (.data.archetype // .agent // "?") + " → " + (.data.decision // "?") +
      " (conf " + ((.data.confidence // 0) | tostring) + ")"
    elif .type == "decision" then
      "decision: " + (.data.what // "unknown") + " → " + (.data.chosen // "unknown")
    elif .type == "phase.transition" then
@@ -209,7 +212,7 @@ render_node() {
  local colored_label
  case "$type" in
    phase.transition) colored_label="${C_TRANS}${label}${C_RESET}" ;;
-    decision)         colored_label="${C_DECISION}${label}${C_RESET}" ;;
+    decision|decision.point) colored_label="${C_DECISION}${label}${C_RESET}" ;;
    review.verdict)   colored_label="${C_VERDICT}${label}${C_RESET}" ;;
    *)                colored_label="${pc}${label}${C_RESET}" ;;
  esac
--- a/lib/archeflow-decision.sh
+++ b/lib/archeflow-decision.sh
@@ -0,0 +1,48 @@
 #!/usr/bin/env bash
 # archeflow-decision.sh — Log a PDCA decision point for run replay / effectiveness analysis.
 #
 # Appends a decision.point event to .archeflow/events/<run_id>.jsonl with:
 #   phase, archetype (agent + data.archetype), input, decision, confidence, ts (via event layer)
 #
 # Usage:
 #   ./lib/archeflow-decision.sh <run_id> <phase> <archetype> '<input>' '<decision>' <confidence> [parent_seq]
 #
 # Examples:
 #   ./lib/archeflow-decision.sh 2026-04-06-auth check guardian \
 #     'diff + proposal risks' 'needs_changes' 0.82 7
 #   ./lib/archeflow-decision.sh 2026-04-06-auth act "" 'route findings' 'send_to_maker' 0.9
 #
 # confidence: 0.0–1.0 (orchestrator-estimated certainty in the recorded choice)
 #
 # Requires: jq (via archeflow-event.sh)
 set -euo pipefail
 LIB_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 if [[ $# -lt 6 ]]; then
  echo "Usage: $0 <run_id> <phase> <archetype> '<input>' '<decision>' <confidence> [parent_seq]" >&2
  exit 1
 fi
 RUN_ID="$1"
 PHASE="$2"
 ARCH="$3"
 INPUT="$4"
 DECISION="$5"
 CONF_RAW="$6"
 PARENT="${7:-}"
 if ! [[ "$CONF_RAW" =~ ^[0-9]*\.?[0-9]+$ ]]; then
  echo "Error: confidence must be a number (e.g. 0.85)" >&2
  exit 1
 fi
 DATA=$(jq -cn \
  --arg a "$ARCH" \
  --arg i "$INPUT" \
  --arg d "$DECISION" \
  --argjson c "$CONF_RAW" \
  '{archetype:$a, input:$i, decision:$d, confidence:$c}')
 exec "$LIB_DIR/archeflow-event.sh" "$RUN_ID" decision.point "$PHASE" "$ARCH" "$DATA" "$PARENT"
--- a/lib/archeflow-event.sh
+++ b/lib/archeflow-event.sh
@@ -8,6 +8,9 @@
 #   ./lib/archeflow-event.sh 2026-04-03-der-huster agent.complete plan creator '{"duration_ms":167522}' 2
 #   ./lib/archeflow-event.sh 2026-04-03-der-huster phase.transition do "" '{"from":"plan","to":"do"}' 3,4
 #   ./lib/archeflow-event.sh 2026-04-03-der-huster fix.applied act "" '{"source":"guardian"}' 8
 #   ./lib/archeflow-event.sh 2026-04-03-der-huster decision.point check guardian \
 #     '{"archetype":"guardian","input":"diff","decision":"needs_changes","confidence":0.85}' 7
 #   # Or use: ./lib/archeflow-decision.sh <run_id> <phase> <arch> '<input>' '<decision>' <confidence> [parent]
 #
 # Parent seqs: comma-separated seq numbers of causal parent events (DAG).
 #   "2"   → single parent [2]
--- a/lib/archeflow-replay.sh
+++ b/lib/archeflow-replay.sh
@@ -0,0 +1,228 @@
 #!/usr/bin/env bash
 # archeflow-replay.sh — Inspect recorded runs: decision timeline and weighted what-if replay.
 #
 # Usage:
 #   archeflow-replay.sh timeline <run_id>
 #   archeflow-replay.sh whatif <run_id> [--weights arch=w,arch2=w2] [--threshold 0.5] [--json]
 #   archeflow-replay.sh compare <run_id> [--weights ...] [--threshold ...] [--json]
 #
 # Events file: .archeflow/events/<run_id>.jsonl (relative to current working directory)
 #
 # whatif / compare:
 #   - Loads check-phase review.verdict events (last verdict per archetype).
 #   - Original gate (strict): BLOCK if any reviewer is not approved.
 #   - Replay gate (weighted): BLOCK if sum(weight * strict) / sum(weight) >= threshold,
 #     where strict=1 for non-approved verdicts, else 0. Default weight per archetype is 1.0.
 #
 # Requires: jq
 set -euo pipefail
 if [[ $# -lt 2 ]]; then
  echo "Usage: $0 {timeline|whatif|compare} <run_id> [options]" >&2
  echo "" >&2
  echo "  timeline <run_id>              Decision timeline (decision.point + review.verdict)" >&2
  echo "  whatif <run_id> [--weights k=v,...] [--threshold 0.5] [--json]" >&2
  echo "  compare <run_id>  (timeline + whatif summary)" >&2
  exit 1
 fi
 COMMAND="$1"
 RUN_ID="$2"
 shift 2
 if ! command -v jq &>/dev/null; then
  echo "Error: jq is required." >&2
  exit 1
 fi
 EVENT_FILE=".archeflow/events/${RUN_ID}.jsonl"
 resolve_event_file() {
  if [[ ! -f "$EVENT_FILE" ]]; then
    echo "Error: event file not found: $EVENT_FILE" >&2
    exit 1
  fi
 }
 cmd_timeline() {
  resolve_event_file
  echo "## Decision timeline — run_id=${RUN_ID}"
  echo ""
  local cnt
  cnt=$(jq -s '[.[] | select(.type == "decision.point")] | length' "$EVENT_FILE")
  if [[ "$cnt" -gt 0 ]]; then
    echo "### decision.point (${cnt})"
    jq -r 'select(.type == "decision.point")
      | "- \(.ts)  [\(.phase)] \(.data.archetype // .agent // "?")  \(.data.decision)  conf=\(.data.confidence // "n/a")  input=\(.data.input // "")"' \
      "$EVENT_FILE"
    echo ""
  else
    echo "### decision.point"
    echo "(none — emit with ./lib/archeflow-decision.sh during the run)"
    echo ""
  fi
  echo "### review.verdict (check phase)"
  if jq -e -s '[.[] | select(.type == "review.verdict" and .phase == "check")] | length > 0' "$EVENT_FILE" >/dev/null 2>&1; then
    jq -r 'select(.type == "review.verdict" and .phase == "check")
      | "- \(.ts)  \(.data.archetype // .agent // "?")  verdict=\(.data.verdict)  findings=\((.data.findings // []) | length)"' \
      "$EVENT_FILE"
  else
    echo "(none)"
  fi
  echo ""
 }
 parse_weights_to_json() {
  local raw="${1:-}"
  local obj='{}'
  if [[ -z "$raw" ]]; then
    echo '{}'
    return
  fi
  IFS=',' read -ra pairs <<< "$raw"
  for pair in "${pairs[@]}"; do
    [[ -z "$pair" ]] && continue
    local k="${pair%%=*}"
    local v="${pair#*=}"
    k=$(echo "$k" | tr '[:upper:]' '[:lower:]' | xargs)
    v=$(echo "$v" | xargs)
    if [[ -z "$k" || "$k" == "$pair" ]]; then
      echo "Error: invalid weight entry (use arch=1.5): $pair" >&2
      exit 1
    fi
    obj=$(echo "$obj" | jq --arg k "$k" --argjson v "$v" '. + {($k): $v}')
  done
  echo "$obj"
 }
 cmd_whatif() {
  local weights_str=""
  local threshold="0.5"
  local json_out="false"
  while [[ $# -gt 0 ]]; do
    case "$1" in
      --weights)
        weights_str="$2"
        shift 2
        ;;
      --threshold)
        threshold="$2"
        shift 2
        ;;
      --json)
        json_out="true"
        shift
        ;;
      *)
        echo "Unknown option: $1" >&2
        exit 1
        ;;
    esac
  done
  resolve_event_file
  local weights_json
  weights_json="$(parse_weights_to_json "$weights_str")"
  local result
  result=$(jq -s --argjson weights "$weights_json" --argjson thr "$threshold" --arg run_id "$RUN_ID" '
    def strict($v):
      if $v == null then 1
      else ($v | ascii_downcase) as $lv
      | if ($lv == "approved" or $lv == "approve") then 0 else 1 end
      end;
    def norm_key: ascii_downcase;
    ([.[] | select(.type == "review.verdict" and .phase == "check")]
      | sort_by(.seq)
      | reduce .[] as $e ({}; . + { (($e.data.archetype // $e.agent // "unknown") | norm_key): $e })
    ) as $last |
    ($last | keys) as $keys |
    if ($keys | length) == 0 then
      {
        run_id: $run_id,
        error: "no check-phase review.verdict events; nothing to simulate"
      }
    else
      [ $keys[] as $k | $last[$k] as $ev |
        ($weights[($k | norm_key)] // 1.0) as $w
        | strict($ev.data.verdict) as $s
        | {
            archetype: ($ev.data.archetype // $ev.agent // $k),
            verdict: ($ev.data.verdict // "unknown"),
            weight: $w,
            strict: $s,
            weighted_contrib: ($w * $s)
          }
      ] as $rows |
      ($rows | map(.weighted_contrib) | add) as $num |
      ($rows | map(.weight) | add) as $den |
      (if $den > 0 then ($num / $den) else 0 end) as $ratio |
      (if ($rows | map(.strict) | max) == 1 then "BLOCK" else "SHIP" end) as $strict_out |
      (if $ratio >= $thr then "BLOCK" else "SHIP" end) as $replay_out |
      {
        run_id: $run_id,
        threshold: $thr,
        weights_used: $weights,
        strict_any_veto: {
          outcome: $strict_out,
          description: "BLOCK if any reviewer verdict is not approved"
        },
        weighted_replay: {
          weighted_strictness: ($ratio * 1000 | round / 1000),
          outcome: $replay_out,
          description: ("BLOCK if weighted strictness >= " + ($thr | tostring))
        },
        reviewers: $rows
      }
    end
  ' "$EVENT_FILE")
  if [[ "$json_out" == "true" ]]; then
    echo "$result"
  else
    echo "$result" | jq -r '
      if .error then "Error: \(.error)" else
        "# What-if replay — run_id=\(.run_id)\n",
        "",
        "## Outcomes",
        "| Model | Result |",
        "|-------|--------|",
        "| Original (any non-approve → BLOCK) | \(.strict_any_veto.outcome) |",
        "| Weighted replay (threshold=\(.threshold)) | \(.weighted_replay.outcome) |",
        "",
        "## Weighted strictness",
        "\(.weighted_replay.weighted_strictness)  (0 = all approved, 1 = all blocking)",
        "",
        "## Per reviewer",
        "| Archetype | Verdict | Weight | Strict | w×strict |",
        "|-----------|---------|--------|--------|----------|",
        (.reviewers[] | "| \(.archetype) | \(.verdict) | \(.weight) | \(.strict) | \(.weighted_contrib) |"),
        "",
        (if (.weights_used | length) > 0 then
          "## Custom weights applied\n" + (.weights_used | to_entries | map("- \(.key): \(.value)") | join("\n")) + "\n"
        else empty end)
      end
    '
  fi
 }
 cmd_compare() {
  cmd_timeline
  echo ""
  cmd_whatif "$@"
 }
 case "$COMMAND" in
  timeline) cmd_timeline ;;
  whatif)   cmd_whatif "$@" ;;
  compare)  cmd_compare "$@" ;;
  *)
    echo "Unknown command: $COMMAND" >&2
    exit 1
    ;;
 esac
--- a/paper/Makefile
+++ b/paper/Makefile
@@ -0,0 +1,18 @@
 # Build the ArcheFlow paper
 # Usage: make          (build PDF)
 #        make clean    (remove build artifacts)
 MAIN = archeflow
 .PHONY: all clean
 all: $(MAIN).pdf
 $(MAIN).pdf: $(MAIN).tex references.bib
 	pdflatex $(MAIN)
 	bibtex $(MAIN)
 	pdflatex $(MAIN)
 	pdflatex $(MAIN)
 clean:
 	rm -f $(MAIN).{aux,bbl,blg,log,out,pdf,toc,lof,lot,nav,snm,vrb}
--- a/paper/archeflow.tex
+++ b/paper/archeflow.tex
@@ -0,0 +1,880 @@
 \documentclass[11pt,a4paper]{article}
 % ---- Packages ----
 \usepackage[utf8]{inputenc}
 \usepackage[T1]{fontenc}
 \usepackage{amsmath,amssymb}
 \usepackage{graphicx}
 \usepackage{booktabs}
 \usepackage{hyperref}
 \usepackage{xcolor}
 \usepackage{listings}
 \usepackage{subcaption}
 \usepackage{tikz}
 \usetikzlibrary{shapes,arrows.meta,positioning,fit,calc}
 \usepackage[numbers]{natbib}
 \usepackage{geometry}
 \geometry{margin=1in}
 % ---- Listings style ----
 \lstset{
  basicstyle=\ttfamily\small,
  breaklines=true,
  frame=single,
  framesep=3pt,
  columns=flexible,
  keepspaces=true,
  showstringspaces=false,
  commentstyle=\color{gray},
  keywordstyle=\color{blue!70!black},
 }
 % ---- Title ----
 \title{%
  ArcheFlow: Multi-Agent Orchestration with\\
  Archetypal Roles and PDCA Quality Cycles%
 }
 \author{
  Christian Nennemann\\
  Independent Researcher\\
  \texttt{chris@nennemann.de}\\
  \texttt{https://github.com/XORwell/archeflow}
 }
 \date{April 2026}
 \begin{document}
 \maketitle
 % ============================================================
 \begin{abstract}
 We present \textsc{ArcheFlow}, an open-source orchestration framework for
 multi-agent software engineering that assigns \emph{archetypal roles}---derived
 from Jungian analytical psychology---to LLM agents and coordinates them through
 \emph{Plan--Do--Check--Act} (PDCA) quality cycles. Each of seven archetypes
 (Explorer, Creator, Maker, Guardian, Skeptic, Trickster, Sage) carries a defined
 cognitive virtue and a quantitatively detected \emph{shadow}---a failure mode
 triggered when the virtue becomes excessive. The framework implements a
 three-layer corrective action system (archetype shadows, system shadows, policy
 boundaries) that detects and mitigates agent dysfunction during autonomous
 operation. We describe ArcheFlow's architecture as a zero-dependency plugin for
 Claude Code, detail its attention filtering, feedback routing, convergence
 detection, and effectiveness scoring mechanisms, and discuss connections to
 recent work on persona stability in language models
 \citep{lu2026assistant}. ArcheFlow demonstrates that structured persona
 assignment with shadow detection can maintain productive agent behavior across
 extended autonomous sessions spanning multiple projects and quality domains
 (code, prose, research). The system is publicly available under the MIT license.
 \end{abstract}
 % ============================================================
 \section{Introduction}
 \label{sec:introduction}
 The rise of agentic coding assistants---tools that autonomously write, test,
 review, and commit code---has created a new class of software engineering
 challenges. While individual LLM agents can produce competent code, the quality
 of autonomous output degrades under conditions that are well-known from human
 software teams: reviewers who rubber-stamp, architects who over-engineer,
 implementers who ignore specifications, and testers who optimize for coverage
 metrics rather than real defects.
 These failure modes are not merely analogies. \citet{lu2026assistant}
 demonstrate that language models occupy a measurable \emph{persona space} and
 can drift from their trained Assistant identity during extended conversations,
 particularly under emotional or philosophical pressure. Their ``Assistant
 Axis''---a dominant directional component in activation space---predicts when
 models will exhibit uncharacteristic behavior. If a single model drifts, a
 multi-agent system where each agent maintains a distinct persona faces
 compounded persona management challenges.
 ArcheFlow addresses this problem by drawing on two established frameworks:
 \begin{enumerate}
  \item \textbf{Jungian archetypal psychology} \citep{jung1968archetypes}, which
  provides a taxonomy of cognitive orientations---each with a productive
  \emph{virtue} and a destructive \emph{shadow}---that map naturally onto
  software engineering roles.
  \item \textbf{PDCA quality cycles} \citep{deming1986out}, which provide a
  convergence mechanism for iterative refinement with measurable exit criteria.
 \end{enumerate}
 The contribution of this paper is threefold:
 \begin{itemize}
  \item We present a \emph{shadow detection framework} that quantitatively
  identifies agent dysfunction---not through sentiment analysis or output
  classification, but through structural metrics (output length, finding ratios,
  scope violations) specific to each archetype's failure mode (Section~\ref{sec:shadows}).
  \item We describe \emph{attention filters} and \emph{feedback routing} mechanisms
  that constrain what each agent sees and where its output flows, preventing the
  information overload and echo chamber effects that plague na\"ive multi-agent
  systems (Section~\ref{sec:attention}).
  \item We demonstrate that PDCA convergence detection---including oscillation
  analysis and divergence scoring---provides principled stopping criteria for
  iterative review cycles (Section~\ref{sec:convergence}).
 \end{itemize}
 ArcheFlow is implemented as a zero-dependency plugin (Bash + Markdown) for
 Claude Code\footnote{\url{https://claude.ai/claude-code}}, Anthropic's CLI
 coding assistant. It has been used in production across a portfolio of 10--30
 repositories spanning code, creative writing, and academic research.
 % ============================================================
 \section{Related Work}
 \label{sec:related}
 \subsection{Multi-Agent Software Engineering}
 Multi-agent systems for software engineering have proliferated since 2024.
 \citet{hong2024metagpt} propose MetaGPT, which assigns human-like roles
 (product manager, architect, engineer) to LLM agents and enforces structured
 communication through Standardized Operating Procedures (SOPs). ChatDev
 \citep{qian2024chatdev} simulates a virtual software company with role-playing
 agents communicating through natural language chat. SWE-Agent
 \citep{yang2024sweagent} focuses on single-agent benchmark performance on
 GitHub issues, demonstrating that tool-augmented agents can resolve real-world
 bugs.
 These systems share a common limitation: roles are defined by \emph{job
 descriptions} rather than \emph{cognitive orientations}. A ``product manager''
 agent may behave identically to a ``tech lead'' agent when both receive the same
 context, because the role boundary is semantic rather than structural. ArcheFlow
 addresses this through attention filters (Section~\ref{sec:attention}) that
 physically restrict what each agent perceives, ensuring that role differences
 manifest in behavior rather than merely in prompts.
 \subsection{Persona Stability in Language Models}
 \citet{lu2026assistant} identify the ``Assistant Axis'' in LLM activation
 space---a linear direction capturing the degree to which a model operates in its
 default helpful mode versus an alternative persona. Their key findings are
 directly relevant to multi-agent orchestration:
 \begin{enumerate}
  \item \textbf{Persona space is low-dimensional}: only 4--19 principal
  components explain 70\% of persona variance across 275 character archetypes.
  \item \textbf{Drift is predictable}: user message embeddings predict response
  position along the Assistant Axis ($R^2 = 0.53$--$0.77$).
  \item \textbf{Drift correlates with harm}: models are more liable to produce
  harmful outputs when drifted from the Assistant identity ($r = 0.39$--$0.52$).
 \end{enumerate}
 ArcheFlow's shadow detection (Section~\ref{sec:shadows}) can be understood as an
 \emph{application-level} analog to activation capping: where \citet{lu2026assistant}
 constrain neural activations to maintain persona stability, ArcheFlow constrains
 \emph{behavioral outputs} through quantitative triggers and corrective prompts.
 Both approaches recognize that productive personas require active stabilization,
 not merely initial assignment.
 \subsection{Quality Cycles in Software Engineering}
 The Plan--Do--Check--Act (PDCA) cycle, formalized by \citet{deming1986out} and
 rooted in Shewhart's statistical process control \citep{shewhart1939statistical},
 is the dominant quality improvement framework in manufacturing and has been
 applied to software engineering through agile retrospectives and continuous
 improvement. To our knowledge, ArcheFlow is the first system to apply PDCA
 cycles to multi-agent LLM orchestration with formal convergence detection and
 oscillation analysis.
 \subsection{Jungian Archetypes in Computing}
 While Jungian archetypes have been applied in user experience design
 \citep{hartson2012ux}, brand strategy, and game design, their application to
 AI agent systems is novel. The closest related work is in computational
 creativity, where archetypal narratives have been used to structure story
 generation \citep{winston2011strong}. ArcheFlow extends this to software
 engineering by mapping archetypal virtues and shadows to measurable engineering
 outcomes.
 % ============================================================
 \section{Architecture}
 \label{sec:architecture}
 ArcheFlow is a plugin for Claude Code that operates entirely through prompt
 engineering, shell scripts, and file-based communication. It has zero runtime
 dependencies beyond Bash and a compatible LLM backend.
 \begin{figure}[t]
 \centering
 \begin{tikzpicture}[
  node distance=1.2cm and 2cm,
  phase/.style={draw, rounded corners, minimum width=2.5cm, minimum height=0.8cm, font=\small\bfseries},
  agent/.style={draw, rounded corners, minimum width=2cm, minimum height=0.6cm, font=\small, fill=blue!5},
  arrow/.style={-{Stealth[length=3mm]}, thick},
  label/.style={font=\scriptsize, text=gray},
 ]
 % PDCA Cycle
 \node[phase, fill=yellow!20] (plan) {Plan};
 \node[phase, fill=green!20, right=of plan] (do) {Do};
 \node[phase, fill=orange!20, right=of do] (check) {Check};
 \node[phase, fill=red!15, right=of check] (act) {Act};
 % Plan agents
 \node[agent, below left=0.8cm and 0.3cm of plan] (explorer) {Explorer};
 \node[agent, below right=0.8cm and 0.3cm of plan] (creator) {Creator};
 % Do agent
 \node[agent, below=0.8cm of do] (maker) {Maker};
 % Check agents
 \node[agent, below left=0.8cm and -0.2cm of check] (guardian) {Guardian};
 \node[agent, below=0.8cm of check] (skeptic) {Skeptic};
 \node[agent, below right=0.8cm and -0.2cm of check] (sage) {Sage};
 % Arrows
 \draw[arrow] (plan) -- (do);
 \draw[arrow] (do) -- (check);
 \draw[arrow] (check) -- (act);
 \draw[arrow, dashed] (act.south) -- ++(0,-0.5) -| node[label, below, pos=0.25] {cycle back} (plan.south);
 % Agent connections
 \draw[-] (plan.south) -- (explorer.north);
 \draw[-] (plan.south) -- (creator.north);
 \draw[-] (do.south) -- (maker.north);
 \draw[-] (check.south) -- (guardian.north);
 \draw[-] (check.south) -- (skeptic.north);
 \draw[-] (check.south) -- (sage.north);
 \end{tikzpicture}
 \caption{ArcheFlow PDCA cycle with archetypal agent assignments. The dashed arrow represents cycle-back when reviewers find issues. A Trickster agent (not shown) joins the Check phase in \texttt{thorough} workflows.}
 \label{fig:pdca}
 \end{figure}
 \subsection{Components}
 The system comprises four component types:
 \begin{description}
  \item[Agent personas] (\texttt{agents/*.md}): Behavioral protocols for each
  archetype, defining the agent's cognitive lens, output format, and quality
  criteria. Each persona is a Markdown file loaded as a system prompt.
  \item[Skills] (\texttt{skills/*/SKILL.md}): Operational instructions that
  Claude Code follows to orchestrate the PDCA cycle. The core \texttt{run} skill
  (466 lines) is self-contained---it encodes the complete orchestration protocol
  including workflow selection, agent spawning, attention filtering, convergence
  checking, and exit decisions.
  \item[Library scripts] (\texttt{lib/*.sh}): Ten Bash scripts handling
  infrastructure concerns: JSONL event logging, git operations (per-phase
  commits, branch management, rollback), cross-run memory, progress tracking,
  effectiveness scoring, and run replay.
  \item[Hooks] (\texttt{hooks/}): Session-start hook that auto-activates
  ArcheFlow and injects the domain detection logic.
 \end{description}
 \subsection{Execution Modes}
 ArcheFlow provides three execution modes optimized for different use cases:
 \begin{description}
  \item[Sprint] (\texttt{/af-sprint}): Queue-driven parallel dispatch. Reads a
  priority-ordered task queue, spawns 3--5 agents across different projects
  simultaneously, collects results, commits, and starts the next batch. Designed
  for throughput over ceremony.
  \item[Review] (\texttt{/af-review}): Guardian-led post-implementation review
  on existing diffs, branches, or commit ranges. No planning or implementation
  orchestration---pure quality analysis.
  \item[Run] (\texttt{/af-run}): Full PDCA orchestration for complex tasks
  requiring structured exploration, design, implementation, and multi-perspective
  review.
 \end{description}
 \subsection{Domain Adaptation}
 ArcheFlow adapts its terminology and quality criteria based on domain detection:
 \texttt{code} (diffs, tests, security), \texttt{writing} (voice consistency,
 dialect authenticity, narrative structure), and \texttt{research} (source quality,
 argument coherence, citation accuracy). Domain is auto-detected from project
 contents or specified in configuration.
 % ============================================================
 \section{The Seven Archetypes}
 \label{sec:archetypes}
 Each archetype embodies a cognitive orientation with a defined virtue (productive
 mode) and shadow (destructive mode). \Cref{tab:archetypes} summarizes the
 complete taxonomy.
 \begin{table}[t]
 \centering
 \caption{The seven ArcheFlow archetypes with their PDCA phase assignments,
 cognitive virtues, and shadow failure modes.}
 \label{tab:archetypes}
 \begin{tabular}{@{}llllll@{}}
 \toprule
 \textbf{Archetype} & \textbf{Phase} & \textbf{Virtue} & \textbf{Shadow} & \textbf{Model Tier} \\
 \midrule
 Explorer  & Plan  & Contextual Clarity      & Rabbit Hole   & Haiku \\
 Creator   & Plan  & Decisive Framing        & Over-Architect & Sonnet \\
 Maker     & Do    & Execution Discipline    & Rogue          & Sonnet \\
 Guardian  & Check & Threat Intuition        & Paranoid       & Sonnet \\
 Skeptic   & Check & Assumption Surfacing    & Paralytic      & Haiku \\
 Trickster & Check & Adversarial Creativity  & False Alarm    & Haiku \\
 Sage      & Check & Maintainability Judgment & Bureaucrat    & Haiku \\
 \bottomrule
 \end{tabular}
 \end{table}
 The archetype--shadow pairing is not metaphorical; it is the core mechanism
 for maintaining agent quality. The virtue describes \emph{what} the archetype
 contributes; the shadow describes what happens when that contribution becomes
 excessive. An Explorer who never stops researching (Rabbit Hole) delays the
 entire pipeline. A Guardian who rejects everything (Paranoid) prevents any
 code from shipping.
 \subsection{Cost-Aware Model Assignment}
 Not all archetypes require the same model capability. Analytical tasks
 (exploration, assumption checking, code quality review) can be performed by
 cheaper models (Haiku), while creative tasks (architecture design,
 implementation, security analysis) benefit from more capable models (Sonnet).
 This tiered assignment reduces per-run costs by 40--60\% compared to using the
 most capable model for all agents, with no observed quality degradation in
 analytical roles.
 % ============================================================
 \section{Shadow Detection and Corrective Action}
 \label{sec:shadows}
 \subsection{Archetype Shadows}
 Shadow detection is \emph{quantitative, not sentiment-based}. Each archetype has
 a specific trigger condition derived from structural properties of its output:
 \begin{table}[h]
 \centering
 \caption{Shadow detection triggers. Each trigger is evaluated automatically
 after the agent completes.}
 \label{tab:shadows}
 \begin{tabular}{@{}lll@{}}
 \toprule
 \textbf{Archetype} & \textbf{Shadow} & \textbf{Trigger} \\
 \midrule
 Explorer  & Rabbit Hole   & Output $> 2000$ words without Recommendation section \\
 Creator   & Over-Architect & $> 2$ new abstractions for a single feature \\
 Maker     & Rogue          & No tests in changeset, or files outside proposal scope \\
 Guardian  & Paranoid       & CRITICAL:WARNING ratio $> 2{:}1$, or zero approvals \\
 Skeptic   & Paralytic      & $> 7$ challenges with $< 50\%$ having alternatives \\
 Trickster & False Alarm    & Findings in untouched code, or $> 10$ total findings \\
 Sage      & Bureaucrat     & Review length $> 2\times$ code change length \\
 \bottomrule
 \end{tabular}
 \end{table}
 The escalation protocol follows a three-strike pattern:
 \begin{enumerate}
  \item \textbf{First detection}: Inject a correction prompt that names the
  shadow and redirects the agent toward its virtue.
  \item \textbf{Second detection} (same shadow, same run): Replace the agent
  with a fresh instance.
  \item \textbf{Third detection}: Escalate to the user for manual intervention.
 \end{enumerate}
 \subsection{System Shadows}
 Beyond individual archetype dysfunction, ArcheFlow monitors for
 \emph{system-level} failure modes:
 \begin{description}
  \item[Echo Chamber]: Multiple reviewers produce identical findings, suggesting
  they are confirming each other rather than applying independent judgment.
  Detected when $> 60\%$ of findings across reviewers share the same
  file-and-category tuple.
  \item[Tunnel Vision]: All findings cluster in a single file or module while
  the changeset spans multiple. Detected when $> 80\%$ of findings target
  $< 20\%$ of changed files.
  \item[Scope Creep]: Maker modifies files not mentioned in the Creator's
  proposal. Detected by comparing \texttt{do-maker-files.txt} against the
  proposal's file list.
 \end{description}
 \subsection{Policy Boundaries and the Wiggum Break}
 The third layer enforces operational limits through budget gates, cycle
 limits, and checkpoint policies. When limits are exceeded, the system
 triggers a \emph{Wiggum Break}\footnote{Named after Chief Wiggum from
 \emph{The Simpsons}---a nod to both ``policy enforcement'' and the
 Ralph Loop plugin for Claude Code.}---a circuit breaker that halts
 execution, saves state, and reports to the user.
 Wiggum Breaks are classified as \emph{hard} (halt immediately) or
 \emph{soft} (finish current task, then halt):
 \begin{description}
  \item[Hard breaks]: 3 consecutive agent failures, 3 consecutive shadow
  detections in one run, test suite broken after merge, 2+ oscillating
  findings.
  \item[Soft breaks]: convergence score $< 0.5$ for 2 consecutive cycles,
  findings unchanged between cycles, budget $> 95\%$ spent.
 \end{description}
 Each Wiggum Break emits a \texttt{wiggum.break} event capturing the
 trigger, run state, and unresolved findings for post-run analysis.
 \subsection{Connection to the Assistant Axis}
 The shadow detection framework addresses the same fundamental problem identified
 by \citet{lu2026assistant}: models drift from productive personas during
 extended operation. Where their work identifies drift in activation space and
 proposes activation capping as a mitigation, ArcheFlow operates at the
 \emph{behavioral} level---detecting drift through output structure rather than
 internal representations, and correcting through prompt injection rather than
 activation manipulation.
 This application-level approach has a practical advantage: it requires no access
 to model internals and works with any LLM backend, including API-only models
 where activation-level interventions are impossible. The tradeoff is that
 behavioral detection is necessarily coarser than activation-level measurement
 and can only detect drift after it manifests in output, not before.
 % ============================================================
 \section{Attention Filters and Information Flow}
 \label{sec:attention}
 A key design principle is that each agent receives \emph{only the information
 relevant to its role}. This is implemented through \emph{attention filters}---rules
 governing which artifacts from prior phases are injected into each agent's
 context.
 \begin{table}[h]
 \centering
 \caption{Attention filter matrix. Each agent receives only the artifacts marked
 with \checkmark.}
 \label{tab:attention}
 \begin{tabular}{@{}lccccc@{}}
 \toprule
 \textbf{Agent} & \textbf{Task} & \textbf{Explorer} & \textbf{Creator} & \textbf{Diff} & \textbf{Reviews} \\
 \midrule
 Explorer  & \checkmark &            &            &           &            \\
 Creator   & \checkmark & \checkmark &            &           &            \\
 Maker     & \checkmark &            & \checkmark &           &            \\
 Guardian  &            &            & (risks)    & \checkmark &            \\
 Skeptic   &            &            & \checkmark &           &            \\
 Sage      &            &            & \checkmark & \checkmark &            \\
 Trickster &            &            &            & \checkmark &            \\
 \bottomrule
 \end{tabular}
 \end{table}
 The rationale for attention filtering is twofold:
 \begin{enumerate}
  \item \textbf{Independence}: Reviewers who see each other's findings tend to
  converge on a shared narrative rather than applying independent judgment. By
  isolating reviewer inputs, ArcheFlow ensures that each reviewer contributes a
  genuinely distinct perspective.
  \item \textbf{Focus}: An agent given everything tends to address everything,
  producing diluted analysis. The Trickster, for example, receives \emph{only}
  the diff---no design rationale, no risk analysis---forcing it to evaluate the
  code purely on its own terms.
 \end{enumerate}
 In PDCA cycle 2+, the feedback from the Act phase is routed selectively:
 Creator-routed issues go to the Creator, Maker-routed issues go to the Maker.
 Neither sees the other's feedback, preventing defensive responses to criticism
 that was directed elsewhere.
 % ============================================================
 \section{Feedback Routing}
 \label{sec:routing}
 When the Check phase identifies issues, the Act phase must decide where to route
 each finding for the next cycle. ArcheFlow uses a deterministic routing table
 based on the source archetype and finding category:
 \begin{table}[h]
 \centering
 \caption{Feedback routing table. Findings are routed to the agent best equipped
 to address them, preventing cross-contamination.}
 \label{tab:routing}
 \begin{tabular}{@{}llll@{}}
 \toprule
 \textbf{Source} & \textbf{Category} & \textbf{Routes To} & \textbf{Rationale} \\
 \midrule
 Guardian  & security, breaking-change & Creator & Design must change \\
 Guardian  & reliability, dependency   & Creator & Architectural decision \\
 Skeptic   & design, scalability       & Creator & Assumptions need revision \\
 Sage      & quality, consistency      & Maker   & Implementation refinement \\
 Sage      & testing                   & Maker   & Test gap, not design flaw \\
 Trickster & reliability (design flaw) & Creator & Needs redesign \\
 Trickster & reliability (test gap)    & Maker   & Needs more tests \\
 \bottomrule
 \end{tabular}
 \end{table}
 The disambiguation principle: if fixing the issue requires changing the
 \emph{approach}, route to Creator. If it requires changing the \emph{code within
 the existing approach}, route to Maker. Findings that persist across two
 consecutive cycles are escalated to the user rather than cycled indefinitely.
 % ============================================================
 \section{Convergence Detection}
 \label{sec:convergence}
 \subsection{Convergence Score}
 In PDCA cycle 2+, ArcheFlow compares current findings against the previous cycle
 and classifies each as \textsc{New}, \textsc{Resolved}, \textsc{Persistent}, or
 \textsc{Regressed}. The convergence score is:
 \begin{equation}
  C = \frac{|\textsc{Resolved}|}{|\textsc{Resolved}| + |\textsc{New}| + |\textsc{Regressed}|}
  \label{eq:convergence}
 \end{equation}
 \begin{table}[h]
 \centering
 \caption{Convergence score interpretation and corresponding actions.}
 \label{tab:convergence}
 \begin{tabular}{@{}lll@{}}
 \toprule
 \textbf{Score Range} & \textbf{Status} & \textbf{Action} \\
 \midrule
 $C > 0.8$   & Converging & Continue if cycles remain \\
 $0.5 \leq C \leq 0.8$ & Stalling  & Continue with caution \\
 $C < 0.5$   & Diverging  & Stop if 2 consecutive diverging cycles \\
 $C = 0$     & Stuck      & Stop immediately \\
 \bottomrule
 \end{tabular}
 \end{table}
 \subsection{Oscillation Detection}
 A finding is \emph{oscillating} if it was present in cycle $n-2$, absent in
 cycle $n-1$, and present again in cycle $n$. Two or more oscillating findings
 trigger an immediate stop with escalation to the user, as oscillation indicates
 a fundamental tension in the review criteria that automated cycles cannot
 resolve.
 \subsection{Adaptive Workflow Escalation}
 Convergence detection interacts with workflow selection through Rule A1: if a
 \texttt{fast} workflow and Guardian finds $\geq 2$ CRITICAL findings, the next
 cycle escalates to \texttt{standard} (adding Skeptic and Sage reviewers). Once
 escalated, the workflow remains escalated for the duration of the run.
 Conversely, Rule A2 provides a \emph{fast-path}: if Guardian finds zero CRITICAL
 and zero WARNING findings, remaining reviewers are skipped entirely, and the
 system proceeds directly to Act. This optimization reduces the cost of runs
 where the Maker's implementation is clean.
 % ============================================================
 \section{Evidence Validation}
 \label{sec:evidence}
 Reviewer findings are subject to evidence validation before they influence
 routing decisions. A CRITICAL or WARNING finding is downgraded to INFO if:
 \begin{itemize}
  \item It uses \emph{banned hedging phrases} without supporting evidence:
  ``might be'', ``could potentially'', ``appears to'', ``seems like'', ``may not''.
  \item It contains \emph{no evidence}: no command output, code citation, line
  reference, or reproduction steps.
 \end{itemize}
 This mechanism addresses a well-known failure mode of LLM reviewers: generating
 plausible-sounding but unsupported concerns. By requiring evidence for
 high-severity findings, ArcheFlow forces reviewers to ground their analysis in
 the actual changeset rather than speculation.
 Downgrades are tracked in the event log but do \emph{not} modify the original
 artifact files, preserving the complete reviewer output for post-run analysis.
 % ============================================================
 \section{Effectiveness Scoring}
 \label{sec:effectiveness}
 After each completed run, ArcheFlow scores review archetypes across five
 dimensions:
 \begin{table}[h]
 \centering
 \caption{Effectiveness scoring dimensions and their weights.}
 \label{tab:effectiveness}
 \begin{tabular}{@{}lp{7cm}r@{}}
 \toprule
 \textbf{Dimension} & \textbf{Description} & \textbf{Weight} \\
 \midrule
 Signal-to-noise & Ratio of useful findings to total findings & 0.30 \\
 Fix rate        & Fraction of findings that led to applied fixes & 0.25 \\
 Cost efficiency & Useful findings per dollar of model inference cost & 0.20 \\
 Accuracy        & Fraction not contradicted by other reviewers & 0.15 \\
 Cycle impact    & Whether findings contributed to cycle exit decision & 0.10 \\
 \bottomrule
 \end{tabular}
 \end{table}
 Scores accumulate in a cross-run memory file
 (\texttt{.archeflow/memory/effectiveness.jsonl}). After 10+ completed runs,
 the system recommends model tier changes (e.g., promoting a Haiku-tier reviewer
 to Sonnet if its signal-to-noise is consistently high) and, in extreme cases,
 archetype removal for persistently low-scoring reviewers.
 % ============================================================
 \section{Cross-Run Memory}
 \label{sec:memory}
 ArcheFlow maintains a lesson-learning system that persists across runs. When
 recurring findings are detected---the same category of issue appearing in
 multiple runs---the system stores a lesson and injects it into future agents
 as additional context.
 Lessons decay over time: each lesson has a relevance counter that increments on
 reuse and decrements on irrelevance. Lessons that fall below a threshold are
 archived rather than injected, preventing the accumulation of stale guidance.
 The memory system also performs regression detection: if a previously resolved
 issue reappears, it is flagged as a regression with higher priority than a
 fresh finding.
 % ============================================================
 \section{Implementation}
 \label{sec:implementation}
 ArcheFlow is implemented in approximately 6,700 lines across three layers:
 \begin{itemize}
  \item \textbf{Skills} (19 Markdown files, $\sim$2,500 lines): Operational
  instructions for Claude Code, written as imperative protocols. The core
  \texttt{run} skill encodes the complete PDCA orchestration in 466 lines.
  \item \textbf{Agent personas} (7 Markdown files, $\sim$700 lines): Behavioral
  protocols defining each archetype's cognitive lens, output format, and
  self-review checklist.
  \item \textbf{Library scripts} (10 Bash scripts, $\sim$3,500 lines): Event
  logging, git operations, memory management, progress tracking, effectiveness
  scoring, and run replay.
 \end{itemize}
 The system uses no database, no API server, and no runtime dependencies beyond
 Bash 4+ and a Claude Code installation. All state is stored in JSONL event logs
 and Markdown artifact files. This zero-dependency architecture was a deliberate
 design choice: orchestration infrastructure that itself requires complex setup
 and maintenance undermines the autonomy it is supposed to enable.
 \subsection{Git Integration}
 ArcheFlow creates per-phase commits, enabling fine-grained rollback. The Maker
 operates in a git worktree---an isolated working copy---so its changes do not
 affect the main branch until explicitly merged. If post-merge tests fail, the
 system auto-reverts the merge and cycles back with ``integration test failure''
 feedback.
 \subsection{Run Replay}
 All orchestration decisions are logged as \texttt{decision.point} events,
 enabling post-hoc analysis. The replay system provides:
 \begin{itemize}
  \item \textbf{Timeline view}: chronological sequence of all decisions with
  confidence scores.
  \item \textbf{Weighted what-if}: re-evaluation of the ship/block outcome
  using different reviewer weights, answering questions like ``would the outcome
  have changed if we weighted Guardian 2x and Sage 0.5x?''
  \item \textbf{Cross-run comparison}: side-by-side analysis of decision
  patterns across runs.
 \end{itemize}
 % ============================================================
 \section{Multi-Domain Application}
 \label{sec:domains}
 ArcheFlow's archetype system extends beyond code. The framework has been
 deployed across three domains:
 \subsection{Software Engineering}
 The primary domain. Archetypes map to standard engineering roles: Explorer
 performs codebase research, Creator designs architecture, Maker writes code,
 and the Check-phase archetypes review for security (Guardian), design flaws
 (Skeptic), edge cases (Trickster), and overall quality (Sage).
 \subsection{Creative Writing}
 In writing mode, the same archetype structure applies with adapted quality
 criteria. Custom archetypes (story-explorer, story-sage) replace or augment
 the defaults. The framework integrates with Colette, a voice profiling system
 that maintains consistent authorial voice across chapters. Quality gates check
 for voice consistency, dialect authenticity, and narrative structure rather
 than test coverage and security.
 \subsection{Academic Research}
 In research mode, quality criteria shift to source quality, argument coherence,
 citation accuracy, and methodological rigor. The Guardian reviews for logical
 fallacies and unsupported claims rather than security vulnerabilities.
 % ============================================================
 \section{Discussion}
 \label{sec:discussion}
 \subsection{Archetypes vs. Role Descriptions}
 The key distinction between ArcheFlow's approach and prior multi-agent systems
 is the \emph{shadow} mechanism. A role description tells an agent what to do;
 an archetype tells an agent what to do \emph{and what doing too much of it
 looks like}. This bidirectional specification creates a bounded operating
 range for each agent, preventing the unbounded optimization that leads to
 dysfunction.
 The connection to \citet{lu2026assistant}'s persona axis is instructive.
 They show that model personas exist on a continuum, with the Assistant identity
 at one extreme and theatrical/mystical identities at the other. ArcheFlow's
 archetypes deliberately position agents \emph{away} from the default Assistant
 toward specific cognitive orientations---but the shadow mechanism prevents them
 from drifting too far, maintaining a productive operating range analogous to
 what \citeauthor{lu2026assistant} achieve through activation capping.
 \subsection{Wiggum Breaks as Human-in-the-Loop Boundaries}
 A central question in autonomous agent systems is: \emph{when should the
 system stop acting and ask a human?} Most frameworks treat this as an
 implementation detail---a timeout, a retry limit, an exception handler.
 ArcheFlow treats it as a first-class architectural concept through the
 \emph{Wiggum Break}.
 The Wiggum Break defines the \textbf{formal boundary between autonomous and
 human-supervised operation}. It is not a failure mode; it is the system's
 \emph{designed} response to situations where autonomous resolution is
 provably unproductive:
 \begin{itemize}
  \item \textbf{Oscillation} (finding present $\to$ absent $\to$ present)
  indicates a genuine tension in the review criteria that no amount of
  cycling will resolve---only human judgment about which criterion takes
  priority.
  \item \textbf{Divergence} (convergence score $< 0.5$ for two consecutive
  cycles) indicates that the implementation is getting worse with each
  iteration---the agents lack the context or capability to solve the
  problem, and continuing wastes resources.
  \item \textbf{Repeated shadow detection} (same dysfunction three times)
  indicates that the corrective action framework has exhausted its
  options---the task structure is incompatible with the assigned archetype,
  and a human must re-scope.
 \end{itemize}
 This framing inverts the typical HITL paradigm. Rather than asking
 ``how much autonomy should the system have?'' and pre-defining approval
 gates, ArcheFlow asks ``under what conditions is autonomy
 \emph{provably unproductive}?'' and derives the HITL boundary from
 convergence theory. The system runs autonomously by default and escalates
 only when it can demonstrate---through quantitative metrics, not
 heuristics---that continued autonomous operation will not improve the
 outcome.
 This approach has three advantages over pre-defined approval gates:
 \begin{enumerate}
  \item \textbf{Adaptive autonomy}: Simple tasks never trigger a Wiggum
  Break; complex tasks trigger one quickly. The HITL boundary adapts to
  task difficulty without manual configuration.
  \item \textbf{Auditable escalation}: Every Wiggum Break emits a
  \texttt{wiggum.break} event with the trigger condition, run state, and
  unresolved findings. The human receives not just a request for help,
  but a structured summary of \emph{why} autonomous resolution failed
  and what specifically needs their judgment.
  \item \textbf{Minimal interruption}: Pre-defined gates (``approve every
  PR'', ``review every design'') interrupt the human on tasks the system
  could have handled autonomously. Convergence-derived breaks interrupt
  only when the system has evidence that it cannot proceed productively.
 \end{enumerate}
 The Wiggum Break thus operationalizes a principle from resilience
 engineering: the system should be \emph{autonomy-seeking} (preferring to
 resolve issues itself) but \emph{escalation-ready} (able to produce a
 useful handoff when self-resolution fails). The quality of the handoff---not
 just the fact of escalation---is what makes HITL effective.
 \subsection{Limitations}
 \begin{enumerate}
  \item \textbf{No activation-level control}: ArcheFlow operates purely at the
  prompt level. It cannot detect persona drift before it manifests in output,
  unlike activation-level approaches \citep{lu2026assistant}.
  \item \textbf{Single LLM backend}: The current implementation targets Claude
  Code. While the architectural principles are model-agnostic, the skill and
  hook system is specific to Claude Code's plugin API.
  \item \textbf{Evaluation methodology}: We have not conducted controlled
  experiments comparing ArcheFlow's output quality against baselines (single-agent,
  role-based multi-agent without shadows, PDCA without archetypes). The system
  has been evaluated through production use across real projects, which
  demonstrates practical utility but not causal attribution.
  \item \textbf{Shadow trigger thresholds}: The quantitative thresholds
  (e.g., 2000 words for Rabbit Hole, ratio $> 2{:}1$ for Paranoid) were
  determined empirically through iterative use and may not generalize across
  all codebases and domains.
 \end{enumerate}
 \subsection{Future Work}
 \begin{enumerate}
  \item \textbf{Activation-level integration}: Combining behavioral shadow
  detection with the Assistant Axis measurement from \citet{lu2026assistant}
  could provide earlier and more reliable drift detection, particularly for
  open-weight models where activations are accessible.
  \item \textbf{Controlled evaluation}: A systematic comparison across standard
  benchmarks (SWE-bench, HumanEval) would establish whether the archetype +
  PDCA approach provides measurable quality improvements over simpler
  orchestration strategies.
  \item \textbf{Archetype discovery}: Rather than hand-designing archetypes,
  the persona space analysis from \citet{lu2026assistant} could be used to
  identify \emph{natural} cognitive orientations that models adopt, potentially
  revealing useful archetypes that human intuition would not suggest.
  \item \textbf{Cross-model persona stability}: Investigating whether shadow
  triggers calibrated for one model family transfer to others, or whether
  per-model calibration is necessary.
 \end{enumerate}
 % ============================================================
 \section{Conclusion}
 \label{sec:conclusion}
 ArcheFlow demonstrates that multi-agent LLM orchestration benefits from
 structured persona management---not just telling agents \emph{what to do},
 but actively monitoring and correcting \emph{how they do it}. The combination
 of Jungian archetypes (providing a principled taxonomy of cognitive virtues and
 their failure modes) with PDCA quality cycles (providing convergence guarantees
 and principled stopping criteria) produces an orchestration framework that
 maintains productive agent behavior across extended autonomous sessions.
 The shadow detection mechanism---quantitative triggers for archetype-specific
 dysfunction---addresses the same persona stability challenge identified by
 \citet{lu2026assistant} at the application level, requiring no access to model
 internals and working with any LLM backend. While coarser than activation-level
 approaches, behavioral shadow detection is practical, interpretable, and
 immediately deployable.
 ArcheFlow is open-source under the MIT license and available at
 \url{https://github.com/XORwell/archeflow}.
 % ============================================================
 \section*{Acknowledgments}
 The author thanks the Claude Code team at Anthropic for building the plugin
 infrastructure that made ArcheFlow possible, and the authors of
 \citet{lu2026assistant} for the Assistant Axis framework that informed the
 theoretical grounding of shadow detection.
 % ============================================================
 \bibliographystyle{plainnat}
 \bibliography{references}
 \end{document}
--- a/paper/references.bib
+++ b/paper/references.bib
@@ -0,0 +1,89 @@
@article{lu2026assistant,
  title={The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models},
  author={Lu, Christina and Gallagher, Jack and Michala, Jonathan and Fish, Kyle and Lindsey, Jack},
  journal={arXiv preprint arXiv:2601.10387},
  year={2026},
  url={https://arxiv.org/abs/2601.10387}
 }
@book{jung1968archetypes,
  title={The Archetypes and the Collective Unconscious},
  author={Jung, Carl Gustav},
  year={1968},
  publisher={Princeton University Press},
  edition={2nd},
  series={Collected Works of C.G. Jung},
  volume={9}
 }
@book{deming1986out,
  title={Out of the Crisis},
  author={Deming, W. Edwards},
  year={1986},
  publisher={MIT Press},
  address={Cambridge, MA}
 }
@book{shewhart1939statistical,
  title={Statistical Method from the Viewpoint of Quality Control},
  author={Shewhart, Walter Andrew},
  year={1939},
  publisher={Graduate School of the Department of Agriculture},
  address={Washington, DC}
 }
@article{hong2024metagpt,
  title={MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework},
  author={Hong, Sirui and Zhuge, Mingchen and Chen, Jonathan and Zheng, Xiawu and Cheng, Yuheng and Zhang, Ceyao and Wang, Jinlin and Wang, Zili and Yau, Steven Ka Shing and Lin, Zijuan and Zhou, Liyang and Ran, Chenyu and Xiao, Lingfeng and Wu, Chenglin and Schmidhuber, J{\"u}rgen},
  journal={arXiv preprint arXiv:2308.00352},
  year={2024},
  url={https://arxiv.org/abs/2308.00352}
 }
@article{qian2024chatdev,
  title={ChatDev: Communicative Agents for Software Development},
  author={Qian, Chen and Liu, Wei and Liu, Hongzhang and Chen, Nuo and Dang, Yufan and Li, Jiahao and Yang, Cheng and Chen, Weize and Su, Yusheng and Cong, Xin and Xu, Juyuan and Li, Dahai and Liu, Zhiyuan and Sun, Maosong},
  journal={arXiv preprint arXiv:2307.07924},
  year={2024},
  url={https://arxiv.org/abs/2307.07924}
 }
@article{yang2024sweagent,
  title={SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering},
  author={Yang, John and Jimenez, Carlos E and Wettig, Alexander and Liber, Kilian and Narasimhan, Karthik and Press, Ofir},
  journal={arXiv preprint arXiv:2405.15793},
  year={2024},
  url={https://arxiv.org/abs/2405.15793}
 }
@article{chen2025persona,
  title={Persona Vectors: Monitoring and Controlling Character Traits via Activation Directions},
  author={Chen, Yiwei and others},
  journal={arXiv preprint arXiv:2507.21509},
  year={2025},
  url={https://arxiv.org/abs/2507.21509}
 }
@article{bai2022constitutional,
  title={Constitutional AI: Harmlessness from AI Feedback},
  author={Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others},
  journal={arXiv preprint arXiv:2212.08073},
  year={2022},
  url={https://arxiv.org/abs/2212.08073}
 }
@book{hartson2012ux,
  title={The UX Book: Process and Guidelines for Ensuring a Quality User Experience},
  author={Hartson, Rex and Pyla, Pardha S.},
  year={2012},
  publisher={Morgan Kaufmann},
  address={Burlington, MA}
 }
@inproceedings{winston2011strong,
  title={The Strong Story Hypothesis and the Directed Perception Hypothesis},
  author={Winston, Patrick Henry},
  booktitle={AAAI Fall Symposium: Advances in Cognitive Systems},
  year={2011},
  pages={345--352}
 }
--- a/paper/taxonomy-refs.bib
+++ b/paper/taxonomy-refs.bib
@@ -0,0 +1,194 @@
 % ---- Agent Frameworks ----
@article{hong2024metagpt,
  title={MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework},
  author={Hong, Sirui and Zhuge, Mingchen and Chen, Jonathan and Zheng, Xiawu and Cheng, Yuheng and Zhang, Ceyao and Wang, Jinlin and Wang, Zili and Yau, Steven Ka Shing and Lin, Zijuan and Zhou, Liyang and Ran, Chenyu and Xiao, Lingfeng and Wu, Chenglin and Schmidhuber, J{\"u}rgen},
  journal={arXiv preprint arXiv:2308.00352},
  year={2024},
  url={https://arxiv.org/abs/2308.00352}
 }
@article{qian2024chatdev,
  title={ChatDev: Communicative Agents for Software Development},
  author={Qian, Chen and Liu, Wei and Liu, Hongzhang and Chen, Nuo and Dang, Yufan and Li, Jiahao and Yang, Cheng and Chen, Weize and Su, Yusheng and Cong, Xin and Xu, Juyuan and Li, Dahai and Liu, Zhiyuan and Sun, Maosong},
  journal={arXiv preprint arXiv:2307.07924},
  year={2024},
  url={https://arxiv.org/abs/2307.07924}
 }
@article{wu2023autogen,
  title={AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation},
  author={Wu, Qingyun and Bansal, Gagan and Zhang, Jieyu and Wu, Yiran and Li, Beibin and Zhu, Erkang and Jiang, Li and Zhang, Xiaoyun and Zhang, Shaokun and Liu, Jiale and Awadallah, Ahmed Hassan and White, Ryen W. and Burger, Doug and Wang, Chi},
  journal={arXiv preprint arXiv:2308.08155},
  year={2023},
  url={https://arxiv.org/abs/2308.08155}
 }
@article{yang2024sweagent,
  title={SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering},
  author={Yang, John and Jimenez, Carlos E and Wettig, Alexander and Liber, Kilian and Narasimhan, Karthik and Press, Ofir},
  journal={arXiv preprint arXiv:2405.15793},
  year={2024},
  url={https://arxiv.org/abs/2405.15793}
 }
@article{nennemann2026archeflow,
  title={ArcheFlow: Multi-Agent Orchestration with Archetypal Roles and PDCA Quality Cycles},
  author={Nennemann, Christian},
  journal={arXiv preprint},
  year={2026},
  url={https://github.com/XORwell/archeflow}
 }
@article{nguyen2024agilecoder,
  title={AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology},
  author={Nguyen, Minh Huynh and Chau, Thang Phan and Phung, Phong X. and Nguyen, Nghi D. Q.},
  journal={arXiv preprint arXiv:2406.11912},
  year={2024},
  url={https://arxiv.org/abs/2406.11912}
 }
@article{patel2026sixsigma,
  title={The Six Sigma Agent: Achieving Enterprise-Grade Reliability in LLM Systems Through Consensus-Driven Decomposed Execution},
  author={Patel, Rushi and Surendira, Bala and George, Allen and Kapale, Kiran},
  journal={arXiv preprint arXiv:2601.22290},
  year={2026},
  url={https://arxiv.org/abs/2601.22290}
 }
@article{shinn2023reflexion,
  title={Reflexion: Language Agents with Verbal Reinforcement Learning},
  author={Shinn, Noah and Cassano, Federico and Gopinath, Ashwin and Narasimhan, Karthik and Yao, Shunyu},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2023},
  url={https://arxiv.org/abs/2303.11366}
 }
@article{xia2024eddops,
  title={Evaluation-Driven Development and Operations of LLM Agents: A Process Model and Reference Architecture},
  author={Xia, Boming and Lu, Qinghua and Zhu, Liming and Xing, Zhenchang and Zhao, Dehai and Zhang, Hao},
  journal={arXiv preprint arXiv:2411.13768},
  year={2024},
  url={https://arxiv.org/abs/2411.13768}
 }
@article{rasheed2024survey,
  title={LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision and the Road Ahead},
  author={Rasheed, Zeeshan and others},
  journal={ACM Transactions on Software Engineering and Methodology},
  year={2025},
  url={https://arxiv.org/abs/2404.04834}
 }
@article{li2023camel,
  title={CAMEL: Communicative Agents for ``Mind'' Exploration of Large Language Model Society},
  author={Li, Guohao and Hammoud, Hasan Abed Al Kader and Itani, Hani and Khizbullin, Dmitrii and Ghanem, Bernard},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2023},
  url={https://arxiv.org/abs/2303.17760}
 }
 % ---- Persona Stability ----
@article{lu2026assistant,
  title={The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models},
  author={Lu, Christina and Gallagher, Jack and Michala, Jonathan and Fish, Kyle and Lindsey, Jack},
  journal={arXiv preprint arXiv:2601.10387},
  year={2026},
  url={https://arxiv.org/abs/2601.10387}
 }
 % ---- PM/OM Foundations ----
@book{deming1986out,
  title={Out of the Crisis},
  author={Deming, W. Edwards},
  year={1986},
  publisher={MIT Press},
  address={Cambridge, MA}
 }
@book{shewhart1939statistical,
  title={Statistical Method from the Viewpoint of Quality Control},
  author={Shewhart, Walter Andrew},
  year={1939},
  publisher={Graduate School of the Department of Agriculture},
  address={Washington, DC}
 }
@book{goldratt1984goal,
  title={The Goal: A Process of Ongoing Improvement},
  author={Goldratt, Eliyahu M. and Cox, Jeff},
  year={1984},
  publisher={North River Press},
  address={Great Barrington, MA}
 }
@book{ohno1988toyota,
  title={Toyota Production System: Beyond Large-Scale Production},
  author={Ohno, Taiichi},
  year={1988},
  publisher={Productivity Press},
  address={Portland, OR}
 }
@book{womack1996lean,
  title={Lean Thinking: Banish Waste and Create Wealth in Your Corporation},
  author={Womack, James P. and Jones, Daniel T.},
  year={1996},
  publisher={Simon \& Schuster},
  address={New York}
 }
@article{cooper1990stagegate,
  title={Stage-Gate Systems: A New Tool for Managing New Products},
  author={Cooper, Robert G.},
  journal={Business Horizons},
  volume={33},
  number={3},
  pages={44--54},
  year={1990},
  publisher={Elsevier}
 }
@article{snowden2007cynefin,
  title={A Leader's Framework for Decision Making},
  author={Snowden, David J. and Boone, Mary E.},
  journal={Harvard Business Review},
  volume={85},
  number={11},
  pages={68--76},
  year={2007}
 }
@book{altshuller1999innovation,
  title={The Innovation Algorithm: TRIZ, Systematic Innovation and Technical Creativity},
  author={Altshuller, Genrich},
  year={1999},
  publisher={Technical Innovation Center},
  address={Worcester, MA}
 }
@article{boyd1976destruction,
  title={Destruction and Creation},
  author={Boyd, John R.},
  year={1976},
  note={Unpublished manuscript, widely circulated}
 }
@book{schwaber2020scrum,
  title={The Scrum Guide},
  author={Schwaber, Ken and Sutherland, Jeff},
  year={2020},
  publisher={Scrum.org},
  note={Available at \url{https://scrumguides.org}}
 }
@techreport{mil1949fmea,
  title={MIL-P-1629: Procedures for Performing a Failure Mode, Effects and Criticality Analysis},
  institution={United States Department of Defense},
  year={1949},
  note={Revised as MIL-STD-1629A, 1980}
 }
--- a/paper/taxonomy.tex
+++ b/paper/taxonomy.tex
@@ -0,0 +1,805 @@
 \documentclass[11pt,a4paper]{article}
 % ---- Packages ----
 \usepackage[utf8]{inputenc}
 \usepackage[T1]{fontenc}
 \usepackage{amsmath,amssymb}
 \usepackage{graphicx}
 \usepackage{booktabs}
 \usepackage{hyperref}
 \usepackage{xcolor}
 \usepackage{listings}
 \usepackage{subcaption}
 \usepackage{tikz}
 \usetikzlibrary{shapes,arrows.meta,positioning,fit,calc,matrix}
 \usepackage[numbers]{natbib}
 \usepackage{geometry}
 \usepackage{enumitem}
 \geometry{margin=1in}
 % ---- Colors ----
 \definecolor{highfit}{HTML}{2E7D32}
 \definecolor{medfit}{HTML}{F57F17}
 \definecolor{lowfit}{HTML}{C62828}
 \definecolor{neutral}{HTML}{546E7A}
 % ---- Title ----
 \title{%
  From Factory Floor to Token Stream:\\
  A Taxonomy of Operations Management Methods\\
  for LLM Agent Orchestration%
 }
 \author{
  Christian Nennemann\\
  Independent Researcher\\
  \texttt{chris@nennemann.de}
 }
 \date{April 2026}
 \begin{document}
 \maketitle
 % ============================================================
 \begin{abstract}
 Multi-agent systems built on large language models (LLMs) increasingly adopt
 metaphors from human project management---sprints, standups, code review---yet
 draw from a remarkably narrow slice of the operations management literature.
 This paper presents a systematic taxonomy of twelve established PM/OM methods,
 evaluates their structural compatibility with LLM agent constraints (stateless
 invocation, cheap cloning, deterministic dysfunction, absence of human
 psychology), and identifies which methods are underexploited, which are
 inapplicable, and which require fundamental adaptation. We find that methods
 designed for \emph{flow optimization} (Kanban, Theory of Constraints) and
 \emph{rapid decision-making} (OODA Loop) are structurally well-suited to
 agent orchestration but remain largely unexplored, while methods centered on
 \emph{human psychology} (Scrum ceremonies, Design Thinking empathy phases)
 transfer poorly without significant reformulation. We propose a decision
 framework for selecting orchestration methods based on task complexity, agent
 count, and quality requirements, and identify five open research directions
 at the intersection of operations management and agentic AI.
 \end{abstract}
 % ============================================================
 \section{Introduction}
 \label{sec:intro}
 The dominant paradigm for multi-agent LLM systems borrows from agile software
 development: agents are organized into ``teams'' with role-based
 specialization, tasks are decomposed into work items, and results are reviewed
 before merging \citep{hong2024metagpt, qian2024chatdev}. This borrowing is
 natural---the humans building these systems are software engineers familiar
 with agile methods---but it is also narrow. The operations management
 literature contains dozens of methods developed over a century of industrial
 practice, each encoding different assumptions about workflow structure, quality
 assurance, failure modes, and coordination costs.
 Not all of these methods are equally applicable to LLM agents. Agents differ
 from human workers in five structurally important ways:
 \begin{enumerate}[label=\textbf{C\arabic*}]
  \item \label{c:stateless} \textbf{Stateless invocation}: Agents do not
  retain memory between invocations unless explicitly persisted. Human team
  members accumulate institutional knowledge automatically.
  \item \label{c:cloning} \textbf{Cheap to clone, expensive to coordinate}:
  Spawning a new agent costs milliseconds and cents; coordinating two agents
  costs tokens and latency. For human teams, the inverse holds---hiring is
  expensive, coordination is (comparatively) cheap.
  \item \label{c:dysfunction} \textbf{Deterministic dysfunction}: LLM agents
  fail in predictable, repeatable patterns---verbosity, scope creep, false
  positives---rather than the varied, context-dependent failures of human
  cognition \citep{nennemann2026archeflow}.
  \item \label{c:psychology} \textbf{No psychology}: Agents have no morale,
  fatigue, ego, or office politics. Methods designed to manage human
  psychology (retrospectives, team-building, conflict resolution) have no
  direct function.
  \item \label{c:speed} \textbf{Cycle speed}: Agents complete tasks in
  seconds to minutes, enabling iteration frequencies that would be
  impractical for human teams. Methods that assume week-long or month-long
  cycles can be compressed.
 \end{enumerate}
 These constraints define a \emph{fitness landscape}: some PM/OM methods gain
 effectiveness when applied to agents (because agents remove friction those
 methods were designed to manage), while others lose their raison d'\^etre
 (because they solve human problems agents don't have).
 This paper contributes:
 \begin{itemize}
  \item A systematic taxonomy of twelve PM/OM methods evaluated against the
  five agent constraints (\ref{c:stateless}--\ref{c:speed}).
  \item A compatibility matrix scoring each method's structural fit for
  agent orchestration (\S\ref{sec:matrix}).
  \item A decision framework for practitioners selecting orchestration
  strategies (\S\ref{sec:decision}).
  \item Five open research directions at the intersection of operations
  management theory and agentic AI (\S\ref{sec:future}).
 \end{itemize}
 % ============================================================
 \section{Background: Current Agent Orchestration Landscape}
 \label{sec:background}
 \subsection{Frameworks and Their Implicit PM Models}
 The current generation of multi-agent LLM frameworks implicitly adopts
 project management concepts, though rarely with explicit attribution to
 PM/OM theory.
 \textbf{MetaGPT} \citep{hong2024metagpt} assigns human job titles (product
 manager, architect, engineer) and enforces communication through Standardized
 Operating Procedures (SOPs)---an implicit adoption of \emph{waterfall}
 phase gates with role-based access control.
 \textbf{ChatDev} \citep{qian2024chatdev} simulates a software company with
 sequential phases (design, coding, testing, documentation). Despite the
 ``company'' framing, the execution model is a \emph{linear pipeline} with
 pair-programming-style chat between adjacent roles.
 \textbf{AgileCoder} \citep{nguyen2024agilecoder} is the first framework to
 explicitly adopt sprint-based iteration, assigning Scrum Master and Product
 Manager roles to LLM agents with a Dynamic Code Graph Generator tracking
 inter-file dependencies between sprints.
 \textbf{CrewAI} organizes agents into ``crews'' with a ``manager'' agent
 orchestrating task delegation---an implicit \emph{hierarchical management}
 model with single-point-of-failure coordination.
 \textbf{AutoGen} \citep{wu2023autogen} provides a conversation-based
 framework where agents negotiate through multi-turn dialogue. The implicit
 model is \emph{committee decision-making}---all agents see all messages,
 consensus emerges through discussion.
 \textbf{The Six Sigma Agent} \citep{patel2026sixsigma} decomposes tasks
 into atomic dependency trees, executes each node $n$ times with independent
 LLM samples, and uses consensus voting to achieve defect rates scaling as
 $O(p^{\lceil n/2 \rceil})$---reaching 3.4 DPMO (the Six Sigma threshold)
 at $n=13$.
 \textbf{Reflexion} \citep{shinn2023reflexion} implements a de facto PDCA
 loop through verbal reinforcement: Plan $\to$ Act $\to$ Evaluate (Check)
 $\to$ Reflect (Act), though it does not name this structure explicitly.
 \textbf{ArcheFlow} \citep{nennemann2026archeflow} explicitly applies PDCA
 quality cycles with Jungian archetypal roles, representing the first
 framework to deliberately adopt a named PM/OM methodology with formal
 convergence criteria.
 \subsection{The Gap}
 Despite the variety of frameworks, the PM/OM methods actually employed
 cluster tightly around four approaches: (1) waterfall-style sequential
 phases (MetaGPT, ChatDev), (2) role-based team simulation (CAMEL
 \citep{li2023camel}, CrewAI), (3) informal ``manager'' delegation
 (AutoGen), and (4) agile sprints (AgileCoder). The Six Sigma Agent
 \citep{patel2026sixsigma} is a notable exception---the only framework to
 explicitly name a PM/OM method as its primary architectural contribution.
 Methods from lean manufacturing, constraint theory, military
 decision-making, innovation management, and failure analysis remain
 unexplored in the peer-reviewed agent orchestration literature, despite
 strong structural compatibility with agent constraints.
 % ============================================================
 \section{Taxonomy of PM/OM Methods}
 \label{sec:taxonomy}
 We evaluate twelve methods spanning five categories: iterative improvement,
 flow optimization, decision-making, innovation management, and quality
 engineering. For each method, we describe the core mechanism, evaluate
 structural compatibility with agent constraints \ref{c:stateless}--\ref{c:speed},
 identify the primary adaptation required, and assess overall fitness.
 % ---- 3.1 Iterative Improvement ----
 \subsection{Iterative Improvement Methods}
 \subsubsection{PDCA (Plan--Do--Check--Act)}
 \label{sec:pdca}
 \textbf{Origin}: Shewhart \citep{shewhart1939statistical}, popularized by
 Deming \citep{deming1986out}.
 \textbf{Mechanism}: Four-phase cycle repeated until quality targets are met.
 Each cycle narrows the gap between current and desired state through
 structured feedback.
 \textbf{Agent fitness}: \textsc{High}. PDCA's phase structure maps directly
 to agent orchestration: Plan (research + design agents), Do (implementation
 agent), Check (review agents), Act (routing + merge decisions). The cycle
 abstraction handles the core challenge of ``when to stop iterating'' through
 convergence metrics. Demonstrated in ArcheFlow \citep{nennemann2026archeflow}.
 \textbf{Key adaptation}: Convergence detection must be automated (human PDCA
 relies on subjective judgment). ArcheFlow addresses this with a convergence
 score based on finding classification (new, resolved, persistent, regressed)
 and oscillation detection.
 \textbf{Constraint fit}: Stateless (\ref{c:stateless})---artifacts persist
 state between cycles. Cloning (\ref{c:cloning})---fresh agents per cycle
 avoid accumulated bias. Speed (\ref{c:speed})---cycles complete in minutes,
 enabling 2--3 cycles where humans would manage one.
 \subsubsection{Scrum}
 \label{sec:scrum}
 \textbf{Origin}: Schwaber \& Sutherland, 1995.
 \textbf{Mechanism}: Time-boxed sprints with defined roles (Product Owner,
 Scrum Master, Development Team), ceremonies (planning, daily standup,
 review, retrospective), and artifacts (backlog, sprint board, burndown).
 \textbf{Agent fitness}: \textsc{Low--Medium}. Scrum's ceremony-heavy
 structure exists primarily to manage human coordination challenges: standups
 maintain shared awareness (agents can share a filesystem), retrospectives
 address interpersonal friction (agents have none), sprint planning negotiates
 capacity (agents have deterministic throughput). The useful kernel---time-boxed
 work with a prioritized backlog---is trivially implementable without Scrum's
 overhead.
 \textbf{Key adaptation}: Strip ceremonies, keep the backlog + sprint
 structure. ``Daily standups'' become status file reads. ``Retrospectives''
 become cross-run memory extraction. The Scrum Master role is pure overhead
 for agents.
 \textbf{Constraint fit}: Psychology (\ref{c:psychology})---most Scrum
 ceremonies solve human problems. Speed (\ref{c:speed})---sprint length
 compresses from weeks to minutes. Cloning (\ref{c:cloning})---team
 stability (a Scrum value) is irrelevant when agents are stateless.
 \subsubsection{DMAIC (Six Sigma)}
 \label{sec:dmaic}
 \textbf{Origin}: Motorola, 1986; systematized by General Electric.
 \textbf{Mechanism}: Define--Measure--Analyze--Improve--Control. Unlike PDCA,
 DMAIC emphasizes \emph{statistical measurement} of process capability and
 explicitly separates analysis (understanding the problem) from improvement
 (fixing it).
 \textbf{Agent fitness}: \textsc{Medium--High}. The Define--Measure--Analyze
 front-loading is valuable for agents: it forces explicit quality metrics
 \emph{before} implementation, preventing the common failure mode of agents
 optimizing for the wrong objective. The Control phase---establishing
 monitoring to prevent regression---maps to cross-run memory systems.
 \textbf{Key adaptation}: Agents can compute statistical process control
 metrics (defect rates, cycle times, sigma levels) automatically from event
 logs. The ``Measure'' phase, which is expensive and tedious for humans,
 becomes a strength: agents can instrument everything.
 \textbf{Constraint fit}: Speed (\ref{c:speed})---full DMAIC in minutes.
 Dysfunction (\ref{c:dysfunction})---agent failure modes have measurable
 baselines, making sigma calculations meaningful. Stateless
 (\ref{c:stateless})---Control phase requires persistent monitoring, which
 must be explicitly built.
 % ---- 3.2 Flow Optimization ----
 \subsection{Flow Optimization Methods}
 \subsubsection{Kanban}
 \label{sec:kanban}
 \textbf{Origin}: Toyota Production System, Taiichi Ohno, 1950s.
 \textbf{Mechanism}: Pull-based workflow with explicit work-in-progress (WIP)
 limits. Work items flow through columns (stages); new work is pulled only
 when capacity is available. No iterations---continuous flow.
 \textbf{Agent fitness}: \textsc{High}. Kanban's WIP limits directly address
 a critical agent challenge: \emph{coordination cost scaling}. Without WIP
 limits, spawning more agents increases throughput initially but eventually
 degrades quality due to coordination overhead (conflicting changes, merge
 conflicts, context fragmentation). Kanban provides a principled mechanism for
 determining optimal concurrency.
 \textbf{Key adaptation}: WIP limits should be \emph{dynamic}, adjusting
 based on observed coordination costs (merge conflicts, finding duplications)
 rather than fixed. The pull mechanism maps naturally: agents poll a task
 queue and pull the highest-priority item they can handle.
 \textbf{Constraint fit}: Cloning (\ref{c:cloning})---WIP limits are
 \emph{exactly} the missing constraint for cheap-to-clone agents. Speed
 (\ref{c:speed})---flow metrics (lead time, cycle time, throughput) update
 in real-time. Psychology (\ref{c:psychology})---no ``swarming'' or
 ``blocked item'' social dynamics to manage.
 \subsubsection{Theory of Constraints (TOC)}
 \label{sec:toc}
 \textbf{Origin}: Goldratt, \emph{The Goal}, 1984.
 \textbf{Mechanism}: Identify the system's constraint (bottleneck), exploit
 it (maximize its throughput), subordinate everything else to it, elevate it
 (invest to remove it), repeat. The Five Focusing Steps.
 \textbf{Agent fitness}: \textsc{High}. In multi-agent pipelines, the
 bottleneck is typically the most capable (and expensive) agent: the
 implementation agent that must run on a powerful model, or the security
 reviewer that requires deep context. TOC provides a framework for
 organizing the entire pipeline around this constraint.
 \textbf{Key adaptation}: ``Exploit the constraint'' means ensuring the
 bottleneck agent never waits for input. Pre-compute its context, batch
 its inputs, and schedule cheaper agents (research, formatting, validation)
 to run during its processing time. ``Subordinate'' means cheaper agents
 should produce output in the format the bottleneck needs, not in whatever
 format is easiest for them.
 \textbf{Constraint fit}: Cloning (\ref{c:cloning})---non-bottleneck agents
 are cheap to overprovision. Speed (\ref{c:speed})---constraint shifts can
 be detected and responded to within a single run. Dysfunction
 (\ref{c:dysfunction})---bottleneck agent's failure mode has outsized impact,
 justifying targeted shadow detection.
 \subsubsection{Lean / Toyota Production System}
 \label{sec:lean}
 \textbf{Origin}: Ohno, 1988; Womack \& Jones, 1996.
 \textbf{Mechanism}: Eliminate waste (\emph{muda}), reduce variability
 (\emph{mura}), avoid overburden (\emph{muri}). Seven wastes: overproduction,
 waiting, transport, overprocessing, inventory, motion, defects.
 \textbf{Agent fitness}: \textsc{Medium--High}. The seven wastes map
 surprisingly well to agent systems:
 \begin{itemize}[nosep]
  \item \textbf{Overproduction}: Agents generating output nobody reads
  (verbose research reports, unused alternative proposals).
  \item \textbf{Waiting}: Agents idle while waiting for predecessor output
  (sequential pipeline where parallel would work).
  \item \textbf{Transport}: Redundant context passing (sending full codebase
  to agents that need only a diff).
  \item \textbf{Overprocessing}: Running thorough review on trivial changes.
  \item \textbf{Inventory}: Accumulated artifacts from prior cycles that
  are never referenced.
  \item \textbf{Motion}: Agents reading files they don't need, exploring
  irrelevant code paths.
  \item \textbf{Defects}: Findings that are false positives, requiring
  rework to dismiss.
 \end{itemize}
 \textbf{Key adaptation}: Lean's ``respect for people'' pillar has no direct
 analog. The technical pillar (continuous improvement, waste elimination)
 transfers fully.
 % ---- 3.3 Decision-Making ----
 \subsection{Decision-Making Methods}
 \subsubsection{OODA Loop (Observe--Orient--Decide--Act)}
 \label{sec:ooda}
 \textbf{Origin}: John Boyd, 1976. Military strategy for air combat; later
 generalized to competitive decision-making.
 \textbf{Mechanism}: Continuous loop of Observe (gather data), Orient (analyze
 context, update mental models), Decide (select course of action), Act
 (execute). The key insight is that the \emph{speed} of the loop---not any
 individual decision's quality---determines competitive advantage. ``Getting
 inside the opponent's OODA loop'' means acting faster than the adversary can
 react.
 \textbf{Agent fitness}: \textsc{High}. OODA is structurally similar to PDCA
 but optimized for speed over thoroughness. For agent systems, this maps to
 scenarios requiring rapid adaptation: adversarial testing, incident response,
 market-reactive coding, or any context where the problem space changes
 during execution.
 \textbf{Key adaptation}: Boyd's ``Orient'' phase---updating mental models
 based on new information---is the hardest to implement for stateless agents.
 It requires either persistent state (a world model that updates across
 iterations) or a ``fast reorientation'' agent that rapidly synthesizes new
 information into an updated context.
 \textbf{Constraint fit}: Speed (\ref{c:speed})---agents can OODA at
 superhuman frequency. Stateless (\ref{c:stateless})---the Orient phase
 needs explicit state management. Psychology (\ref{c:psychology})---Boyd's
 concept of ``mental agility'' translates to model selection: smaller, faster
 models for rapid OODA; larger models for deep Orient phases.
 \subsubsection{Cynefin Framework}
 \label{sec:cynefin}
 \textbf{Origin}: Snowden \& Boone, 2007.
 \textbf{Mechanism}: Classify problems into five domains---\textsc{Clear}
 (obvious cause-effect), \textsc{Complicated} (expert analysis needed),
 \textsc{Complex} (emergent, probe-sense-respond), \textsc{Chaotic}
 (act first, then sense), \textsc{Confused} (unknown domain)---and apply
 domain-appropriate strategies.
 \textbf{Agent fitness}: \textsc{Medium--High}. Cynefin provides a
 \emph{meta-framework}: instead of choosing one orchestration method for all
 tasks, classify the task first, then select the appropriate method:
 \begin{itemize}[nosep]
  \item \textsc{Clear}: Single agent, no review (``fix this typo'').
  \item \textsc{Complicated}: Expert agent with review (PDCA fast workflow).
  \item \textsc{Complex}: Multiple competing proposals, let results emerge
  (PDCA standard/thorough with parallel alternatives).
  \item \textsc{Chaotic}: Act immediately, stabilize, then analyze (OODA
  with hotfix agent, then PDCA for proper fix).
 \end{itemize}
 \textbf{Key adaptation}: Task classification must be automated. Proxies:
 number of files affected, cross-module dependencies, security sensitivity,
 test coverage of affected area.
 % ---- 3.4 Innovation Management ----
 \subsection{Innovation Management Methods}
 \subsubsection{Stage-Gate}
 \label{sec:stagegate}
 \textbf{Origin}: Cooper, 1990.
 \textbf{Mechanism}: Innovation projects pass through stages (scoping,
 business case, development, testing, launch), separated by gates where a
 cross-functional team decides: Go, Kill, Hold, or Recycle. The gate
 decision is binary---no ``continue with reservations.''
 \textbf{Agent fitness}: \textsc{Medium}. The gate mechanism maps well to
 agent confidence checks: a Creator agent's proposal either meets the
 confidence threshold (Go) or doesn't (Kill/Recycle). However, Stage-Gate
 assumes expensive stages (weeks/months of human work), making Kill decisions
 high-stakes. For agents, stages are cheap (minutes), reducing the value of
 formal gate decisions.
 \textbf{Key adaptation}: Gates become lightweight confidence checks rather
 than committee reviews. The ``Kill'' decision---rare and painful in human
 innovation---should be common and cheap for agents. Explore multiple
 proposals in parallel, gate aggressively, continue only the best.
 \subsubsection{Design Thinking}
 \label{sec:designthinking}
 \textbf{Origin}: IDEO / Stanford d.school, 2000s.
 \textbf{Mechanism}: Five phases: Empathize (understand the user),
 Define (frame the problem), Ideate (generate solutions), Prototype (build
 quickly), Test (get feedback). Emphasis on user empathy and divergent
 thinking.
 \textbf{Agent fitness}: \textsc{Low}. Design Thinking's core value
 proposition---\emph{empathy with users}---is precisely what LLM agents
 cannot genuinely do. Agents can simulate empathy (generate persona-based
 scenarios), but the insight that comes from observing real users in context
 has no agent equivalent. The Ideate phase (divergent brainstorming) is
 feasible but produces quantity over quality without the ``empathy filter''
 that makes Design Thinking effective.
 \textbf{Key adaptation}: If used, the Empathize phase must be replaced
 with explicit user research artifacts (personas, journey maps, interview
 transcripts) provided as input. This transforms Design Thinking from a
 discovery method into a synthesis method---fundamentally changing its nature.
 \subsubsection{TRIZ}
 \label{sec:triz}
 \textbf{Origin}: Altshuller, 1946--1985. Theory of Inventive Problem
 Solving.
 \textbf{Mechanism}: Problems contain contradictions (improving one parameter
 worsens another). TRIZ provides a contradiction matrix mapping 39 engineering
 parameters to 40 inventive principles. Instead of compromise, TRIZ seeks
 solutions that resolve the contradiction.
 \textbf{Agent fitness}: \textsc{Medium}. TRIZ's structured problem-solving
 is well-suited to agents: the contradiction matrix is a lookup table, and
 agents can systematically apply inventive principles. However, TRIZ requires
 \emph{reformulating the problem as a contradiction}---a creative step that
 is itself challenging for agents.
 \textbf{Key adaptation}: Provide the contradiction matrix as context. Train
 agents to identify the ``improving parameter'' and ``worsening parameter''
 in engineering tasks (e.g., ``improving security worsens performance'').
 Use TRIZ principles as a structured brainstorming prompt for the Creator
 archetype.
 % ---- 3.5 Quality Engineering ----
 \subsection{Quality Engineering Methods}
 \subsubsection{FMEA (Failure Mode and Effects Analysis)}
 \label{sec:fmea}
 \textbf{Origin}: US Military, 1949; adopted by automotive (AIAG) and
 aerospace.
 \textbf{Mechanism}: For each component/process step, systematically
 enumerate: (1) potential failure modes, (2) effects of each failure,
 (3) causes, (4) current controls, (5) risk priority number
 (severity $\times$ occurrence $\times$ detection). Address highest-RPN
 items first.
 \textbf{Agent fitness}: \textsc{High}. FMEA's systematic enumeration is
 exactly what LLM agents excel at: given a design, enumerate everything that
 could go wrong, assess severity, and propose mitigations. The Risk Priority
 Number provides a quantitative framework for prioritizing review effort---more
 principled than the common ``CRITICAL/WARNING/INFO'' severity classification.
 \textbf{Key adaptation}: Use FMEA \emph{before} implementation (as part of
 the Plan phase) rather than only during review. An FMEA agent analyzes the
 Creator's proposal and generates a failure mode table; the Maker then
 implements with awareness of high-RPN failure modes; the Guardian validates
 that mitigations are in place.
 \textbf{Constraint fit}: Dysfunction (\ref{c:dysfunction})---agents' own
 failure modes can be pre-enumerated via FMEA, creating a meta-level
 quality system. Cloning (\ref{c:cloning})---FMEA agents are cheap
 (analytical, not creative), enabling systematic coverage.
 \subsubsection{Statistical Process Control (SPC)}
 \label{sec:spc}
 \textbf{Origin}: Shewhart, 1920s.
 \textbf{Mechanism}: Monitor process outputs over time using control charts.
 Distinguish \emph{common cause} variation (inherent to the process) from
 \emph{special cause} variation (attributable to specific events). React only
 to special causes; reduce common cause variation through process improvement.
 \textbf{Agent fitness}: \textsc{Medium--High}. SPC requires historical data,
 which agent orchestration systems naturally generate (event logs, finding
 counts, cycle times, token usage). Control charts over agent effectiveness
 scores can distinguish between normal variation (``Guardian found 2 issues
 this run vs. 1 last run'') and genuine degradation (``Guardian's false
 positive rate spiked after a model update'').
 \textbf{Key adaptation}: Sufficient run history is needed to establish
 control limits. Early runs operate without SPC; after 10--20 runs,
 control limits become meaningful. Model updates reset control limits
 (new process = new baseline).
 % ============================================================
 \section{Compatibility Matrix}
 \label{sec:matrix}
 Table~\ref{tab:matrix} scores each method against the five agent constraints,
 producing an overall fitness assessment.
 \begin{table}[t]
 \centering
 \small
 \caption{Compatibility matrix: PM/OM methods scored against agent constraints.
 \textcolor{highfit}{\textbf{+}} = method benefits from this constraint;
 \textcolor{lowfit}{\textbf{--}} = method is undermined;
 \textcolor{neutral}{\textbf{0}} = neutral.
 Overall fitness: H = High, M = Medium, L = Low.}
 \label{tab:matrix}
 \begin{tabular}{@{}l*{5}{c}c@{}}
 \toprule
 \textbf{Method} &
 \textbf{C1} &
 \textbf{C2} &
 \textbf{C3} &
 \textbf{C4} &
 \textbf{C5} &
 \textbf{Fit} \\
 \midrule
 PDCA            & \textcolor{neutral}{0}  & \textcolor{highfit}{+}  & \textcolor{highfit}{+}  & \textcolor{neutral}{0}  & \textcolor{highfit}{+}  & \textbf{H} \\
 Scrum           & \textcolor{lowfit}{--}  & \textcolor{neutral}{0}  & \textcolor{neutral}{0}  & \textcolor{lowfit}{--}  & \textcolor{highfit}{+}  & \textbf{L--M} \\
 DMAIC           & \textcolor{lowfit}{--}  & \textcolor{highfit}{+}  & \textcolor{highfit}{+}  & \textcolor{neutral}{0}  & \textcolor{highfit}{+}  & \textbf{M--H} \\
 Kanban          & \textcolor{neutral}{0}  & \textcolor{highfit}{+}  & \textcolor{neutral}{0}  & \textcolor{highfit}{+}  & \textcolor{highfit}{+}  & \textbf{H} \\
 TOC             & \textcolor{neutral}{0}  & \textcolor{highfit}{+}  & \textcolor{highfit}{+}  & \textcolor{highfit}{+}  & \textcolor{highfit}{+}  & \textbf{H} \\
 Lean            & \textcolor{neutral}{0}  & \textcolor{highfit}{+}  & \textcolor{neutral}{0}  & \textcolor{lowfit}{--}  & \textcolor{highfit}{+}  & \textbf{M--H} \\
 OODA            & \textcolor{lowfit}{--}  & \textcolor{highfit}{+}  & \textcolor{neutral}{0}  & \textcolor{highfit}{+}  & \textcolor{highfit}{+}  & \textbf{H} \\
 Cynefin         & \textcolor{neutral}{0}  & \textcolor{neutral}{0}  & \textcolor{neutral}{0}  & \textcolor{neutral}{0}  & \textcolor{neutral}{0}  & \textbf{M--H} \\
 Stage-Gate      & \textcolor{neutral}{0}  & \textcolor{highfit}{+}  & \textcolor{neutral}{0}  & \textcolor{highfit}{+}  & \textcolor{lowfit}{--}  & \textbf{M} \\
 Design Think.   & \textcolor{neutral}{0}  & \textcolor{neutral}{0}  & \textcolor{neutral}{0}  & \textcolor{lowfit}{--}  & \textcolor{neutral}{0}  & \textbf{L} \\
 TRIZ            & \textcolor{neutral}{0}  & \textcolor{highfit}{+}  & \textcolor{neutral}{0}  & \textcolor{neutral}{0}  & \textcolor{highfit}{+}  & \textbf{M} \\
 FMEA            & \textcolor{neutral}{0}  & \textcolor{highfit}{+}  & \textcolor{highfit}{+}  & \textcolor{highfit}{+}  & \textcolor{highfit}{+}  & \textbf{H} \\
 SPC             & \textcolor{lowfit}{--}  & \textcolor{neutral}{0}  & \textcolor{highfit}{+}  & \textcolor{highfit}{+}  & \textcolor{highfit}{+}  & \textbf{M--H} \\
 \bottomrule
 \end{tabular}
 \end{table}
 \subsection{Analysis}
 Several patterns emerge from the compatibility matrix:
 \textbf{High-fitness methods share three properties}: they are
 \emph{mechanistic} (decisions follow rules, not judgment), \emph{flow-oriented}
 (optimize throughput, not team dynamics), and \emph{metric-driven} (quality
 is quantified, not discussed). PDCA, Kanban, TOC, OODA, and FMEA all share
 this profile.
 \textbf{Low-fitness methods are psychology-dependent}: Scrum and Design
 Thinking derive their primary value from managing human cognitive and social
 limitations. Without those limitations, the methods become overhead.
 \textbf{The ``Cheap Clone'' constraint is universally beneficial}: every
 method either benefits from or is neutral to the ability to spawn agents
 cheaply. This suggests that agent orchestration should generally favor
 \emph{parallelism}---run multiple approaches simultaneously, then
 select the best result.
 \textbf{``Stateless'' is the most disruptive constraint}: methods that
 assume accumulated knowledge (Scrum's team velocity, SPC's control charts,
 DMAIC's baseline measurements) require explicit persistence mechanisms that
 agents don't provide natively.
 % ============================================================
 \section{Hybrid Approaches and Method Composition}
 \label{sec:hybrid}
 The methods in our taxonomy are not mutually exclusive. Effective agent
 orchestration likely requires combining methods at different levels:
 \subsection{Proposed Three-Layer Architecture}
 \begin{description}
  \item[Strategic layer (Cynefin)]: Classify the task and select the
  appropriate orchestration method. Simple tasks get a single agent;
  complicated tasks get PDCA; complex tasks get parallel competing
  approaches; chaotic tasks get OODA.
  \item[Operational layer (PDCA/OODA + Kanban)]: Execute the selected
  method with flow control. Kanban WIP limits prevent coordination
  overload. PDCA provides quality convergence for standard tasks; OODA
  provides rapid adaptation for time-sensitive tasks.
  \item[Quality layer (FMEA + SPC + TOC)]: Monitor execution quality.
  FMEA front-loads failure analysis in the Plan phase. SPC monitors
  long-term agent effectiveness trends. TOC identifies and optimizes
  around bottleneck agents.
 \end{description}
 \subsection{ArcheFlow as a Case Study}
 ArcheFlow \citep{nennemann2026archeflow} already implements elements of
 this three-layer architecture, though without explicitly naming all methods:
 \begin{itemize}[nosep]
  \item \textbf{Strategic}: Workflow selection (fast/standard/thorough)
  functions as a simplified Cynefin classification.
  \item \textbf{Operational}: PDCA cycles with convergence detection;
  sprint mode with WIP-limited parallel dispatch (implicit Kanban).
  \item \textbf{Quality}: Shadow detection (behavioral FMEA for agent
  failure modes); effectiveness scoring (rudimentary SPC); Guardian
  fast-path (TOC---don't waste the bottleneck on clean code); ``Wiggum
  Break'' circuit breakers (hard/soft halt conditions with event logging).
 \end{itemize}
 The gap is in explicit TOC application (identifying and optimizing around
 the most expensive agent) and in OODA integration for time-sensitive tasks.
 % ============================================================
 \section{Decision Framework}
 \label{sec:decision}
 We propose a practitioner-oriented decision framework for selecting
 orchestration methods based on three dimensions:
 \begin{figure}[h]
 \centering
 \begin{tikzpicture}[
  box/.style={draw, rounded corners, minimum width=3.5cm, minimum height=0.7cm, font=\small, fill=#1},
  arrow/.style={-{Stealth[length=3mm]}, thick},
 ]
 % Decision tree
 \node[box=yellow!20] (start) {Task arrives};
 \node[box=orange!15, below=0.8cm of start] (cynefin) {Classify (Cynefin)};
 \node[box=green!15, below left=1cm and 2cm of cynefin] (clear) {Clear};
 \node[box=green!15, below left=1cm and 0cm of cynefin] (complicated) {Complicated};
 \node[box=blue!10, below right=1cm and 0cm of cynefin] (complex) {Complex};
 \node[box=red!10, below right=1cm and 2cm of cynefin] (chaotic) {Chaotic};
 \node[box=white, below=0.7cm of clear, text width=2.5cm, align=center, font=\scriptsize] (m1) {Single agent\\No review};
 \node[box=white, below=0.7cm of complicated, text width=2.5cm, align=center, font=\scriptsize] (m2) {PDCA fast\\+ FMEA};
 \node[box=white, below=0.7cm of complex, text width=2.5cm, align=center, font=\scriptsize] (m3) {PDCA thorough\\+ parallel proposals};
 \node[box=white, below=0.7cm of chaotic, text width=2.5cm, align=center, font=\scriptsize] (m4) {OODA\\then PDCA};
 \draw[arrow] (start) -- (cynefin);
 \draw[arrow] (cynefin) -- (clear);
 \draw[arrow] (cynefin) -- (complicated);
 \draw[arrow] (cynefin) -- (complex);
 \draw[arrow] (cynefin) -- (chaotic);
 \draw[arrow] (clear) -- (m1);
 \draw[arrow] (complicated) -- (m2);
 \draw[arrow] (complex) -- (m3);
 \draw[arrow] (chaotic) -- (m4);
 \end{tikzpicture}
 \caption{Decision framework for selecting agent orchestration method
 based on Cynefin task classification.}
 \label{fig:decision}
 \end{figure}
 \textbf{Cross-cutting concerns} apply regardless of classification:
 \begin{itemize}[nosep]
  \item \textbf{Kanban WIP limits}: Always. Prevents coordination overload.
  \item \textbf{TOC awareness}: Identify the costliest agent; schedule
  others around it.
  \item \textbf{SPC monitoring}: After 10+ runs, establish control limits
  for agent effectiveness.
  \item \textbf{Lean waste audit}: Periodically review token usage patterns
  for waste (unused artifacts, redundant context, overprocessing).
 \end{itemize}
 % ============================================================
 \section{Open Research Directions}
 \label{sec:future}
 \subsection{Adaptive Method Selection}
 Current frameworks use a fixed orchestration method. An adaptive system
 would classify each incoming task (Cynefin), select the appropriate method,
 and switch methods mid-execution if the task's nature changes (e.g.,
 a ``complicated'' task reveals unexpected complexity during exploration).
 This requires a \emph{method-aware orchestrator} that understands the
 assumptions and exit criteria of each method.
 \subsection{Kanban for Agent Swarms}
 As agent counts increase beyond 5--10, coordination costs dominate.
 Kanban's WIP limits and flow metrics provide a theoretical basis for
 determining optimal agent concurrency, but empirical studies are needed
 to establish how coordination cost scales with agent count across
 different task types and model capabilities.
 \subsection{OODA for Adversarial Agent Scenarios}
 Boyd's OODA loop was designed for competitive environments where speed of
 decision-making determines the winner. Applications include adversarial
 testing (red team agents vs. blue team agents), competitive code generation
 (multiple agents racing to solve a problem), and incident response
 (rapid diagnosis and mitigation under time pressure).
 \subsection{Cross-Method Quality Metrics}
 Each PM/OM method defines quality differently: PDCA uses convergence scores,
 Six Sigma uses sigma levels, Lean uses waste ratios, SPC uses control
 limits. A unified quality metric for agent orchestration---one that allows
 meaningful comparison across methods---does not yet exist.
 \subsection{FMEA for Agent Failure Modes}
 Agent failure modes (hallucination, scope creep, false positive reviews,
 persona drift \citep{lu2026assistant}) can be systematically enumerated
 using FMEA methodology. A comprehensive FMEA catalog for LLM agents---with
 severity, occurrence, and detection ratings calibrated from empirical
 data---would provide a foundation for designing more robust orchestration
 systems.
 % ============================================================
 \section{Conclusion}
 \label{sec:conclusion}
 The operations management literature offers a rich toolkit for agent
 orchestration that extends far beyond the agile methods currently dominant
 in the field. Our taxonomy reveals that the highest-fitness methods---PDCA,
 Kanban, TOC, OODA, and FMEA---share a common profile: mechanistic,
 flow-oriented, and metric-driven. Methods centered on human psychology
 (Scrum, Design Thinking) transfer poorly without fundamental reformulation.
 The key insight is that LLM agents are not ``fast humans.'' They have
 fundamentally different constraint profiles---cheap to clone, expensive to
 coordinate, stateless, psychologically inert---and these differences make
 some PM/OM methods \emph{more} effective (OODA loops at superhuman speed,
 FMEA with exhaustive enumeration) while rendering others irrelevant
 (standups without psychology, retrospectives without learning).
 We encourage the agent orchestration community to look beyond agile sprints
 and role-playing frameworks toward the broader operations management
 tradition. A century of industrial practice has much to teach us about
 orchestrating intelligent agents---if we take the time to translate.
 % ============================================================
 \section*{Acknowledgments}
 The author thanks the operations management and quality engineering
 communities whose work, developed over decades for human organizations,
 provides the theoretical foundation for this analysis.
 % ============================================================
 \bibliographystyle{plainnat}
 \bibliography{taxonomy-refs}
 \end{document}
--- a/scripts/run-tests.sh
+++ b/scripts/run-tests.sh
@@ -0,0 +1,34 @@
 #!/usr/bin/env bash
 # run-tests.sh — Run all ArcheFlow bats tests.
 #
 # Usage: ./scripts/run-tests.sh [bats-args...]
 # Examples:
 #   ./scripts/run-tests.sh                  # Run all tests
 #   ./scripts/run-tests.sh --filter "event" # Run only event tests
 #   ./scripts/run-tests.sh -t               # TAP output
 set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 PROJECT_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
 TESTS_DIR="$PROJECT_DIR/tests"
 # Find bats binary
 BATS="${BATS:-}"
 if [[ -z "$BATS" ]]; then
  if command -v bats &>/dev/null; then
    BATS="bats"
  elif [[ -x "$HOME/.local/bin/bats" ]]; then
    BATS="$HOME/.local/bin/bats"
  else
    echo "ERROR: bats not found. Install bats-core or set BATS env var." >&2
    exit 1
  fi
 fi
 echo "Running ArcheFlow tests..."
 echo "  bats: $($BATS --version)"
 echo "  tests: $TESTS_DIR"
 echo ""
 exec "$BATS" "$@" "$TESTS_DIR"/*.bats
--- a/skills/act-phase/SKILL.md
+++ b/skills/act-phase/SKILL.md
@@ -1,292 +1,46 @@
 ---
 name: act-phase
 description: |
-  Use after the Check phase completes. Collects reviewer findings, prioritizes them, routes fixes to the right agent or tool, applies fixes systematically, and decides whether to exit or cycle.
+  Use after the Check phase completes. Collects reviewer findings, routes fixes, applies them, decides whether to exit or cycle.
  <example>Automatically loaded during orchestration after Check phase</example>
  <example>User: "Run just the act phase on existing findings"</example>
 ---
 # Act Phase
-After all reviewers complete, the Act phase turns findings into fixes and decides whether the cycle is done. This is the bridge between "what's wrong" and "what we do about it."
+Turn Check phase findings into fixes, then decide: exit or cycle.
 ## Overview
 ```
-Check phase output → Collect → Prioritize → Route → Fix → Verify → Exit or Cycle
+Check output → Collect → Deduplicate → Route → Fix → Exit or Cycle
 ```
 ---
-## Step 1: Finding Collection
+## Step 1: Collect and Consolidate Findings
-Parse all reviewer outputs into one consolidated findings table. Use the standardized format from the `check-phase` skill.
+Parse all reviewer outputs into one table grouped by severity (CRITICAL / WARNING / INFO):
 ```markdown
 ## Findings Summary — Cycle N
 ### CRITICAL (must fix before next cycle)
 | # | Source | Location | Category | Description | Suggested Fix |
 |---|--------|----------|----------|-------------|---------------|
 | 1 | guardian | src/auth/handler.ts:48 | security | Empty string bypasses validation | Add length check |
 | 2 | trickster | src/api/parse.ts:92 | reliability | Null input causes crash | Guard with null check |
 ### WARNING (should fix)
 | # | Source | Location | Category | Description | Suggested Fix |
 |---|--------|----------|----------|-------------|---------------|
 | 3 | sage | tests/auth.test.ts:15 | testing | Test names don't describe behavior | Rename to "should reject expired tokens" |
 | 4 | guardian | src/auth/handler.ts:52 | security | Missing rate limit | Add rate limiter middleware |
 ### INFO (nice to have)
 | # | Source | Location | Category | Description | Suggested Fix |
 |---|--------|----------|----------|-------------|---------------|
 | 5 | skeptic | src/auth/handler.ts:30 | design | Consider caching validated tokens | Add TTL cache |
 ```
 ### Deduplication
-Before listing findings, deduplicate across reviewers (same rule as `check-phase`):
+Same file + same category + similar description = one finding. Use the higher severity, credit all sources (e.g. `guardian + skeptic`).
 - Same file + same category + similar description = one finding
 - Use the higher severity
 - Credit all sources: `guardian + skeptic`
 - Don't double-count in severity tallies
-### Cross-Cycle Tracking
+### Cross-Cycle Tracking (cycle > 1)
-Compare against prior cycle findings (if cycle > 1):
+Compare against prior cycle findings:
- **Resolved:** Finding from cycle N-1 no longer present → mark resolved, do not re-raise
+- **Resolved** — no longer present, mark resolved, do not re-raise
- **Persisting:** Same location + category still present → increment `cycle_count`
+- **Persisting** — same location + category, increment `cycle_count`
- **New:** Finding not seen before → add with `cycle_count: 1`
+- **New** — first appearance, `cycle_count: 1`
-If a finding persists for 2+ consecutive cycles, flag for user escalation (see Step 5).
+Finding persisting 2+ cycles = flag for escalation (see Step 4).
 ---
 ## Step 2: Fix Routing
-Not all findings are fixed the same way. Route each finding based on its nature:
+This is the **canonical routing table** (single source of truth for the whole system):
 | Category | Fix Route | Rationale |
 |----------|-----------|-----------|
 | `security` | Spawn Maker with targeted instructions | Security fixes need tested code changes |
 | `reliability` | Spawn Maker with targeted instructions | Same — code-level fix with test |
 | `breaking-change` | Route to Creator in next cycle | Design decision needed |
 | `design` | Route to Creator in next cycle | Architecture change, not a patch |
 | `dependency` | Spawn Maker with targeted instructions | Package update or removal |
 | `quality` | Spawn Maker or apply directly | Depends on scope (see below) |
 | `testing` | Spawn Maker with targeted instructions | Tests need to be written and run |
 | `consistency` | Apply directly or spawn Maker | Naming/style → direct. Pattern change → Maker |
 ### Direct Fix (no agent)
 Apply directly with Edit tool when **all** of these are true:
 - The fix is mechanical (typo, naming, formatting, import order)
 - No behavioral change
 - No test update needed
 - Exactly one file affected
 Examples: rename a variable, fix a typo in a string, reorder imports, fix indentation.
 ### Maker Fix (spawn agent)
 Spawn a targeted Maker when the fix involves:
 - Code logic changes
 - New or modified tests
 - Multiple files
 - Any behavioral change
 Provide the Maker with:
 1. The specific finding(s) to address (not all findings — just the routed ones)
 2. The file and line location
 3. The suggested fix from the reviewer
 4. The Maker's original branch (to apply fixes on top)
 ```
 Agent(
  description: "Fix: <finding description>",
  prompt: "You are the MAKER archetype.
    Apply this fix on branch: <maker's branch>
    Finding: <source> | <severity> | <category>
    Location: <file:line>
    Issue: <description>
    Suggested fix: <fix>
    Rules:
    1. Fix ONLY this issue — no other changes
    2. Add/update tests if the fix changes behavior
    3. Run existing tests — nothing may break
    4. Commit with message: 'fix: <description>'
    Do NOT refactor surrounding code.",
  isolation: "worktree",
  mode: "bypassPermissions"
 )
 ```
 ### Writing/Prose Fix (domain-specific)
 For writing projects (books, stories), voice or prose findings need special context:
 ```
 Agent(
  description: "Fix: voice drift in <file>",
  prompt: "You are the MAKER archetype.
    Apply this prose fix on branch: <maker's branch>
    Finding: <source> | <severity> | <category>
    Location: <file:line>
    Issue: <description>
    Voice profile to match: <load from .archeflow/config.yaml or project voice profile>
    Rules:
    1. Fix the flagged passage to match the voice profile
    2. Do not rewrite surrounding paragraphs
    3. Preserve the narrative intent — only change voice/style
    4. Commit with message: 'fix: <description>'",
  isolation: "worktree",
  mode: "bypassPermissions"
 )
 ```
 ### Design Fix (route to next cycle)
 Findings that require design changes are NOT fixed in the Act phase. They become structured feedback for the Creator in the next PDCA cycle. Collect them into `act-feedback.md` (see Step 5).
 ---
 ## Step 3: Fix Application Protocol
 Apply fixes in severity order: CRITICAL first, then WARNING, then INFO. Within the same severity, fix in file order (reduces context switching).
 ### For each fix:
 1. **Apply the change** (direct edit or via Maker agent)
 2. **Emit `fix.applied` event:**
   ```json
   {
     "type": "fix.applied",
     "phase": "act",
     "agent": "maker",
     "data": {
       "source": "guardian",
       "finding": "Empty string bypasses validation",
       "file": "src/auth/handler.ts",
       "line": 48,
       "severity": "CRITICAL",
       "before": "<old code>",
       "after": "<new code>"
     },
     "parent": [<seq of the review.verdict that found it>]
   }
   ```
 3. **Targeted re-check** (if the fix is non-trivial):
   - Re-run only the reviewer that raised the finding
   - Scope the re-check to just the changed file(s)
   - If the re-check raises new findings → add them to the findings list with source `re-check:<reviewer>`
 ### Batching Maker Fixes
 If multiple findings route to the same Maker and affect the same file or tightly coupled files, batch them into a single Maker spawn:
 ```
 Agent(
  description: "Fix: 3 findings in src/auth/",
  prompt: "You are the MAKER archetype.
    Apply these fixes on branch: <maker's branch>
    1. [CRITICAL] src/auth/handler.ts:48 — Empty string bypass → Add length check
    2. [WARNING] src/auth/handler.ts:52 — Missing rate limit → Add middleware
    3. [WARNING] tests/auth.test.ts:15 — Bad test names → Rename to behavior descriptions
    Fix all three. Commit each as a separate commit.
    Run tests after all fixes."
 )
 ```
 Batch only within the same functional area. Don't batch unrelated fixes — the Maker loses focus.
 ---
 ## Step 4: Exit Decision
 After all fixes are applied, evaluate exit conditions:
 ### Decision Tree
 ```
 ┌─ Count remaining CRITICAL findings (including from re-checks)
 │
 ├─ CRITICAL = 0 AND completion criteria met (if defined)
 │  └─ EXIT: Proceed to merge
 │
 ├─ CRITICAL = 0 AND completion criteria NOT met
 │  └─ CYCLE: Feed back "completion criteria failing" to Creator
 │
 ├─ CRITICAL > 0 AND cycles_remaining > 0
 │  └─ CYCLE: Build feedback, go to Plan phase
 │
 ├─ CRITICAL > 0 AND cycles_remaining = 0
 │  └─ STOP: Report to user with unresolved findings
 │
 └─ Same CRITICAL finding persisted 2+ cycles
   └─ ESCALATE: Stop and ask user for guidance
 ```
 ### Emit `cycle.boundary` event:
 ```json
 {
  "type": "cycle.boundary",
  "phase": "act",
  "data": {
    "cycle": 1,
    "max_cycles": 2,
    "exit_condition": "all_approved",
    "met": false,
    "critical_remaining": 1,
    "warning_remaining": 2,
    "info_remaining": 1,
    "fixes_applied": 3,
    "design_issues_forwarded": 1,
    "next_action": "cycle"
  }
 }
 ```
 ---
 ## Step 5: Cycle Feedback Protocol
 When cycling back, produce `act-feedback.md` as a structured handoff. This replaces dumping raw findings.
 ```markdown
 ## Cycle N Feedback → Cycle N+1
 ### For Creator (design changes needed)
 | # | Source | Severity | Category | Issue | Cycles Open |
 |---|--------|----------|----------|-------|-------------|
 | 1 | guardian | CRITICAL | security | SQL injection in user input | 1 |
 | 2 | skeptic | WARNING | design | Assumes single-tenant only | 1 |
 ### For Maker (implementation fixes needed)
 | # | Source | Severity | Category | Issue | Cycles Open |
 |---|--------|----------|----------|-------|-------------|
 | 3 | sage | WARNING | testing | Test assertions too weak | 1 |
 | 4 | trickster | WARNING | reliability | Error path not tested | 1 |
 ### Resolved in This Cycle
 | # | Source | Issue | How Resolved |
 |---|--------|-------|--------------|
 | 5 | guardian | Missing rate limit | Added rate limiter middleware (commit abc123) |
 | 6 | sage | Test names unclear | Renamed to behavior descriptions (commit def456) |
 ### Persisting Issues (escalation candidates)
 | # | Source | Issue | Cycles Open | Action |
 |---|--------|-------|-------------|--------|
 | — | — | — | — | — |
 ```
 **Routing rules** (canonical table — matches orchestration and artifact-routing skills):
 | Source | Category | Routes to | Reason |
 |--------|----------|-----------|--------|
@@ -296,76 +50,91 @@ When cycling back, produce `act-feedback.md` as a structured handoff. This repla
 | Sage | quality, consistency | Maker | Implementation refinement |
 | Sage | testing | Maker | Test gap, not design flaw |
 | Trickster | reliability (design flaw) | Creator | Needs redesign |
-| Trickster | reliability (test gap) | Maker | Needs more tests |
+| Trickster | reliability (test gap), testing | Maker | Needs more tests |
 | Trickster | testing | Maker | Edge case not covered |
-**Disambiguation rule:** When in doubt: if the fix requires changing the approach, route to Creator. If it requires changing the code within the existing approach, route to Maker.
+**Disambiguation:** If the fix requires changing the approach → Creator. If it requires changing code within the existing approach → Maker.
 ### Direct Fix (no agent)
 Apply with Edit tool when **all** are true:
 - Mechanical (typo, naming, formatting, import order)
 - No behavioral change
 - No test update needed
 - Single file
 ### Maker Fix (spawn agent)
 Spawn a targeted Maker when the fix involves code logic, tests, multiple files, or behavioral changes. Batch findings in the same file area into one Maker spawn.
 ```
 Agent(
  description: "Fix: <description>",
  prompt: "You are the MAKER archetype.
    Branch: <maker's branch>
    Findings:
    1. [CRITICAL] file:line — issue → suggested fix
    2. [WARNING] file:line — issue → suggested fix
    Rules: fix ONLY these issues, add/update tests if behavior changes,
    run tests, commit each fix separately as 'fix: <description>'.
    Do NOT refactor surrounding code.",
  isolation: "worktree",
  mode: "bypassPermissions"
 )
 ```
 ### Design Fix (route to Creator)
 Design findings are NOT fixed in Act. Collect them into `act-feedback.md` for the Creator in the next cycle (see Step 5).
 ---
-## Step 6: Incremental Runs
+## Step 3: Fix Application
-Support starting the orchestration from any phase by reusing existing artifacts.
+Apply in severity order: CRITICAL → WARNING → INFO. Within same severity, group by file.
-### `--start-from check`
+For each fix:
-
+1. Apply the change (direct edit or via Maker agent)
-Re-run Check + Act on existing Do artifacts:
+2. Emit `fix.applied` event with source, finding, file, severity, before/after
-1. Read `.archeflow/artifacts/<run_id>/` for Maker branch and implementation summary
+3. For non-trivial fixes: re-run only the originating reviewer scoped to changed files. New findings from re-check get added with source `re-check:<reviewer>`
 2. Verify the Maker branch still exists (`git branch --list`)
 3. Spawn reviewers against the existing branch
 4. Proceed through Act phase normally
 ### `--start-from act`
 Re-run Act with existing Check findings:
 1. Read `.archeflow/artifacts/<run_id>/` for Check phase consolidated output
 2. Parse findings from the stored reviewer outputs
 3. Skip finding collection (already done) — proceed from Step 2 (Fix Routing)
 ### `--start-from do`
 Re-run Do + Check + Act with existing Plan:
 1. Read `.archeflow/artifacts/<run_id>/` for Creator's proposal
 2. Verify proposal exists and is parseable
 3. Spawn Maker with the existing proposal
 4. Proceed through Check and Act normally
 ### Artifact Verification
 Before starting from a mid-point, verify required artifacts exist:
 ```
 --start-from do    → needs: proposal (Creator output)
 --start-from check → needs: proposal + implementation (Maker branch + summary)
 --start-from act   → needs: proposal + implementation + review outputs
 ```
 If artifacts are missing, report which ones and abort. Don't guess or generate placeholders.
 ### Event Continuity
 For incremental runs, emit events with `parent` pointing to the existing artifacts' events:
 1. Read the existing `<run_id>.jsonl` to find the last `seq` number
 2. Continue sequence numbering from there
 3. Set `parent` on the first new event to point to the last event of the prior phase
 ---
-## Act Phase Checklist (Quick Reference)
+## Step 4: Exit Decision
 ```
-□ Parse all reviewer outputs into consolidated findings table
+CRITICAL = 0 AND criteria met        → EXIT: proceed to merge
-□ Deduplicate across reviewers
+CRITICAL = 0 AND criteria NOT met    → CYCLE: feedback to Creator
-□ Compare against prior cycle findings (if cycle > 1)
+CRITICAL > 0 AND cycles remaining    → CYCLE: build feedback, go to Plan
-□ Route each finding: direct fix / Maker / Creator feedback
+CRITICAL > 0 AND no cycles left      → STOP: report unresolved to user
-□ Apply direct fixes first (fastest)
+Same CRITICAL persists 2+ cycles     → ESCALATE: ask user for guidance
 □ Spawn Maker(s) for code fixes (batch by file area)
 □ Emit fix.applied event for each fix
 □ Re-check non-trivial fixes with the originating reviewer
 □ Count remaining CRITICALs after all fixes
 □ Check completion criteria (if defined)
 □ Decide: exit / cycle / escalate
 □ If cycling: produce act-feedback.md with routed findings
 □ If exiting: proceed to merge (see orchestration skill Step 4)
 □ Emit cycle.boundary event
 ```
 Emit `cycle.boundary` event with: cycle number, max_cycles, critical/warning/info remaining, fixes applied, next action.
 ---
 ## Step 5: Cycle Feedback
 When cycling back, produce `act-feedback.md`:
 ```markdown
 ## Cycle N → Cycle N+1
 ### For Creator (design changes needed)
 | # | Source | Severity | Category | Issue | Cycles Open |
 |---|--------|----------|----------|-------|-------------|
 ### For Maker (implementation fixes needed)
 | # | Source | Severity | Category | Issue | Cycles Open |
 |---|--------|----------|----------|-------|-------------|
 ### Resolved This Cycle
 | # | Source | Issue | How Resolved |
 |---|--------|-------|--------------|
 ### Persisting Issues (escalation candidates)
 | # | Source | Issue | Cycles Open | Action |
 |---|--------|-------|-------------|--------|
 ```
 Route findings into Creator vs Maker sections using the routing table in Step 2.
--- a/skills/af-dag/SKILL.md
+++ b/skills/af-dag/SKILL.md
@@ -0,0 +1,34 @@
 ---
 name: af-dag
 description: |
  Show the DAG of the current or last ArcheFlow run.
  <example>User: "/af-dag"</example>
  <example>User: "/af-dag 2026-04-06-jwt-auth"</example>
 ---
 # ArcheFlow Run DAG
 1. Parse `run_id` from args. If none provided, read the latest run_id from `.archeflow/events/index.jsonl`.
 2. Run `./lib/archeflow-dag.sh .archeflow/events/<run_id>.jsonl` if the script exists. Display its output.
 3. If the script does not exist, read `.archeflow/events/<run_id>.jsonl` and render a text DAG:
   - Each node is an event (phase transitions, agent starts/completes, findings).
   - Show parent relationships via indentation.
   - Mark completed events with `[done]`, active with `[running]`, failed with `[FAIL]`.
 Example output:
 ```
 run.start 2026-04-06-jwt-auth
  plan.start
    agent.complete explorer (42s)
    agent.complete creator (68s)
  do.start
    agent.complete maker (180s)
  check.start
    agent.complete guardian (55s) -- 3 findings
    agent.complete skeptic (40s) -- 1 finding
  act.start
    fixes.applied 3/4
 run.complete (6m12s)
 ```
 4. If no events found for the run_id, say: "No events found for run `<run_id>`."
--- a/skills/af-replay/SKILL.md
+++ b/skills/af-replay/SKILL.md
@@ -0,0 +1,42 @@
 ---
 name: af-replay
 description: "Replay and analyze a recorded ArcheFlow run: decision timeline and weighted what-if. Usage: /af-replay <run-id> [--timeline|--whatif|--compare] [--weights arch=w,...]"
 user-invocable: true
 ---
 # ArcheFlow Run Replay
 Inspect a completed or in-progress run logged in `.archeflow/events/<run_id>.jsonl`. Use this to study which archetypes drove outcomes and to simulate **weighted** consensus (what-if).
 ## Recording (during PDCA)
 After each meaningful orchestration choice, log a **decision point** (in addition to `review.verdict` where applicable):
 ```bash
 ./lib/archeflow-decision.sh <run_id> <phase> <archetype> '<input_summary>' '<decision>' <confidence> [parent_seq]
 ```
 Fields stored: `phase`, `archetype`, `input`, `decision`, `confidence`, `ts` (event timestamp). The event type is `decision.point`.
 Lower-level alternative:
 ```bash
 ./lib/archeflow-event.sh "$RUN_ID" decision.point check guardian \
  '{"archetype":"guardian","input":"diff","decision":"needs_changes","confidence":0.85}' 7
 ```
 ## Commands (from project root)
 | Action | Shell |
 |--------|--------|
 | Timeline | `./lib/archeflow-replay.sh timeline <run_id>` |
 | What-if | `./lib/archeflow-replay.sh whatif <run_id> [--weights guardian=2,sage=0.5] [--threshold 0.5] [--json]` |
 | Both | `./lib/archeflow-replay.sh compare <run_id> [--weights ...]` |
 - **Timeline** lists `decision.point` rows and `review.verdict` (check phase).
 - **What-if** reads the **last** `review.verdict` per archetype in check. **Original** outcome uses strict any-veto (any non-approve → BLOCK). **Replay** uses weighted mean strictness: each reviewer contributes weight × (1 if not approved, else 0); BLOCK if mean ≥ threshold (default 0.5).
 - **`--json`** emits machine-readable output for dashboards or scripts.
 ## Learning effectiveness
 Correlate `decision.point` confidence and verdicts with cycle outcomes (`cycle.boundary`, `run.complete`) and `./lib/archeflow-score.sh extract` to see which archetypes add signal for which task shapes.
--- a/skills/af-report/SKILL.md
+++ b/skills/af-report/SKILL.md
@@ -0,0 +1,40 @@
 ---
 name: af-report
 description: |
  Generate a full process report for an ArcheFlow run.
  <example>User: "/af-report"</example>
  <example>User: "/af-report 2026-04-06-jwt-auth"</example>
 ---
 # ArcheFlow Run Report
 1. Parse `run_id` from args. If none provided, read the latest run_id from `.archeflow/events/index.jsonl`.
 2. Run `./lib/archeflow-report.sh .archeflow/events/<run_id>.jsonl` if the script exists. Display its output.
 3. If the script does not exist, read `.archeflow/events/<run_id>.jsonl` and produce a markdown report:
 ```markdown
 # ArcheFlow Report: <run_id>
 ## Overview
 | Field | Value |
 |-------|-------|
 | Task | ... |
 | Workflow | fast/standard/thorough |
 | Cycles | N |
 | Duration | Xm Ys |
 | Total Cost | $X.XX |
 ## Phase Summary
 For each phase (Plan, Do, Check, Act): agents involved, duration, token cost, key outputs.
 ## Findings
 Table of all findings: severity, category, description, archetype source, resolution (fixed/dismissed/deferred).
 ## Fixes Applied
 List of fixes with before/after summary and which finding they addressed.
 ## Lessons Learned
 Any new lessons extracted to memory during this run.
 ```
 4. If no events found for the run_id, say: "No events found for run `<run_id>`."
--- a/skills/af-score/SKILL.md
+++ b/skills/af-score/SKILL.md
@@ -0,0 +1,23 @@
 ---
 name: af-score
 description: |
  Show archetype effectiveness scores across runs.
  <example>User: "/af-score"</example>
 ---
 # ArcheFlow Effectiveness Scores
 1. Run `./lib/archeflow-score.sh list` if the script exists. Display its output.
 2. If the script does not exist, read `.archeflow/memory/effectiveness.jsonl` directly.
 3. Summarize per archetype as a table:
 | Archetype | Runs | Signal/Noise | Fix Rate | Avg Cost |
 |-----------|------|--------------|----------|----------|
 | Guardian  | ...  | ...          | ...      | ...      |
 | Skeptic   | ...  | ...          | ...      | ...      |
 - **Signal/Noise**: findings that led to actual fixes vs total findings raised.
 - **Fix Rate**: percentage of findings that were applied (not dismissed).
 - **Avg Cost**: mean token cost per review across runs.
 4. If no effectiveness data exists, say: "No effectiveness data yet. Run `/af-run` at least once."
--- a/skills/af-status/SKILL.md
+++ b/skills/af-status/SKILL.md
@@ -0,0 +1,25 @@
 ---
 name: af-status
 description: |
  Show ArcheFlow status — current/last run, active agents, findings.
  <example>User: "/af-status"</example>
 ---
 # ArcheFlow Status
 1. Read `.archeflow/state.json` if it exists. Extract: task, phase, cycle, workflow, active agents, findings count, start time.
 2. If `state.json` does not exist, read the latest entry from `.archeflow/events/index.jsonl`. Extract run_id, task, last event type, timestamp.
 3. Calculate duration from start time to now (or to completion time if run finished).
 4. Report as a compact table:
 | Field | Value |
 |-------|-------|
 | Run | `<run_id>` |
 | Task | `<task description>` |
 | Phase | `<current phase>` |
 | Cycle | `<cycle number>` |
 | Workflow | `<fast/standard/thorough>` |
 | Findings | `<count>` |
 | Duration | `<elapsed>` |
 5. If no `state.json` and no `index.jsonl`, say: "No active or recent ArcheFlow runs."
--- a/skills/attention-filters/SKILL.md
+++ b/skills/attention-filters/SKILL.md
@@ -1,121 +0,0 @@
 ---
 name: attention-filters
 description: Use when spawning archetype agents to decide what context each agent receives. Reduces token waste and sharpens focus by passing only relevant artifacts.
 ---
 # Attention Filters
 Each archetype needs different context. Pass only what's relevant — not everything.
 | Archetype | Receives | Does NOT Receive |
 |-----------|----------|-----------------|
 | Explorer | Task description, codebase access | Prior proposals or reviews |
 | Creator | Explorer's research + task description | Implementation details |
 | Maker | Creator's proposal | Explorer's research, reviews |
 | Guardian | Maker's git diff + proposal risk section | Explorer's research |
 | Skeptic | Creator's proposal (focus: assumptions) | Git diff details |
 | Trickster | Maker's git diff only | Everything else |
 | Sage | Proposal + implementation + diff | Explorer's raw research |
 ## Why This Matters
 - **Token cost:** A Guardian reading the Explorer's 2000-word research wastes ~2600 tokens on irrelevant context
 - **Focus:** An agent with too much context drifts from its archetype's concern
 - **Shadow prevention:** Over-loading context encourages rabbit-holing (Explorer) and scope creep (Maker)
 ## In Practice
 When spawning a Check-phase agent, include only the filtered context in the prompt:
 ```
 # Guardian receives:
 "Review these changes: <git diff output>
 The proposal identified these risks: <risks section only>
 Verdict: APPROVED or REJECTED with findings."
 # NOT:
 "Here is the full research, the full proposal, the full implementation,
 the full git log, and everything else we have..."
 ```
 ## Prompt Construction Templates
 ### Explorer
 - **Receives:** Task description, file tree (max 200 lines), prior-cycle feedback (if cycle 2+)
 - **Excludes:** Creator proposals, Maker diffs, reviewer outputs
 - **Token target:** ~2000 tokens input
 ### Creator
 - **Receives:** Task description, Explorer research (if available), prior-cycle feedback (if cycle 2+)
 - **Excludes:** Maker diffs, reviewer outputs
 - **Token target:** ~3000 tokens input
 ### Maker
 - **Receives:** Creator's proposal (full), test strategy section, file list
 - **Excludes:** Explorer research, reviewer outputs, prior-cycle feedback
 - **Token target:** ~2500 tokens input
 ### Guardian
 - **Receives:** Maker's git diff, proposal risk section, test results
 - **Excludes:** Explorer research, Creator rationale, Skeptic/Sage outputs
 - **Token target:** ~2000 tokens input
 ### Skeptic
 - **Receives:** Creator's proposal (assumptions + architecture decision), confidence scores
 - **Excludes:** Git diff details, Explorer raw research, other reviewer outputs
 - **Token target:** ~1500 tokens input
 ### Trickster
 - **Receives:** Maker's git diff only, attack surface summary (file types + entry points)
 - **Excludes:** Proposal, research, other reviewer outputs
 - **Token target:** ~1500 tokens input
 ### Sage
 - **Receives:** Creator's proposal, Maker's implementation summary + diff, test results
 - **Excludes:** Explorer raw research, other reviewer verdicts
 - **Token target:** ~2500 tokens input
 ## Token Budget Targets
 | Archetype | Fast | Standard | Thorough |
 |-----------|------|----------|----------|
 | Explorer  | skip | 2000     | 3000     |
 | Creator   | 2000 | 3000     | 4000     |
 | Maker     | 2000 | 2500     | 3000     |
 | Guardian  | 1500 | 2000     | 2500     |
 | Skeptic   | skip | 1500     | 2000     |
 | Trickster | skip | skip     | 1500     |
 | Sage      | skip | 2500     | 3000     |
 "skip" means the archetype is not spawned in that workflow tier.
 ## Cycle-Back Filtering
 When injecting prior-cycle feedback into cycle 2+:
 1. **Summary only** — pass the structured feedback table (issue, source, severity), not full reviewer artifacts
 2. **Strip resolved items** — if a finding was marked Fixed in the Act phase, exclude it
 3. **Compress context** — prior proposal diffs reduce to "What Changed" section only (not full re-proposal)
 4. **Cap at 500 tokens** — if feedback exceeds this, summarize by severity (CRITICAL first, then WARNING, drop INFO)
 ## Filter Verification Checklist
 Before spawning each agent, verify:
 - [ ] Prompt contains ONLY the artifacts listed in that archetype's "Receives" above
 - [ ] No cross-contamination from other reviewers' outputs
 - [ ] Token count is within 20% of the target for the current workflow tier
 - [ ] Prior-cycle feedback (if any) is summarized, not raw
 - [ ] Excluded artifacts are genuinely absent (search for keywords like file paths from excluded sources)
 ## Context Isolation
 Attention filters control *what* each agent receives. Context isolation controls *how* that context is constructed — ensuring agents operate on provided facts, not ambient knowledge.
 ### Rules
 1. **No session bleed.** Agents receive fresh context only — constructed from task description, artifact files, or extracted sections. They must not inherit session state, chat history, or prior agent prompts.
 2. **No cross-agent contamination.** An agent receives another agent's output only if the attention filter table above explicitly allows it. Guardian does not see Skeptic's output. Skeptic does not see the Maker's diff. Violations produce unreliable reviews.
 3. **Controller-constructed only.** All agent context is assembled by the orchestrator from: (a) the task description, (b) artifact files on disk, or (c) extracted sections of those artifacts. Agents never pull their own context.
 4. **No ambient knowledge.** Agents cannot "remember" findings from prior phases or cycles unless that information is explicitly injected via the cycle-back filtering protocol above. An agent that references information not in its prompt is hallucinating.
 5. **Verification.** Before spawning each agent, confirm the constructed prompt has zero references to other agents' raw outputs that are not in the "Receives" column. Search for file paths, archetype names, and finding descriptions from excluded sources.
--- a/skills/check-phase/SKILL.md
+++ b/skills/check-phase/SKILL.md
@@ -1,233 +1,110 @@
 ---
 name: check-phase
-description: Use when you are acting as Guardian, Skeptic, Sage, or Trickster archetype in the Check phase. Defines shared review rules and output format.
+description: Use when acting as Guardian, Skeptic, Sage, or Trickster in the Check phase. Defines review rules, finding format, attention filters, and spawning protocol.
 ---
 # Check Phase
-Multiple reviewers examine the Maker's implementation in parallel. Each agent definition has its specific protocol — this skill defines the shared rules.
+Reviewers examine the Maker's implementation. This skill defines shared rules, finding format, and spawning protocol.
 ## Shared Rules
-1. **Read the proposal first.** Review against the intended design, not invented requirements.
+1. Review against the proposal's intended design, not invented requirements.
-2. **Read the actual code.** Use `git diff` on the Maker's branch. Don't review descriptions alone.
+2. Read actual code via `git diff` on the Maker's branch.
-3. **Structured findings.** Use the standardized finding format below for every issue.
+3. Use the finding format below for every issue.
-4. **Clear verdict:** `APPROVED` or `REJECTED` with rationale.
+4. Give a clear verdict: `APPROVED` or `REJECTED` with rationale.
-5. **Status tokens are separate from verdicts.** The `STATUS: DONE` line signals the agent finished successfully. The `APPROVED`/`REJECTED` verdict is domain output. A reviewer can be `STATUS: DONE` with verdict `REJECTED` — that is normal. Parse both independently.
+5. `STATUS: DONE` signals agent completion. `APPROVED`/`REJECTED` is domain output. Both are parsed independently.
 ## Finding Format
 Every finding must use this format for cross-cycle tracking:
 ```
 | Location | Severity | Category | Description | Fix |
 |----------|----------|----------|-------------|-----|
-| src/auth/handler.ts:48 | CRITICAL | security | Empty string bypasses validation | Add length check before processing |
+| src/auth/handler.ts:48 | CRITICAL | security | Empty string bypasses validation | Add length check |
 ```
-**Severity:**
+**Severity:** CRITICAL = must fix, blocks approval. WARNING = should fix, doesn't block alone. INFO = nice to have, never blocks.
 - **CRITICAL** — Must fix. Blocks approval.
 - **WARNING** — Should fix. Doesn't block alone.
 - **INFO** — Nice to have. Never blocks.
-**Categories** (use consistently for cross-cycle tracking):
+**Categories:** `security` `reliability` `design` `breaking-change` `dependency` `quality` `testing` `consistency`
 - `security` — Injection, auth bypass, data exposure, secrets
 - `reliability` — Error handling, edge cases, race conditions, crashes
 - `design` — Architecture, assumptions, scalability, coupling
 - `breaking-change` — API compatibility, schema migrations, removals
 - `dependency` — New deps, version conflicts, license issues
 - `quality` — Readability, maintainability, naming, duplication
 - `testing` — Missing tests, weak assertions, untested paths
 - `consistency` — Deviates from codebase patterns
-## Consolidated Output
+## Evidence Requirements
-After all reviewers finish, compile:
+Every CRITICAL or WARNING must include concrete evidence. Without evidence, downgrade to INFO.
-```markdown
+**Valid evidence:** command output, exit codes, code citations with line numbers, git diff excerpts, reproduction steps.
 ## Check Phase Results — Cycle N
-### Guardian: APPROVED
+**Banned in CRITICAL/WARNING:** "might be", "could potentially", "appears to", "seems like", "may not". Rewrite with evidence or downgrade.
 | Location | Severity | Category | Description | Fix |
 |----------|----------|----------|-------------|-----|
 | src/auth/handler.ts:52 | WARNING | security | Missing rate limit | Add rate limiter middleware |
-### Skeptic: APPROVED
+For each CRITICAL/WARNING, state: (1) what was tested, (2) what was observed, (3) what correct behavior should be.
 | Location | Severity | Category | Description | Fix |
 |----------|----------|----------|-------------|-----|
 | src/auth/handler.ts:30 | INFO | design | Consider caching validated tokens | Add TTL cache for token validation |
-### Sage: APPROVED
+## Attention Filters
 | Location | Severity | Category | Description | Fix |
 |----------|----------|----------|-------------|-----|
 | tests/auth.test.ts:15 | WARNING | testing | Test names don't describe behavior | Rename to "should reject expired tokens" |
-### Trickster: REJECTED
+Each archetype receives only relevant context. Do not pass everything.
 | Location | Severity | Category | Description | Fix |
 |----------|----------|----------|-------------|-----|
 | src/auth/handler.ts:48 | CRITICAL | reliability | Empty string bypasses validation | Add `if (!token || token.trim() === '')` guard |
-### Deduplication
+| Archetype | Receives | Excludes |
-If two reviewers raise the same issue (same file + same category), merge:
+|-----------|----------|----------|
-| Guardian + Skeptic | CRITICAL | security | Input not sanitized (src/api.ts:30) | Add validation |
+| Guardian | Maker's git diff + proposal risk section + test results | Explorer research, Creator rationale, other reviewers |
 | Skeptic | Creator's proposal (assumptions + architecture) + confidence scores | Git diff, Explorer research, other reviewers |
 | Sage | Creator's proposal + Maker's diff + implementation summary + test results | Explorer raw research, other reviewer verdicts |
 | Trickster | Maker's git diff + attack surface summary (file types + entry points) | Proposal, research, other reviewers |
-Use the higher severity. Don't double-count in the verdict.
+**Token budget targets:**
-### Verdict: REJECTED — 1 critical finding
+| Archetype | Fast | Standard | Thorough |
-→ Build cycle feedback (see orchestration skill) and feed to Plan phase
+|-----------|------|----------|----------|
-```
+| Guardian | 1500 | 2000 | 2500 |
 | Skeptic | skip | 1500 | 2000 |
 | Trickster | skip | skip | 1500 |
 | Sage | skip | 2500 | 3000 |
 **Context isolation:** Agents receive fresh, controller-constructed context only. No session bleed, no cross-agent contamination, no ambient knowledge. Verify zero references to excluded artifacts before spawning.
 **Cycle-back filtering (cycle 2+):** Pass structured feedback table only (not full reviewer artifacts). Strip resolved items. Cap at 500 tokens — summarize by severity if exceeded.
 ## Reviewer Spawning Protocol
 This section defines the exact sequence for spawning reviewers in the Check phase.
 ### Step 1: Guardian First (mandatory)
-Guardian always runs first, before any other reviewer. It receives the Maker's git diff and the proposal's risk section only.
+Guardian always runs first. It receives the Maker's git diff and the proposal's risk section only.
 **Context for Guardian:**
 - `git diff main...<maker-branch>` (the actual code changes)
 - Risk section from `plan-creator.md` (if present)
 - Do NOT include: Explorer research, full proposal, other reviewer outputs
 ```
 Agent(
  description: "Guardian: security and risk review for <task>",
  prompt: "You are the GUARDIAN archetype.
    Review the diff: <maker's diff>
    Proposal risks: <risk section from plan-creator.md>
    Assess: security vulnerabilities, reliability risks, breaking changes, dependency risks.
    Output: APPROVED or REJECTED with findings in the standardized format.
    Each finding: | Location | Severity | Category | Description | Fix |",
  model: <resolve_model guardian $WORKFLOW>
 )
 ```
 Save output to `.archeflow/artifacts/${RUN_ID}/check-guardian.md`.
 ### Step 2: A2 Fast-Path Evaluation
-After Guardian completes, parse its output before spawning other reviewers:
+After Guardian completes, count CRITICAL and WARNING findings in its output. If both are zero, and not escalated, and not first cycle of a thorough workflow — skip remaining reviewers and proceed to Act phase.
-```bash
+### Step 3: Parallel Remaining Reviewers
 CRITICAL_COUNT=$(grep -c "| CRITICAL |" ".archeflow/artifacts/${RUN_ID}/check-guardian.md" || echo 0)
 WARNING_COUNT=$(grep -c "| WARNING |" ".archeflow/artifacts/${RUN_ID}/check-guardian.md" || echo 0)
-# A2 fast-path: skip remaining reviewers if Guardian is clean
+If A2 does not trigger, spawn remaining reviewers in parallel:
 # Exception: first cycle of thorough workflows always spawns all reviewers
 if [[ "$CRITICAL_COUNT" -eq 0 && "$WARNING_COUNT" -eq 0 \
      && "$ESCALATED" != "true" \
      && ! ("$WORKFLOW" == "thorough" && "$CYCLE" -eq 1) ]]; then
  echo "Guardian fast-path: 0 CRITICAL, 0 WARNING — skipping remaining reviewers."
  # Proceed directly to Act phase
 fi
 ```
 ### Step 3: Parallel Reviewer Spawning
 If A2 does not trigger, spawn remaining reviewers in parallel based on workflow:
 | Workflow | Reviewers (after Guardian) |
 |----------|--------------------------|
 | `fast` | None (Guardian only) |
-| `fast` (escalated via A1) | Skeptic + Sage |
+| `fast` (escalated) | Skeptic + Sage |
 | `standard` | Skeptic + Sage |
 | `thorough` | Skeptic + Sage + Trickster |
-Spawn all applicable reviewers in a single message with multiple Agent calls:
+Each reviewer gets context per the attention filters above.
-```
+### Step 4: Collect and Consolidate
 # Standard workflow example — spawn Skeptic and Sage in parallel:
 Agent(
  description: "Skeptic: challenge assumptions for <task>",
  prompt: "<Skeptic prompt with Creator's proposal>",
  model: <resolve_model skeptic $WORKFLOW>
 )
-Agent(
+For each reviewer: save to `.archeflow/artifacts/${RUN_ID}/check-<archetype>.md`, emit `review.verdict` event, record sequence number.
-  description: "Sage: holistic quality review for <task>",
+
-  prompt: "<Sage prompt with proposal + diff + implementation summary>",
+**Deduplication:** If two reviewers raise the same issue (same file + same category), merge into one finding using the higher severity. Don't double-count.
-  model: <resolve_model sage $WORKFLOW>
+
-)
+**Verdict:** Count CRITICAL findings across all reviewers (after dedup). Any CRITICAL = `REJECTED`. Otherwise `APPROVED`.
 Example consolidated output:
 ```markdown
 ## Check Phase Results — Cycle 1
 ### Guardian: APPROVED
 | Location | Severity | Category | Description | Fix |
 |----------|----------|----------|-------------|-----|
 | src/auth.ts:52 | WARNING | security | Missing rate limit | Add rate limiter |
 ### Verdict: APPROVED — 0 critical, 1 warning
 ```
-Each reviewer gets context per the attention filters defined in `archeflow:orchestration`:
+## Timeout Handling
 - **Skeptic:** Creator's proposal (assumptions section focus)
 - **Sage:** Creator's proposal + Maker's diff + implementation summary
 - **Trickster:** Maker's diff only
-### Step 4: Collect Results
+Each reviewer has a **5-minute timeout**. On timeout: emit `agent.complete` with `"error": true`, log WARNING, treat as no findings, proceed.
-Wait for all spawned reviewers to return. For each:
+**Exception:** Guardian timeout is blocking — abort Check phase and report to user.
 1. Save output to `.archeflow/artifacts/${RUN_ID}/check-<archetype>.md`
 2. Emit `review.verdict` event with findings
 3. Record sequence number for DAG parent tracking
 ### Timeout Handling
 Each reviewer has a **5-minute timeout**. If a reviewer does not return within 5 minutes:
 1. Emit `agent.complete` with `"error": true, "reason": "timeout"`
 2. Log a WARNING — do not block the run
 3. Treat the timed-out reviewer as having delivered no findings (neither approved nor rejected)
 4. Proceed with available verdicts
 If Guardian times out, this is a blocking failure — abort the Check phase and report to the user.
 ### Re-Check Protocol (Act Phase Fixes)
 When the Act phase routes findings back to the Maker and the Maker applies fixes in a subsequent cycle, the Check phase re-runs with the updated diff. Reviewers who previously rejected should focus on whether their specific findings were addressed. The structured feedback from `act-feedback.md` provides the mapping of which findings were routed where.
 ---
 ## Evidence Requirements
 Every CRITICAL or WARNING finding must include concrete evidence. Findings without evidence are downgraded to INFO.
 ### Evidence Types
 | Type | Example | When Required |
 |------|---------|---------------|
 | Command output | `npm test` output showing failure | Test-related findings |
 | Exit code | `exit code 1 from eslint` | Tool-based validation |
 | Code citation | `src/auth.ts:48 — \`if (token) { ... }\`` | Logic or security findings |
 | Git diff | `+  db.query(userInput)` (unsanitized) | Implementation review |
 | Reproduction steps | "1. Send POST with empty body, 2. Observe 500" | Runtime behavior findings |
 ### Banned Phrases
 The following phrases are not permitted in CRITICAL or WARNING findings. They indicate speculation, not evidence:
 - "might be"
 - "could potentially"
 - "appears to"
 - "seems like"
 - "may not"
 A finding using these phrases must either be rewritten with evidence or downgraded to INFO.
 ### Verification Protocol
 For each CRITICAL or WARNING finding, state:
 1. **What was tested** — the specific code path, input, or scenario examined
 2. **What was observed** — the actual behavior or code construct found
 3. **What correct behavior should be** — the expected alternative
 ### Downgrade Rule
 If a reviewer produces a CRITICAL or WARNING finding without any of the evidence types above, the orchestrator downgrades it to INFO and emits a `decision` event:
 ```bash
 ./lib/archeflow-event.sh "$RUN_ID" decision check "" \
  '{"what":"evidence_downgrade","from":"CRITICAL","to":"INFO","finding":"<description>","reviewer":"<archetype>","reason":"no evidence provided"}'
 ```
 ---
 ## Why Structured Findings Matter
 The standardized format enables:
 - **Cross-cycle tracking:** Same category + location = same issue. Can detect resolution or regression.
 - **Feedback routing:** Security/design findings → Creator. Quality/testing findings → Maker.
 - **Shadow detection:** CRITICAL:WARNING ratios, finding counts, and category distributions are measurable.
 - **Metrics:** Severity counts feed into the orchestration summary.
--- a/skills/run/SKILL.md
+++ b/skills/run/SKILL.md
@@ -352,10 +352,12 @@ Emit events via `./lib/archeflow-event.sh <run_id> <type> <phase> <agent> '<json
 | After agent returns | `agent.complete` | archetype, duration_ms, artifacts, summary |
 | Phase boundary | `phase.transition` | from, to, artifacts_so_far |
 | Alternative chosen | `decision` | what, chosen, alternatives, rationale |
 | Orchestrator decision (replay) | `decision.point` | archetype, input, decision, confidence — use `./lib/archeflow-decision.sh` |
 | Reviewer verdict | `review.verdict` | archetype, verdict, findings[] |
 | Fix addressing review | `fix.applied` | source, finding, file, line |
 | End of PDCA cycle | `cycle.boundary` | cycle, max_cycles, exit_condition, convergence |
 | Shadow triggered | `shadow.detected` | archetype, shadow, trigger, action |
 | Policy halt | `wiggum.break` | trigger, run_state, unresolved_findings, hard/soft |
 | Run ends | `run.complete` | status, cycles, agents_total, fixes_total |
 Parent rules: `run.start` has `parent: []`. Agents parent to the event that triggered them. Phase transitions fan-in from all completing events. Parallel agents share the same parent.
@@ -403,6 +405,12 @@ Scores stored in `.archeflow/memory/effectiveness.jsonl`. After 10+ runs, recomm
 ---
 ## Run replay (decision log + what-if)
 After key choices (routing, fast-path skip, escalation), emit `decision.point` via `./lib/archeflow-decision.sh` so runs can be inspected with `./lib/archeflow-replay.sh timeline|whatif|compare <run_id>`. Weighted what-if helps estimate how much each review archetype swayed the effective ship/block outcome. See skill `af-replay`.
 ---
 ## Dry-Run Mode
 When `--dry-run`: Run Plan phase only. Display workflow, agent counts, confidence scores, cost estimate. Ask user to proceed. If yes, continue with `--start-from do`.
--- a/skills/shadow-detection/SKILL.md
+++ b/skills/shadow-detection/SKILL.md
@@ -1,66 +1,139 @@
 ---
 name: shadow-detection
-description: Use when monitoring agent behavior for dysfunction, when an agent seems stuck, or when orchestration quality is degrading. Detects and corrects Jungian shadow activation in archetypes.
+description: |
  Corrective action framework for agent dysfunction, system health, and operational policy.
  Three layers — archetype shadows, system shadows, policy boundaries — one escalation protocol.
 ---
-# Shadow Detection
+# Corrective Action Framework
-Every archetype has a virtue and a shadow (its destructive inversion). Shadow activates when the virtue is pushed too far.
+Detect dysfunction. Apply corrective action. Escalate if repeated.
-| Archetype | Virtue | Shadow |
+Three layers, one protocol:
-|-----------|--------|--------|
+- **Archetype Shadows** — individual agent dysfunction (virtue pushed too far)
-| Explorer | Contextual Clarity | Rabbit Hole |
+- **System Shadows** — orchestration-level dysfunction (process going wrong)
-| Creator | Decisive Framing | Over-Architect |
+- **Policy Boundaries** — operational limits (time, cost, quality thresholds)
 | Maker | Execution Discipline | Rogue |
 | Guardian | Threat Intuition | Paranoid |
 | Skeptic | Assumption Surfacing | Paralytic |
 | Trickster | Adversarial Creativity | False Alarm |
 | Sage | Maintainability Judgment | Bureaucrat |
 ---
-### Explorer -> Rabbit Hole
+## Archetype Shadows
 **Detect** (any): output >2000w without Recommendation | >3 tangents | >15 files no patterns | no synthesis in final 25%
 **Correct**: "Summarize top 3 findings and one recommendation in under 300 words."
-### Creator -> Over-Architect
+| Archetype | Shadow | Detect (any) | Corrective Action |
-**Detect** (any): >2 new abstractions for a single feature | "future-proof" in rationale | scope exceeds task by >50% | >1 new package for one feature
+|-----------|--------|-------------|-------------------|
-**Correct**: "Design for the current order of magnitude. Remove abstractions that serve hypothetical requirements."
+| Explorer | Rabbit Hole | Output >2000w without Recommendation; >3 tangents; >15 files no patterns; no synthesis in final 25% | "Summarize top 3 findings and one recommendation in 300 words." |
 | Creator | Over-Architect | >2 new abstractions for one feature; "future-proof" in rationale; scope exceeds task >50%; >1 new package | "Design for the current order of magnitude. Remove abstractions for hypothetical requirements." |
 | Maker | Rogue | Zero test files with >=3 files changed; single monolithic commit; files outside proposal; no test run evidence | "Read the proposal. Write a test. Commit. Revert out-of-scope files." |
 | Guardian | Paranoid | CRITICAL:WARNING ratio >2:1 (min 3); zero APPROVED in 3+ reviews; <50% findings include fix; findings require compromised systems | "For each CRITICAL: would a senior engineer block a PR? If not, downgrade. Every rejection needs a specific fix." |
 | Skeptic | Paralytic | >7 challenges; <50% include alternatives; same concern 2+ times reworded; >3 findings outside scope | "Rank by impact. Keep top 3 with alternatives. Delete the rest." |
 | Trickster | False Alarm | Findings in untouched code; >10 findings for <5 files; impossible scenarios; >3 without repro steps | "Delete findings outside the diff. Rank by likelihood x impact. Keep top 3-5." |
 | Sage | Bureaucrat | Review words >2x diff lines; findings outside changeset; >2 "consider" without action; suggesting docs for trivial functions | "Limit to issues affecting maintainability in 6 months. Every finding needs a specific action." |
-### Maker -> Rogue
+### Shadow Immunity
 **Detect** (any): zero test files with >=3 files changed | single monolithic commit | diff contains files not in proposal | no evidence of running tests
 **Correct**: "Read the proposal. Write a test. Commit what you have. Revert changes to files not in the proposal."
-### Guardian -> Paranoid
+Intensity alone is not a shadow. **Shadow = behavior disconnected from the goal.**
 **Detect** (any): CRITICAL:WARNING ratio >2:1 (min 3 findings) | zero APPROVED in 3+ reviews | <50% findings include a fix | findings require already-compromised systems
 **Correct**: "For each CRITICAL: would a senior engineer block a PR for this? If not, downgrade. Every rejection must include a specific fix."
-### Skeptic -> Paralytic
+- Explorer reading 20 files in a monorepo with scattered deps -- not rabbit hole if each is relevant
-**Detect** (any): >7 challenges in a single review | <50% include alternatives | same concern appears 2+ times reworded | >3 findings outside task scope
+- Guardian blocking with 2 CRITICALs -- not paranoid if both are genuine vulnerabilities
 **Correct**: "Rank challenges by impact. Keep top 3. Each must include a specific alternative. Delete the rest."
 ### Trickster -> False Alarm
 **Detect** (any): findings reference code untouched by diff | >10 findings for <5 files | impossible deployment scenarios | >3 findings without repro steps
 **Correct**: "Delete findings outside the diff. Rank remaining by likelihood x impact. Keep top 3-5."
 ### Sage -> Bureaucrat
 **Detect** (any): review words >2x diff lines | findings reference files not in changeset | >2 "consider" without concrete action | suggesting docs for <5-line functions
 **Correct**: "Limit to issues affecting maintainability in the next 6 months. Every finding must end with a specific action."
 ---
 ## Escalation Protocol
 1. **1st detection:** Log the shadow, apply the correction prompt, let the agent continue
 2. **2nd detection (same agent, same shadow):** Replace the agent -- the shadow is entrenched
 3. **3+ agents shadowed in same cycle:** Escalate to user -- the task may need to be broken down
 ## Shadow Immunity
 Some behaviors look like shadows but are not. **Rule of thumb:** shadow = behavior disconnected from the goal. Intensity alone is not a shadow.
 - Explorer reading 20 files in a monorepo with scattered dependencies -- not a rabbit hole if each file is genuinely relevant
 - Creator adding an abstraction -- not over-architect if the current task genuinely needs it
 - Guardian blocking with 2 CRITICALs -- not paranoid if both are genuine security vulnerabilities
 - Trickster finding 5 edge cases -- not false alarm if all are in changed code with repro steps
- Sage writing a long review -- not bureaucrat if the change is large and every finding is actionable
+
 ---
 ## System Shadows
 Orchestration-level dysfunction that isn't tied to one archetype.
 | Shadow | Detect | Corrective Action |
 |--------|--------|-------------------|
 | **Tunnel Vision** | All reviewers flag same category (e.g., 4 security findings, 0 quality/testing) | "Redistribute attention. Are we missing quality, testing, or design concerns?" |
 | **Echo Chamber** | Unanimous approval in <30s on standard/thorough workflow | "Suspicious fast consensus. Re-run Guardian with adversarial prompt." |
 | **Gold Plating** | Maker working on INFO fixes while CRITICALs remain open | "Fix CRITICALs first. Park INFO items." |
 | **Analysis Paralysis** | Plan phase >2x longer than Do phase; Explorer spawned 3+ times | "Stop researching. Ship a proposal with known gaps." |
 | **Cargo Cult** | Memory lesson injected but the same finding repeats anyway | "Lesson ineffective. Reword, strengthen, or remove it." |
 | **Broken Window** | 3+ WARNINGs deferred across consecutive runs in the same project | "Accumulated tech debt. Schedule a cleanup sprint." |
 | **Scope Creep** | Maker changes >2x files listed in proposal | "Revert to proposal scope. If more files needed, update the proposal first." |
 ---
 ## Policy Boundaries
 Operational limits that protect session quality, cost, and resumability.
 ### Checkpoint Policy
 Every **45 minutes** or **3 completed tasks** (whichever first):
 1. Commit + push all work in progress
 2. Write handoff summary to `control-center.md`
 3. Log token spend so far
 4. Compare output quality: last task vs first task
 5. If quality degrading -> STOP with clean state
 6. If budget >80% spent -> STOP with clean state
 7. Otherwise -> continue
 ### Budget Gate
 | Threshold | Action |
 |-----------|--------|
 | 50% budget spent | Log warning, continue |
 | 80% budget spent | Downgrade models (sonnet->haiku for reviewers) |
 | 95% budget spent | Complete current task, then STOP |
 | 100% budget | STOP immediately, commit WIP |
 ### Wiggum Break (Circuit Breaker)
 Named after Chief Wiggum — policy enforcement AND the Ralph Loop's dad.
 When a Wiggum Break triggers, the system halts execution, saves state, and
 reports to the user. "Bake 'em away, toys."
 **Hard breaks** (halt immediately, commit WIP):
 | Trigger | Reason |
 |---------|--------|
 | 3 consecutive agent failures/timeouts | Infrastructure issue, not a code problem |
 | 3 consecutive task failures in sprint | Something systemic is wrong |
 | Same shadow detected 3+ times in one cycle | Task needs to be broken down or re-scoped |
 | Test suite broken after merge | Auto-revert, then halt |
 | 2+ oscillating findings (present→absent→present) | Fundamental tension in review criteria |
 **Soft breaks** (finish current task, then halt):
 | Signal | Reason |
 |--------|--------|
 | Cycle N findings identical to cycle N-1 | No progress — present best result |
 | Convergence score <0.5 for 2 consecutive cycles | "This needs a different approach" |
 | Reviewer finding count increases cycle over cycle | Implementation is diverging, not converging |
 When a Wiggum Break fires, emit a `wiggum.break` event with trigger, run state, and unresolved findings.
 The event log makes it easy to audit why a run was halted and whether the break was warranted.
 ### Context Pollution
 | Signal | Action |
 |--------|--------|
 | >15 memory lessons injected into one prompt | Prune to top 5 by frequency |
 | >20 findings tracked across cycles | Summarize into top 5 themes |
 | Agent prompt exceeds estimated 50% of context window | Strip examples, keep rules only |
 ---
 ## Unified Escalation Protocol
 All three layers use the same escalation:
 | Step | Archetype Shadows | System Shadows | Policy Boundaries |
 |------|-------------------|----------------|-------------------|
 | **1st** | Apply corrective action, let agent continue | Apply corrective action, continue run | Apply boundary action (downgrade, checkpoint) |
 | **2nd** (same issue) | Replace the agent -- shadow is entrenched | Pause run, report to user | Force stop with clean state |
 | **3rd** (pattern) | Escalate to user: "task needs re-scoping" | Escalate to user: "systemic issue" | Escalate to user: "resource limits reached" |
 ---
 ## Integration
 Shadow checks run **after each agent completes** during orchestration. System shadow checks run **at phase boundaries**. Policy checks run **on a timer and at task boundaries**.
 The `run` skill references this framework at:
 - Step 3 (Check phase): archetype shadow monitoring
 - Step 4 (Act phase): convergence/diminishing returns
 - Step 5 (Completion): effectiveness scoring
 - Sprint skill: checkpoint policy between batches
--- a/skills/using-archeflow/ACTIVATION.md
+++ b/skills/using-archeflow/ACTIVATION.md
@@ -0,0 +1,22 @@
 # ArcheFlow -- Active
 Multi-agent orchestration using archetypal roles and PDCA quality cycles.
 ## Session Start
 On activation, print ONE line then proceed silently:
 ```
 archeflow v0.8.0 · 19 skills · <domain> domain
 ```
 Domain: `writing` if `colette.yaml` exists, `research` if paper/thesis files, `code` otherwise.
 ## When to Use
 | Need | Command |
 |------|---------|
 | Work the queue | `/af-sprint` |
 | Deep orchestration | `/af-run <task>` |
 | Code review | `/af-review` |
 | Simple fix / question | Skip ArcheFlow — just do it directly |
 Do NOT use ArcheFlow for: single-line fixes, questions, reading code, config tweaks, git ops.
--- a/skills/using-archeflow/SKILL.md
+++ b/skills/using-archeflow/SKILL.md
@@ -7,7 +7,7 @@ description: Use at session start when implementing features, reviewing code, de
 On activation, print ONE line then proceed silently:
 ```
-archeflow v0.7.0 · 25 skills · <domain> domain
+archeflow v0.9.0 · 24 skills · <domain> domain
 ```
 Domain auto-detected: `writing` if `colette.yaml` exists, `research` if paper/thesis files, `code` otherwise.
@@ -46,6 +46,7 @@ Do NOT use for: single-line fixes, questions, reading/exploring, config tweaks,
 | `/af-memory` | Cross-run lesson memory |
 | `/af-fanout` | Colette book fanout via agents |
 | `/af-dag` | DAG of current/last run |
 | `/af-replay <run_id>` | Decision timeline + weighted what-if on recorded events |
 ## Mini-Reflect Fallback
--- a/tests/archeflow-dag.bats
+++ b/tests/archeflow-dag.bats
@@ -0,0 +1,71 @@
 # Tests for archeflow-dag.sh — ASCII DAG rendering from JSONL events.
 #
 # Validates: basic rendering, parent relationships, color flags, missing file handling.
 setup() {
  load test_helper
  _common_setup
  # Create a standard events file with parent relationships
  cat > "$BATS_TEST_TMPDIR/dag-events.jsonl" <<'EVENTS'
 {"ts":"2026-04-03T10:00:00Z","run_id":"dag-run","seq":1,"parent":[],"type":"run.start","phase":"plan","agent":null,"data":{"task":"DAG test"}}
 {"ts":"2026-04-03T10:01:00Z","run_id":"dag-run","seq":2,"parent":[1],"type":"agent.complete","phase":"plan","agent":"creator","data":{"archetype":"creator","duration_ms":60000,"tokens":1500}}
 {"ts":"2026-04-03T10:02:00Z","run_id":"dag-run","seq":3,"parent":[2],"type":"phase.transition","phase":"do","agent":null,"data":{"from":"plan","to":"do"}}
 {"ts":"2026-04-03T10:03:00Z","run_id":"dag-run","seq":4,"parent":[3],"type":"agent.complete","phase":"do","agent":"maker","data":{"archetype":"maker","duration_ms":120000,"tokens":3000}}
 {"ts":"2026-04-03T10:04:00Z","run_id":"dag-run","seq":5,"parent":[4],"type":"run.complete","phase":"act","agent":null,"data":{"agents_total":2,"fixes_total":0}}
 EVENTS
 }
@test "dag: exits 1 with usage when called with no args" {
  run "$LIB_DIR/archeflow-dag.sh"
  [ "$status" -eq 1 ]
  [[ "$output" == *"Usage"* ]]
 }
@test "dag: exits 1 when events file not found" {
  run "$LIB_DIR/archeflow-dag.sh" nonexistent.jsonl
  [ "$status" -eq 1 ]
  [[ "$output" == *"not found"* ]]
 }
@test "dag: renders run.start as root node" {
  run "$LIB_DIR/archeflow-dag.sh" "$BATS_TEST_TMPDIR/dag-events.jsonl" --no-color
  [ "$status" -eq 0 ]
  [[ "$output" == *"#1"* ]]
  [[ "$output" == *"run.start"* ]]
 }
@test "dag: renders agent.complete events with archetype name" {
  run "$LIB_DIR/archeflow-dag.sh" "$BATS_TEST_TMPDIR/dag-events.jsonl" --no-color
  [ "$status" -eq 0 ]
  [[ "$output" == *"creator"* ]]
  [[ "$output" == *"maker"* ]]
 }
@test "dag: renders phase transitions" {
  run "$LIB_DIR/archeflow-dag.sh" "$BATS_TEST_TMPDIR/dag-events.jsonl" --no-color
  [ "$status" -eq 0 ]
  [[ "$output" == *"plan"* ]]
  [[ "$output" == *"do"* ]]
 }
@test "dag: renders run.complete with agent/fix counts" {
  run "$LIB_DIR/archeflow-dag.sh" "$BATS_TEST_TMPDIR/dag-events.jsonl" --no-color
  [ "$status" -eq 0 ]
  [[ "$output" == *"run.complete"* ]]
  [[ "$output" == *"2 agents"* ]]
 }
@test "dag: --no-color suppresses ANSI codes" {
  run "$LIB_DIR/archeflow-dag.sh" "$BATS_TEST_TMPDIR/dag-events.jsonl" --no-color
  [ "$status" -eq 0 ]
  # Should not contain escape sequences
  [[ "$output" != *$'\033'* ]]
 }
@test "dag: uses tree-drawing characters for hierarchy" {
  run "$LIB_DIR/archeflow-dag.sh" "$BATS_TEST_TMPDIR/dag-events.jsonl" --no-color
  [ "$status" -eq 0 ]
  # Should contain box-drawing characters (either unicode or ASCII connectors)
  [[ "$output" == *"├"* ]] || [[ "$output" == *"└"* ]]
 }
--- a/tests/archeflow-event.bats
+++ b/tests/archeflow-event.bats
@@ -0,0 +1,127 @@
 # Tests for archeflow-event.sh — structured JSONL event logging.
 #
 # Validates: JSONL output format, sequence numbering, parent field handling,
 # input validation, file/directory creation.
 setup() {
  load test_helper
  _common_setup
 }
 teardown() {
  _common_teardown
 }
@test "event: exits 1 with usage when called with fewer than 4 args" {
  run "$LIB_DIR/archeflow-event.sh" run1 type1 plan
  [ "$status" -eq 1 ]
  [[ "$output" == *"Usage"* ]]
 }
@test "event: creates events directory and file on first call" {
  run "$LIB_DIR/archeflow-event.sh" test-run run.start plan "" '{"task":"test"}'
  [ "$status" -eq 0 ]
  [ -d ".archeflow/events" ]
  [ -f ".archeflow/events/test-run.jsonl" ]
 }
@test "event: first event has seq=1" {
  run "$LIB_DIR/archeflow-event.sh" test-run run.start plan "" '{"task":"test"}'
  [ "$status" -eq 0 ]
  local seq
  seq=$(head -1 ".archeflow/events/test-run.jsonl" | jq -r '.seq')
  [ "$seq" -eq 1 ]
 }
@test "event: second event has seq=2" {
  "$LIB_DIR/archeflow-event.sh" test-run run.start plan "" '{"task":"test"}' 2>/dev/null
  "$LIB_DIR/archeflow-event.sh" test-run agent.complete plan creator '{"dur":100}' "1" 2>/dev/null
  local count
  count=$(wc -l < ".archeflow/events/test-run.jsonl")
  [ "$count" -eq 2 ]
  local seq2
  seq2=$(tail -1 ".archeflow/events/test-run.jsonl" | jq -r '.seq')
  [ "$seq2" -eq 2 ]
 }
@test "event: output is valid JSONL" {
  "$LIB_DIR/archeflow-event.sh" test-run run.start plan "" '{"task":"hello"}' 2>/dev/null
  # jq will fail if the line is not valid JSON
  jq empty ".archeflow/events/test-run.jsonl"
 }
@test "event: fields are correctly populated" {
  "$LIB_DIR/archeflow-event.sh" test-run agent.complete do maker '{"tokens":500}' 2>/dev/null
  local event
  event=$(head -1 ".archeflow/events/test-run.jsonl")
  [ "$(echo "$event" | jq -r '.run_id')" = "test-run" ]
  [ "$(echo "$event" | jq -r '.type')" = "agent.complete" ]
  [ "$(echo "$event" | jq -r '.phase')" = "do" ]
  [ "$(echo "$event" | jq -r '.agent')" = "maker" ]
  [ "$(echo "$event" | jq -r '.data.tokens')" = "500" ]
 }
@test "event: empty agent becomes null in JSON" {
  "$LIB_DIR/archeflow-event.sh" test-run phase.transition do "" '{"from":"plan","to":"do"}' 2>/dev/null
  local agent
  agent=$(head -1 ".archeflow/events/test-run.jsonl" | jq -r '.agent')
  [ "$agent" = "null" ]
 }
@test "event: parent field is empty array for root events" {
  "$LIB_DIR/archeflow-event.sh" test-run run.start plan "" '{}' 2>/dev/null
  local parent
  parent=$(head -1 ".archeflow/events/test-run.jsonl" | jq -c '.parent')
  [ "$parent" = "[]" ]
 }
@test "event: single parent is parsed correctly" {
  "$LIB_DIR/archeflow-event.sh" test-run run.start plan "" '{}' 2>/dev/null
  "$LIB_DIR/archeflow-event.sh" test-run agent.complete plan creator '{}' "1" 2>/dev/null
  local parent
  parent=$(tail -1 ".archeflow/events/test-run.jsonl" | jq -c '.parent')
  [ "$parent" = "[1]" ]
 }
@test "event: multiple parents (fan-in) are parsed correctly" {
  "$LIB_DIR/archeflow-event.sh" test-run run.start plan "" '{}' 2>/dev/null
  "$LIB_DIR/archeflow-event.sh" test-run a plan "" '{}' "1" 2>/dev/null
  "$LIB_DIR/archeflow-event.sh" test-run b plan "" '{}' "1" 2>/dev/null
  "$LIB_DIR/archeflow-event.sh" test-run merge plan "" '{}' "2,3" 2>/dev/null
  local parent
  parent=$(tail -1 ".archeflow/events/test-run.jsonl" | jq -c '.parent')
  [ "$parent" = "[2,3]" ]
 }
@test "event: rejects invalid JSON data" {
  run "$LIB_DIR/archeflow-event.sh" test-run run.start plan "" 'not-json'
  [ "$status" -eq 1 ]
  [[ "$output" == *"invalid JSON"* ]]
 }
@test "event: rejects invalid parent format" {
  run "$LIB_DIR/archeflow-event.sh" test-run run.start plan "" '{}' "abc"
  [ "$status" -eq 1 ]
  [[ "$output" == *"invalid parent format"* ]]
 }
@test "event: timestamp is ISO 8601 UTC format" {
  "$LIB_DIR/archeflow-event.sh" test-run run.start plan "" '{}' 2>/dev/null
  local ts
  ts=$(head -1 ".archeflow/events/test-run.jsonl" | jq -r '.ts')
  # Matches YYYY-MM-DDTHH:MM:SSZ
  [[ "$ts" =~ ^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}Z$ ]]
 }
@test "event: default data is empty object when omitted" {
  "$LIB_DIR/archeflow-event.sh" test-run run.start plan agent 2>/dev/null
  local data
  data=$(head -1 ".archeflow/events/test-run.jsonl" | jq -c '.data')
  [ "$data" = "{}" ]
 }
@test "event: confirmation message goes to stderr" {
  run "$LIB_DIR/archeflow-event.sh" test-run run.start plan "" '{}' "" 2>&1
  [[ "$output" == *"[archeflow-event]"* ]]
  [[ "$output" == *"#1"* ]]
 }
--- a/tests/archeflow-git.bats
+++ b/tests/archeflow-git.bats
@@ -0,0 +1,212 @@
 # Tests for archeflow-git.sh — git branch/commit strategy for ArcheFlow runs.
 #
 # Validates: branch creation with correct naming, commit formatting,
 # merge strategies, input validation, and safety guards.
 setup() {
  load test_helper
  _common_setup
 }
 teardown() {
  _common_teardown
 }
 # --- Usage ---
@test "git: exits 1 with usage when called with fewer than 2 args" {
  run "$LIB_DIR/archeflow-git.sh"
  [ "$status" -eq 1 ]
  [[ "$output" == *"Usage"* ]]
 }
@test "git: exits 1 for unknown command" {
  run "$LIB_DIR/archeflow-git.sh" nonexistent test-run
  [ "$status" -ne 0 ]
  [[ "$output" == *"Unknown command"* ]]
 }
 # --- init ---
@test "git init: creates branch with archeflow/ prefix" {
  run "$LIB_DIR/archeflow-git.sh" init test-run
  [ "$status" -eq 0 ]
  local current
  current=$(git branch --show-current)
  [ "$current" = "archeflow/test-run" ]
 }
@test "git init: stores base branch in .archeflow/runs/<run_id>/base-branch" {
  "$LIB_DIR/archeflow-git.sh" init test-run 2>/dev/null
  [ -f ".archeflow/runs/test-run/base-branch" ]
  local base
  base=$(cat ".archeflow/runs/test-run/base-branch")
  [ "$base" = "main" ]
 }
@test "git init: fails if branch already exists" {
  "$LIB_DIR/archeflow-git.sh" init test-run 2>/dev/null
  git checkout main --quiet
  run "$LIB_DIR/archeflow-git.sh" init test-run
  [ "$status" -ne 0 ]
  [[ "$output" == *"already exists"* ]]
 }
 # --- commit ---
@test "git commit: uses conventional commit format by default" {
  "$LIB_DIR/archeflow-git.sh" init test-run 2>/dev/null
  # Create a file to commit
  mkdir -p .archeflow/events
  echo '{"test":true}' > .archeflow/events/test-run.jsonl
  "$LIB_DIR/archeflow-git.sh" commit test-run plan "initial plan" 2>/dev/null
  local msg
  msg=$(git log -1 --format=%s)
  [[ "$msg" == "archeflow(plan): initial plan" ]]
 }
@test "git commit: stages event file automatically" {
  "$LIB_DIR/archeflow-git.sh" init test-run 2>/dev/null
  mkdir -p .archeflow/events
  echo '{"test":true}' > .archeflow/events/test-run.jsonl
  "$LIB_DIR/archeflow-git.sh" commit test-run plan "test commit" 2>/dev/null
  # Verify the event file was committed
  local committed_files
  committed_files=$(git diff-tree --no-commit-id --name-only -r HEAD)
  [[ "$committed_files" == *"test-run.jsonl"* ]]
 }
@test "git commit: stages extra files passed as arguments" {
  "$LIB_DIR/archeflow-git.sh" init test-run 2>/dev/null
  echo "extra content" > extra.txt
  "$LIB_DIR/archeflow-git.sh" commit test-run do "with extras" extra.txt 2>/dev/null
  local committed_files
  committed_files=$(git diff-tree --no-commit-id --name-only -r HEAD)
  [[ "$committed_files" == *"extra.txt"* ]]
 }
@test "git commit: reports nothing to commit when no changes" {
  "$LIB_DIR/archeflow-git.sh" init test-run 2>/dev/null
  # Commit the init artifacts first so there's a clean state
  git add -A && git commit -m "init artifacts" --quiet 2>/dev/null || true
  run bash -c "cd '$BATS_TEST_TMPDIR' && '$LIB_DIR/archeflow-git.sh' commit test-run plan 'empty' 2>&1"
  [ "$status" -eq 0 ]
  [[ "$output" == *"Nothing to commit"* ]]
 }
@test "git commit: fails if not on the run branch" {
  "$LIB_DIR/archeflow-git.sh" init test-run 2>/dev/null
  git checkout main --quiet
  run "$LIB_DIR/archeflow-git.sh" commit test-run plan "wrong branch"
  [ "$status" -ne 0 ]
  [[ "$output" == *"Expected to be on branch"* ]]
 }
 # --- phase-commit ---
@test "git phase-commit: creates commit with phase transition message" {
  "$LIB_DIR/archeflow-git.sh" init test-run 2>/dev/null
  mkdir -p .archeflow/events
  echo '{"test":true}' > .archeflow/events/test-run.jsonl
  "$LIB_DIR/archeflow-git.sh" phase-commit test-run plan 2>/dev/null
  local msg
  msg=$(git log -1 --format=%s)
  # Should contain the phase transition arrow
  [[ "$msg" == *"plan"* ]]
  [[ "$msg" == *"do"* ]]
 }
 # --- merge ---
@test "git merge: squash merge is the default strategy" {
  "$LIB_DIR/archeflow-git.sh" init test-run 2>/dev/null
  mkdir -p .archeflow/events
  echo '{"test":true}' > .archeflow/events/test-run.jsonl
  "$LIB_DIR/archeflow-git.sh" commit test-run plan "test" 2>/dev/null
  "$LIB_DIR/archeflow-git.sh" merge test-run 2>/dev/null
  local current
  current=$(git branch --show-current)
  [ "$current" = "main" ]
  local msg
  msg=$(git log -1 --format=%s)
  [[ "$msg" == *"archeflow run test-run"* ]]
 }
@test "git merge: --no-ff creates a merge commit" {
  "$LIB_DIR/archeflow-git.sh" init test-run 2>/dev/null
  mkdir -p .archeflow/events
  echo '{"test":true}' > .archeflow/events/test-run.jsonl
  "$LIB_DIR/archeflow-git.sh" commit test-run plan "test" 2>/dev/null
  "$LIB_DIR/archeflow-git.sh" merge test-run --no-ff 2>/dev/null
  local current
  current=$(git branch --show-current)
  [ "$current" = "main" ]
  # no-ff merge commit should have 2 parents
  local parent_count
  parent_count=$(git cat-file -p HEAD | grep -c '^parent')
  [ "$parent_count" -eq 2 ]
 }
@test "git merge: rejects unknown merge strategy" {
  "$LIB_DIR/archeflow-git.sh" init test-run 2>/dev/null
  mkdir -p .archeflow/events
  echo '{"test":true}' > .archeflow/events/test-run.jsonl
  "$LIB_DIR/archeflow-git.sh" commit test-run plan "test" 2>/dev/null
  run "$LIB_DIR/archeflow-git.sh" merge test-run --fast-forward
  [ "$status" -ne 0 ]
  [[ "$output" == *"Unknown merge strategy"* ]]
 }
@test "git merge: fails with uncommitted changes" {
  "$LIB_DIR/archeflow-git.sh" init test-run 2>/dev/null
  echo "dirty" > dirty.txt
  git add dirty.txt
  run "$LIB_DIR/archeflow-git.sh" merge test-run
  [ "$status" -ne 0 ]
  [[ "$output" == *"Uncommitted changes"* ]]
 }
 # --- format_message ---
@test "git commit: simple style uses 'phase: msg' format" {
  "$LIB_DIR/archeflow-git.sh" init test-run 2>/dev/null
  # Create config with simple style
  mkdir -p .archeflow
  echo "commit_style: simple" > .archeflow/config.yaml
  mkdir -p .archeflow/events
  echo '{"test":true}' > .archeflow/events/test-run.jsonl
  "$LIB_DIR/archeflow-git.sh" commit test-run plan "simple test" 2>/dev/null
  local msg
  msg=$(git log -1 --format=%s)
  [ "$msg" = "plan: simple test" ]
 }
 # --- status ---
@test "git status: shows branch info for existing run" {
  "$LIB_DIR/archeflow-git.sh" init test-run 2>/dev/null
  run "$LIB_DIR/archeflow-git.sh" status test-run
  [ "$status" -eq 0 ]
  [[ "$output" == *"Branch: archeflow/test-run"* ]]
  [[ "$output" == *"Base: main"* ]]
 }
@test "git status: fails for nonexistent branch" {
  run "$LIB_DIR/archeflow-git.sh" status nonexistent
  [ "$status" -ne 0 ]
  [[ "$output" == *"does not exist"* ]]
 }
 # --- cleanup ---
@test "git cleanup: fails if currently on the run branch" {
  "$LIB_DIR/archeflow-git.sh" init test-run 2>/dev/null
  run "$LIB_DIR/archeflow-git.sh" cleanup test-run
  [ "$status" -ne 0 ]
  [[ "$output" == *"Cannot delete"* ]]
 }
--- a/tests/archeflow-init.bats
+++ b/tests/archeflow-init.bats
@@ -0,0 +1,81 @@
 # Tests for archeflow-init.sh — project initialization from templates.
 #
 # Validates: usage output, --list, --from (clone), and argument parsing.
 setup() {
  load test_helper
  _common_setup
 }
 teardown() {
  _common_teardown
 }
@test "init: shows usage when called with no args" {
  run "$LIB_DIR/archeflow-init.sh"
  [ "$status" -eq 0 ]
  [[ "$output" == *"Usage"* ]]
  [[ "$output" == *"bundle-name"* ]]
 }
@test "init: --list shows template listing without errors" {
  run "$LIB_DIR/archeflow-init.sh" --list
  [ "$status" -eq 0 ]
  [[ "$output" == *"Templates"* ]]
  [[ "$output" == *"Bundles"* ]]
 }
@test "init: --from fails when source has no .archeflow dir" {
  local source_dir
  source_dir=$(mktemp -d)
  run "$LIB_DIR/archeflow-init.sh" --from "$source_dir"
  [ "$status" -ne 0 ]
  [[ "$output" == *"No .archeflow/"* ]]
  rm -rf "$source_dir"
 }
@test "init: --from clones setup from another project" {
  # Create a source project with .archeflow structure
  local source_dir
  source_dir=$(mktemp -d)
  mkdir -p "$source_dir/.archeflow/teams" "$source_dir/.archeflow/workflows"
  echo "name: test-team" > "$source_dir/.archeflow/teams/test.yaml"
  echo "name: test-workflow" > "$source_dir/.archeflow/workflows/test.yaml"
  echo "bundle: test" > "$source_dir/.archeflow/config.yaml"
  run "$LIB_DIR/archeflow-init.sh" --from "$source_dir"
  [ "$status" -eq 0 ]
  [ -f ".archeflow/teams/test.yaml" ]
  [ -f ".archeflow/workflows/test.yaml" ]
  [ -f ".archeflow/config.yaml" ]
  rm -rf "$source_dir"
 }
@test "init: --from skips events and artifacts directories" {
  local source_dir
  source_dir=$(mktemp -d)
  mkdir -p "$source_dir/.archeflow/events" "$source_dir/.archeflow/artifacts"
  mkdir -p "$source_dir/.archeflow/teams"
  echo "name: test" > "$source_dir/.archeflow/teams/t.yaml"
  echo '{"test":true}' > "$source_dir/.archeflow/events/run.jsonl"
  echo "artifact" > "$source_dir/.archeflow/artifacts/test.txt"
  run "$LIB_DIR/archeflow-init.sh" --from "$source_dir"
  [ "$status" -eq 0 ]
  [ ! -f ".archeflow/events/run.jsonl" ]
  [ ! -f ".archeflow/artifacts/test.txt" ]
  [[ "$output" == *"skipped events"* ]]
  rm -rf "$source_dir"
 }
@test "init: rejects unknown options" {
  run "$LIB_DIR/archeflow-init.sh" --nonexistent
  [ "$status" -ne 0 ]
  [[ "$output" == *"Unknown option"* ]]
 }
@test "init: --save fails with no .archeflow directory" {
  run "$LIB_DIR/archeflow-init.sh" --save test-save
  [ "$status" -ne 0 ]
  [[ "$output" == *"No .archeflow/"* ]]
 }
--- a/tests/archeflow-memory.bats
+++ b/tests/archeflow-memory.bats
@@ -0,0 +1,227 @@
 # Tests for archeflow-memory.sh — cross-run lesson memory management.
 #
 # Validates: add, list, decay, forget, inject filtering, and JSONL format.
 setup() {
  load test_helper
  _common_setup
 }
 teardown() {
  _common_teardown
 }
 # --- Usage / error handling ---
@test "memory: exits 1 with usage when called with no args" {
  run "$LIB_DIR/archeflow-memory.sh"
  [ "$status" -eq 1 ]
  [[ "$output" == *"Usage"* ]]
 }
@test "memory: exits 1 for unknown command" {
  run "$LIB_DIR/archeflow-memory.sh" nonexistent
  [ "$status" -eq 1 ]
  [[ "$output" == *"Unknown command"* ]]
 }
 # --- add ---
@test "memory add: creates lessons.jsonl and appends a valid JSONL line" {
  run "$LIB_DIR/archeflow-memory.sh" add preference "Always validate inputs"
  [ "$status" -eq 0 ]
  [ -f ".archeflow/memory/lessons.jsonl" ]
  jq empty ".archeflow/memory/lessons.jsonl"
 }
@test "memory add: lesson has correct fields" {
  "$LIB_DIR/archeflow-memory.sh" add pattern "Guardian misses SQL injection" 2>/dev/null
  [ "$(jq -r '.type' .archeflow/memory/lessons.jsonl)" = "pattern" ]
  [ "$(jq -r '.description' .archeflow/memory/lessons.jsonl)" = "Guardian misses SQL injection" ]
  [ "$(jq -r '.source' .archeflow/memory/lessons.jsonl)" = "user_feedback" ]
  [ "$(jq -r '.frequency' .archeflow/memory/lessons.jsonl)" = "1" ]
  [ "$(jq -r '.run_id' .archeflow/memory/lessons.jsonl)" = "manual" ]
  [ "$(jq -r '.domain' .archeflow/memory/lessons.jsonl)" = "general" ]
 }
@test "memory add: generates sequential IDs" {
  "$LIB_DIR/archeflow-memory.sh" add pattern "first lesson" 2>/dev/null
  "$LIB_DIR/archeflow-memory.sh" add pattern "second lesson" 2>/dev/null
  local id1 id2
  id1=$(head -1 ".archeflow/memory/lessons.jsonl" | jq -r '.id')
  id2=$(tail -1 ".archeflow/memory/lessons.jsonl" | jq -r '.id')
  [ "$id1" = "m-001" ]
  [ "$id2" = "m-002" ]
 }
@test "memory add: generates tags from description" {
  "$LIB_DIR/archeflow-memory.sh" add pattern "Guardian misses SQL injection attacks" 2>/dev/null
  local tags_count
  tags_count=$(head -1 ".archeflow/memory/lessons.jsonl" | jq '.tags | length')
  [ "$tags_count" -gt 0 ]
 }
@test "memory add: exits 1 when description is missing" {
  run "$LIB_DIR/archeflow-memory.sh" add pattern
  [ "$status" -eq 1 ]
  [[ "$output" == *"Usage"* ]]
 }
 # --- list ---
@test "memory list: shows message when no lessons exist" {
  run bash -c "'$LIB_DIR/archeflow-memory.sh' list 2>&1"
  [ "$status" -eq 0 ]
  [[ "$output" == *"No lessons"* ]]
 }
@test "memory list: shows table header and lesson data" {
  "$LIB_DIR/archeflow-memory.sh" add pattern "Test lesson for listing" 2>/dev/null
  run "$LIB_DIR/archeflow-memory.sh" list
  [ "$status" -eq 0 ]
  [[ "$output" == *"ID"* ]]
  [[ "$output" == *"Freq"* ]]
  [[ "$output" == *"m-001"* ]]
  [[ "$output" == *"Test lesson for listing"* ]]
 }
 # --- decay ---
@test "memory decay: increments runs_since_last_seen" {
  "$LIB_DIR/archeflow-memory.sh" add pattern "Decay test lesson" 2>/dev/null
  "$LIB_DIR/archeflow-memory.sh" decay 2>/dev/null
  local runs_since
  runs_since=$(head -1 ".archeflow/memory/lessons.jsonl" | jq '.runs_since_last_seen')
  [ "$runs_since" -eq 1 ]
 }
@test "memory decay: decrements frequency after 10 runs" {
  "$LIB_DIR/archeflow-memory.sh" add pattern "Decay frequency test" 2>/dev/null
  # Set frequency=3 and runs_since=9 to trigger decay on next call
  local tmp=".archeflow/memory/lessons.jsonl.tmp"
  head -1 ".archeflow/memory/lessons.jsonl" | jq -c '.frequency = 3 | .runs_since_last_seen = 9' > "$tmp"
  mv "$tmp" ".archeflow/memory/lessons.jsonl"
  "$LIB_DIR/archeflow-memory.sh" decay 2>/dev/null
  local freq
  freq=$(head -1 ".archeflow/memory/lessons.jsonl" | jq '.frequency')
  [ "$freq" -eq 2 ]
 }
@test "memory decay: archives lesson when frequency reaches 0" {
  "$LIB_DIR/archeflow-memory.sh" add pattern "Will be archived" 2>/dev/null
  # Set frequency=1 and runs_since=9 to trigger archival
  local tmp=".archeflow/memory/lessons.jsonl.tmp"
  head -1 ".archeflow/memory/lessons.jsonl" | jq -c '.frequency = 1 | .runs_since_last_seen = 9' > "$tmp"
  mv "$tmp" ".archeflow/memory/lessons.jsonl"
  "$LIB_DIR/archeflow-memory.sh" decay 2>/dev/null
  # Lesson should be gone from lessons file (file should be empty)
  local remaining
  remaining=$(wc -l < ".archeflow/memory/lessons.jsonl" | tr -d ' ')
  [ "$remaining" -eq 0 ]
  # And present in archive
  [ -f ".archeflow/memory/archive.jsonl" ]
  local archived_count
  archived_count=$(wc -l < ".archeflow/memory/archive.jsonl" | tr -d ' ')
  [ "$archived_count" -eq 1 ]
 }
@test "memory decay: does nothing when no lessons exist" {
  run "$LIB_DIR/archeflow-memory.sh" decay
  [ "$status" -eq 0 ]
 }
 # --- forget ---
@test "memory forget: moves lesson to archive" {
  "$LIB_DIR/archeflow-memory.sh" add pattern "Will forget this" 2>/dev/null
  "$LIB_DIR/archeflow-memory.sh" forget m-001 2>/dev/null
  # Lessons file should be empty
  local remaining
  remaining=$(wc -l < ".archeflow/memory/lessons.jsonl" | tr -d ' ')
  [ "$remaining" -eq 0 ]
  # Archive should have it
  [ -f ".archeflow/memory/archive.jsonl" ]
  local archived_id
  archived_id=$(head -1 ".archeflow/memory/archive.jsonl" | jq -r '.id')
  [ "$archived_id" = "m-001" ]
 }
@test "memory forget: exits 1 for nonexistent ID" {
  "$LIB_DIR/archeflow-memory.sh" add pattern "test" 2>/dev/null
  run "$LIB_DIR/archeflow-memory.sh" forget m-999
  [ "$status" -eq 1 ]
  [[ "$output" == *"not found"* ]]
 }
@test "memory forget: exits 1 when no lessons file exists" {
  run "$LIB_DIR/archeflow-memory.sh" forget m-001
  [ "$status" -eq 1 ]
  [[ "$output" == *"No lessons file"* ]]
 }
 # --- inject ---
@test "memory inject: outputs nothing when no lessons file exists" {
  run "$LIB_DIR/archeflow-memory.sh" inject code guardian
  [ "$status" -eq 0 ]
  [ -z "$output" ]
 }
@test "memory inject: outputs relevant lessons with frequency >= 2" {
  "$LIB_DIR/archeflow-memory.sh" add pattern "Test injection lesson" 2>/dev/null
  # Bump frequency to 2
  local tmp=".archeflow/memory/lessons.jsonl.tmp"
  jq -c '.frequency = 2' ".archeflow/memory/lessons.jsonl" > "$tmp"
  mv "$tmp" ".archeflow/memory/lessons.jsonl"
  run "$LIB_DIR/archeflow-memory.sh" inject "" ""
  [ "$status" -eq 0 ]
  [[ "$output" == *"Known Issues"* ]]
  [[ "$output" == *"Test injection lesson"* ]]
 }
@test "memory inject: skips lessons with frequency < 2 (except preferences)" {
  "$LIB_DIR/archeflow-memory.sh" add pattern "Low frequency lesson" 2>/dev/null
  # frequency is 1 by default, type is pattern -> should NOT be injected
  run "$LIB_DIR/archeflow-memory.sh" inject "" ""
  [ "$status" -eq 0 ]
  [ -z "$output" ]
 }
@test "memory inject: always injects preferences regardless of frequency" {
  "$LIB_DIR/archeflow-memory.sh" add preference "User prefers explicit error messages" 2>/dev/null
  run "$LIB_DIR/archeflow-memory.sh" inject "" ""
  [ "$status" -eq 0 ]
  [[ "$output" == *"User prefers explicit error messages"* ]]
 }
 # --- extract ---
@test "memory extract: exits 1 when events file not found" {
  run "$LIB_DIR/archeflow-memory.sh" extract nonexistent.jsonl
  [ "$status" -eq 1 ]
  [[ "$output" == *"not found"* ]]
 }
@test "memory extract: extracts findings from review.verdict events" {
  # Create a mock events file with a review.verdict
  mkdir -p .archeflow/events
  cat > /tmp/test-events.jsonl <<'EOF'
 {"run_id":"test-run","seq":1,"type":"run.start","phase":"plan","data":{"task":"test"}}
 {"run_id":"test-run","seq":2,"type":"review.verdict","phase":"check","data":{"archetype":"guardian","verdict":"needs_changes","findings":[{"severity":"warning","description":"Missing input validation on user endpoint","category":"code"}]}}
 EOF
  run "$LIB_DIR/archeflow-memory.sh" extract /tmp/test-events.jsonl
  [ "$status" -eq 0 ]
  [ -f ".archeflow/memory/lessons.jsonl" ]
  local desc
  desc=$(jq -r '.description' ".archeflow/memory/lessons.jsonl")
  [[ "$desc" == *"Missing input validation"* ]]
  rm -f /tmp/test-events.jsonl
 }
--- a/tests/archeflow-progress.bats
+++ b/tests/archeflow-progress.bats
@@ -0,0 +1,78 @@
 # Tests for archeflow-progress.sh — live progress file generation.
 #
 # Validates: markdown output structure, JSON mode, missing events handling, exit codes.
 setup() {
  load test_helper
  _common_setup
  # Create standard events for progress tests
  mkdir -p .archeflow/events
  cat > ".archeflow/events/test-run.jsonl" <<'EVENTS'
 {"ts":"2026-04-03T10:00:00Z","run_id":"test-run","seq":1,"parent":[],"type":"run.start","phase":"plan","agent":null,"data":{"task":"Build feature","workflow":"standard","team":"default"}}
 {"ts":"2026-04-03T10:01:00Z","run_id":"test-run","seq":2,"parent":[1],"type":"agent.complete","phase":"plan","agent":"creator","data":{"archetype":"creator","duration_ms":60000,"tokens":1500,"estimated_cost_usd":0.02,"summary":"Planned"}}
 EVENTS
 }
@test "progress: exits 1 with usage when called with no args" {
  run "$LIB_DIR/archeflow-progress.sh"
  [ "$status" -eq 1 ]
  [[ "$output" == *"Usage"* ]]
 }
@test "progress: exits 1 when events file not found" {
  run "$LIB_DIR/archeflow-progress.sh" nonexistent-run
  [ "$status" -eq 1 ]
  [[ "$output" == *"not found"* ]]
 }
@test "progress: default mode generates progress.md" {
  run "$LIB_DIR/archeflow-progress.sh" test-run
  [ "$status" -eq 0 ]
  [ -f ".archeflow/progress.md" ]
  [[ "$output" == *"# ArcheFlow Run: test-run"* ]]
  [[ "$output" == *"Status:"* ]]
  [[ "$output" == *"Progress"* ]]
 }
@test "progress: json mode outputs valid JSON" {
  run "$LIB_DIR/archeflow-progress.sh" test-run --json
  [ "$status" -eq 0 ]
  echo "$output" | jq empty
  local run_id
  run_id=$(echo "$output" | jq -r '.run_id')
  [ "$run_id" = "test-run" ]
 }
@test "progress: json mode includes completed agents" {
  run "$LIB_DIR/archeflow-progress.sh" test-run --json
  [ "$status" -eq 0 ]
  local completed_count
  completed_count=$(echo "$output" | jq '.completed | length')
  [ "$completed_count" -eq 1 ]
  local agent
  agent=$(echo "$output" | jq -r '.completed[0].agent')
  [ "$agent" = "creator" ]
 }
@test "progress: json mode shows correct phase" {
  run "$LIB_DIR/archeflow-progress.sh" test-run --json
  [ "$status" -eq 0 ]
  local phase
  phase=$(echo "$output" | jq -r '.phase')
  [ "$phase" = "plan" ]
 }
@test "progress: reports error in json when events file missing" {
  run "$LIB_DIR/archeflow-progress.sh" missing-run --json
  # JSON mode returns the JSON even on error
  local error
  error=$(echo "$output" | jq -r '.error // empty')
  [[ "$error" == *"not found"* ]]
 }
@test "progress: rejects unknown flags" {
  run "$LIB_DIR/archeflow-progress.sh" test-run --invalid
  [ "$status" -eq 1 ]
  [[ "$output" == *"Unknown flag"* ]]
 }
--- a/tests/archeflow-replay.bats
+++ b/tests/archeflow-replay.bats
@@ -0,0 +1,62 @@
 # Tests for archeflow-replay.sh — timeline, what-if, and compare modes.
 setup() {
  load test_helper
  _common_setup
  mkdir -p .archeflow/events
  cat > ".archeflow/events/replay-run.jsonl" <<'EVENTS'
 {"ts":"2026-04-03T10:00:00Z","run_id":"replay-run","seq":1,"parent":[],"type":"run.start","phase":"plan","agent":null,"data":{"task":"replay test"}}
 {"ts":"2026-04-03T10:05:00Z","run_id":"replay-run","seq":2,"parent":[1],"type":"decision.point","phase":"check","agent":"guardian","data":{"archetype":"guardian","input":"diff","decision":"needs_changes","confidence":0.88}}
 {"ts":"2026-04-03T10:06:00Z","run_id":"replay-run","seq":3,"parent":[1],"type":"review.verdict","phase":"check","agent":"guardian","data":{"archetype":"guardian","verdict":"needs_changes","findings":[]}}
 {"ts":"2026-04-03T10:07:00Z","run_id":"replay-run","seq":4,"parent":[1],"type":"review.verdict","phase":"check","agent":"sage","data":{"archetype":"sage","verdict":"approved","findings":[]}}
 {"ts":"2026-04-03T10:08:00Z","run_id":"replay-run","seq":5,"parent":[1],"type":"run.complete","phase":"act","agent":null,"data":{"agents_total":2,"fixes_total":0}}
 EVENTS
 }
@test "replay: usage without args" {
  run "$LIB_DIR/archeflow-replay.sh"
  [ "$status" -eq 1 ]
  [[ "$output" == *"Usage"* ]]
 }
@test "replay: timeline shows decision.point" {
  run "$LIB_DIR/archeflow-replay.sh" timeline replay-run
  [ "$status" -eq 0 ]
  [[ "$output" == *"decision.point"* ]]
  [[ "$output" == *"guardian"* ]]
  [[ "$output" == *"needs_changes"* ]]
 }
@test "replay: whatif strict blocks when any reviewer blocks" {
  run "$LIB_DIR/archeflow-replay.sh" whatif replay-run
  [ "$status" -eq 0 ]
  [[ "$output" == *"BLOCK"* ]]
 }
@test "replay: whatif weighted can ship when blocker is down-weighted" {
  run "$LIB_DIR/archeflow-replay.sh" whatif replay-run --weights guardian=0.2,sage=3
  [ "$status" -eq 0 ]
  [[ "$output" == *"Weighted replay"* ]] || [[ "$output" == *"SHIP"* ]]
  [[ "$output" == *"SHIP"* ]]
 }
@test "replay: whatif --json is valid JSON" {
  run "$LIB_DIR/archeflow-replay.sh" whatif replay-run --json
  [ "$status" -eq 0 ]
  echo "$output" | jq -e '.run_id == "replay-run"' >/dev/null
 }
@test "replay: compare includes timeline and whatif" {
  run "$LIB_DIR/archeflow-replay.sh" compare replay-run
  [ "$status" -eq 0 ]
  [[ "$output" == *"Decision timeline"* ]]
  [[ "$output" == *"What-if replay"* ]]
 }
@test "decision: logs decision.point via wrapper" {
  run "$LIB_DIR/archeflow-decision.sh" replay-run check trickster 'diff only' 'edge_case' 0.61 1
  [ "$status" -eq 0 ]
  last=$(jq -r 'select(.type=="decision.point") | .data.decision' ".archeflow/events/replay-run.jsonl" | tail -1)
  [ "$last" = "edge_case" ]
 }
--- a/tests/archeflow-report.bats
+++ b/tests/archeflow-report.bats
@@ -0,0 +1,80 @@
 # Tests for archeflow-report.sh — Markdown process report generation from JSONL events.
 #
 # Validates: report output format, summary mode, missing file handling, jq dependency check.
 setup() {
  load test_helper
  _common_setup
  # Create a standard events file used by multiple tests
  mkdir -p .archeflow/events
  cat > "$BATS_TEST_TMPDIR/events.jsonl" <<'EVENTS'
 {"ts":"2026-04-03T10:00:00Z","run_id":"test-run","seq":1,"parent":[],"type":"run.start","phase":"plan","agent":null,"data":{"task":"Write unit tests","workflow":"standard","team":"default"}}
 {"ts":"2026-04-03T10:01:00Z","run_id":"test-run","seq":2,"parent":[1],"type":"agent.complete","phase":"plan","agent":"creator","data":{"archetype":"creator","duration_ms":60000,"tokens":1500,"summary":"Designed test structure"}}
 {"ts":"2026-04-03T10:02:00Z","run_id":"test-run","seq":3,"parent":[2],"type":"phase.transition","phase":"do","agent":null,"data":{"from":"plan","to":"do"}}
 {"ts":"2026-04-03T10:05:00Z","run_id":"test-run","seq":4,"parent":[3],"type":"agent.complete","phase":"do","agent":"maker","data":{"archetype":"maker","duration_ms":180000,"tokens":3000,"summary":"Implemented tests"}}
 {"ts":"2026-04-03T10:06:00Z","run_id":"test-run","seq":5,"parent":[4],"type":"phase.transition","phase":"check","agent":null,"data":{"from":"do","to":"check"}}
 {"ts":"2026-04-03T10:07:00Z","run_id":"test-run","seq":6,"parent":[5],"type":"review.verdict","phase":"check","agent":"guardian","data":{"archetype":"guardian","verdict":"approved","findings":[]}}
 {"ts":"2026-04-03T10:08:00Z","run_id":"test-run","seq":7,"parent":[6],"type":"run.complete","phase":"act","agent":null,"data":{"status":"completed","cycles":1,"agents_total":3,"fixes_total":0,"duration_ms":480000}}
 EVENTS
 }
@test "report: exits 1 with usage when called with no args" {
  run "$LIB_DIR/archeflow-report.sh"
  [ "$status" -eq 1 ]
  [[ "$output" == *"Usage"* ]]
 }
@test "report: exits 1 when events file not found" {
  run "$LIB_DIR/archeflow-report.sh" nonexistent.jsonl
  [ "$status" -eq 1 ]
  [[ "$output" == *"not found"* ]]
 }
@test "report: full mode produces markdown with header and overview" {
  run "$LIB_DIR/archeflow-report.sh" "$BATS_TEST_TMPDIR/events.jsonl"
  [ "$status" -eq 0 ]
  [[ "$output" == *"# Process Report: Write unit tests"* ]]
  [[ "$output" == *"test-run"* ]]
  [[ "$output" == *"Overview"* ]]
  [[ "$output" == *"Status"* ]]
  [[ "$output" == *"completed"* ]]
 }
@test "report: full mode includes phase sections" {
  run "$LIB_DIR/archeflow-report.sh" "$BATS_TEST_TMPDIR/events.jsonl"
  [ "$status" -eq 0 ]
  [[ "$output" == *"PLAN"* ]]
  [[ "$output" == *"DO"* ]]
  [[ "$output" == *"CHECK"* ]]
 }
@test "report: summary mode outputs one-line summary" {
  run "$LIB_DIR/archeflow-report.sh" "$BATS_TEST_TMPDIR/events.jsonl" --summary
  [ "$status" -eq 0 ]
  # Should be a single logical line with key stats
  [[ "$output" == *"[completed]"* ]]
  [[ "$output" == *"Write unit tests"* ]]
  [[ "$output" == *"1 cycles"* ]]
  [[ "$output" == *"test-run"* ]]
 }
@test "report: --output writes to file instead of stdout" {
  run "$LIB_DIR/archeflow-report.sh" "$BATS_TEST_TMPDIR/events.jsonl" --output "$BATS_TEST_TMPDIR/report.md"
  [ "$status" -eq 0 ]
  [ -f "$BATS_TEST_TMPDIR/report.md" ]
  local content
  content=$(cat "$BATS_TEST_TMPDIR/report.md")
  [[ "$content" == *"# Process Report"* ]]
 }
@test "report: summary for in-progress run shows [in-progress]" {
  # Events file without run.complete
  cat > "$BATS_TEST_TMPDIR/in-progress.jsonl" <<'EVENTS'
 {"ts":"2026-04-03T10:00:00Z","run_id":"wip-run","seq":1,"parent":[],"type":"run.start","phase":"plan","agent":null,"data":{"task":"WIP task","workflow":"fast","team":"default"}}
 EVENTS
  run "$LIB_DIR/archeflow-report.sh" "$BATS_TEST_TMPDIR/in-progress.jsonl" --summary
  [ "$status" -eq 0 ]
  [[ "$output" == *"[in-progress]"* ]]
  [[ "$output" == *"WIP task"* ]]
 }
--- a/tests/archeflow-review.bats
+++ b/tests/archeflow-review.bats
@@ -0,0 +1,82 @@
 # Tests for archeflow-review.sh — git diff extraction for code review.
 #
 # Validates: argument parsing, diff modes, stats output, empty diff handling.
 setup() {
  load test_helper
  _common_setup
 }
 teardown() {
  _common_teardown
 }
@test "review: --help shows usage" {
  run "$LIB_DIR/archeflow-review.sh" --help
  [ "$status" -eq 0 ]
  [[ "$output" == *"Usage"* ]]
  [[ "$output" == *"--branch"* ]]
  [[ "$output" == *"--commit"* ]]
 }
@test "review: exits 1 when no changes to review" {
  run "$LIB_DIR/archeflow-review.sh"
  [ "$status" -eq 1 ]
  [[ "$output" == *"No changes"* ]]
 }
@test "review: shows diff for uncommitted changes" {
  echo "new content" > testfile.txt
  git add testfile.txt
  run "$LIB_DIR/archeflow-review.sh"
  [ "$status" -eq 0 ]
  [[ "$output" == *"testfile.txt"* ]]
 }
@test "review: --stat-only prints stats without diff content" {
  echo "stat content" > statfile.txt
  git add statfile.txt
  run "$LIB_DIR/archeflow-review.sh" --stat-only
  [ "$status" -eq 0 ]
  # stderr has stats, stdout should be empty (no diff)
  # But run captures both, so just check it ran ok
  [[ "$output" == *"Review Stats"* ]]
 }
@test "review: --branch fails for nonexistent branch" {
  run "$LIB_DIR/archeflow-review.sh" --branch nonexistent-branch-xyz
  [ "$status" -ne 0 ]
  [[ "$output" == *"not found"* ]]
 }
@test "review: rejects unknown arguments" {
  run "$LIB_DIR/archeflow-review.sh" --unknown
  [ "$status" -ne 0 ]
  [[ "$output" == *"Unknown argument"* ]]
 }
@test "review: --branch shows diff against base" {
  # Create a feature branch with changes
  git checkout -b feat/test-review --quiet
  echo "feature" > feature.txt
  git add feature.txt
  git commit -m "feat: add feature" --quiet
  git checkout main --quiet
  run "$LIB_DIR/archeflow-review.sh" --branch feat/test-review
  [ "$status" -eq 0 ]
  [[ "$output" == *"feature.txt"* ]]
 }
@test "review: --commit shows diff for commit range" {
  echo "first" > first.txt
  git add first.txt
  git commit -m "first" --quiet
  echo "second" > second.txt
  git add second.txt
  git commit -m "second" --quiet
  run "$LIB_DIR/archeflow-review.sh" --commit HEAD~1..HEAD
  [ "$status" -eq 0 ]
  [[ "$output" == *"second.txt"* ]]
 }
--- a/tests/archeflow-rollback.bats
+++ b/tests/archeflow-rollback.bats
@@ -0,0 +1,58 @@
 # Tests for archeflow-rollback.sh — post-merge test and phase rollback.
 #
 # Validates: argument parsing, mutual exclusivity, phase validation, test-cmd config reading.
 setup() {
  load test_helper
  _common_setup
 }
 teardown() {
  _common_teardown
 }
@test "rollback: exits with error when called with no args" {
  run "$LIB_DIR/archeflow-rollback.sh"
  [ "$status" -ne 0 ]
 }
@test "rollback: rejects mutually exclusive --to and --test-cmd" {
  run "$LIB_DIR/archeflow-rollback.sh" test-run --to plan --test-cmd "true"
  [ "$status" -eq 2 ]
  [[ "$output" == *"mutually exclusive"* ]]
 }
@test "rollback: rejects invalid phase names" {
  run "$LIB_DIR/archeflow-rollback.sh" test-run --to invalid-phase
  [ "$status" -eq 2 ]
  [[ "$output" == *"Invalid phase"* ]]
 }
@test "rollback: accepts valid phase names (plan, do, check)" {
  # This will fail because no git branch exists, but should NOT fail on phase validation
  run "$LIB_DIR/archeflow-rollback.sh" test-run --to plan
  # Should fail later (archeflow-git.sh rollback) not on phase validation
  [[ "$output" != *"Invalid phase"* ]]
 }
@test "rollback: exits 2 when no test command available" {
  run "$LIB_DIR/archeflow-rollback.sh" test-run
  [ "$status" -eq 2 ]
  [[ "$output" == *"No test command"* ]]
 }
@test "rollback: reads test_command from config.yaml" {
  mkdir -p .archeflow
  echo 'test_command: "echo ok"' > .archeflow/config.yaml
  # HEAD won't have archeflow in its message, but the script just warns and proceeds
  run "$LIB_DIR/archeflow-rollback.sh" test-run
  # It should pick up the command and try to run it (test should pass -> exit 0)
  [ "$status" -eq 0 ]
  [[ "$output" == *"Tests passed"* ]]
 }
@test "rollback: rejects unknown options" {
  run "$LIB_DIR/archeflow-rollback.sh" test-run --unknown-flag
  [ "$status" -eq 2 ]
  [[ "$output" == *"Unknown option"* ]]
 }
--- a/tests/archeflow-score.bats
+++ b/tests/archeflow-score.bats
@@ -0,0 +1,105 @@
 # Tests for archeflow-score.sh — archetype effectiveness scoring.
 #
 # Validates: score extraction from events, report generation, input validation.
 setup() {
  load test_helper
  _common_setup
  # Create a complete run events file with review data
  mkdir -p .archeflow/events .archeflow/memory
  cat > "$BATS_TEST_TMPDIR/scored-events.jsonl" <<'EVENTS'
 {"ts":"2026-04-03T10:00:00Z","run_id":"score-run","seq":1,"parent":[],"type":"run.start","phase":"plan","agent":null,"data":{"task":"Score test"}}
 {"ts":"2026-04-03T10:01:00Z","run_id":"score-run","seq":2,"parent":[1],"type":"agent.complete","phase":"plan","agent":"creator","data":{"archetype":"creator","duration_ms":60000,"tokens":1500,"estimated_cost_usd":0.02}}
 {"ts":"2026-04-03T10:02:00Z","run_id":"score-run","seq":3,"parent":[2],"type":"agent.complete","phase":"do","agent":"maker","data":{"archetype":"maker","duration_ms":120000,"tokens":3000,"estimated_cost_usd":0.05}}
 {"ts":"2026-04-03T10:03:00Z","run_id":"score-run","seq":4,"parent":[3],"type":"review.verdict","phase":"check","agent":"guardian","data":{"archetype":"guardian","verdict":"needs_changes","findings":[{"severity":"warning","description":"Missing validation","fix_required":true},{"severity":"info","description":"Consider logging","fix_required":false}]}}
 {"ts":"2026-04-03T10:03:30Z","run_id":"score-run","seq":5,"parent":[3],"type":"review.verdict","phase":"check","agent":"sage","data":{"archetype":"sage","verdict":"approved","findings":[]}}
 {"ts":"2026-04-03T10:04:00Z","run_id":"score-run","seq":6,"parent":[4],"type":"fix.applied","phase":"act","agent":null,"data":{"source":"guardian","finding":"Missing validation"}}
 {"ts":"2026-04-03T10:05:00Z","run_id":"score-run","seq":7,"parent":[6],"type":"cycle.boundary","phase":"act","agent":null,"data":{"cycle":1,"max_cycles":3,"met":true,"next_action":"merge"}}
 {"ts":"2026-04-03T10:06:00Z","run_id":"score-run","seq":8,"parent":[7],"type":"run.complete","phase":"act","agent":null,"data":{"status":"completed","cycles":1,"agents_total":4,"fixes_total":1}}
 EVENTS
 }
@test "score: exits 1 with usage when called with no args" {
  run "$LIB_DIR/archeflow-score.sh"
  [ "$status" -eq 1 ]
  [[ "$output" == *"Usage"* ]]
 }
@test "score: exits 1 for unknown command" {
  run "$LIB_DIR/archeflow-score.sh" nonexistent
  [ "$status" -eq 1 ]
  [[ "$output" == *"Unknown command"* ]]
 }
@test "score extract: exits 1 when events file not found" {
  run "$LIB_DIR/archeflow-score.sh" extract nonexistent.jsonl
  [ "$status" -eq 1 ]
  [[ "$output" == *"not found"* ]]
 }
@test "score extract: exits 1 for incomplete run (no run.complete)" {
  cat > "$BATS_TEST_TMPDIR/incomplete.jsonl" <<'EVENTS'
 {"ts":"2026-04-03T10:00:00Z","run_id":"incomplete","seq":1,"parent":[],"type":"run.start","phase":"plan","agent":null,"data":{"task":"Incomplete"}}
 EVENTS
  run "$LIB_DIR/archeflow-score.sh" extract "$BATS_TEST_TMPDIR/incomplete.jsonl"
  [ "$status" -eq 1 ]
  [[ "$output" == *"run.complete"* ]]
 }
@test "score extract: creates effectiveness.jsonl with archetype scores" {
  run "$LIB_DIR/archeflow-score.sh" extract "$BATS_TEST_TMPDIR/scored-events.jsonl"
  [ "$status" -eq 0 ]
  [ -f ".archeflow/memory/effectiveness.jsonl" ]
  # Should have scores for guardian and sage (the reviewers)
  local guardian_score
  guardian_score=$(grep '"guardian"' ".archeflow/memory/effectiveness.jsonl" | head -1)
  [ -n "$guardian_score" ]
  # Verify JSONL is valid
  while IFS= read -r line; do
    echo "$line" | jq empty
  done < ".archeflow/memory/effectiveness.jsonl"
 }
@test "score extract: guardian has correct finding counts" {
  "$LIB_DIR/archeflow-score.sh" extract "$BATS_TEST_TMPDIR/scored-events.jsonl" 2>/dev/null
  local guardian
  guardian=$(grep '"guardian"' ".archeflow/memory/effectiveness.jsonl" | head -1)
  local total_findings
  total_findings=$(echo "$guardian" | jq '.findings_total')
  [ "$total_findings" -eq 2 ]
  local useful_findings
  useful_findings=$(echo "$guardian" | jq '.findings_useful')
  [ "$useful_findings" -eq 1 ]
  local fixes
  fixes=$(echo "$guardian" | jq '.fixes_applied')
  [ "$fixes" -eq 1 ]
 }
@test "score extract: composite score is between 0 and 1" {
  "$LIB_DIR/archeflow-score.sh" extract "$BATS_TEST_TMPDIR/scored-events.jsonl" 2>/dev/null
  while IFS= read -r line; do
    local score
    score=$(echo "$line" | jq '.composite_score')
    # score >= 0 and score <= 1
    [ "$(echo "$score >= 0" | bc)" -eq 1 ]
    [ "$(echo "$score <= 1" | bc)" -eq 1 ]
  done < ".archeflow/memory/effectiveness.jsonl"
 }
@test "score report: exits 1 when no effectiveness data" {
  run "$LIB_DIR/archeflow-score.sh" report
  [ "$status" -eq 1 ]
  [[ "$output" == *"No effectiveness data"* ]]
 }
@test "score report: outputs markdown table with archetype data" {
  "$LIB_DIR/archeflow-score.sh" extract "$BATS_TEST_TMPDIR/scored-events.jsonl" 2>/dev/null
  run "$LIB_DIR/archeflow-score.sh" report
  [ "$status" -eq 0 ]
  [[ "$output" == *"Archetype Effectiveness Report"* ]]
  [[ "$output" == *"Archetype"* ]]
  [[ "$output" == *"guardian"* ]]
 }
--- a/tests/test_helper.bash
+++ b/tests/test_helper.bash
@@ -0,0 +1,40 @@
 # test_helper.bash — Shared setup/teardown for ArcheFlow bats tests.
 #
 # Usage in .bats files:
 #   setup()    { load test_helper; _common_setup; }
 #   teardown() { _common_teardown; }
 #
 # Provides:
 #   - BATS_TEST_TMPDIR: unique temp directory per test
 #   - Mock .archeflow/ structure via a git repo
 #   - LIB_DIR: path to the lib/ scripts under test
 LIB_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../lib" && pwd)"
 _common_setup() {
  # Create a unique temp directory for this test
  BATS_TEST_TMPDIR="$(mktemp -d)"
  export BATS_TEST_TMPDIR
  # Work inside the temp dir so scripts create .archeflow/ there
  cd "$BATS_TEST_TMPDIR"
  # Initialize a minimal git repo (many scripts need it)
  git init --quiet
  git config user.email "test@test.com"
  git config user.name "Test User"
  # Disable commit signing in tests (global config may have it enabled)
  git config commit.gpgsign false
  git config tag.gpgsign false
  # Create an initial commit so HEAD exists
  echo "init" > README.md
  git add README.md
  git commit -m "init" --quiet
 }
 _common_teardown() {
  # Return to a safe directory before cleanup
  cd /tmp
  rm -rf "$BATS_TEST_TMPDIR"
 }
Author	SHA1	Message	Date
Christian Nennemann	3ef956485f	docs: add HITL discussion — Wiggum Breaks as formal autonomy boundary New subsection in Discussion framing Wiggum Breaks as the formal boundary between autonomous and human-supervised operation. Derives HITL from convergence theory rather than pre-defined approval gates. Covers oscillation, divergence, and repeated shadow detection as provably unproductive conditions that trigger human escalation.	2026-04-08 05:21:20 +02:00
Christian Nennemann	1e96d87f49	feat: introduce Wiggum Break as named circuit breaker Replaces generic "circuit breaker" with "Wiggum Break" — policy enforcement halt condition named after Chief Wiggum (policy + Ralph Loop's dad). Hard breaks (immediate halt) and soft breaks (finish then halt) with wiggum.break event type. Updated both papers and shadow-detection skill.	2026-04-08 05:19:35 +02:00
Christian Nennemann	d99f449083	docs: add Six Sigma Agent, AgileCoder, Reflexion citations to taxonomy paper Incorporate findings from literature survey: Six Sigma Agent (arXiv:2601.22290) as the only prior explicit PM/OM-named framework, AgileCoder for Scrum sprints, Reflexion as implicit PDCA, CAMEL for role theory.	2026-04-08 05:15:55 +02:00
Christian Nennemann	58315ac982	docs: add taxonomy paper — PM/OM methods for agent orchestration Survey of 12 operations management methods (PDCA, Scrum, DMAIC, Kanban, TOC, Lean, OODA, Cynefin, Stage-Gate, Design Thinking, TRIZ, FMEA, SPC) evaluated against 5 agent constraints. Includes compatibility matrix and decision framework.	2026-04-08 05:13:59 +02:00
Christian Nennemann	24ea632207	docs: add arXiv paper on ArcheFlow architecture LaTeX paper describing the archetypal role system, PDCA quality cycles, shadow detection framework, attention filters, convergence detection, and effectiveness scoring. References Lu et al. 2026 (Assistant Axis) for persona stability grounding.	2026-04-08 04:54:14 +02:00
Christian Nennemann	55dde5f07a	docs: add ArcheFlow roadmap v0.9-v0.12	2026-04-06 23:08:11 +02:00
Christian Nennemann	4f8e2a9962	feat: add run replay for archetype effectiveness analysis - archeflow-decision.sh records decision points during runs - archeflow-replay.sh: timeline, whatif, compare commands - What-if replay with adjustable archetype weights - /af-replay skill for interactive use - Tests in archeflow-replay.bats	2026-04-06 21:43:29 +02:00
Christian Nennemann	506143d613	feat: add decision.point event, decision logger, and run replay	2026-04-06 21:33:42 +02:00
Christian Nennemann	607a53f1bf	feat: add decision.point event type, decision logger, and run replay script - archeflow-decision.sh: convenience wrapper for logging PDCA decision points - archeflow-replay.sh: timeline view and weighted what-if replay for recorded runs - archeflow-event.sh: add decision.point usage example - archeflow-dag.sh: render decision.point events in DAG output	2026-04-06 21:33:36 +02:00
Christian Nennemann	6a49c21bbe	test: add bats test suite for lib/ helper scripts 110 tests across 10 test files covering all lib/ scripts: - archeflow-event.sh: JSONL format, seq numbering, parent fields, validation - archeflow-memory.sh: add/list/decay/forget/inject/extract commands - archeflow-git.sh: branch creation, commit format, merge strategies, safety - archeflow-report.sh: markdown output, summary mode, in-progress handling - archeflow-progress.sh: progress.md generation, JSON mode, error handling - archeflow-score.sh: archetype scoring, effectiveness report, validation - archeflow-dag.sh: DAG rendering, color flags, tree structure - archeflow-rollback.sh: arg parsing, phase validation, mutual exclusivity - archeflow-init.sh: template listing, clone from project, arg validation - archeflow-review.sh: diff modes, stats, branch/commit range review Includes test_helper.bash (shared setup/teardown with temp git repos) and scripts/run-tests.sh runner.	2026-04-06 21:20:05 +02:00
Christian Nennemann	6bae80b874	feat: add af-status, af-score, af-dag, af-report slash command skills	2026-04-06 21:10:22 +02:00
Christian Nennemann	43a147676e	refactor: slim session-start hook from 55 to ~20 lines of injected context Create ACTIVATION.md as minimal stub for session-start injection. Full SKILL.md stays in place for on-demand loading when commands are invoked.	2026-04-06 21:10:14 +02:00
Christian Nennemann	14d70689ce	refactor: ArcheFlow v0.8.0 — consolidate 27 to 19 skills, corrective action framework	2026-04-06 21:07:01 +02:00
Christian Nennemann	130c04fa58	feat: corrective action framework + CLAUDE.md rewrite + v0.8.0 cleanup - Extend shadow-detection with 3-layer corrective action framework: archetype shadows, system shadows (tunnel vision, echo chamber, etc.), and policy boundaries (checkpoints, budget gates, circuit breakers) - Rewrite CLAUDE.md with proper guardrails (DO/DO NOT, skill writing rules, 200-line max per skill, no bash pseudo-code in skills) - Update plugin.json to v0.8.0 with consolidated 19-skill list - Update README architecture tree and skills reference - Update using-archeflow version string to v0.8.0 / 19 skills - Remove 8 empty skill directories (absorbed into run skill)	2026-04-06 20:52:27 +02:00
Christian Nennemann	752177528f	refactor: trim act-phase skill from 371 to 140 lines Remove duplicated routing tables, verbose JSON event examples, writing/prose domain template (belongs in domains/colette-bridge), --start-from section (belongs in run skill), and redundant checklist. Consolidate three Agent() templates into one compact template. Preserve all routing rules, decision logic, and feedback format.	2026-04-06 20:50:59 +02:00
Christian Nennemann	a1667633ad	Merge branch 'refactor/consolidate-check-phase-v2' into refactor/trim-secondary-skills # Conflicts: # skills/colette-bridge/SKILL.md # skills/using-archeflow/SKILL.md	2026-04-06 20:50:31 +02:00
Christian Nennemann	8837a359ac	refactor: simplify memory and shadow-detection skills Trim verbose implementation details that duplicate what the bash helper scripts already handle. Memory skill: 278 -> 120 lines. Shadow detection skill: 180 -> 66 lines. All essential protocols, tables, and commands preserved; removed redundant algorithm descriptions, multiple examples, and narrative prose.	2026-04-06 20:42:47 +02:00
Christian Nennemann	af1f4e7da7	refactor: merge attention-filters into check-phase skill Consolidate the attention-filters skill (122 lines) into check-phase, reducing check-phase from 234 to 110 lines. Removed verbose bash code blocks, 30-line consolidated output example, re-check protocol (belongs in act-phase), and motivational section. Updated all references in README, plugin.json, using-archeflow, and colette-bridge.	2026-04-06 20:41:36 +02:00