Files

Christian Nennemann 55dde5f07a docs: add ArcheFlow roadmap v0.9-v0.12

2026-04-06 23:08:11 +02:00

9.1 KiB

Raw Blame History

ArcheFlow Roadmap — From Framework to Tool

Status: Planning (2026-04-06) Context: v0.8.0 shipped — consolidated skills, corrective action framework, 110 tests. The scaffolding is solid. Now make it genuinely useful.

Guiding Principle

Every feature must close a feedback loop or remove friction. No features that add complexity without measurable improvement in either speed, cost, or quality.

Tier 1: Make the Sprint Runner Smart (highest impact)

1.1 Queue from Git Issues

Problem: Manual queue.json is the biggest friction point. Nobody wants to maintain a JSON file by hand.

Solution: ./scripts/ws sync-issues that:

Reads Gitea/GitHub issues via API (gh issue list or Gitea REST)
Maps labels to priority: P0=critical/blocker, P1=high, P2=medium, P3=low/enhancement
Maps labels to estimate: size/S, size/M, size/L, size/XL (default: M)
Extracts depends_on from "blocks #N" / "depends on #N" in issue body
Upserts into queue.json (doesn't overwrite manual edits, merges by issue ID)
Skips issues with wontfix, duplicate, question labels

Scope: One script in scripts/, ~100 lines. Gitea API + GitHub API (detect from remote URL). Needs API token in env var GITEA_TOKEN or GITHUB_TOKEN.

Test: bats tests with mock API responses (curl fixture files).

1.2 Cost Estimation

Problem: Users don't know what a sprint will cost before running it.

Solution: /af-sprint --dry-run shows estimated cost:

Sprint estimate: 7 tasks, ~18 agents, est. $1.20-$2.40, ~12 minutes
  P1: writing.colette fanout (L) — est. $0.50, 4 agents
  P1: tool.archeflow review (M) — est. $0.15, 2 agents
  ...
Proceed? [y/n]

How: Track actual token counts per task size (S/M/L/XL) in .archeflow/memory/cost-history.jsonl. After 5+ tasks per size bucket, use median. Before that, use defaults: S=$0.05, M=$0.15, L=$0.50, XL=$1.50.

Scope: Update sprint skill with estimation section. Add cost logging to archeflow-event.sh (include tokens_used in agent.complete data). New script lib/archeflow-cost.sh for estimation.

1.3 Smart Workflow Selection

Problem: Current auto-selection uses keyword matching ("fix" -> pipeline). This is crude.

Solution: Analyze the actual task + codebase signals:

Signal	Source	Workflow
Files matching `auth	crypto	secret
Public API changes (OpenAPI spec modified, exported functions changed)	git diff	-> thorough
<3 files changed, all in same dir	git diff	-> fast/pipeline
Test files only	git diff	-> pipeline
Historical: this project's last 3 runs needed 0 cycles	memory	-> fast
Historical: this project's last run had 2+ CRITICALs	memory	-> thorough

Scope: Add to the run skill's Strategy Selection section. Read git diff stats + memory lessons before choosing. ~20 lines of logic replacing the current keyword table.

Tier 2: Close the Learning Loop

2.1 Confidence Calibration

Problem: Creator's confidence scores (0.0-1.0) are self-reported and uncalibrated. A Creator that always says 0.8 but gets rejected 40% of the time is not useful.

Solution: After each run.complete, log calibration data:

{"run_id":"...","creator_confidence":{"task":0.8,"solution":0.7,"risk":0.6},"actual_outcome":"rejected","cycles":2,"criticals":1}

At run start, inject calibration context into Creator prompt:

Your historical calibration: You rate task understanding at 0.8 avg,
but 35% of runs with that score needed cycle-back. Consider scoring
more conservatively.

Scope: New field in archeflow-memory.sh calibration store. ~30 lines in run skill to log + inject. Needs 5+ runs before meaningful.

2.2 Archetype Auto-Tuning

Problem: The effectiveness scoring system exists (archeflow-score.sh) but nothing acts on it.

Solution: After 10+ runs, auto-generate recommendations:

Archetype Recommendations (based on 15 runs):
  Guardian: essential (caught real issues in 80% of runs)
  Sage: keep (useful findings in 60% of runs)
  Skeptic: demote to thorough-only (useful in 20%, mostly INFO)
  Trickster: keep for thorough (caught 2 bugs Guardian missed)

Add to /af-score output. Store recommendation in config as reviewers.recommended:

reviewers:
  recommended:
    always: [guardian]
    default: [sage]
    thorough_only: [skeptic, trickster]
  # Auto-generated 2026-04-06 from 15 runs. Override with explicit config.

Scope: Update archeflow-score.sh with recommendation logic. Update run skill to read recommended config. Add to af-score skill display.

2.3 Campaign Memory

Problem: Related runs (e.g., "harden all API endpoints") don't share context.

Solution: Optional --campaign <id> flag on /af-run:

Links runs under a campaign ID
Cross-run context: "In Run 1, we found the auth pattern uses middleware X. In Run 2, the same pattern applies."
Campaign-level progress: "3/8 endpoints hardened, 2 CRITICALs remaining"
Campaign memory injected into Explorer/Creator prompts

Scope: New field in event schema. Campaign index in .archeflow/campaigns/. Update memory injection to filter by campaign. ~50 lines in run skill.

Tier 3: Integrate with Real Workflow

3.1 Findings as PR Comments

Problem: Review findings live in .archeflow/artifacts/. Nobody reads artifact files — they read PR comments.

Solution: After Check phase, if a PR exists for the branch:

# Post each CRITICAL/WARNING as a PR review comment
gh api repos/{owner}/{repo}/pulls/{pr}/comments \
  --field body="🛡️ **Guardian** [CRITICAL/security]\n\n${description}\n\nSuggested fix: ${fix}" \
  --field path="${file}" --field line="${line}"

Scope: New --pr <number> flag on /af-run and /af-review. Script lib/archeflow-pr.sh for posting comments. Falls back gracefully if no PR or no API token.

3.2 CI Hook Mode

Problem: ArcheFlow runs manually. It should run automatically on PRs.

Solution: Lightweight CI integration:

# .github/workflows/archeflow-review.yml (or Gitea equivalent)
on: pull_request
jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: claude --plugin-dir ./archeflow -p "/af-review --branch ${{ github.head_ref }} --pr ${{ github.event.number }}"

Only runs Guardian (fast, cheap). Posts findings as PR comments. No PDCA overhead.

Scope: Template workflow file in examples/ci/. Update review skill to support --pr flag. Documentation.

3.3 Watch Mode

Problem: You have to remember to run /af-review after pushing.

Solution: /af-watch — background process that monitors a branch:

Uses git log --since polling (every 60s)
On new commits: auto-run /af-review on the diff
Posts findings as PR comments if PR exists
Respects budget gate from corrective action framework

Scope: New skill af-watch/SKILL.md (~30 lines). Uses the loop skill infrastructure. Low priority — CI hook mode covers most use cases.

Tier 4: Replay and Analysis

4.1 Decision Journal

Problem: No visibility into why ArcheFlow made specific choices during a run.

Solution: Already started with archeflow-decision.sh and archeflow-replay.sh. Extend:

Log every decision point: workflow selection, A1/A2/A3 triggers, fix routing, shadow detections
/af-replay <run_id> --timeline shows the decision chain
/af-replay <run_id> --whatif --workflow thorough simulates: "What would thorough have found?"

Scope: Mostly built. Needs integration into the run skill (emit decision.point events at each choice). The replay script needs the what-if simulation logic.

4.2 Run Comparison

Problem: No way to evaluate whether workflow X is better than workflow Y for a project.

Solution: /af-replay compare <run_a> <run_b>:

Run A (standard, 4m30s, $0.80): 5 findings, 4 resolved, 1 INFO remaining
Run B (thorough, 12m, $2.10):   7 findings, 6 resolved, 1 INFO remaining
Delta: +2 findings (both INFO), +165% cost, +167% time
Verdict: Standard was sufficient for this task.

Scope: Update archeflow-replay.sh with comparison mode. Needs at least 2 runs on similar tasks.

Implementation Order

v0.9.0 — Sprint Intelligence
  1.1 Queue from issues
  1.2 Cost estimation
  1.3 Smart workflow selection

v0.10.0 — Learning Loop
  2.1 Confidence calibration
  2.2 Archetype auto-tuning
  2.3 Campaign memory

v0.11.0 — Integration
  3.1 Findings as PR comments
  3.2 CI hook mode
  3.3 Watch mode (stretch)

v0.12.0 — Analysis
  4.1 Decision journal (mostly done)
  4.2 Run comparison

Each version is independently shippable. No version depends on a later one.

What NOT to Build

Web dashboard — Terminal is the interface. Don't add a server.
Embedding-based memory — Keyword matching works. Don't add vector DBs.
Agent marketplace — Focus on the 7 built-in archetypes being excellent.
Multi-user collaboration — ArcheFlow is a single-user tool. Git is the collaboration layer.
Plugin system for plugins — ArcheFlow IS a plugin. Don't go meta.

9.1 KiB Raw Blame History