feat: add memory, convergence, colette bridge, templates, progress, effectiveness, git integration

- skills/memory: cross-run learning from recurring findings + lib/archeflow-memory.sh - skills/convergence: oscillation detection + early termination in multi-cycle runs - skills/colette-bridge: auto-inject voice profiles, personas, characters from colette.yaml - skills/templates: workflow/team/archetype gallery with init/save/share - skills/progress: live .archeflow/progress.md during runs - skills/effectiveness: per-archetype signal-to-noise + cost efficiency scoring - skills/git-integration: auto-branch per run, commit per phase, rollback support
2026-04-03 11:40:04 +02:00
parent b6df3d19fd
commit 19f8f76232
8 changed files with 2243 additions and 0 deletions
--- a/skills/effectiveness/SKILL.md
+++ b/skills/effectiveness/SKILL.md
@@ -0,0 +1,200 @@
+---
+name: effectiveness
+description: |
+  Track archetype effectiveness across runs. Scores each archetype on signal-to-noise,
+  fix rate, cost efficiency, accuracy, and cycle impact. Recommends model tier changes
+  and archetype removal based on rolling averages.
+  <example>User: "Which reviewers are actually useful?"</example>
+  <example>User: "Show archetype effectiveness report"</example>
+---
+
+# Agent Effectiveness Scoring
+
+Track which archetypes are most useful vs. which waste tokens. Over multiple runs, build a profile of each archetype's effectiveness and use it to optimize team composition and model selection.
+
+## Storage
+
+```
+.archeflow/memory/effectiveness.jsonl     # Per-run archetype scores (append-only)
+```
+
+## Scoring Dimensions
+
+For each archetype that participates in a run, calculate these scores:
+
+| Dimension | How Measured | Weight |
+|-----------|-------------|--------|
+| **Signal-to-noise** | useful findings / total findings | 0.30 |
+| **Fix rate** | findings that led to actual fixes / total findings | 0.25 |
+| **Cost efficiency** | useful findings per dollar spent | 0.20 |
+| **Accuracy** | findings not contradicted by other reviewers | 0.15 |
+| **Cycle impact** | did this archetype's findings lead to cycle exit? | 0.10 |
+
+### Definitions
+
+- **Useful finding**: A finding in a `review.verdict` event with `severity >= WARNING` (i.e., severity is `warning`, `bug`, or `critical`) AND `fix_required == true`.
+- **Actual fix**: A `fix.applied` event whose `source` field matches this archetype (or whose DAG `parent` chain traces back to this archetype's `review.verdict` event).
+- **Contradicted finding**: Another reviewer's `review.verdict` has `verdict == "approved"` for the same scope where this archetype flagged an issue. Approximation: if archetype A flags N findings but archetype B approves the same code with 0 findings in overlapping severity categories, A's unmatched findings are considered potentially contradicted.
+- **Cycle impact**: The archetype's findings (with `fix_required == true`) resulted in fixes that were part of the final approved cycle. Determined by checking if `fix.applied` events referencing this archetype exist before the final `cycle.boundary` with `met == true`.
+
+### Composite Score
+
+```
+composite = (signal_to_noise * 0.30)
+          + (fix_rate * 0.25)
+          + (cost_efficiency_normalized * 0.20)
+          + (accuracy * 0.15)
+          + (cycle_impact * 0.10)
+```
+
+**Cost efficiency normalization**: Raw cost efficiency is `useful_findings / cost_usd`. To normalize to 0-1 range, use: `min(1.0, raw_efficiency / 100)`. The threshold of 100 means "100 useful findings per dollar" is considered perfect efficiency (achievable with haiku on structured reviews).
+
+## Per-Run Scoring
+
+After `run.complete`, calculate scores for each archetype that participated. The `extract` command does this.
+
+### Per-Run Score Record
+
+```jsonl
+{"ts":"2026-04-03T16:00:00Z","run_id":"2026-04-03-der-huster","archetype":"guardian","signal_to_noise":0.85,"fix_rate":1.0,"cost_efficiency":42.5,"accuracy":1.0,"cycle_impact":true,"composite_score":0.91,"tokens":5000,"cost_usd":0.004,"model":"haiku","findings_total":4,"findings_useful":3,"fixes_applied":3}
+```
+
+Appended to `.archeflow/memory/effectiveness.jsonl`.
+
+### Scoring Non-Review Archetypes
+
+Only archetypes that produce `review.verdict` events are scored (Guardian, Skeptic, Sage, Trickster, and any custom review archetypes). Non-review archetypes (Explorer, Creator, Maker) are tracked by cost-tracking but not effectiveness-scored, because their output quality is measured differently (by whether the run succeeds, not by individual findings).
+
+## Aggregate Scoring
+
+Across all runs, maintain rolling averages (computed on-demand, not stored):
+
+```jsonl
+{"archetype":"guardian","runs":12,"avg_composite":0.88,"avg_signal_noise":0.82,"avg_cost_efficiency":38.2,"trend":"stable","recommendation":"keep"}
+{"archetype":"trickster","runs":8,"avg_composite":0.35,"avg_signal_noise":0.20,"avg_cost_efficiency":5.1,"trend":"declining","recommendation":"consider_removing"}
+```
+
+### Trend Calculation
+
+Compare the average composite score of the last 5 runs to the 5 runs before that:
+
+- **improving**: last-5 avg > prior-5 avg + 0.05
+- **declining**: last-5 avg < prior-5 avg - 0.05
+- **stable**: within +/- 0.05
+
+If fewer than 10 runs exist, trend is `"insufficient_data"`.
+
+### Recommendations
+
+Based on aggregate composite scores:
+
+| Composite Score | Recommendation | Meaning |
+|----------------|---------------|---------|
+| >= 0.70 | `keep` | Archetype is valuable, contributes meaningful findings |
+| 0.40 - 0.69 | `optimize` | Consider cheaper model or tighter review lens |
+| < 0.40 | `consider_removing` | Might be wasting tokens, review whether it adds value |
+
+## Integration Points
+
+### At Run Start
+
+When the `run` skill initializes, show a brief effectiveness summary for the team's archetypes:
+
+```
+Archetype effectiveness (last 10 runs):
+  guardian:  0.88 (keep)     — haiku, $0.004/run avg
+  sage:      0.72 (keep)     — sonnet, $0.08/run avg
+  skeptic:   0.45 (optimize) — haiku, $0.003/run avg
+  trickster: 0.32 (consider_removing) — haiku, $0.003/run avg
+```
+
+### Model Tier Suggestions
+
+Cross-reference effectiveness with model assignment:
+
+- **High effectiveness on cheap model** (composite >= 0.7, model = haiku): "Keep cheap. Working well."
+- **Low effectiveness on cheap model** (composite < 0.5, model = haiku): "Consider upgrading to sonnet — cheap model may not be capturing issues."
+- **High effectiveness on expensive model** (composite >= 0.7, model = sonnet): "Try downgrading to haiku — may maintain quality at lower cost."
+- **Low effectiveness on expensive model** (composite < 0.5, model = sonnet): "Consider removing — expensive and not contributing."
+
+### Cost-Tracking Integration
+
+Multiply estimated cost by effectiveness to get "value per dollar":
+
+```
+value_per_dollar = composite_score / cost_usd
+```
+
+This metric helps compare archetypes directly: a cheap archetype with moderate effectiveness may have higher value_per_dollar than an expensive one with high effectiveness.
+
+## Effectiveness Script
+
+**Location:** `lib/archeflow-score.sh`
+
+```
+Usage:
+  archeflow-score.sh extract <events.jsonl>     # Score archetypes from a completed run
+  archeflow-score.sh report                     # Show aggregate effectiveness report
+  archeflow-score.sh recommend <team.yaml>      # Recommend model tiers for a team
+```
+
+### `extract` Command
+
+1. Read all events from the JSONL file
+2. Verify a `run.complete` event exists (scoring incomplete runs is unreliable)
+3. For each `review.verdict` event:
+   - Count total findings and useful findings (severity >= WARNING, fix_required)
+   - Cross-reference with `fix.applied` events via the `source` field or DAG parent chain
+   - Check for contradictions from other reviewers
+   - Determine cycle impact
+4. Calculate all scoring dimensions and composite score
+5. Append per-archetype score records to `.archeflow/memory/effectiveness.jsonl`
+
+### `report` Command
+
+1. Read `.archeflow/memory/effectiveness.jsonl`
+2. Group by archetype
+3. Calculate rolling averages (last 10 runs per archetype)
+4. Calculate trends (last 5 vs. prior 5)
+5. Output a markdown table:
+
+```markdown
+# Archetype Effectiveness Report
+
+| Archetype | Runs | Avg Score | S/N | Fix Rate | Cost Eff | Accuracy | Trend | Rec |
+|-----------|------|-----------|-----|----------|----------|----------|-------|-----|
+| guardian | 12 | 0.88 | 0.82 | 0.95 | 38.2 | 0.97 | stable | keep |
+| sage | 10 | 0.72 | 0.70 | 0.80 | 12.1 | 0.88 | improving | keep |
+| skeptic | 8 | 0.45 | 0.40 | 0.50 | 22.5 | 0.60 | stable | optimize |
+| trickster | 8 | 0.35 | 0.20 | 0.30 | 5.1 | 0.55 | declining | consider_removing |
+
+**Model suggestions:**
+- skeptic (haiku, score 0.45): Consider upgrading to sonnet or tightening review lens
+- trickster (haiku, score 0.35): Consider removing — low signal, low fix rate
+```
+
+### `recommend` Command
+
+1. Read the team preset YAML file
+2. For each archetype in the team, look up its effectiveness from `.archeflow/memory/effectiveness.jsonl`
+3. Cross-reference current model assignment with effectiveness
+4. Output recommendations:
+
+```markdown
+# Model Recommendations for team: story-development
+
+| Archetype | Current Model | Score | Suggestion |
+|-----------|--------------|-------|------------|
+| guardian | haiku | 0.88 | Keep haiku — high effectiveness at low cost |
+| sage | sonnet | 0.72 | Keep sonnet — quality-sensitive role |
+| skeptic | haiku | 0.45 | Try sonnet — may improve signal quality |
+| trickster | haiku | 0.35 | Consider removing from team |
+```
+
+## Design Principles
+
+1. **Append-only.** Score records are immutable facts. Aggregates are computed on-demand.
+2. **Review archetypes only.** Non-review agents (Explorer, Creator, Maker) are not scored — their value is in the final product, not in individual findings.
+3. **Relative, not absolute.** Scores are meaningful in comparison (guardian vs. trickster), not as standalone numbers. The thresholds (0.7, 0.4) are starting points — calibrate after 20+ runs.
+4. **Actionable.** Every report ends with concrete recommendations (keep, optimize, remove, change model).
+5. **Cheap to compute.** One JSONL scan per report. No databases, no external services.