Files
claude-archeflow-plugin/skills/effectiveness/SKILL.md
Christian Nennemann 19f8f76232 feat: add memory, convergence, colette bridge, templates, progress, effectiveness, git integration
- skills/memory: cross-run learning from recurring findings + lib/archeflow-memory.sh
- skills/convergence: oscillation detection + early termination in multi-cycle runs
- skills/colette-bridge: auto-inject voice profiles, personas, characters from colette.yaml
- skills/templates: workflow/team/archetype gallery with init/save/share
- skills/progress: live .archeflow/progress.md during runs
- skills/effectiveness: per-archetype signal-to-noise + cost efficiency scoring
- skills/git-integration: auto-branch per run, commit per phase, rollback support
2026-04-03 11:40:04 +02:00

9.1 KiB

name, description
name description
effectiveness Track archetype effectiveness across runs. Scores each archetype on signal-to-noise, fix rate, cost efficiency, accuracy, and cycle impact. Recommends model tier changes and archetype removal based on rolling averages. <example>User: "Which reviewers are actually useful?"</example> <example>User: "Show archetype effectiveness report"</example>

Agent Effectiveness Scoring

Track which archetypes are most useful vs. which waste tokens. Over multiple runs, build a profile of each archetype's effectiveness and use it to optimize team composition and model selection.

Storage

.archeflow/memory/effectiveness.jsonl     # Per-run archetype scores (append-only)

Scoring Dimensions

For each archetype that participates in a run, calculate these scores:

Dimension How Measured Weight
Signal-to-noise useful findings / total findings 0.30
Fix rate findings that led to actual fixes / total findings 0.25
Cost efficiency useful findings per dollar spent 0.20
Accuracy findings not contradicted by other reviewers 0.15
Cycle impact did this archetype's findings lead to cycle exit? 0.10

Definitions

  • Useful finding: A finding in a review.verdict event with severity >= WARNING (i.e., severity is warning, bug, or critical) AND fix_required == true.
  • Actual fix: A fix.applied event whose source field matches this archetype (or whose DAG parent chain traces back to this archetype's review.verdict event).
  • Contradicted finding: Another reviewer's review.verdict has verdict == "approved" for the same scope where this archetype flagged an issue. Approximation: if archetype A flags N findings but archetype B approves the same code with 0 findings in overlapping severity categories, A's unmatched findings are considered potentially contradicted.
  • Cycle impact: The archetype's findings (with fix_required == true) resulted in fixes that were part of the final approved cycle. Determined by checking if fix.applied events referencing this archetype exist before the final cycle.boundary with met == true.

Composite Score

composite = (signal_to_noise * 0.30)
          + (fix_rate * 0.25)
          + (cost_efficiency_normalized * 0.20)
          + (accuracy * 0.15)
          + (cycle_impact * 0.10)

Cost efficiency normalization: Raw cost efficiency is useful_findings / cost_usd. To normalize to 0-1 range, use: min(1.0, raw_efficiency / 100). The threshold of 100 means "100 useful findings per dollar" is considered perfect efficiency (achievable with haiku on structured reviews).

Per-Run Scoring

After run.complete, calculate scores for each archetype that participated. The extract command does this.

Per-Run Score Record

{"ts":"2026-04-03T16:00:00Z","run_id":"2026-04-03-der-huster","archetype":"guardian","signal_to_noise":0.85,"fix_rate":1.0,"cost_efficiency":42.5,"accuracy":1.0,"cycle_impact":true,"composite_score":0.91,"tokens":5000,"cost_usd":0.004,"model":"haiku","findings_total":4,"findings_useful":3,"fixes_applied":3}

Appended to .archeflow/memory/effectiveness.jsonl.

Scoring Non-Review Archetypes

Only archetypes that produce review.verdict events are scored (Guardian, Skeptic, Sage, Trickster, and any custom review archetypes). Non-review archetypes (Explorer, Creator, Maker) are tracked by cost-tracking but not effectiveness-scored, because their output quality is measured differently (by whether the run succeeds, not by individual findings).

Aggregate Scoring

Across all runs, maintain rolling averages (computed on-demand, not stored):

{"archetype":"guardian","runs":12,"avg_composite":0.88,"avg_signal_noise":0.82,"avg_cost_efficiency":38.2,"trend":"stable","recommendation":"keep"}
{"archetype":"trickster","runs":8,"avg_composite":0.35,"avg_signal_noise":0.20,"avg_cost_efficiency":5.1,"trend":"declining","recommendation":"consider_removing"}

Trend Calculation

Compare the average composite score of the last 5 runs to the 5 runs before that:

  • improving: last-5 avg > prior-5 avg + 0.05
  • declining: last-5 avg < prior-5 avg - 0.05
  • stable: within +/- 0.05

If fewer than 10 runs exist, trend is "insufficient_data".

Recommendations

Based on aggregate composite scores:

Composite Score Recommendation Meaning
>= 0.70 keep Archetype is valuable, contributes meaningful findings
0.40 - 0.69 optimize Consider cheaper model or tighter review lens
< 0.40 consider_removing Might be wasting tokens, review whether it adds value

Integration Points

At Run Start

When the run skill initializes, show a brief effectiveness summary for the team's archetypes:

Archetype effectiveness (last 10 runs):
  guardian:  0.88 (keep)     — haiku, $0.004/run avg
  sage:      0.72 (keep)     — sonnet, $0.08/run avg
  skeptic:   0.45 (optimize) — haiku, $0.003/run avg
  trickster: 0.32 (consider_removing) — haiku, $0.003/run avg

Model Tier Suggestions

Cross-reference effectiveness with model assignment:

  • High effectiveness on cheap model (composite >= 0.7, model = haiku): "Keep cheap. Working well."
  • Low effectiveness on cheap model (composite < 0.5, model = haiku): "Consider upgrading to sonnet — cheap model may not be capturing issues."
  • High effectiveness on expensive model (composite >= 0.7, model = sonnet): "Try downgrading to haiku — may maintain quality at lower cost."
  • Low effectiveness on expensive model (composite < 0.5, model = sonnet): "Consider removing — expensive and not contributing."

Cost-Tracking Integration

Multiply estimated cost by effectiveness to get "value per dollar":

value_per_dollar = composite_score / cost_usd

This metric helps compare archetypes directly: a cheap archetype with moderate effectiveness may have higher value_per_dollar than an expensive one with high effectiveness.

Effectiveness Script

Location: lib/archeflow-score.sh

Usage:
  archeflow-score.sh extract <events.jsonl>     # Score archetypes from a completed run
  archeflow-score.sh report                     # Show aggregate effectiveness report
  archeflow-score.sh recommend <team.yaml>      # Recommend model tiers for a team

extract Command

  1. Read all events from the JSONL file
  2. Verify a run.complete event exists (scoring incomplete runs is unreliable)
  3. For each review.verdict event:
    • Count total findings and useful findings (severity >= WARNING, fix_required)
    • Cross-reference with fix.applied events via the source field or DAG parent chain
    • Check for contradictions from other reviewers
    • Determine cycle impact
  4. Calculate all scoring dimensions and composite score
  5. Append per-archetype score records to .archeflow/memory/effectiveness.jsonl

report Command

  1. Read .archeflow/memory/effectiveness.jsonl
  2. Group by archetype
  3. Calculate rolling averages (last 10 runs per archetype)
  4. Calculate trends (last 5 vs. prior 5)
  5. Output a markdown table:
# Archetype Effectiveness Report

| Archetype | Runs | Avg Score | S/N | Fix Rate | Cost Eff | Accuracy | Trend | Rec |
|-----------|------|-----------|-----|----------|----------|----------|-------|-----|
| guardian | 12 | 0.88 | 0.82 | 0.95 | 38.2 | 0.97 | stable | keep |
| sage | 10 | 0.72 | 0.70 | 0.80 | 12.1 | 0.88 | improving | keep |
| skeptic | 8 | 0.45 | 0.40 | 0.50 | 22.5 | 0.60 | stable | optimize |
| trickster | 8 | 0.35 | 0.20 | 0.30 | 5.1 | 0.55 | declining | consider_removing |

**Model suggestions:**
- skeptic (haiku, score 0.45): Consider upgrading to sonnet or tightening review lens
- trickster (haiku, score 0.35): Consider removing — low signal, low fix rate

recommend Command

  1. Read the team preset YAML file
  2. For each archetype in the team, look up its effectiveness from .archeflow/memory/effectiveness.jsonl
  3. Cross-reference current model assignment with effectiveness
  4. Output recommendations:
# Model Recommendations for team: story-development

| Archetype | Current Model | Score | Suggestion |
|-----------|--------------|-------|------------|
| guardian | haiku | 0.88 | Keep haiku — high effectiveness at low cost |
| sage | sonnet | 0.72 | Keep sonnet — quality-sensitive role |
| skeptic | haiku | 0.45 | Try sonnet — may improve signal quality |
| trickster | haiku | 0.35 | Consider removing from team |

Design Principles

  1. Append-only. Score records are immutable facts. Aggregates are computed on-demand.
  2. Review archetypes only. Non-review agents (Explorer, Creator, Maker) are not scored — their value is in the final product, not in individual findings.
  3. Relative, not absolute. Scores are meaningful in comparison (guardian vs. trickster), not as standalone numbers. The thresholds (0.7, 0.4) are starting points — calibrate after 20+ runs.
  4. Actionable. Every report ends with concrete recommendations (keep, optimize, remove, change model).
  5. Cheap to compute. One JSONL scan per report. No databases, no external services.