--- name: effectiveness description: | Track archetype effectiveness across runs. Scores each archetype on signal-to-noise, fix rate, cost efficiency, accuracy, and cycle impact. Recommends model tier changes and archetype removal based on rolling averages. User: "Which reviewers are actually useful?" User: "Show archetype effectiveness report" --- # Agent Effectiveness Scoring Track which archetypes are most useful vs. which waste tokens. Over multiple runs, build a profile of each archetype's effectiveness and use it to optimize team composition and model selection. ## Storage ``` .archeflow/memory/effectiveness.jsonl # Per-run archetype scores (append-only) ``` ## Scoring Dimensions For each archetype that participates in a run, calculate these scores: | Dimension | How Measured | Weight | |-----------|-------------|--------| | **Signal-to-noise** | useful findings / total findings | 0.30 | | **Fix rate** | findings that led to actual fixes / total findings | 0.25 | | **Cost efficiency** | useful findings per dollar spent | 0.20 | | **Accuracy** | findings not contradicted by other reviewers | 0.15 | | **Cycle impact** | did this archetype's findings lead to cycle exit? | 0.10 | ### Definitions - **Useful finding**: A finding in a `review.verdict` event with `severity >= WARNING` (i.e., severity is `warning`, `bug`, or `critical`) AND `fix_required == true`. - **Actual fix**: A `fix.applied` event whose `source` field matches this archetype (or whose DAG `parent` chain traces back to this archetype's `review.verdict` event). - **Contradicted finding**: Another reviewer's `review.verdict` has `verdict == "approved"` for the same scope where this archetype flagged an issue. Approximation: if archetype A flags N findings but archetype B approves the same code with 0 findings in overlapping severity categories, A's unmatched findings are considered potentially contradicted. - **Cycle impact**: The archetype's findings (with `fix_required == true`) resulted in fixes that were part of the final approved cycle. Determined by checking if `fix.applied` events referencing this archetype exist before the final `cycle.boundary` with `met == true`. ### Composite Score ``` composite = (signal_to_noise * 0.30) + (fix_rate * 0.25) + (cost_efficiency_normalized * 0.20) + (accuracy * 0.15) + (cycle_impact * 0.10) ``` **Cost efficiency normalization**: Raw cost efficiency is `useful_findings / cost_usd`. To normalize to 0-1 range, use: `min(1.0, raw_efficiency / 100)`. The threshold of 100 means "100 useful findings per dollar" is considered perfect efficiency (achievable with haiku on structured reviews). ## Per-Run Scoring After `run.complete`, calculate scores for each archetype that participated. The `extract` command does this. ### Per-Run Score Record ```jsonl {"ts":"2026-04-03T16:00:00Z","run_id":"2026-04-03-der-huster","archetype":"guardian","signal_to_noise":0.85,"fix_rate":1.0,"cost_efficiency":42.5,"accuracy":1.0,"cycle_impact":true,"composite_score":0.91,"tokens":5000,"cost_usd":0.004,"model":"haiku","findings_total":4,"findings_useful":3,"fixes_applied":3} ``` Appended to `.archeflow/memory/effectiveness.jsonl`. ### Scoring Non-Review Archetypes Only archetypes that produce `review.verdict` events are scored (Guardian, Skeptic, Sage, Trickster, and any custom review archetypes). Non-review archetypes (Explorer, Creator, Maker) are tracked by cost-tracking but not effectiveness-scored, because their output quality is measured differently (by whether the run succeeds, not by individual findings). ## Aggregate Scoring Across all runs, maintain rolling averages (computed on-demand, not stored): ```jsonl {"archetype":"guardian","runs":12,"avg_composite":0.88,"avg_signal_noise":0.82,"avg_cost_efficiency":38.2,"trend":"stable","recommendation":"keep"} {"archetype":"trickster","runs":8,"avg_composite":0.35,"avg_signal_noise":0.20,"avg_cost_efficiency":5.1,"trend":"declining","recommendation":"consider_removing"} ``` ### Trend Calculation Compare the average composite score of the last 5 runs to the 5 runs before that: - **improving**: last-5 avg > prior-5 avg + 0.05 - **declining**: last-5 avg < prior-5 avg - 0.05 - **stable**: within +/- 0.05 If fewer than 10 runs exist, trend is `"insufficient_data"`. ### Recommendations Based on aggregate composite scores: | Composite Score | Recommendation | Meaning | |----------------|---------------|---------| | >= 0.70 | `keep` | Archetype is valuable, contributes meaningful findings | | 0.40 - 0.69 | `optimize` | Consider cheaper model or tighter review lens | | < 0.40 | `consider_removing` | Might be wasting tokens, review whether it adds value | ## Integration Points ### At Run Start When the `run` skill initializes, show a brief effectiveness summary for the team's archetypes: ``` Archetype effectiveness (last 10 runs): guardian: 0.88 (keep) — haiku, $0.004/run avg sage: 0.72 (keep) — sonnet, $0.08/run avg skeptic: 0.45 (optimize) — haiku, $0.003/run avg trickster: 0.32 (consider_removing) — haiku, $0.003/run avg ``` ### Model Tier Suggestions Cross-reference effectiveness with model assignment: - **High effectiveness on cheap model** (composite >= 0.7, model = haiku): "Keep cheap. Working well." - **Low effectiveness on cheap model** (composite < 0.5, model = haiku): "Consider upgrading to sonnet — cheap model may not be capturing issues." - **High effectiveness on expensive model** (composite >= 0.7, model = sonnet): "Try downgrading to haiku — may maintain quality at lower cost." - **Low effectiveness on expensive model** (composite < 0.5, model = sonnet): "Consider removing — expensive and not contributing." ### Cost-Tracking Integration Multiply estimated cost by effectiveness to get "value per dollar": ``` value_per_dollar = composite_score / cost_usd ``` This metric helps compare archetypes directly: a cheap archetype with moderate effectiveness may have higher value_per_dollar than an expensive one with high effectiveness. ## Effectiveness Script **Location:** `lib/archeflow-score.sh` ``` Usage: archeflow-score.sh extract # Score archetypes from a completed run archeflow-score.sh report # Show aggregate effectiveness report archeflow-score.sh recommend # Recommend model tiers for a team ``` ### `extract` Command 1. Read all events from the JSONL file 2. Verify a `run.complete` event exists (scoring incomplete runs is unreliable) 3. For each `review.verdict` event: - Count total findings and useful findings (severity >= WARNING, fix_required) - Cross-reference with `fix.applied` events via the `source` field or DAG parent chain - Check for contradictions from other reviewers - Determine cycle impact 4. Calculate all scoring dimensions and composite score 5. Append per-archetype score records to `.archeflow/memory/effectiveness.jsonl` ### `report` Command 1. Read `.archeflow/memory/effectiveness.jsonl` 2. Group by archetype 3. Calculate rolling averages (last 10 runs per archetype) 4. Calculate trends (last 5 vs. prior 5) 5. Output a markdown table: ```markdown # Archetype Effectiveness Report | Archetype | Runs | Avg Score | S/N | Fix Rate | Cost Eff | Accuracy | Trend | Rec | |-----------|------|-----------|-----|----------|----------|----------|-------|-----| | guardian | 12 | 0.88 | 0.82 | 0.95 | 38.2 | 0.97 | stable | keep | | sage | 10 | 0.72 | 0.70 | 0.80 | 12.1 | 0.88 | improving | keep | | skeptic | 8 | 0.45 | 0.40 | 0.50 | 22.5 | 0.60 | stable | optimize | | trickster | 8 | 0.35 | 0.20 | 0.30 | 5.1 | 0.55 | declining | consider_removing | **Model suggestions:** - skeptic (haiku, score 0.45): Consider upgrading to sonnet or tightening review lens - trickster (haiku, score 0.35): Consider removing — low signal, low fix rate ``` ### `recommend` Command 1. Read the team preset YAML file 2. For each archetype in the team, look up its effectiveness from `.archeflow/memory/effectiveness.jsonl` 3. Cross-reference current model assignment with effectiveness 4. Output recommendations: ```markdown # Model Recommendations for team: story-development | Archetype | Current Model | Score | Suggestion | |-----------|--------------|-------|------------| | guardian | haiku | 0.88 | Keep haiku — high effectiveness at low cost | | sage | sonnet | 0.72 | Keep sonnet — quality-sensitive role | | skeptic | haiku | 0.45 | Try sonnet — may improve signal quality | | trickster | haiku | 0.35 | Consider removing from team | ``` ## Design Principles 1. **Append-only.** Score records are immutable facts. Aggregates are computed on-demand. 2. **Review archetypes only.** Non-review agents (Explorer, Creator, Maker) are not scored — their value is in the final product, not in individual findings. 3. **Relative, not absolute.** Scores are meaningful in comparison (guardian vs. trickster), not as standalone numbers. The thresholds (0.7, 0.4) are starting points — calibrate after 20+ runs. 4. **Actionable.** Every report ends with concrete recommendations (keep, optimize, remove, change model). 5. **Cheap to compute.** One JSONL scan per report. No databases, no external services.