claude-archeflow-plugin/skills/effectiveness/SKILL.md

---
name: effectiveness
description: |
  Track archetype effectiveness across runs. Scores each archetype on signal-to-noise,
  fix rate, cost efficiency, accuracy, and cycle impact. Recommends model tier changes
  and archetype removal based on rolling averages.
  <example>User: "Which reviewers are actually useful?"</example>
  <example>User: "Show archetype effectiveness report"</example>
---

# Agent Effectiveness Scoring

Track which archetypes are most useful vs. which waste tokens. Over multiple runs, build a profile of each archetype's effectiveness and use it to optimize team composition and model selection.

## Storage

```
.archeflow/memory/effectiveness.jsonl     # Per-run archetype scores (append-only)
```

## Scoring Dimensions

For each archetype that participates in a run, calculate these scores:

| Dimension | How Measured | Weight |
|-----------|-------------|--------|
| **Signal-to-noise** | useful findings / total findings | 0.30 |
| **Fix rate** | findings that led to actual fixes / total findings | 0.25 |
| **Cost efficiency** | useful findings per dollar spent | 0.20 |
| **Accuracy** | findings not contradicted by other reviewers | 0.15 |
| **Cycle impact** | did this archetype's findings lead to cycle exit? | 0.10 |

### Definitions

- **Useful finding**: A finding in a `review.verdict` event with `severity >= WARNING` (i.e., severity is `warning`, `bug`, or `critical`) AND `fix_required == true`.
- **Actual fix**: A `fix.applied` event whose `source` field matches this archetype (or whose DAG `parent` chain traces back to this archetype's `review.verdict` event).
- **Contradicted finding**: Another reviewer's `review.verdict` has `verdict == "approved"` for the same scope where this archetype flagged an issue. Approximation: if archetype A flags N findings but archetype B approves the same code with 0 findings in overlapping severity categories, A's unmatched findings are considered potentially contradicted.
- **Cycle impact**: The archetype's findings (with `fix_required == true`) resulted in fixes that were part of the final approved cycle. Determined by checking if `fix.applied` events referencing this archetype exist before the final `cycle.boundary` with `met == true`.

### Composite Score

```
composite = (signal_to_noise * 0.30)
          + (fix_rate * 0.25)
          + (cost_efficiency_normalized * 0.20)
          + (accuracy * 0.15)
          + (cycle_impact * 0.10)
```

**Cost efficiency normalization**: Raw cost efficiency is `useful_findings / cost_usd`. To normalize to 0-1 range, use: `min(1.0, raw_efficiency / 100)`. The threshold of 100 means "100 useful findings per dollar" is considered perfect efficiency (achievable with haiku on structured reviews).

## Per-Run Scoring

After `run.complete`, calculate scores for each archetype that participated. The `extract` command does this.

### Per-Run Score Record

```jsonl
{"ts":"2026-04-03T16:00:00Z","run_id":"2026-04-03-der-huster","archetype":"guardian","signal_to_noise":0.85,"fix_rate":1.0,"cost_efficiency":42.5,"accuracy":1.0,"cycle_impact":true,"composite_score":0.91,"tokens":5000,"cost_usd":0.004,"model":"haiku","findings_total":4,"findings_useful":3,"fixes_applied":3}
```

Appended to `.archeflow/memory/effectiveness.jsonl`.

### Scoring Non-Review Archetypes

Only archetypes that produce `review.verdict` events are scored (Guardian, Skeptic, Sage, Trickster, and any custom review archetypes). Non-review archetypes (Explorer, Creator, Maker) are tracked by cost-tracking but not effectiveness-scored, because their output quality is measured differently (by whether the run succeeds, not by individual findings).

## Aggregate Scoring

Across all runs, maintain rolling averages (computed on-demand, not stored):

```jsonl
{"archetype":"guardian","runs":12,"avg_composite":0.88,"avg_signal_noise":0.82,"avg_cost_efficiency":38.2,"trend":"stable","recommendation":"keep"}
{"archetype":"trickster","runs":8,"avg_composite":0.35,"avg_signal_noise":0.20,"avg_cost_efficiency":5.1,"trend":"declining","recommendation":"consider_removing"}
```

### Trend Calculation

Compare the average composite score of the last 5 runs to the 5 runs before that:

- **improving**: last-5 avg > prior-5 avg + 0.05
- **declining**: last-5 avg < prior-5 avg - 0.05
- **stable**: within +/- 0.05

If fewer than 10 runs exist, trend is `"insufficient_data"`.

### Recommendations

Based on aggregate composite scores:

| Composite Score | Recommendation | Meaning |
|----------------|---------------|---------|
| >= 0.70 | `keep` | Archetype is valuable, contributes meaningful findings |
| 0.40 - 0.69 | `optimize` | Consider cheaper model or tighter review lens |
| < 0.40 | `consider_removing` | Might be wasting tokens, review whether it adds value |

## Integration Points

### At Run Start

When the `run` skill initializes, show a brief effectiveness summary for the team's archetypes:

```
Archetype effectiveness (last 10 runs):
  guardian:  0.88 (keep)     — haiku, $0.004/run avg
  sage:      0.72 (keep)     — sonnet, $0.08/run avg
  skeptic:   0.45 (optimize) — haiku, $0.003/run avg
  trickster: 0.32 (consider_removing) — haiku, $0.003/run avg
```

### Model Tier Suggestions

Cross-reference effectiveness with model assignment:

- **High effectiveness on cheap model** (composite >= 0.7, model = haiku): "Keep cheap. Working well."
- **Low effectiveness on cheap model** (composite < 0.5, model = haiku): "Consider upgrading to sonnet — cheap model may not be capturing issues."
- **High effectiveness on expensive model** (composite >= 0.7, model = sonnet): "Try downgrading to haiku — may maintain quality at lower cost."
- **Low effectiveness on expensive model** (composite < 0.5, model = sonnet): "Consider removing — expensive and not contributing."

### Cost-Tracking Integration

Multiply estimated cost by effectiveness to get "value per dollar":

```
value_per_dollar = composite_score / cost_usd
```

This metric helps compare archetypes directly: a cheap archetype with moderate effectiveness may have higher value_per_dollar than an expensive one with high effectiveness.

## Effectiveness Script

**Location:** `lib/archeflow-score.sh`

```
Usage:
  archeflow-score.sh extract <events.jsonl>     # Score archetypes from a completed run
  archeflow-score.sh report                     # Show aggregate effectiveness report
  archeflow-score.sh recommend <team.yaml>      # Recommend model tiers for a team
```

### `extract` Command

1. Read all events from the JSONL file
2. Verify a `run.complete` event exists (scoring incomplete runs is unreliable)
3. For each `review.verdict` event:
   - Count total findings and useful findings (severity >= WARNING, fix_required)
   - Cross-reference with `fix.applied` events via the `source` field or DAG parent chain
   - Check for contradictions from other reviewers
   - Determine cycle impact
4. Calculate all scoring dimensions and composite score
5. Append per-archetype score records to `.archeflow/memory/effectiveness.jsonl`

### `report` Command

1. Read `.archeflow/memory/effectiveness.jsonl`
2. Group by archetype
3. Calculate rolling averages (last 10 runs per archetype)
4. Calculate trends (last 5 vs. prior 5)
5. Output a markdown table:

```markdown
# Archetype Effectiveness Report

| Archetype | Runs | Avg Score | S/N | Fix Rate | Cost Eff | Accuracy | Trend | Rec |
|-----------|------|-----------|-----|----------|----------|----------|-------|-----|
| guardian | 12 | 0.88 | 0.82 | 0.95 | 38.2 | 0.97 | stable | keep |
| sage | 10 | 0.72 | 0.70 | 0.80 | 12.1 | 0.88 | improving | keep |
| skeptic | 8 | 0.45 | 0.40 | 0.50 | 22.5 | 0.60 | stable | optimize |
| trickster | 8 | 0.35 | 0.20 | 0.30 | 5.1 | 0.55 | declining | consider_removing |

**Model suggestions:**
- skeptic (haiku, score 0.45): Consider upgrading to sonnet or tightening review lens
- trickster (haiku, score 0.35): Consider removing — low signal, low fix rate
```

### `recommend` Command

1. Read the team preset YAML file
2. For each archetype in the team, look up its effectiveness from `.archeflow/memory/effectiveness.jsonl`
3. Cross-reference current model assignment with effectiveness
4. Output recommendations:

```markdown
# Model Recommendations for team: story-development

| Archetype | Current Model | Score | Suggestion |
|-----------|--------------|-------|------------|
| guardian | haiku | 0.88 | Keep haiku — high effectiveness at low cost |
| sage | sonnet | 0.72 | Keep sonnet — quality-sensitive role |
| skeptic | haiku | 0.45 | Try sonnet — may improve signal quality |
| trickster | haiku | 0.35 | Consider removing from team |
```

## Design Principles

1. **Append-only.** Score records are immutable facts. Aggregates are computed on-demand.
2. **Review archetypes only.** Non-review agents (Explorer, Creator, Maker) are not scored — their value is in the final product, not in individual findings.
3. **Relative, not absolute.** Scores are meaningful in comparison (guardian vs. trickster), not as standalone numbers. The thresholds (0.7, 0.4) are starting points — calibrate after 20+ runs.
4. **Actionable.** Every report ends with concrete recommendations (keep, optimize, remove, change model).
5. **Cheap to compute.** One JSONL scan per report. No databases, no external services.