- skills/memory: cross-run learning from recurring findings + lib/archeflow-memory.sh - skills/convergence: oscillation detection + early termination in multi-cycle runs - skills/colette-bridge: auto-inject voice profiles, personas, characters from colette.yaml - skills/templates: workflow/team/archetype gallery with init/save/share - skills/progress: live .archeflow/progress.md during runs - skills/effectiveness: per-archetype signal-to-noise + cost efficiency scoring - skills/git-integration: auto-branch per run, commit per phase, rollback support
201 lines
9.1 KiB
Markdown
201 lines
9.1 KiB
Markdown
---
|
|
name: effectiveness
|
|
description: |
|
|
Track archetype effectiveness across runs. Scores each archetype on signal-to-noise,
|
|
fix rate, cost efficiency, accuracy, and cycle impact. Recommends model tier changes
|
|
and archetype removal based on rolling averages.
|
|
<example>User: "Which reviewers are actually useful?"</example>
|
|
<example>User: "Show archetype effectiveness report"</example>
|
|
---
|
|
|
|
# Agent Effectiveness Scoring
|
|
|
|
Track which archetypes are most useful vs. which waste tokens. Over multiple runs, build a profile of each archetype's effectiveness and use it to optimize team composition and model selection.
|
|
|
|
## Storage
|
|
|
|
```
|
|
.archeflow/memory/effectiveness.jsonl # Per-run archetype scores (append-only)
|
|
```
|
|
|
|
## Scoring Dimensions
|
|
|
|
For each archetype that participates in a run, calculate these scores:
|
|
|
|
| Dimension | How Measured | Weight |
|
|
|-----------|-------------|--------|
|
|
| **Signal-to-noise** | useful findings / total findings | 0.30 |
|
|
| **Fix rate** | findings that led to actual fixes / total findings | 0.25 |
|
|
| **Cost efficiency** | useful findings per dollar spent | 0.20 |
|
|
| **Accuracy** | findings not contradicted by other reviewers | 0.15 |
|
|
| **Cycle impact** | did this archetype's findings lead to cycle exit? | 0.10 |
|
|
|
|
### Definitions
|
|
|
|
- **Useful finding**: A finding in a `review.verdict` event with `severity >= WARNING` (i.e., severity is `warning`, `bug`, or `critical`) AND `fix_required == true`.
|
|
- **Actual fix**: A `fix.applied` event whose `source` field matches this archetype (or whose DAG `parent` chain traces back to this archetype's `review.verdict` event).
|
|
- **Contradicted finding**: Another reviewer's `review.verdict` has `verdict == "approved"` for the same scope where this archetype flagged an issue. Approximation: if archetype A flags N findings but archetype B approves the same code with 0 findings in overlapping severity categories, A's unmatched findings are considered potentially contradicted.
|
|
- **Cycle impact**: The archetype's findings (with `fix_required == true`) resulted in fixes that were part of the final approved cycle. Determined by checking if `fix.applied` events referencing this archetype exist before the final `cycle.boundary` with `met == true`.
|
|
|
|
### Composite Score
|
|
|
|
```
|
|
composite = (signal_to_noise * 0.30)
|
|
+ (fix_rate * 0.25)
|
|
+ (cost_efficiency_normalized * 0.20)
|
|
+ (accuracy * 0.15)
|
|
+ (cycle_impact * 0.10)
|
|
```
|
|
|
|
**Cost efficiency normalization**: Raw cost efficiency is `useful_findings / cost_usd`. To normalize to 0-1 range, use: `min(1.0, raw_efficiency / 100)`. The threshold of 100 means "100 useful findings per dollar" is considered perfect efficiency (achievable with haiku on structured reviews).
|
|
|
|
## Per-Run Scoring
|
|
|
|
After `run.complete`, calculate scores for each archetype that participated. The `extract` command does this.
|
|
|
|
### Per-Run Score Record
|
|
|
|
```jsonl
|
|
{"ts":"2026-04-03T16:00:00Z","run_id":"2026-04-03-der-huster","archetype":"guardian","signal_to_noise":0.85,"fix_rate":1.0,"cost_efficiency":42.5,"accuracy":1.0,"cycle_impact":true,"composite_score":0.91,"tokens":5000,"cost_usd":0.004,"model":"haiku","findings_total":4,"findings_useful":3,"fixes_applied":3}
|
|
```
|
|
|
|
Appended to `.archeflow/memory/effectiveness.jsonl`.
|
|
|
|
### Scoring Non-Review Archetypes
|
|
|
|
Only archetypes that produce `review.verdict` events are scored (Guardian, Skeptic, Sage, Trickster, and any custom review archetypes). Non-review archetypes (Explorer, Creator, Maker) are tracked by cost-tracking but not effectiveness-scored, because their output quality is measured differently (by whether the run succeeds, not by individual findings).
|
|
|
|
## Aggregate Scoring
|
|
|
|
Across all runs, maintain rolling averages (computed on-demand, not stored):
|
|
|
|
```jsonl
|
|
{"archetype":"guardian","runs":12,"avg_composite":0.88,"avg_signal_noise":0.82,"avg_cost_efficiency":38.2,"trend":"stable","recommendation":"keep"}
|
|
{"archetype":"trickster","runs":8,"avg_composite":0.35,"avg_signal_noise":0.20,"avg_cost_efficiency":5.1,"trend":"declining","recommendation":"consider_removing"}
|
|
```
|
|
|
|
### Trend Calculation
|
|
|
|
Compare the average composite score of the last 5 runs to the 5 runs before that:
|
|
|
|
- **improving**: last-5 avg > prior-5 avg + 0.05
|
|
- **declining**: last-5 avg < prior-5 avg - 0.05
|
|
- **stable**: within +/- 0.05
|
|
|
|
If fewer than 10 runs exist, trend is `"insufficient_data"`.
|
|
|
|
### Recommendations
|
|
|
|
Based on aggregate composite scores:
|
|
|
|
| Composite Score | Recommendation | Meaning |
|
|
|----------------|---------------|---------|
|
|
| >= 0.70 | `keep` | Archetype is valuable, contributes meaningful findings |
|
|
| 0.40 - 0.69 | `optimize` | Consider cheaper model or tighter review lens |
|
|
| < 0.40 | `consider_removing` | Might be wasting tokens, review whether it adds value |
|
|
|
|
## Integration Points
|
|
|
|
### At Run Start
|
|
|
|
When the `run` skill initializes, show a brief effectiveness summary for the team's archetypes:
|
|
|
|
```
|
|
Archetype effectiveness (last 10 runs):
|
|
guardian: 0.88 (keep) — haiku, $0.004/run avg
|
|
sage: 0.72 (keep) — sonnet, $0.08/run avg
|
|
skeptic: 0.45 (optimize) — haiku, $0.003/run avg
|
|
trickster: 0.32 (consider_removing) — haiku, $0.003/run avg
|
|
```
|
|
|
|
### Model Tier Suggestions
|
|
|
|
Cross-reference effectiveness with model assignment:
|
|
|
|
- **High effectiveness on cheap model** (composite >= 0.7, model = haiku): "Keep cheap. Working well."
|
|
- **Low effectiveness on cheap model** (composite < 0.5, model = haiku): "Consider upgrading to sonnet — cheap model may not be capturing issues."
|
|
- **High effectiveness on expensive model** (composite >= 0.7, model = sonnet): "Try downgrading to haiku — may maintain quality at lower cost."
|
|
- **Low effectiveness on expensive model** (composite < 0.5, model = sonnet): "Consider removing — expensive and not contributing."
|
|
|
|
### Cost-Tracking Integration
|
|
|
|
Multiply estimated cost by effectiveness to get "value per dollar":
|
|
|
|
```
|
|
value_per_dollar = composite_score / cost_usd
|
|
```
|
|
|
|
This metric helps compare archetypes directly: a cheap archetype with moderate effectiveness may have higher value_per_dollar than an expensive one with high effectiveness.
|
|
|
|
## Effectiveness Script
|
|
|
|
**Location:** `lib/archeflow-score.sh`
|
|
|
|
```
|
|
Usage:
|
|
archeflow-score.sh extract <events.jsonl> # Score archetypes from a completed run
|
|
archeflow-score.sh report # Show aggregate effectiveness report
|
|
archeflow-score.sh recommend <team.yaml> # Recommend model tiers for a team
|
|
```
|
|
|
|
### `extract` Command
|
|
|
|
1. Read all events from the JSONL file
|
|
2. Verify a `run.complete` event exists (scoring incomplete runs is unreliable)
|
|
3. For each `review.verdict` event:
|
|
- Count total findings and useful findings (severity >= WARNING, fix_required)
|
|
- Cross-reference with `fix.applied` events via the `source` field or DAG parent chain
|
|
- Check for contradictions from other reviewers
|
|
- Determine cycle impact
|
|
4. Calculate all scoring dimensions and composite score
|
|
5. Append per-archetype score records to `.archeflow/memory/effectiveness.jsonl`
|
|
|
|
### `report` Command
|
|
|
|
1. Read `.archeflow/memory/effectiveness.jsonl`
|
|
2. Group by archetype
|
|
3. Calculate rolling averages (last 10 runs per archetype)
|
|
4. Calculate trends (last 5 vs. prior 5)
|
|
5. Output a markdown table:
|
|
|
|
```markdown
|
|
# Archetype Effectiveness Report
|
|
|
|
| Archetype | Runs | Avg Score | S/N | Fix Rate | Cost Eff | Accuracy | Trend | Rec |
|
|
|-----------|------|-----------|-----|----------|----------|----------|-------|-----|
|
|
| guardian | 12 | 0.88 | 0.82 | 0.95 | 38.2 | 0.97 | stable | keep |
|
|
| sage | 10 | 0.72 | 0.70 | 0.80 | 12.1 | 0.88 | improving | keep |
|
|
| skeptic | 8 | 0.45 | 0.40 | 0.50 | 22.5 | 0.60 | stable | optimize |
|
|
| trickster | 8 | 0.35 | 0.20 | 0.30 | 5.1 | 0.55 | declining | consider_removing |
|
|
|
|
**Model suggestions:**
|
|
- skeptic (haiku, score 0.45): Consider upgrading to sonnet or tightening review lens
|
|
- trickster (haiku, score 0.35): Consider removing — low signal, low fix rate
|
|
```
|
|
|
|
### `recommend` Command
|
|
|
|
1. Read the team preset YAML file
|
|
2. For each archetype in the team, look up its effectiveness from `.archeflow/memory/effectiveness.jsonl`
|
|
3. Cross-reference current model assignment with effectiveness
|
|
4. Output recommendations:
|
|
|
|
```markdown
|
|
# Model Recommendations for team: story-development
|
|
|
|
| Archetype | Current Model | Score | Suggestion |
|
|
|-----------|--------------|-------|------------|
|
|
| guardian | haiku | 0.88 | Keep haiku — high effectiveness at low cost |
|
|
| sage | sonnet | 0.72 | Keep sonnet — quality-sensitive role |
|
|
| skeptic | haiku | 0.45 | Try sonnet — may improve signal quality |
|
|
| trickster | haiku | 0.35 | Consider removing from team |
|
|
```
|
|
|
|
## Design Principles
|
|
|
|
1. **Append-only.** Score records are immutable facts. Aggregates are computed on-demand.
|
|
2. **Review archetypes only.** Non-review agents (Explorer, Creator, Maker) are not scored — their value is in the final product, not in individual findings.
|
|
3. **Relative, not absolute.** Scores are meaningful in comparison (guardian vs. trickster), not as standalone numbers. The thresholds (0.7, 0.4) are starting points — calibrate after 20+ runs.
|
|
4. **Actionable.** Every report ends with concrete recommendations (keep, optimize, remove, change model).
|
|
5. **Cheap to compute.** One JSONL scan per report. No databases, no external services.
|