- skills/memory: cross-run learning from recurring findings + lib/archeflow-memory.sh - skills/convergence: oscillation detection + early termination in multi-cycle runs - skills/colette-bridge: auto-inject voice profiles, personas, characters from colette.yaml - skills/templates: workflow/team/archetype gallery with init/save/share - skills/progress: live .archeflow/progress.md during runs - skills/effectiveness: per-archetype signal-to-noise + cost efficiency scoring - skills/git-integration: auto-branch per run, commit per phase, rollback support
9.1 KiB
name, description
| name | description |
|---|---|
| effectiveness | Track archetype effectiveness across runs. Scores each archetype on signal-to-noise, fix rate, cost efficiency, accuracy, and cycle impact. Recommends model tier changes and archetype removal based on rolling averages. <example>User: "Which reviewers are actually useful?"</example> <example>User: "Show archetype effectiveness report"</example> |
Agent Effectiveness Scoring
Track which archetypes are most useful vs. which waste tokens. Over multiple runs, build a profile of each archetype's effectiveness and use it to optimize team composition and model selection.
Storage
.archeflow/memory/effectiveness.jsonl # Per-run archetype scores (append-only)
Scoring Dimensions
For each archetype that participates in a run, calculate these scores:
| Dimension | How Measured | Weight |
|---|---|---|
| Signal-to-noise | useful findings / total findings | 0.30 |
| Fix rate | findings that led to actual fixes / total findings | 0.25 |
| Cost efficiency | useful findings per dollar spent | 0.20 |
| Accuracy | findings not contradicted by other reviewers | 0.15 |
| Cycle impact | did this archetype's findings lead to cycle exit? | 0.10 |
Definitions
- Useful finding: A finding in a
review.verdictevent withseverity >= WARNING(i.e., severity iswarning,bug, orcritical) ANDfix_required == true. - Actual fix: A
fix.appliedevent whosesourcefield matches this archetype (or whose DAGparentchain traces back to this archetype'sreview.verdictevent). - Contradicted finding: Another reviewer's
review.verdicthasverdict == "approved"for the same scope where this archetype flagged an issue. Approximation: if archetype A flags N findings but archetype B approves the same code with 0 findings in overlapping severity categories, A's unmatched findings are considered potentially contradicted. - Cycle impact: The archetype's findings (with
fix_required == true) resulted in fixes that were part of the final approved cycle. Determined by checking iffix.appliedevents referencing this archetype exist before the finalcycle.boundarywithmet == true.
Composite Score
composite = (signal_to_noise * 0.30)
+ (fix_rate * 0.25)
+ (cost_efficiency_normalized * 0.20)
+ (accuracy * 0.15)
+ (cycle_impact * 0.10)
Cost efficiency normalization: Raw cost efficiency is useful_findings / cost_usd. To normalize to 0-1 range, use: min(1.0, raw_efficiency / 100). The threshold of 100 means "100 useful findings per dollar" is considered perfect efficiency (achievable with haiku on structured reviews).
Per-Run Scoring
After run.complete, calculate scores for each archetype that participated. The extract command does this.
Per-Run Score Record
{"ts":"2026-04-03T16:00:00Z","run_id":"2026-04-03-der-huster","archetype":"guardian","signal_to_noise":0.85,"fix_rate":1.0,"cost_efficiency":42.5,"accuracy":1.0,"cycle_impact":true,"composite_score":0.91,"tokens":5000,"cost_usd":0.004,"model":"haiku","findings_total":4,"findings_useful":3,"fixes_applied":3}
Appended to .archeflow/memory/effectiveness.jsonl.
Scoring Non-Review Archetypes
Only archetypes that produce review.verdict events are scored (Guardian, Skeptic, Sage, Trickster, and any custom review archetypes). Non-review archetypes (Explorer, Creator, Maker) are tracked by cost-tracking but not effectiveness-scored, because their output quality is measured differently (by whether the run succeeds, not by individual findings).
Aggregate Scoring
Across all runs, maintain rolling averages (computed on-demand, not stored):
{"archetype":"guardian","runs":12,"avg_composite":0.88,"avg_signal_noise":0.82,"avg_cost_efficiency":38.2,"trend":"stable","recommendation":"keep"}
{"archetype":"trickster","runs":8,"avg_composite":0.35,"avg_signal_noise":0.20,"avg_cost_efficiency":5.1,"trend":"declining","recommendation":"consider_removing"}
Trend Calculation
Compare the average composite score of the last 5 runs to the 5 runs before that:
- improving: last-5 avg > prior-5 avg + 0.05
- declining: last-5 avg < prior-5 avg - 0.05
- stable: within +/- 0.05
If fewer than 10 runs exist, trend is "insufficient_data".
Recommendations
Based on aggregate composite scores:
| Composite Score | Recommendation | Meaning |
|---|---|---|
| >= 0.70 | keep |
Archetype is valuable, contributes meaningful findings |
| 0.40 - 0.69 | optimize |
Consider cheaper model or tighter review lens |
| < 0.40 | consider_removing |
Might be wasting tokens, review whether it adds value |
Integration Points
At Run Start
When the run skill initializes, show a brief effectiveness summary for the team's archetypes:
Archetype effectiveness (last 10 runs):
guardian: 0.88 (keep) — haiku, $0.004/run avg
sage: 0.72 (keep) — sonnet, $0.08/run avg
skeptic: 0.45 (optimize) — haiku, $0.003/run avg
trickster: 0.32 (consider_removing) — haiku, $0.003/run avg
Model Tier Suggestions
Cross-reference effectiveness with model assignment:
- High effectiveness on cheap model (composite >= 0.7, model = haiku): "Keep cheap. Working well."
- Low effectiveness on cheap model (composite < 0.5, model = haiku): "Consider upgrading to sonnet — cheap model may not be capturing issues."
- High effectiveness on expensive model (composite >= 0.7, model = sonnet): "Try downgrading to haiku — may maintain quality at lower cost."
- Low effectiveness on expensive model (composite < 0.5, model = sonnet): "Consider removing — expensive and not contributing."
Cost-Tracking Integration
Multiply estimated cost by effectiveness to get "value per dollar":
value_per_dollar = composite_score / cost_usd
This metric helps compare archetypes directly: a cheap archetype with moderate effectiveness may have higher value_per_dollar than an expensive one with high effectiveness.
Effectiveness Script
Location: lib/archeflow-score.sh
Usage:
archeflow-score.sh extract <events.jsonl> # Score archetypes from a completed run
archeflow-score.sh report # Show aggregate effectiveness report
archeflow-score.sh recommend <team.yaml> # Recommend model tiers for a team
extract Command
- Read all events from the JSONL file
- Verify a
run.completeevent exists (scoring incomplete runs is unreliable) - For each
review.verdictevent:- Count total findings and useful findings (severity >= WARNING, fix_required)
- Cross-reference with
fix.appliedevents via thesourcefield or DAG parent chain - Check for contradictions from other reviewers
- Determine cycle impact
- Calculate all scoring dimensions and composite score
- Append per-archetype score records to
.archeflow/memory/effectiveness.jsonl
report Command
- Read
.archeflow/memory/effectiveness.jsonl - Group by archetype
- Calculate rolling averages (last 10 runs per archetype)
- Calculate trends (last 5 vs. prior 5)
- Output a markdown table:
# Archetype Effectiveness Report
| Archetype | Runs | Avg Score | S/N | Fix Rate | Cost Eff | Accuracy | Trend | Rec |
|-----------|------|-----------|-----|----------|----------|----------|-------|-----|
| guardian | 12 | 0.88 | 0.82 | 0.95 | 38.2 | 0.97 | stable | keep |
| sage | 10 | 0.72 | 0.70 | 0.80 | 12.1 | 0.88 | improving | keep |
| skeptic | 8 | 0.45 | 0.40 | 0.50 | 22.5 | 0.60 | stable | optimize |
| trickster | 8 | 0.35 | 0.20 | 0.30 | 5.1 | 0.55 | declining | consider_removing |
**Model suggestions:**
- skeptic (haiku, score 0.45): Consider upgrading to sonnet or tightening review lens
- trickster (haiku, score 0.35): Consider removing — low signal, low fix rate
recommend Command
- Read the team preset YAML file
- For each archetype in the team, look up its effectiveness from
.archeflow/memory/effectiveness.jsonl - Cross-reference current model assignment with effectiveness
- Output recommendations:
# Model Recommendations for team: story-development
| Archetype | Current Model | Score | Suggestion |
|-----------|--------------|-------|------------|
| guardian | haiku | 0.88 | Keep haiku — high effectiveness at low cost |
| sage | sonnet | 0.72 | Keep sonnet — quality-sensitive role |
| skeptic | haiku | 0.45 | Try sonnet — may improve signal quality |
| trickster | haiku | 0.35 | Consider removing from team |
Design Principles
- Append-only. Score records are immutable facts. Aggregates are computed on-demand.
- Review archetypes only. Non-review agents (Explorer, Creator, Maker) are not scored — their value is in the final product, not in individual findings.
- Relative, not absolute. Scores are meaningful in comparison (guardian vs. trickster), not as standalone numbers. The thresholds (0.7, 0.4) are starting points — calibrate after 20+ runs.
- Actionable. Every report ends with concrete recommendations (keep, optimize, remove, change model).
- Cheap to compute. One JSONL scan per report. No databases, no external services.