Files

Christian Nennemann 19f8f76232 feat: add memory, convergence, colette bridge, templates, progress, effectiveness, git integration

- skills/memory: cross-run learning from recurring findings + lib/archeflow-memory.sh
- skills/convergence: oscillation detection + early termination in multi-cycle runs
- skills/colette-bridge: auto-inject voice profiles, personas, characters from colette.yaml
- skills/templates: workflow/team/archetype gallery with init/save/share
- skills/progress: live .archeflow/progress.md during runs
- skills/effectiveness: per-archetype signal-to-noise + cost efficiency scoring
- skills/git-integration: auto-branch per run, commit per phase, rollback support

2026-04-03 11:40:04 +02:00

9.1 KiB

Raw Blame History

name, description

name	description
effectiveness	Track archetype effectiveness across runs. Scores each archetype on signal-to-noise, fix rate, cost efficiency, accuracy, and cycle impact. Recommends model tier changes and archetype removal based on rolling averages. <example>User: "Which reviewers are actually useful?"</example> <example>User: "Show archetype effectiveness report"</example>

Agent Effectiveness Scoring

Track which archetypes are most useful vs. which waste tokens. Over multiple runs, build a profile of each archetype's effectiveness and use it to optimize team composition and model selection.

Storage

.archeflow/memory/effectiveness.jsonl     # Per-run archetype scores (append-only)

Scoring Dimensions

For each archetype that participates in a run, calculate these scores:

Dimension	How Measured	Weight
Signal-to-noise	useful findings / total findings	0.30
Fix rate	findings that led to actual fixes / total findings	0.25
Cost efficiency	useful findings per dollar spent	0.20
Accuracy	findings not contradicted by other reviewers	0.15
Cycle impact	did this archetype's findings lead to cycle exit?	0.10

Definitions

Useful finding: A finding in a review.verdict event with severity >= WARNING (i.e., severity is warning, bug, or critical) AND fix_required == true.
Actual fix: A fix.applied event whose source field matches this archetype (or whose DAG parent chain traces back to this archetype's review.verdict event).
Contradicted finding: Another reviewer's review.verdict has verdict == "approved" for the same scope where this archetype flagged an issue. Approximation: if archetype A flags N findings but archetype B approves the same code with 0 findings in overlapping severity categories, A's unmatched findings are considered potentially contradicted.
Cycle impact: The archetype's findings (with fix_required == true) resulted in fixes that were part of the final approved cycle. Determined by checking if fix.applied events referencing this archetype exist before the final cycle.boundary with met == true.

Composite Score

composite = (signal_to_noise * 0.30)
          + (fix_rate * 0.25)
          + (cost_efficiency_normalized * 0.20)
          + (accuracy * 0.15)
          + (cycle_impact * 0.10)

Cost efficiency normalization: Raw cost efficiency is useful_findings / cost_usd. To normalize to 0-1 range, use: min(1.0, raw_efficiency / 100). The threshold of 100 means "100 useful findings per dollar" is considered perfect efficiency (achievable with haiku on structured reviews).

Per-Run Scoring

After run.complete, calculate scores for each archetype that participated. The extract command does this.

Per-Run Score Record

{"ts":"2026-04-03T16:00:00Z","run_id":"2026-04-03-der-huster","archetype":"guardian","signal_to_noise":0.85,"fix_rate":1.0,"cost_efficiency":42.5,"accuracy":1.0,"cycle_impact":true,"composite_score":0.91,"tokens":5000,"cost_usd":0.004,"model":"haiku","findings_total":4,"findings_useful":3,"fixes_applied":3}

Appended to .archeflow/memory/effectiveness.jsonl.

Scoring Non-Review Archetypes

Only archetypes that produce review.verdict events are scored (Guardian, Skeptic, Sage, Trickster, and any custom review archetypes). Non-review archetypes (Explorer, Creator, Maker) are tracked by cost-tracking but not effectiveness-scored, because their output quality is measured differently (by whether the run succeeds, not by individual findings).

Aggregate Scoring

Across all runs, maintain rolling averages (computed on-demand, not stored):

{"archetype":"guardian","runs":12,"avg_composite":0.88,"avg_signal_noise":0.82,"avg_cost_efficiency":38.2,"trend":"stable","recommendation":"keep"}
{"archetype":"trickster","runs":8,"avg_composite":0.35,"avg_signal_noise":0.20,"avg_cost_efficiency":5.1,"trend":"declining","recommendation":"consider_removing"}

Trend Calculation

Compare the average composite score of the last 5 runs to the 5 runs before that:

improving: last-5 avg > prior-5 avg + 0.05
declining: last-5 avg < prior-5 avg - 0.05
stable: within +/- 0.05

If fewer than 10 runs exist, trend is "insufficient_data".

Recommendations

Based on aggregate composite scores:

Composite Score	Recommendation	Meaning
>= 0.70	`keep`	Archetype is valuable, contributes meaningful findings
0.40 - 0.69	`optimize`	Consider cheaper model or tighter review lens
< 0.40	`consider_removing`	Might be wasting tokens, review whether it adds value

Integration Points

At Run Start

When the run skill initializes, show a brief effectiveness summary for the team's archetypes:

Archetype effectiveness (last 10 runs):
  guardian:  0.88 (keep)     — haiku, $0.004/run avg
  sage:      0.72 (keep)     — sonnet, $0.08/run avg
  skeptic:   0.45 (optimize) — haiku, $0.003/run avg
  trickster: 0.32 (consider_removing) — haiku, $0.003/run avg

Model Tier Suggestions

Cross-reference effectiveness with model assignment:

High effectiveness on cheap model (composite >= 0.7, model = haiku): "Keep cheap. Working well."
Low effectiveness on cheap model (composite < 0.5, model = haiku): "Consider upgrading to sonnet — cheap model may not be capturing issues."
High effectiveness on expensive model (composite >= 0.7, model = sonnet): "Try downgrading to haiku — may maintain quality at lower cost."
Low effectiveness on expensive model (composite < 0.5, model = sonnet): "Consider removing — expensive and not contributing."

Cost-Tracking Integration

Multiply estimated cost by effectiveness to get "value per dollar":

value_per_dollar = composite_score / cost_usd

This metric helps compare archetypes directly: a cheap archetype with moderate effectiveness may have higher value_per_dollar than an expensive one with high effectiveness.

Effectiveness Script

Location: lib/archeflow-score.sh

Usage:
  archeflow-score.sh extract <events.jsonl>     # Score archetypes from a completed run
  archeflow-score.sh report                     # Show aggregate effectiveness report
  archeflow-score.sh recommend <team.yaml>      # Recommend model tiers for a team

`extract` Command

Read all events from the JSONL file
Verify a run.complete event exists (scoring incomplete runs is unreliable)
For each review.verdict event:
- Count total findings and useful findings (severity >= WARNING, fix_required)
- Cross-reference with fix.applied events via the source field or DAG parent chain
- Check for contradictions from other reviewers
- Determine cycle impact
Calculate all scoring dimensions and composite score
Append per-archetype score records to .archeflow/memory/effectiveness.jsonl

`report` Command

Read .archeflow/memory/effectiveness.jsonl
Group by archetype
Calculate rolling averages (last 10 runs per archetype)
Calculate trends (last 5 vs. prior 5)
Output a markdown table:

# Archetype Effectiveness Report

| Archetype | Runs | Avg Score | S/N | Fix Rate | Cost Eff | Accuracy | Trend | Rec |
|-----------|------|-----------|-----|----------|----------|----------|-------|-----|
| guardian | 12 | 0.88 | 0.82 | 0.95 | 38.2 | 0.97 | stable | keep |
| sage | 10 | 0.72 | 0.70 | 0.80 | 12.1 | 0.88 | improving | keep |
| skeptic | 8 | 0.45 | 0.40 | 0.50 | 22.5 | 0.60 | stable | optimize |
| trickster | 8 | 0.35 | 0.20 | 0.30 | 5.1 | 0.55 | declining | consider_removing |

**Model suggestions:**
- skeptic (haiku, score 0.45): Consider upgrading to sonnet or tightening review lens
- trickster (haiku, score 0.35): Consider removing — low signal, low fix rate

`recommend` Command

Read the team preset YAML file
For each archetype in the team, look up its effectiveness from .archeflow/memory/effectiveness.jsonl
Cross-reference current model assignment with effectiveness
Output recommendations:

# Model Recommendations for team: story-development

| Archetype | Current Model | Score | Suggestion |
|-----------|--------------|-------|------------|
| guardian | haiku | 0.88 | Keep haiku — high effectiveness at low cost |
| sage | sonnet | 0.72 | Keep sonnet — quality-sensitive role |
| skeptic | haiku | 0.45 | Try sonnet — may improve signal quality |
| trickster | haiku | 0.35 | Consider removing from team |

Design Principles

Append-only. Score records are immutable facts. Aggregates are computed on-demand.
Review archetypes only. Non-review agents (Explorer, Creator, Maker) are not scored — their value is in the final product, not in individual findings.
Relative, not absolute. Scores are meaningful in comparison (guardian vs. trickster), not as standalone numbers. The thresholds (0.7, 0.4) are starting points — calibrate after 20+ runs.
Actionable. Every report ends with concrete recommendations (keep, optimize, remove, change model).
Cheap to compute. One JSONL scan per report. No databases, no external services.

9.1 KiB Raw Blame History