feat: add memory, convergence, colette bridge, templates, progress, effectiveness, git integration
- skills/memory: cross-run learning from recurring findings + lib/archeflow-memory.sh - skills/convergence: oscillation detection + early termination in multi-cycle runs - skills/colette-bridge: auto-inject voice profiles, personas, characters from colette.yaml - skills/templates: workflow/team/archetype gallery with init/save/share - skills/progress: live .archeflow/progress.md during runs - skills/effectiveness: per-archetype signal-to-noise + cost efficiency scoring - skills/git-integration: auto-branch per run, commit per phase, rollback support
This commit is contained in:
200
skills/effectiveness/SKILL.md
Normal file
200
skills/effectiveness/SKILL.md
Normal file
@@ -0,0 +1,200 @@
|
||||
---
|
||||
name: effectiveness
|
||||
description: |
|
||||
Track archetype effectiveness across runs. Scores each archetype on signal-to-noise,
|
||||
fix rate, cost efficiency, accuracy, and cycle impact. Recommends model tier changes
|
||||
and archetype removal based on rolling averages.
|
||||
<example>User: "Which reviewers are actually useful?"</example>
|
||||
<example>User: "Show archetype effectiveness report"</example>
|
||||
---
|
||||
|
||||
# Agent Effectiveness Scoring
|
||||
|
||||
Track which archetypes are most useful vs. which waste tokens. Over multiple runs, build a profile of each archetype's effectiveness and use it to optimize team composition and model selection.
|
||||
|
||||
## Storage
|
||||
|
||||
```
|
||||
.archeflow/memory/effectiveness.jsonl # Per-run archetype scores (append-only)
|
||||
```
|
||||
|
||||
## Scoring Dimensions
|
||||
|
||||
For each archetype that participates in a run, calculate these scores:
|
||||
|
||||
| Dimension | How Measured | Weight |
|
||||
|-----------|-------------|--------|
|
||||
| **Signal-to-noise** | useful findings / total findings | 0.30 |
|
||||
| **Fix rate** | findings that led to actual fixes / total findings | 0.25 |
|
||||
| **Cost efficiency** | useful findings per dollar spent | 0.20 |
|
||||
| **Accuracy** | findings not contradicted by other reviewers | 0.15 |
|
||||
| **Cycle impact** | did this archetype's findings lead to cycle exit? | 0.10 |
|
||||
|
||||
### Definitions
|
||||
|
||||
- **Useful finding**: A finding in a `review.verdict` event with `severity >= WARNING` (i.e., severity is `warning`, `bug`, or `critical`) AND `fix_required == true`.
|
||||
- **Actual fix**: A `fix.applied` event whose `source` field matches this archetype (or whose DAG `parent` chain traces back to this archetype's `review.verdict` event).
|
||||
- **Contradicted finding**: Another reviewer's `review.verdict` has `verdict == "approved"` for the same scope where this archetype flagged an issue. Approximation: if archetype A flags N findings but archetype B approves the same code with 0 findings in overlapping severity categories, A's unmatched findings are considered potentially contradicted.
|
||||
- **Cycle impact**: The archetype's findings (with `fix_required == true`) resulted in fixes that were part of the final approved cycle. Determined by checking if `fix.applied` events referencing this archetype exist before the final `cycle.boundary` with `met == true`.
|
||||
|
||||
### Composite Score
|
||||
|
||||
```
|
||||
composite = (signal_to_noise * 0.30)
|
||||
+ (fix_rate * 0.25)
|
||||
+ (cost_efficiency_normalized * 0.20)
|
||||
+ (accuracy * 0.15)
|
||||
+ (cycle_impact * 0.10)
|
||||
```
|
||||
|
||||
**Cost efficiency normalization**: Raw cost efficiency is `useful_findings / cost_usd`. To normalize to 0-1 range, use: `min(1.0, raw_efficiency / 100)`. The threshold of 100 means "100 useful findings per dollar" is considered perfect efficiency (achievable with haiku on structured reviews).
|
||||
|
||||
## Per-Run Scoring
|
||||
|
||||
After `run.complete`, calculate scores for each archetype that participated. The `extract` command does this.
|
||||
|
||||
### Per-Run Score Record
|
||||
|
||||
```jsonl
|
||||
{"ts":"2026-04-03T16:00:00Z","run_id":"2026-04-03-der-huster","archetype":"guardian","signal_to_noise":0.85,"fix_rate":1.0,"cost_efficiency":42.5,"accuracy":1.0,"cycle_impact":true,"composite_score":0.91,"tokens":5000,"cost_usd":0.004,"model":"haiku","findings_total":4,"findings_useful":3,"fixes_applied":3}
|
||||
```
|
||||
|
||||
Appended to `.archeflow/memory/effectiveness.jsonl`.
|
||||
|
||||
### Scoring Non-Review Archetypes
|
||||
|
||||
Only archetypes that produce `review.verdict` events are scored (Guardian, Skeptic, Sage, Trickster, and any custom review archetypes). Non-review archetypes (Explorer, Creator, Maker) are tracked by cost-tracking but not effectiveness-scored, because their output quality is measured differently (by whether the run succeeds, not by individual findings).
|
||||
|
||||
## Aggregate Scoring
|
||||
|
||||
Across all runs, maintain rolling averages (computed on-demand, not stored):
|
||||
|
||||
```jsonl
|
||||
{"archetype":"guardian","runs":12,"avg_composite":0.88,"avg_signal_noise":0.82,"avg_cost_efficiency":38.2,"trend":"stable","recommendation":"keep"}
|
||||
{"archetype":"trickster","runs":8,"avg_composite":0.35,"avg_signal_noise":0.20,"avg_cost_efficiency":5.1,"trend":"declining","recommendation":"consider_removing"}
|
||||
```
|
||||
|
||||
### Trend Calculation
|
||||
|
||||
Compare the average composite score of the last 5 runs to the 5 runs before that:
|
||||
|
||||
- **improving**: last-5 avg > prior-5 avg + 0.05
|
||||
- **declining**: last-5 avg < prior-5 avg - 0.05
|
||||
- **stable**: within +/- 0.05
|
||||
|
||||
If fewer than 10 runs exist, trend is `"insufficient_data"`.
|
||||
|
||||
### Recommendations
|
||||
|
||||
Based on aggregate composite scores:
|
||||
|
||||
| Composite Score | Recommendation | Meaning |
|
||||
|----------------|---------------|---------|
|
||||
| >= 0.70 | `keep` | Archetype is valuable, contributes meaningful findings |
|
||||
| 0.40 - 0.69 | `optimize` | Consider cheaper model or tighter review lens |
|
||||
| < 0.40 | `consider_removing` | Might be wasting tokens, review whether it adds value |
|
||||
|
||||
## Integration Points
|
||||
|
||||
### At Run Start
|
||||
|
||||
When the `run` skill initializes, show a brief effectiveness summary for the team's archetypes:
|
||||
|
||||
```
|
||||
Archetype effectiveness (last 10 runs):
|
||||
guardian: 0.88 (keep) — haiku, $0.004/run avg
|
||||
sage: 0.72 (keep) — sonnet, $0.08/run avg
|
||||
skeptic: 0.45 (optimize) — haiku, $0.003/run avg
|
||||
trickster: 0.32 (consider_removing) — haiku, $0.003/run avg
|
||||
```
|
||||
|
||||
### Model Tier Suggestions
|
||||
|
||||
Cross-reference effectiveness with model assignment:
|
||||
|
||||
- **High effectiveness on cheap model** (composite >= 0.7, model = haiku): "Keep cheap. Working well."
|
||||
- **Low effectiveness on cheap model** (composite < 0.5, model = haiku): "Consider upgrading to sonnet — cheap model may not be capturing issues."
|
||||
- **High effectiveness on expensive model** (composite >= 0.7, model = sonnet): "Try downgrading to haiku — may maintain quality at lower cost."
|
||||
- **Low effectiveness on expensive model** (composite < 0.5, model = sonnet): "Consider removing — expensive and not contributing."
|
||||
|
||||
### Cost-Tracking Integration
|
||||
|
||||
Multiply estimated cost by effectiveness to get "value per dollar":
|
||||
|
||||
```
|
||||
value_per_dollar = composite_score / cost_usd
|
||||
```
|
||||
|
||||
This metric helps compare archetypes directly: a cheap archetype with moderate effectiveness may have higher value_per_dollar than an expensive one with high effectiveness.
|
||||
|
||||
## Effectiveness Script
|
||||
|
||||
**Location:** `lib/archeflow-score.sh`
|
||||
|
||||
```
|
||||
Usage:
|
||||
archeflow-score.sh extract <events.jsonl> # Score archetypes from a completed run
|
||||
archeflow-score.sh report # Show aggregate effectiveness report
|
||||
archeflow-score.sh recommend <team.yaml> # Recommend model tiers for a team
|
||||
```
|
||||
|
||||
### `extract` Command
|
||||
|
||||
1. Read all events from the JSONL file
|
||||
2. Verify a `run.complete` event exists (scoring incomplete runs is unreliable)
|
||||
3. For each `review.verdict` event:
|
||||
- Count total findings and useful findings (severity >= WARNING, fix_required)
|
||||
- Cross-reference with `fix.applied` events via the `source` field or DAG parent chain
|
||||
- Check for contradictions from other reviewers
|
||||
- Determine cycle impact
|
||||
4. Calculate all scoring dimensions and composite score
|
||||
5. Append per-archetype score records to `.archeflow/memory/effectiveness.jsonl`
|
||||
|
||||
### `report` Command
|
||||
|
||||
1. Read `.archeflow/memory/effectiveness.jsonl`
|
||||
2. Group by archetype
|
||||
3. Calculate rolling averages (last 10 runs per archetype)
|
||||
4. Calculate trends (last 5 vs. prior 5)
|
||||
5. Output a markdown table:
|
||||
|
||||
```markdown
|
||||
# Archetype Effectiveness Report
|
||||
|
||||
| Archetype | Runs | Avg Score | S/N | Fix Rate | Cost Eff | Accuracy | Trend | Rec |
|
||||
|-----------|------|-----------|-----|----------|----------|----------|-------|-----|
|
||||
| guardian | 12 | 0.88 | 0.82 | 0.95 | 38.2 | 0.97 | stable | keep |
|
||||
| sage | 10 | 0.72 | 0.70 | 0.80 | 12.1 | 0.88 | improving | keep |
|
||||
| skeptic | 8 | 0.45 | 0.40 | 0.50 | 22.5 | 0.60 | stable | optimize |
|
||||
| trickster | 8 | 0.35 | 0.20 | 0.30 | 5.1 | 0.55 | declining | consider_removing |
|
||||
|
||||
**Model suggestions:**
|
||||
- skeptic (haiku, score 0.45): Consider upgrading to sonnet or tightening review lens
|
||||
- trickster (haiku, score 0.35): Consider removing — low signal, low fix rate
|
||||
```
|
||||
|
||||
### `recommend` Command
|
||||
|
||||
1. Read the team preset YAML file
|
||||
2. For each archetype in the team, look up its effectiveness from `.archeflow/memory/effectiveness.jsonl`
|
||||
3. Cross-reference current model assignment with effectiveness
|
||||
4. Output recommendations:
|
||||
|
||||
```markdown
|
||||
# Model Recommendations for team: story-development
|
||||
|
||||
| Archetype | Current Model | Score | Suggestion |
|
||||
|-----------|--------------|-------|------------|
|
||||
| guardian | haiku | 0.88 | Keep haiku — high effectiveness at low cost |
|
||||
| sage | sonnet | 0.72 | Keep sonnet — quality-sensitive role |
|
||||
| skeptic | haiku | 0.45 | Try sonnet — may improve signal quality |
|
||||
| trickster | haiku | 0.35 | Consider removing from team |
|
||||
```
|
||||
|
||||
## Design Principles
|
||||
|
||||
1. **Append-only.** Score records are immutable facts. Aggregates are computed on-demand.
|
||||
2. **Review archetypes only.** Non-review agents (Explorer, Creator, Maker) are not scored — their value is in the final product, not in individual findings.
|
||||
3. **Relative, not absolute.** Scores are meaningful in comparison (guardian vs. trickster), not as standalone numbers. The thresholds (0.7, 0.4) are starting points — calibrate after 20+ runs.
|
||||
4. **Actionable.** Every report ends with concrete recommendations (keep, optimize, remove, change model).
|
||||
5. **Cheap to compute.** One JSONL scan per report. No databases, no external services.
|
||||
Reference in New Issue
Block a user