feat: core improvements — feedback loop, attention filters, shadow heuristics, metrics, auto-activation

- Cross-cycle feedback protocol with structured finding format, routing, and resolution tracking - Attention filter enforcement: explicit context include/exclude per archetype - Shadow detection: quantitative checklists with concrete thresholds - Orchestration metrics: per-phase timing, agent count, findings summary - Autonomous mode wiring: checkpoint protocol, session log, stop conditions - Auto-activation: SessionStart hook fires ArcheFlow for implementation tasks without user config - Emoji avatars for all 7 archetypes - Standardized finding format across all reviewers for cross-cycle tracking - Persisted implementation plan in docs/
2026-04-03 06:02:10 +02:00
parent eec1fc3d82
commit d08dc657d1
14 changed files with 553 additions and 85 deletions
--- a/skills/shadow-detection/SKILL.md
+++ b/skills/shadow-detection/SKILL.md
@@ -30,10 +30,11 @@ Maintainability Judgment      → reviews only      → Bureaucrat
 - Reading more than 15 files without producing findings
 - Output is a raw inventory of files with no analysis or recommendation

-**Triggers:**
- Output length > 2000 words without a recommendation section
- More than 3 "see also" or "related" tangents
- No patterns or recommendation in output
+**Detection Checklist** (trigger on ANY):
+- [ ] Output >2000 words without a `### Recommendation` section
+- [ ] >3 tangent topics not directly related to the original task
+- [ ] >15 files read with no `### Patterns` identified
+- [ ] No synthesis language (recommend, suggest, conclusion, finding, summary) in final 25% of output

 **Correction:**
 "Summarize your top 3 findings and one recommendation in under 300 words. If your output has no Recommendation section, add one. A dump is not research."
@@ -49,10 +50,11 @@ Maintainability Judgment      → reviews only      → Bureaucrat
 - Configuration systems for things that could be constants
 - Proposal has more infrastructure than business logic

-**Triggers:**
- More than 2 new abstractions (interfaces, base classes, factories) for a single feature
- "In the future we might need..." appears in rationale
- Proposal scope exceeds original task by > 50%
+**Detection Checklist** (trigger on ANY):
+- [ ] >2 new abstractions (interfaces, base classes, factories, registries) for a single feature
+- [ ] "In the future we might need..." or "future-proof" appears in rationale
+- [ ] Proposal scope (files changed) exceeds original task scope by >50%
+- [ ] More than 1 new package/module introduced for a single feature

 **Correction:**
 "Design for the current order of magnitude. If the app has 1000 users, design for 10,000 — not 10 million. Remove abstractions that serve hypothetical requirements."
@@ -68,10 +70,11 @@ Maintainability Judgment      → reviews only      → Bureaucrat
 - Large uncommitted working tree
 - Files changed that aren't mentioned in the proposal

-**Triggers:**
- No test files in the changeset
- Single monolithic commit instead of incremental commits
- Diff contains files not listed in the Creator's proposal
+**Detection Checklist** (trigger on ANY):
+- [ ] Zero test files (`.test.`, `.spec.`, `_test.`) in the changeset with >=3 files changed
+- [ ] Single monolithic commit instead of incremental commits
+- [ ] Diff contains files not listed in the Creator's proposal `### Changes` section
+- [ ] No evidence of running existing test suite before finishing

 **Correction:**
 "Read the proposal. Write a test. Commit what you have. Revert changes to files not in the proposal. Then continue."
@@ -87,10 +90,11 @@ Maintainability Judgment      → reviews only      → Bureaucrat
 - Rejecting without suggesting how to fix
 - Security concerns for internal-only code at external-API severity

-**Triggers:**
- CRITICAL:WARNING ratio > 2:1
- Zero APPROVED verdicts in 3+ consecutive reviews
- Less than 50% of findings include a suggested fix
+**Detection Checklist** (trigger on ANY):
+- [ ] CRITICAL:WARNING ratio >2:1 (with minimum 3 total findings)
+- [ ] Zero APPROVED verdicts in 3+ consecutive reviews
+- [ ] <50% of findings include a suggested fix in the `Fix` column
+- [ ] Findings reference attack scenarios that require already-compromised internal systems

 **Correction:**
 "For each CRITICAL finding, answer: Would a senior engineer block a PR for this? If not, downgrade. Every rejection must include a specific, implementable fix."
@@ -106,10 +110,11 @@ Maintainability Judgment      → reviews only      → Bureaucrat
 - "What about X?" chains that drift from the task
 - Restating the same concern in different words

-**Triggers:**
- Challenge count > 7
- Less than 50% of challenges include alternatives
- Same conceptual concern raised multiple times
+**Detection Checklist** (trigger on ANY):
+- [ ] >7 findings/challenges raised in a single review
+- [ ] <50% of findings include an alternative in the `Fix` column
+- [ ] Same conceptual concern appears 2+ times with different wording
+- [ ] >3 findings reference code or scenarios outside the task scope

 **Correction:**
 "Rank your challenges by impact. Keep the top 3. Each must include a specific alternative. Delete the rest."
@@ -125,13 +130,14 @@ Maintainability Judgment      → reviews only      → Bureaucrat
 - 20 findings when 3 good ones would cover the real risks
 - Edge cases for edge cases (diminishing returns)

-**Triggers:**
- Findings reference code untouched by the implementation
- More than 10 findings for a small change
- Findings describe scenarios that can't happen in the actual deployment context
+**Detection Checklist** (trigger on ANY):
+- [ ] Any finding references code untouched by the Maker's diff
+- [ ] >10 findings for a change touching <5 files
+- [ ] Findings describe scenarios requiring conditions that can't occur in the deployment context
+- [ ] >3 findings without reproduction steps

 **Correction:**
-"Quality over quantity. Delete findings outside the Maker's diff. Rank remaining by likelihood × impact. Keep top 3-5. Three real findings beat twenty noise."
+"Quality over quantity. Delete findings outside the Maker's diff. Rank remaining by likelihood x impact. Keep top 3-5. Three real findings beat twenty noise."

 ---

@@ -144,10 +150,11 @@ Maintainability Judgment      → reviews only      → Bureaucrat
 - Suggesting refactors unrelated to the current task
 - Deep-sounding analysis that doesn't end with a specific action

-**Triggers:**
- Review word count > 2x the code change's word count
- Suggestions reference files not in the changeset
- Findings contain "consider" or "think about" without a specific action
+**Detection Checklist** (trigger on ANY):
+- [ ] Review word count >2x the code change's line count (rough: review words > diff lines x 2)
+- [ ] Any finding references files not in the Maker's changeset
+- [ ] >2 findings use "consider" or "think about" without a concrete action in the `Fix` column
+- [ ] Suggesting documentation for functions with <5 lines or self-descriptive names

 **Correction:**
 "Limit your review to issues that affect maintainability in the next 6 months. Every finding must end with a specific action. If you can't state the consequence of NOT fixing it, don't raise it."