feat: principles #26-#32 — PDCA, prod testing, changelog, emergency stop, guardian, debug logs, multi-auth

2026-03-31 21:31:44 +00:00
parent 0f5456bb53
commit 68e2c6e7a8
1 changed files with 87 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -269,6 +269,93 @@ When an agent needs a tool that isn't installed, it should install it automatica

 ---

+## Quality & Process
+
+### 26. PDCA Every Sprint
+
+Plan-Do-Check-Act after every sprint, not just at the end. Check catches bugs before they compound.
+
+- Plan: define features + acceptance criteria
+- Do: implement with team, commit after each feature
+- Check: test in production, read debug logs, try bad inputs, verify on mobile
+- Act: fix everything found before starting next sprint
+- Never skip Check. A shipped bug costs 10x more than a caught bug.
+
+**Origin:** Sprint 1-3 each had a PDCA cycle that caught rate limiting issues, SSE race conditions, and Caddy routing gaps.
+
+### 27. Test in Production, Not in Mocks
+
+For single-user tools: test against the real deployment. Mocks hide integration bugs.
+
+- `curl` against the live API after each deploy
+- Try the PWA on your actual phone
+- Submit real jobs through the real worker
+- Read the real debug logs
+
+**Origin:** "Committe regelmäßig und test in production — keine mocks!"
+
+### 28. Changelog as First-Class Artifact
+
+Every project gets a CHANGELOG.md. Updated with every sprint. The user should never have to ask "what changed?"
+
+- Reverse-chronological, grouped by version/sprint
+- Include Added/Changed/Security/Fixed sections
+- Link to relevant commits if helpful
+- Update it DURING the sprint, not after
+
+**Origin:** "Ich brauch gute changelogs um bei allem laufenden zu bleiben."
+
+### 29. Emergency Stop (Not-Aus)
+
+Every autonomous system needs a kill switch. One button, kills everything, no confirmation cascade.
+
+- Cancel all running jobs immediately
+- Pause the system (workers stop polling)
+- Log the event as critical
+- Resume button to unpause
+- Visible at all times, not buried in a menu
+
+**Origin:** "Und wir brauchen einen Not-Aus-Knopf ;)"
+
+### 30. Self-Monitoring (Guardian Pattern)
+
+The system monitors itself. A background watchdog checks health every N minutes and logs findings.
+
+- Check: stuck jobs, dead workers, error spikes, DB connectivity
+- Log structured findings to a queryable debug_log
+- Agent can read the logs to self-diagnose
+- Future: alert the user via push/webhook when degraded
+- Clean up old logs automatically
+
+**Origin:** "We should have a guardian who checks every other minute what's going on."
+
+### 31. Debug Logs as Agent Interface
+
+Structured debug logs aren't just for humans — they're an API for the agent to understand system health.
+
+- Queryable by level, component, time range
+- Secret-safe (auto-redact tokens, keys, passwords)
+- Agent reads them between sprints to catch issues
+- Self-healing: agent detects error patterns and applies fixes
+
+**Origin:** Built during dispatch development — agent reads `/debug/logs` to diagnose production issues.
+
+### 32. Multi-Layer Auth for Admin Endpoints
+
+Regular API operations and admin/debug operations need different auth levels.
+
+- Regular token: job CRUD, worker operations
+- Admin token: debug logs, stats, worker management, emergency stop
+- Rate limiting: stricter on admin endpoints
+- Never share the same token for both levels
+
+**Origin:** "Ich hoffe wir haben da ne mehrstufige Authentifizierung dahinter..."
+
+---
+
 ## (inbox — unsorted ideas)

+- **Least-privilege agent access**: Agents should SSH as a dedicated non-root user (e.g. `deploy@`) with scoped sudo for only what they need (systemctl, caddy reload). No root SSH long-term.
+- **Immutable deploy artifacts**: Agent builds a tarball/image, uploads it, runs a deploy script. Never edits files in-place on production.
+
 _Drop new principles here. They get organized on next pass._