diff --git a/README.md b/README.md index 74ccfa9..6813e66 100644 --- a/README.md +++ b/README.md @@ -269,6 +269,93 @@ When an agent needs a tool that isn't installed, it should install it automatica --- +## Quality & Process + +### 26. PDCA Every Sprint + +Plan-Do-Check-Act after every sprint, not just at the end. Check catches bugs before they compound. + +- Plan: define features + acceptance criteria +- Do: implement with team, commit after each feature +- Check: test in production, read debug logs, try bad inputs, verify on mobile +- Act: fix everything found before starting next sprint +- Never skip Check. A shipped bug costs 10x more than a caught bug. + +**Origin:** Sprint 1-3 each had a PDCA cycle that caught rate limiting issues, SSE race conditions, and Caddy routing gaps. + +### 27. Test in Production, Not in Mocks + +For single-user tools: test against the real deployment. Mocks hide integration bugs. + +- `curl` against the live API after each deploy +- Try the PWA on your actual phone +- Submit real jobs through the real worker +- Read the real debug logs + +**Origin:** "Committe regelmäßig und test in production — keine mocks!" + +### 28. Changelog as First-Class Artifact + +Every project gets a CHANGELOG.md. Updated with every sprint. The user should never have to ask "what changed?" + +- Reverse-chronological, grouped by version/sprint +- Include Added/Changed/Security/Fixed sections +- Link to relevant commits if helpful +- Update it DURING the sprint, not after + +**Origin:** "Ich brauch gute changelogs um bei allem laufenden zu bleiben." + +### 29. Emergency Stop (Not-Aus) + +Every autonomous system needs a kill switch. One button, kills everything, no confirmation cascade. + +- Cancel all running jobs immediately +- Pause the system (workers stop polling) +- Log the event as critical +- Resume button to unpause +- Visible at all times, not buried in a menu + +**Origin:** "Und wir brauchen einen Not-Aus-Knopf ;)" + +### 30. Self-Monitoring (Guardian Pattern) + +The system monitors itself. A background watchdog checks health every N minutes and logs findings. + +- Check: stuck jobs, dead workers, error spikes, DB connectivity +- Log structured findings to a queryable debug_log +- Agent can read the logs to self-diagnose +- Future: alert the user via push/webhook when degraded +- Clean up old logs automatically + +**Origin:** "We should have a guardian who checks every other minute what's going on." + +### 31. Debug Logs as Agent Interface + +Structured debug logs aren't just for humans — they're an API for the agent to understand system health. + +- Queryable by level, component, time range +- Secret-safe (auto-redact tokens, keys, passwords) +- Agent reads them between sprints to catch issues +- Self-healing: agent detects error patterns and applies fixes + +**Origin:** Built during dispatch development — agent reads `/debug/logs` to diagnose production issues. + +### 32. Multi-Layer Auth for Admin Endpoints + +Regular API operations and admin/debug operations need different auth levels. + +- Regular token: job CRUD, worker operations +- Admin token: debug logs, stats, worker management, emergency stop +- Rate limiting: stricter on admin endpoints +- Never share the same token for both levels + +**Origin:** "Ich hoffe wir haben da ne mehrstufige Authentifizierung dahinter..." + +--- + ## (inbox — unsorted ideas) +- **Least-privilege agent access**: Agents should SSH as a dedicated non-root user (e.g. `deploy@`) with scoped sudo for only what they need (systemctl, caddy reload). No root SSH long-term. +- **Immutable deploy artifacts**: Agent builds a tarball/image, uploads it, runs a deploy script. Never edits files in-place on production. + _Drop new principles here. They get organized on next pass._