feat: principles #26-#32 — PDCA, prod testing, changelog, emergency stop, guardian, debug logs, multi-auth

This commit is contained in:
2026-03-31 21:31:44 +00:00
parent 0f5456bb53
commit 68e2c6e7a8

View File

@@ -269,6 +269,93 @@ When an agent needs a tool that isn't installed, it should install it automatica
---
## Quality & Process
### 26. PDCA Every Sprint
Plan-Do-Check-Act after every sprint, not just at the end. Check catches bugs before they compound.
- Plan: define features + acceptance criteria
- Do: implement with team, commit after each feature
- Check: test in production, read debug logs, try bad inputs, verify on mobile
- Act: fix everything found before starting next sprint
- Never skip Check. A shipped bug costs 10x more than a caught bug.
**Origin:** Sprint 1-3 each had a PDCA cycle that caught rate limiting issues, SSE race conditions, and Caddy routing gaps.
### 27. Test in Production, Not in Mocks
For single-user tools: test against the real deployment. Mocks hide integration bugs.
- `curl` against the live API after each deploy
- Try the PWA on your actual phone
- Submit real jobs through the real worker
- Read the real debug logs
**Origin:** "Committe regelmäßig und test in production — keine mocks!"
### 28. Changelog as First-Class Artifact
Every project gets a CHANGELOG.md. Updated with every sprint. The user should never have to ask "what changed?"
- Reverse-chronological, grouped by version/sprint
- Include Added/Changed/Security/Fixed sections
- Link to relevant commits if helpful
- Update it DURING the sprint, not after
**Origin:** "Ich brauch gute changelogs um bei allem laufenden zu bleiben."
### 29. Emergency Stop (Not-Aus)
Every autonomous system needs a kill switch. One button, kills everything, no confirmation cascade.
- Cancel all running jobs immediately
- Pause the system (workers stop polling)
- Log the event as critical
- Resume button to unpause
- Visible at all times, not buried in a menu
**Origin:** "Und wir brauchen einen Not-Aus-Knopf ;)"
### 30. Self-Monitoring (Guardian Pattern)
The system monitors itself. A background watchdog checks health every N minutes and logs findings.
- Check: stuck jobs, dead workers, error spikes, DB connectivity
- Log structured findings to a queryable debug_log
- Agent can read the logs to self-diagnose
- Future: alert the user via push/webhook when degraded
- Clean up old logs automatically
**Origin:** "We should have a guardian who checks every other minute what's going on."
### 31. Debug Logs as Agent Interface
Structured debug logs aren't just for humans — they're an API for the agent to understand system health.
- Queryable by level, component, time range
- Secret-safe (auto-redact tokens, keys, passwords)
- Agent reads them between sprints to catch issues
- Self-healing: agent detects error patterns and applies fixes
**Origin:** Built during dispatch development — agent reads `/debug/logs` to diagnose production issues.
### 32. Multi-Layer Auth for Admin Endpoints
Regular API operations and admin/debug operations need different auth levels.
- Regular token: job CRUD, worker operations
- Admin token: debug logs, stats, worker management, emergency stop
- Rate limiting: stricter on admin endpoints
- Never share the same token for both levels
**Origin:** "Ich hoffe wir haben da ne mehrstufige Authentifizierung dahinter..."
---
## (inbox — unsorted ideas)
- **Least-privilege agent access**: Agents should SSH as a dedicated non-root user (e.g. `deploy@`) with scoped sudo for only what they need (systemctl, caddy reload). No root SSH long-term.
- **Immutable deploy artifacts**: Agent builds a tarball/image, uploads it, runs a deploy script. Never edits files in-place on production.
_Drop new principles here. They get organized on next pass._