docs: add operational runbook, Grafana dashboard, and production docker-compose

Add comprehensive operational documentation: - docs/operations/backup-restore.md: SQLCipher, file backend, blob backup/restore - docs/operations/key-rotation.md: auth token, TLS, federation, DB key, OPAQUE rotation - docs/operations/incident-response.md: playbook for common incidents - docs/operations/scaling-guide.md: resource sizing, scaling triggers, capacity planning - docs/operations/monitoring.md: Prometheus metrics, alert rules, log monitoring - docs/operations/dashboards/qpq-overview.json: Grafana dashboard template - docs/operations/prometheus.yml + alerts: Prometheus scrape and alert config - docs/operations/grafana-provisioning/: auto-provisioning for datasources and dashboards - docker-compose.prod.yml: production stack (server + Prometheus + Grafana) - .env.example: documented environment variable template
2026-03-04 20:30:57 +01:00
parent b94248b3b6
commit 91c5495ab7
12 changed files with 1872 additions and 0 deletions
--- a/docs/operations/incident-response.md
+++ b/docs/operations/incident-response.md
@@ -0,0 +1,338 @@
+# Incident Response Playbook
+
+This document provides procedures for responding to common operational incidents in a quicproquo deployment.
+
+## Severity Levels
+
+| Level | Description | Response Time | Examples |
+|-------|-------------|---------------|----------|
+| P1 - Critical | Service down, data loss, key compromise | Immediate | Server crash loop, DB corruption, leaked secrets |
+| P2 - Major | Degraded service, partial outage | 15 minutes | High latency, storage full, cert expiry |
+| P3 - Minor | Non-critical issue, monitoring alert | 1 hour | Rate limit spikes, non-critical warnings |
+
+## Incident: Server Not Starting
+
+### Symptoms
+- Server process exits immediately
+- Logs show "TLS cert or key missing" or "production forbids" errors
+
+### Diagnosis
+
+```bash
+# Check server logs
+journalctl -u qpq-server --since "10 min ago" --no-pager
+
+# Docker
+docker compose logs --tail=50 server
+```
+
+### Common Causes and Fixes
+
+**Missing TLS certificates (production mode)**
+```bash
+# Production requires pre-existing certs (no auto-generation)
+ls -la data/server-cert.der data/server-key.der
+
+# If missing, restore from backup or generate new ones
+# See: key-rotation.md
+```
+
+**Missing auth token (production mode)**
+```bash
+# Production requires QPQ_AUTH_TOKEN >= 16 chars, not "devtoken"
+echo $QPQ_AUTH_TOKEN | wc -c
+```
+
+**Database locked or corrupt**
+```bash
+# Check if another process holds the database
+fuser data/qpq.db
+
+# Verify database integrity
+sqlite3 data/qpq.db "PRAGMA key='${QPQ_DB_KEY}'; PRAGMA integrity_check;"
+```
+
+**Port already in use**
+```bash
+# Check if something is already listening on port 7000
+ss -tlnp | grep 7000
+```
+
+## Incident: Node Down / Unresponsive
+
+### Symptoms
+- Clients cannot connect
+- Health check failures
+- No new log entries
+
+### Diagnosis
+
+```bash
+# 1. Check if the process is running
+systemctl status qpq-server
+# or: docker compose ps
+
+# 2. Check resource usage
+top -bn1 | grep qpq
+df -h /var/lib/quicproquo
+free -h
+
+# 3. Check QUIC port is reachable
+# From another host:
+nc -uzv <server-ip> 7000
+
+# 4. Check for OOM kills
+dmesg | grep -i "out of memory\|oom" | tail -5
+journalctl -k | grep -i oom
+```
+
+### Recovery
+
+```bash
+# Restart the service
+systemctl restart qpq-server
+
+# If OOM: increase memory limit
+systemctl edit qpq-server --force
+# MemoryMax=2G
+
+# If disk full: see "Storage Full" incident below
+```
+
+## Incident: Storage Full
+
+### Symptoms
+- `enqueue` operations fail
+- Logs show "No space left on device"
+- `delivery_queue_depth` gauge rising
+
+### Diagnosis
+
+```bash
+# Check disk usage
+df -h /var/lib/quicproquo
+du -sh /var/lib/quicproquo/*
+
+# Check largest files
+du -a /var/lib/quicproquo | sort -rn | head -20
+
+# Check blob storage specifically
+du -sh /var/lib/quicproquo/blobs/
+find /var/lib/quicproquo/blobs/ -type f | wc -l
+```
+
+### Recovery
+
+```bash
+# 1. Identify and remove expired messages (the cleanup task handles this,
+#    but if it's behind, you can trigger manual cleanup)
+
+# For SQL backend: delete expired delivery messages
+sqlite3 data/qpq.db <<'EOF'
+PRAGMA key = '${QPQ_DB_KEY}';
+DELETE FROM delivery_queue WHERE expires_at IS NOT NULL AND expires_at < unixepoch();
+VACUUM;
+EOF
+
+# 2. Remove orphaned blobs (blobs not referenced by any message)
+# This is application-specific; coordinate with the codebase
+
+# 3. If the data partition is full, expand the volume
+# AWS EBS: aws ec2 modify-volume --volume-id vol-xxx --size 100
+# Then: resize2fs /dev/xvdf
+
+# 4. Move to a larger disk
+systemctl stop qpq-server
+rsync -av /var/lib/quicproquo/ /mnt/new-volume/quicproquo/
+# Update QPQ_DATA_DIR and QPQ_DB_PATH to point to the new location
+systemctl start qpq-server
+```
+
+### Prevention
+
+- Set up disk usage alerts at 70% and 90% thresholds.
+- Configure message TTL (`ttl_secs`) to auto-expire old messages.
+- Schedule regular `VACUUM` on the SQLCipher database.
+
+## Incident: DDoS / Connection Flood
+
+### Symptoms
+- `rate_limit_hit_total` counter spiking
+- `auth_login_failure_total` counter spiking
+- High CPU usage
+- Legitimate clients cannot connect
+
+### Diagnosis
+
+```bash
+# Check connection rate limit hits in metrics
+curl -s http://localhost:9090/metrics | grep rate_limit
+
+# Check auth failure rate
+curl -s http://localhost:9090/metrics | grep auth_login_failure
+
+# Check active connections (QUIC uses UDP)
+ss -unp | grep 7000 | wc -l
+```
+
+### Mitigation
+
+```bash
+# 1. The server has built-in per-IP connection rate limiting.
+#    Check the logs for "connection rate limit exceeded" messages.
+
+# 2. Block offending IPs at the firewall level
+iptables -A INPUT -s <attacker-ip> -j DROP
+
+# 3. For volumetric attacks, use upstream DDoS protection
+#    (Cloudflare Spectrum, AWS Shield, etc.)
+
+# 4. If the server is overwhelmed, restart to clear state
+systemctl restart qpq-server
+
+# 5. Enable log redaction to reduce I/O pressure during attacks
+# Set QPQ_REDACT_LOGS=true
+```
+
+## Incident: Key Compromise
+
+### Auth Token Compromised
+
+**Severity: P1**
+
+```bash
+# 1. Immediately rotate the auth token
+NEW_TOKEN=$(openssl rand -base64 32)
+
+# 2. Update server config and restart
+# See: key-rotation.md "Auth Token Rotation"
+
+# 3. Notify all legitimate clients of the new token
+
+# 4. Review logs for unauthorized access
+journalctl -u qpq-server | grep "auth_login_success" | tail -100
+```
+
+### TLS Private Key Compromised
+
+**Severity: P1**
+
+```bash
+# 1. Generate and install a new certificate immediately
+# See: key-rotation.md "TLS Certificate Rotation"
+
+# 2. Revoke the compromised certificate with your CA
+# (procedure depends on your CA)
+
+# 3. Restart the server with the new certificate
+systemctl restart qpq-server
+
+# 4. If clients pin certificates, notify them of the change
+```
+
+### Database Key Compromised
+
+**Severity: P1**
+
+```bash
+# 1. Stop the server
+systemctl stop qpq-server
+
+# 2. Rekey the database immediately
+# See: key-rotation.md "Database Encryption Key Rotation"
+
+# 3. Assess data exposure
+# If the attacker had access to the database file, assume all
+# stored data (users, key packages, delivery queues) is compromised.
+
+# 4. Consider notifying affected users
+```
+
+### OPAQUE ServerSetup Compromised
+
+**Severity: P1**
+
+```bash
+# 1. Rotate the OPAQUE ServerSetup
+# See: key-rotation.md "OPAQUE ServerSetup Rotation"
+
+# WARNING: This invalidates ALL OPAQUE credentials.
+# All users must re-register.
+
+# 2. All users must re-register with new credentials
+# 3. Review logs for unauthorized OPAQUE authentications
+```
+
+## Incident: High Latency
+
+### Symptoms
+- Clients report slow message delivery
+- `delivery_queue_depth` gauge is high
+- Fetch operations are slow
+
+### Diagnosis
+
+```bash
+# 1. Check system resources
+top -bn1 | head -20
+iostat -x 1 3
+
+# 2. Check database performance
+sqlite3 data/qpq.db <<'EOF'
+PRAGMA key = '${QPQ_DB_KEY}';
+PRAGMA integrity_check;
+PRAGMA wal_checkpoint(PASSIVE);
+-- Check table sizes
+SELECT 'delivery_queue', count(*) FROM delivery_queue
+UNION ALL SELECT 'key_packages', count(*) FROM key_packages
+UNION ALL SELECT 'users', count(*) FROM users;
+EOF
+
+# 3. Check queue depth via metrics
+curl -s http://localhost:9090/metrics | grep delivery_queue_depth
+```
+
+### Recovery
+
+```bash
+# 1. Checkpoint the WAL (reduces WAL file size)
+sqlite3 data/qpq.db "PRAGMA key='${QPQ_DB_KEY}'; PRAGMA wal_checkpoint(TRUNCATE);"
+
+# 2. VACUUM to reclaim space and defragment
+sqlite3 data/qpq.db "PRAGMA key='${QPQ_DB_KEY}'; VACUUM;"
+
+# 3. If the queue is huge, check for clients not fetching
+# (delivery_queue rows accumulate when clients are offline)
+
+# 4. If I/O-bound: move database to faster storage (SSD/NVMe)
+```
+
+## Incident: Certificate Expiring
+
+### Symptoms
+- Log warning: "TLS certificate expires within 30 days"
+- Monitoring alert on certificate expiry
+
+### Response
+
+```bash
+# 1. Check current certificate expiry
+openssl x509 -inform DER -in data/server-cert.der -noout -enddate
+
+# 2. Renew the certificate
+# See: key-rotation.md "TLS Certificate Rotation"
+
+# 3. Verify the new certificate is loaded
+journalctl -u qpq-server --since "1 min ago" | grep -i cert
+```
+
+## Post-Incident Checklist
+
+After resolving any incident:
+
+1. **Document** the incident: timeline, root cause, resolution steps
+2. **Verify** the service is fully operational (check metrics, test client connections)
+3. **Review** whether monitoring would have caught this earlier
+4. **Update** alerts and thresholds based on findings
+5. **Communicate** with affected users if there was data exposure or service disruption
+6. **Schedule** follow-up actions (e.g., add monitoring, improve automation)