docs: add operational runbook, Grafana dashboard, and production docker-compose
Add comprehensive operational documentation: - docs/operations/backup-restore.md: SQLCipher, file backend, blob backup/restore - docs/operations/key-rotation.md: auth token, TLS, federation, DB key, OPAQUE rotation - docs/operations/incident-response.md: playbook for common incidents - docs/operations/scaling-guide.md: resource sizing, scaling triggers, capacity planning - docs/operations/monitoring.md: Prometheus metrics, alert rules, log monitoring - docs/operations/dashboards/qpq-overview.json: Grafana dashboard template - docs/operations/prometheus.yml + alerts: Prometheus scrape and alert config - docs/operations/grafana-provisioning/: auto-provisioning for datasources and dashboards - docker-compose.prod.yml: production stack (server + Prometheus + Grafana) - .env.example: documented environment variable template
This commit is contained in:
338
docs/operations/incident-response.md
Normal file
338
docs/operations/incident-response.md
Normal file
@@ -0,0 +1,338 @@
|
||||
# Incident Response Playbook
|
||||
|
||||
This document provides procedures for responding to common operational incidents in a quicproquo deployment.
|
||||
|
||||
## Severity Levels
|
||||
|
||||
| Level | Description | Response Time | Examples |
|
||||
|-------|-------------|---------------|----------|
|
||||
| P1 - Critical | Service down, data loss, key compromise | Immediate | Server crash loop, DB corruption, leaked secrets |
|
||||
| P2 - Major | Degraded service, partial outage | 15 minutes | High latency, storage full, cert expiry |
|
||||
| P3 - Minor | Non-critical issue, monitoring alert | 1 hour | Rate limit spikes, non-critical warnings |
|
||||
|
||||
## Incident: Server Not Starting
|
||||
|
||||
### Symptoms
|
||||
- Server process exits immediately
|
||||
- Logs show "TLS cert or key missing" or "production forbids" errors
|
||||
|
||||
### Diagnosis
|
||||
|
||||
```bash
|
||||
# Check server logs
|
||||
journalctl -u qpq-server --since "10 min ago" --no-pager
|
||||
|
||||
# Docker
|
||||
docker compose logs --tail=50 server
|
||||
```
|
||||
|
||||
### Common Causes and Fixes
|
||||
|
||||
**Missing TLS certificates (production mode)**
|
||||
```bash
|
||||
# Production requires pre-existing certs (no auto-generation)
|
||||
ls -la data/server-cert.der data/server-key.der
|
||||
|
||||
# If missing, restore from backup or generate new ones
|
||||
# See: key-rotation.md
|
||||
```
|
||||
|
||||
**Missing auth token (production mode)**
|
||||
```bash
|
||||
# Production requires QPQ_AUTH_TOKEN >= 16 chars, not "devtoken"
|
||||
echo $QPQ_AUTH_TOKEN | wc -c
|
||||
```
|
||||
|
||||
**Database locked or corrupt**
|
||||
```bash
|
||||
# Check if another process holds the database
|
||||
fuser data/qpq.db
|
||||
|
||||
# Verify database integrity
|
||||
sqlite3 data/qpq.db "PRAGMA key='${QPQ_DB_KEY}'; PRAGMA integrity_check;"
|
||||
```
|
||||
|
||||
**Port already in use**
|
||||
```bash
|
||||
# Check if something is already listening on port 7000
|
||||
ss -tlnp | grep 7000
|
||||
```
|
||||
|
||||
## Incident: Node Down / Unresponsive
|
||||
|
||||
### Symptoms
|
||||
- Clients cannot connect
|
||||
- Health check failures
|
||||
- No new log entries
|
||||
|
||||
### Diagnosis
|
||||
|
||||
```bash
|
||||
# 1. Check if the process is running
|
||||
systemctl status qpq-server
|
||||
# or: docker compose ps
|
||||
|
||||
# 2. Check resource usage
|
||||
top -bn1 | grep qpq
|
||||
df -h /var/lib/quicproquo
|
||||
free -h
|
||||
|
||||
# 3. Check QUIC port is reachable
|
||||
# From another host:
|
||||
nc -uzv <server-ip> 7000
|
||||
|
||||
# 4. Check for OOM kills
|
||||
dmesg | grep -i "out of memory\|oom" | tail -5
|
||||
journalctl -k | grep -i oom
|
||||
```
|
||||
|
||||
### Recovery
|
||||
|
||||
```bash
|
||||
# Restart the service
|
||||
systemctl restart qpq-server
|
||||
|
||||
# If OOM: increase memory limit
|
||||
systemctl edit qpq-server --force
|
||||
# MemoryMax=2G
|
||||
|
||||
# If disk full: see "Storage Full" incident below
|
||||
```
|
||||
|
||||
## Incident: Storage Full
|
||||
|
||||
### Symptoms
|
||||
- `enqueue` operations fail
|
||||
- Logs show "No space left on device"
|
||||
- `delivery_queue_depth` gauge rising
|
||||
|
||||
### Diagnosis
|
||||
|
||||
```bash
|
||||
# Check disk usage
|
||||
df -h /var/lib/quicproquo
|
||||
du -sh /var/lib/quicproquo/*
|
||||
|
||||
# Check largest files
|
||||
du -a /var/lib/quicproquo | sort -rn | head -20
|
||||
|
||||
# Check blob storage specifically
|
||||
du -sh /var/lib/quicproquo/blobs/
|
||||
find /var/lib/quicproquo/blobs/ -type f | wc -l
|
||||
```
|
||||
|
||||
### Recovery
|
||||
|
||||
```bash
|
||||
# 1. Identify and remove expired messages (the cleanup task handles this,
|
||||
# but if it's behind, you can trigger manual cleanup)
|
||||
|
||||
# For SQL backend: delete expired delivery messages
|
||||
sqlite3 data/qpq.db <<'EOF'
|
||||
PRAGMA key = '${QPQ_DB_KEY}';
|
||||
DELETE FROM delivery_queue WHERE expires_at IS NOT NULL AND expires_at < unixepoch();
|
||||
VACUUM;
|
||||
EOF
|
||||
|
||||
# 2. Remove orphaned blobs (blobs not referenced by any message)
|
||||
# This is application-specific; coordinate with the codebase
|
||||
|
||||
# 3. If the data partition is full, expand the volume
|
||||
# AWS EBS: aws ec2 modify-volume --volume-id vol-xxx --size 100
|
||||
# Then: resize2fs /dev/xvdf
|
||||
|
||||
# 4. Move to a larger disk
|
||||
systemctl stop qpq-server
|
||||
rsync -av /var/lib/quicproquo/ /mnt/new-volume/quicproquo/
|
||||
# Update QPQ_DATA_DIR and QPQ_DB_PATH to point to the new location
|
||||
systemctl start qpq-server
|
||||
```
|
||||
|
||||
### Prevention
|
||||
|
||||
- Set up disk usage alerts at 70% and 90% thresholds.
|
||||
- Configure message TTL (`ttl_secs`) to auto-expire old messages.
|
||||
- Schedule regular `VACUUM` on the SQLCipher database.
|
||||
|
||||
## Incident: DDoS / Connection Flood
|
||||
|
||||
### Symptoms
|
||||
- `rate_limit_hit_total` counter spiking
|
||||
- `auth_login_failure_total` counter spiking
|
||||
- High CPU usage
|
||||
- Legitimate clients cannot connect
|
||||
|
||||
### Diagnosis
|
||||
|
||||
```bash
|
||||
# Check connection rate limit hits in metrics
|
||||
curl -s http://localhost:9090/metrics | grep rate_limit
|
||||
|
||||
# Check auth failure rate
|
||||
curl -s http://localhost:9090/metrics | grep auth_login_failure
|
||||
|
||||
# Check active connections (QUIC uses UDP)
|
||||
ss -unp | grep 7000 | wc -l
|
||||
```
|
||||
|
||||
### Mitigation
|
||||
|
||||
```bash
|
||||
# 1. The server has built-in per-IP connection rate limiting.
|
||||
# Check the logs for "connection rate limit exceeded" messages.
|
||||
|
||||
# 2. Block offending IPs at the firewall level
|
||||
iptables -A INPUT -s <attacker-ip> -j DROP
|
||||
|
||||
# 3. For volumetric attacks, use upstream DDoS protection
|
||||
# (Cloudflare Spectrum, AWS Shield, etc.)
|
||||
|
||||
# 4. If the server is overwhelmed, restart to clear state
|
||||
systemctl restart qpq-server
|
||||
|
||||
# 5. Enable log redaction to reduce I/O pressure during attacks
|
||||
# Set QPQ_REDACT_LOGS=true
|
||||
```
|
||||
|
||||
## Incident: Key Compromise
|
||||
|
||||
### Auth Token Compromised
|
||||
|
||||
**Severity: P1**
|
||||
|
||||
```bash
|
||||
# 1. Immediately rotate the auth token
|
||||
NEW_TOKEN=$(openssl rand -base64 32)
|
||||
|
||||
# 2. Update server config and restart
|
||||
# See: key-rotation.md "Auth Token Rotation"
|
||||
|
||||
# 3. Notify all legitimate clients of the new token
|
||||
|
||||
# 4. Review logs for unauthorized access
|
||||
journalctl -u qpq-server | grep "auth_login_success" | tail -100
|
||||
```
|
||||
|
||||
### TLS Private Key Compromised
|
||||
|
||||
**Severity: P1**
|
||||
|
||||
```bash
|
||||
# 1. Generate and install a new certificate immediately
|
||||
# See: key-rotation.md "TLS Certificate Rotation"
|
||||
|
||||
# 2. Revoke the compromised certificate with your CA
|
||||
# (procedure depends on your CA)
|
||||
|
||||
# 3. Restart the server with the new certificate
|
||||
systemctl restart qpq-server
|
||||
|
||||
# 4. If clients pin certificates, notify them of the change
|
||||
```
|
||||
|
||||
### Database Key Compromised
|
||||
|
||||
**Severity: P1**
|
||||
|
||||
```bash
|
||||
# 1. Stop the server
|
||||
systemctl stop qpq-server
|
||||
|
||||
# 2. Rekey the database immediately
|
||||
# See: key-rotation.md "Database Encryption Key Rotation"
|
||||
|
||||
# 3. Assess data exposure
|
||||
# If the attacker had access to the database file, assume all
|
||||
# stored data (users, key packages, delivery queues) is compromised.
|
||||
|
||||
# 4. Consider notifying affected users
|
||||
```
|
||||
|
||||
### OPAQUE ServerSetup Compromised
|
||||
|
||||
**Severity: P1**
|
||||
|
||||
```bash
|
||||
# 1. Rotate the OPAQUE ServerSetup
|
||||
# See: key-rotation.md "OPAQUE ServerSetup Rotation"
|
||||
|
||||
# WARNING: This invalidates ALL OPAQUE credentials.
|
||||
# All users must re-register.
|
||||
|
||||
# 2. All users must re-register with new credentials
|
||||
# 3. Review logs for unauthorized OPAQUE authentications
|
||||
```
|
||||
|
||||
## Incident: High Latency
|
||||
|
||||
### Symptoms
|
||||
- Clients report slow message delivery
|
||||
- `delivery_queue_depth` gauge is high
|
||||
- Fetch operations are slow
|
||||
|
||||
### Diagnosis
|
||||
|
||||
```bash
|
||||
# 1. Check system resources
|
||||
top -bn1 | head -20
|
||||
iostat -x 1 3
|
||||
|
||||
# 2. Check database performance
|
||||
sqlite3 data/qpq.db <<'EOF'
|
||||
PRAGMA key = '${QPQ_DB_KEY}';
|
||||
PRAGMA integrity_check;
|
||||
PRAGMA wal_checkpoint(PASSIVE);
|
||||
-- Check table sizes
|
||||
SELECT 'delivery_queue', count(*) FROM delivery_queue
|
||||
UNION ALL SELECT 'key_packages', count(*) FROM key_packages
|
||||
UNION ALL SELECT 'users', count(*) FROM users;
|
||||
EOF
|
||||
|
||||
# 3. Check queue depth via metrics
|
||||
curl -s http://localhost:9090/metrics | grep delivery_queue_depth
|
||||
```
|
||||
|
||||
### Recovery
|
||||
|
||||
```bash
|
||||
# 1. Checkpoint the WAL (reduces WAL file size)
|
||||
sqlite3 data/qpq.db "PRAGMA key='${QPQ_DB_KEY}'; PRAGMA wal_checkpoint(TRUNCATE);"
|
||||
|
||||
# 2. VACUUM to reclaim space and defragment
|
||||
sqlite3 data/qpq.db "PRAGMA key='${QPQ_DB_KEY}'; VACUUM;"
|
||||
|
||||
# 3. If the queue is huge, check for clients not fetching
|
||||
# (delivery_queue rows accumulate when clients are offline)
|
||||
|
||||
# 4. If I/O-bound: move database to faster storage (SSD/NVMe)
|
||||
```
|
||||
|
||||
## Incident: Certificate Expiring
|
||||
|
||||
### Symptoms
|
||||
- Log warning: "TLS certificate expires within 30 days"
|
||||
- Monitoring alert on certificate expiry
|
||||
|
||||
### Response
|
||||
|
||||
```bash
|
||||
# 1. Check current certificate expiry
|
||||
openssl x509 -inform DER -in data/server-cert.der -noout -enddate
|
||||
|
||||
# 2. Renew the certificate
|
||||
# See: key-rotation.md "TLS Certificate Rotation"
|
||||
|
||||
# 3. Verify the new certificate is loaded
|
||||
journalctl -u qpq-server --since "1 min ago" | grep -i cert
|
||||
```
|
||||
|
||||
## Post-Incident Checklist
|
||||
|
||||
After resolving any incident:
|
||||
|
||||
1. **Document** the incident: timeline, root cause, resolution steps
|
||||
2. **Verify** the service is fully operational (check metrics, test client connections)
|
||||
3. **Review** whether monitoring would have caught this earlier
|
||||
4. **Update** alerts and thresholds based on findings
|
||||
5. **Communicate** with affected users if there was data exposure or service disruption
|
||||
6. **Schedule** follow-up actions (e.g., add monitoring, improve automation)
|
||||
Reference in New Issue
Block a user