docs: add operational runbook, Grafana dashboard, and production docker-compose

Add comprehensive operational documentation:
- docs/operations/backup-restore.md: SQLCipher, file backend, blob backup/restore
- docs/operations/key-rotation.md: auth token, TLS, federation, DB key, OPAQUE rotation
- docs/operations/incident-response.md: playbook for common incidents
- docs/operations/scaling-guide.md: resource sizing, scaling triggers, capacity planning
- docs/operations/monitoring.md: Prometheus metrics, alert rules, log monitoring
- docs/operations/dashboards/qpq-overview.json: Grafana dashboard template
- docs/operations/prometheus.yml + alerts: Prometheus scrape and alert config
- docs/operations/grafana-provisioning/: auto-provisioning for datasources and dashboards
- docker-compose.prod.yml: production stack (server + Prometheus + Grafana)
- .env.example: documented environment variable template
This commit is contained in:
2026-03-04 20:30:57 +01:00
parent b94248b3b6
commit 91c5495ab7
12 changed files with 1872 additions and 0 deletions

View File

@@ -0,0 +1,250 @@
# Key Rotation Procedures
This document provides step-by-step procedures for rotating all cryptographic material in a quicproquo deployment.
## Auth Token Rotation
The auth token (`QPQ_AUTH_TOKEN`) is used for bearer-token authentication (auth version 1). OPAQUE-authenticated sessions are not affected by token rotation.
### Procedure
```bash
# 1. Generate a new token (minimum 16 characters for production)
NEW_TOKEN=$(openssl rand -base64 32)
echo "New token: $NEW_TOKEN"
# 2. Update the config file or environment
# Option A: TOML config file
sed -i "s/^auth_token = .*/auth_token = \"$NEW_TOKEN\"/" qpq-server.toml
# Option B: Environment variable (systemd)
systemctl edit qpq-server --force
# Add: Environment=QPQ_AUTH_TOKEN=<new-token>
# Option C: Docker Compose
# Update QPQ_AUTH_TOKEN in docker-compose.prod.yml or .env file
# 3. Restart the server
systemctl restart qpq-server
# or: docker compose restart server
# 4. Update all clients with the new token
# Clients using OPAQUE auth are unaffected.
# Clients using bearer-token auth must update their QPQ_ACCESS_TOKEN.
```
### Impact
- Active bearer-token sessions continue until they expire (sessions are in-memory).
- New bearer-token connections must use the new token.
- OPAQUE-authenticated clients are not affected.
## TLS Certificate Rotation
The server uses DER-encoded X.509 certificates for QUIC TLS 1.3. The server validates certificates at startup and warns if expiry is within 30 days.
### Procedure
```bash
# 1. Obtain a new certificate (example with Let's Encrypt / certbot)
certbot certonly --standalone -d chat.example.com
# 2. Convert PEM to DER format (qpq-server expects DER)
openssl x509 -in /etc/letsencrypt/live/chat.example.com/fullchain.pem \
-outform DER -out /tmp/server-cert.der
openssl pkey -in /etc/letsencrypt/live/chat.example.com/privkey.pem \
-outform DER -out /tmp/server-key.der
# 3. Set restrictive permissions on the private key
chmod 600 /tmp/server-key.der
# 4. Back up the current certificates
cp data/server-cert.der data/server-cert.der.bak
cp data/server-key.der data/server-key.der.bak
# 5. Replace certificates
cp /tmp/server-cert.der data/server-cert.der
cp /tmp/server-key.der data/server-key.der
# 6. Verify the new certificate
openssl x509 -inform DER -in data/server-cert.der -noout -text | head -20
# 7. Restart the server (QUIC requires restart for new TLS config)
systemctl restart qpq-server
# 8. Verify the server started with the new certificate
journalctl -u qpq-server --since "1 min ago" | grep -i tls
```
### Self-Signed Certificate (Development)
In non-production mode, the server auto-generates a self-signed certificate if none exists. To force regeneration:
```bash
rm data/server-cert.der data/server-key.der
systemctl restart qpq-server
# Server will generate a new self-signed cert for localhost/127.0.0.1/::1
```
### Automated Renewal with Certbot
```bash
#!/bin/bash
# /opt/qpq/scripts/renew-cert.sh
set -euo pipefail
DOMAIN="chat.example.com"
CERT_DIR="/etc/letsencrypt/live/$DOMAIN"
QPQ_DATA="/var/lib/quicproquo"
certbot renew --quiet
openssl x509 -in "$CERT_DIR/fullchain.pem" -outform DER -out "$QPQ_DATA/server-cert.der"
openssl pkey -in "$CERT_DIR/privkey.pem" -outform DER -out "$QPQ_DATA/server-key.der"
chmod 600 "$QPQ_DATA/server-key.der"
chown qpq:qpq "$QPQ_DATA/server-cert.der" "$QPQ_DATA/server-key.der"
systemctl restart qpq-server
```
```cron
# Run cert renewal check twice daily
0 3,15 * * * /opt/qpq/scripts/renew-cert.sh >> /var/log/qpq-cert-renew.log 2>&1
```
## Federation Certificate Rotation
Federation uses mutual TLS (mTLS) with a shared CA for server-to-server authentication.
### Procedure
```bash
# 1. Generate a new federation certificate signed by the federation CA
openssl req -new -nodes -keyout /tmp/federation-key.pem \
-out /tmp/federation.csr -subj "/CN=chat.example.com"
openssl x509 -req -in /tmp/federation.csr \
-CA federation-ca.pem -CAkey federation-ca-key.pem \
-CAcreateserial -days 365 -out /tmp/federation-cert.pem
# 2. Convert to DER
openssl x509 -in /tmp/federation-cert.pem -outform DER -out data/federation-cert.der
openssl pkey -in /tmp/federation-key.pem -outform DER -out data/federation-key.der
chmod 600 data/federation-key.der
# 3. Restart the server
systemctl restart qpq-server
# 4. Coordinate with federation peers: they must trust the same CA
```
## Database Encryption Key Rotation
The SQLCipher database key (`QPQ_DB_KEY`) encrypts all data at rest.
### Procedure (SQLCipher PRAGMA rekey)
```bash
# 1. Stop the server
systemctl stop qpq-server
# 2. Back up the database
cp data/qpq.db /backups/qpq-pre-rekey-$(date +%Y%m%d).db
# 3. Rekey the database
sqlite3 data/qpq.db <<EOF
PRAGMA key = 'old-encryption-key';
PRAGMA rekey = 'new-encryption-key';
EOF
# 4. Verify the database opens with the new key
sqlite3 data/qpq.db "PRAGMA key = 'new-encryption-key'; PRAGMA integrity_check;"
# 5. Update the environment/config with the new key
# Option A: systemd
systemctl edit qpq-server --force
# Environment=QPQ_DB_KEY=new-encryption-key
# Option B: Docker Compose .env
echo "QPQ_DB_KEY=new-encryption-key" >> .env
# 6. Start the server
systemctl start qpq-server
```
### Full Re-encryption (Alternative)
If `PRAGMA rekey` is unavailable or you want a fresh database file:
```bash
# 1. Stop the server and back up
systemctl stop qpq-server
cp data/qpq.db /backups/qpq-pre-rekey.db
# 2. Export with old key, import with new key
sqlite3 data/qpq.db "PRAGMA key='old-key'; .dump" | \
sqlite3 data/qpq-new.db "PRAGMA key='new-key'; .read /dev/stdin"
# 3. Replace the database
mv data/qpq-new.db data/qpq.db
# 4. Update config and restart
systemctl start qpq-server
```
## OPAQUE ServerSetup Rotation
The OPAQUE ServerSetup is generated once and persisted. Rotating it invalidates all registered OPAQUE credentials.
**WARNING: Rotating the OPAQUE ServerSetup requires all users to re-register. Only do this if the setup is compromised.**
```bash
# 1. Stop the server
systemctl stop qpq-server
# 2. Back up the database
cp data/qpq.db /backups/qpq-pre-opaque-rotate.db
# 3. Delete the persisted OPAQUE setup
# For SQL backend:
sqlite3 data/qpq.db "PRAGMA key='${QPQ_DB_KEY}'; DELETE FROM server_state WHERE key = 'opaque_setup';"
# For file backend:
rm data/opaque_setup.bin 2>/dev/null || true
# 4. Start the server (it will generate a new OPAQUE ServerSetup)
systemctl start qpq-server
# 5. All users must re-register (existing OPAQUE credentials are invalid)
```
## Server Signing Key Rotation
The Ed25519 signing key is used for delivery proofs. Rotating it means old delivery proofs cannot be verified against the new key.
```bash
# 1. Stop the server
systemctl stop qpq-server
# 2. Back up
cp data/qpq.db /backups/qpq-pre-sigkey-rotate.db
# 3. Delete the persisted signing key seed
# For SQL backend:
sqlite3 data/qpq.db "PRAGMA key='${QPQ_DB_KEY}'; DELETE FROM server_state WHERE key = 'signing_key_seed';"
# 4. Start the server (generates a new Ed25519 signing key)
systemctl start qpq-server
```
## Rotation Schedule
| Key Material | Rotation Frequency | Impact |
|---|---|---|
| Auth token | Quarterly or on compromise | Clients using bearer auth must update |
| TLS certificate | Before expiry (automate with certbot) | Server restart required |
| Federation cert | Annually or before expiry | Coordinate with peers |
| DB encryption key | Annually or on compromise | Server downtime required |
| OPAQUE ServerSetup | Only on compromise | All users must re-register |
| Server signing key | Only on compromise | Old delivery proofs unverifiable |