docs: add operational runbook, Grafana dashboard, and production docker-compose

Add comprehensive operational documentation:
- docs/operations/backup-restore.md: SQLCipher, file backend, blob backup/restore
- docs/operations/key-rotation.md: auth token, TLS, federation, DB key, OPAQUE rotation
- docs/operations/incident-response.md: playbook for common incidents
- docs/operations/scaling-guide.md: resource sizing, scaling triggers, capacity planning
- docs/operations/monitoring.md: Prometheus metrics, alert rules, log monitoring
- docs/operations/dashboards/qpq-overview.json: Grafana dashboard template
- docs/operations/prometheus.yml + alerts: Prometheus scrape and alert config
- docs/operations/grafana-provisioning/: auto-provisioning for datasources and dashboards
- docker-compose.prod.yml: production stack (server + Prometheus + Grafana)
- .env.example: documented environment variable template
This commit is contained in:
2026-03-04 20:30:57 +01:00
parent b94248b3b6
commit 91c5495ab7
12 changed files with 1872 additions and 0 deletions

View File

@@ -0,0 +1,199 @@
# Backup and Restore Procedures
This document covers backup and restore for all quicproquo server data stores.
## Data Inventory
| Data | Location | Backend | Contains |
|------|----------|---------|----------|
| SQLCipher DB | `QPQ_DB_PATH` (default `data/qpq.db`) | `store_backend=sql` | Users, key packages, delivery queues, sessions, KT log, OPAQUE setup, blobs metadata, moderation |
| File store | `QPQ_DATA_DIR` (default `data/`) | `store_backend=file` | Bincode-serialized key packages, delivery queues, server state |
| Blob storage | `QPQ_DATA_DIR/blobs/` | Filesystem | Uploaded file transfer blobs |
| TLS certificates | `QPQ_TLS_CERT`, `QPQ_TLS_KEY` | DER files | Server identity |
| OPAQUE ServerSetup | Inside DB or file store | Persisted | OPAQUE credential state (critical for auth) |
| Server signing key | Inside DB or file store | Persisted | Ed25519 key for delivery proofs |
| KT Merkle log | Inside DB or file store | Persisted | Key transparency audit log |
## SQLCipher Backup
### Hot Backup (Online)
SQLCipher supports the `.backup` command while the server is running (WAL mode allows concurrent readers).
```bash
# 1. Open the encrypted database with the same key
sqlite3 data/qpq.db
# 2. At the sqlite3 prompt, set the encryption key
PRAGMA key = 'your-db-key-here';
# 3. Perform an online backup
.backup /backups/qpq-$(date +%Y%m%d-%H%M%S).db
.quit
```
### Scripted Hot Backup
```bash
#!/bin/bash
set -euo pipefail
BACKUP_DIR="/backups/qpq"
DB_PATH="${QPQ_DB_PATH:-data/qpq.db}"
DB_KEY="${QPQ_DB_KEY}"
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/qpq-${TIMESTAMP}.db"
mkdir -p "$BACKUP_DIR"
sqlite3 "$DB_PATH" <<EOF
PRAGMA key = '${DB_KEY}';
.backup ${BACKUP_FILE}
EOF
# Verify the backup is readable
sqlite3 "$BACKUP_FILE" "PRAGMA key = '${DB_KEY}'; PRAGMA integrity_check;" \
| grep -q "ok" && echo "Backup verified: $BACKUP_FILE" \
|| { echo "ERROR: backup verification failed"; exit 1; }
# Retain last 7 daily backups
find "$BACKUP_DIR" -name 'qpq-*.db' -mtime +7 -delete
```
### Cold Backup (Offline)
```bash
# 1. Stop the server
systemctl stop qpq-server # or docker compose stop server
# 2. Copy the database file
cp data/qpq.db /backups/qpq-$(date +%Y%m%d).db
# 3. Copy the WAL and SHM files if they exist
cp data/qpq.db-wal /backups/ 2>/dev/null || true
cp data/qpq.db-shm /backups/ 2>/dev/null || true
# 4. Restart the server
systemctl start qpq-server
```
## File Backend Backup
When using `store_backend=file`, data is stored as bincode files under `QPQ_DATA_DIR`.
```bash
# Full directory backup
tar czf /backups/qpq-data-$(date +%Y%m%d-%H%M%S).tar.gz \
-C "$(dirname "${QPQ_DATA_DIR:-data}")" \
"$(basename "${QPQ_DATA_DIR:-data}")"
```
## Blob Storage Backup
Blobs are stored in `QPQ_DATA_DIR/blobs/`. These are immutable once written.
```bash
# Incremental rsync (blobs are write-once, ideal for rsync)
rsync -av --progress data/blobs/ /backups/blobs/
```
## TLS Certificate Backup
```bash
# Back up TLS certificates (store separately from DB backups)
cp data/server-cert.der /backups/tls/server-cert.der
cp data/server-key.der /backups/tls/server-key.der
# Federation certs (if federation is enabled)
cp data/federation-cert.der /backups/tls/federation-cert.der 2>/dev/null || true
cp data/federation-key.der /backups/tls/federation-key.der 2>/dev/null || true
cp data/federation-ca.der /backups/tls/federation-ca.der 2>/dev/null || true
```
## Restore Procedures
### Restore SQLCipher Database
```bash
# 1. Stop the server
systemctl stop qpq-server
# 2. Move the current (corrupt/lost) database aside
mv data/qpq.db data/qpq.db.broken 2>/dev/null || true
rm -f data/qpq.db-wal data/qpq.db-shm
# 3. Copy the backup in place
cp /backups/qpq-20260304.db data/qpq.db
# 4. Verify integrity
sqlite3 data/qpq.db "PRAGMA key = '${QPQ_DB_KEY}'; PRAGMA integrity_check;"
# 5. Start the server (migrations will apply automatically if needed)
systemctl start qpq-server
```
### Restore File Backend
```bash
# 1. Stop the server
systemctl stop qpq-server
# 2. Replace the data directory
mv data data.broken 2>/dev/null || true
tar xzf /backups/qpq-data-20260304.tar.gz -C .
# 3. Restore TLS certs if not included in the data backup
cp /backups/tls/server-cert.der data/server-cert.der
cp /backups/tls/server-key.der data/server-key.der
# 4. Start the server
systemctl start qpq-server
```
### Restore Blobs Only
```bash
rsync -av /backups/blobs/ data/blobs/
```
## Backup Schedule Recommendations
| Frequency | What | Method |
|-----------|------|--------|
| Every 6 hours | SQLCipher database | Hot backup script via cron |
| Daily | File backend / full data dir | tar + offsite copy |
| Continuous | Blobs | rsync (incremental) |
| On change | TLS certificates | Manual + secret manager |
## Cron Example
```cron
# SQLCipher hot backup every 6 hours
0 */6 * * * /opt/qpq/scripts/backup-db.sh >> /var/log/qpq-backup.log 2>&1
# Full data directory daily at 02:00
0 2 * * * tar czf /backups/qpq-data-$(date +\%Y\%m\%d).tar.gz -C /var/lib quicproquo
# Blob sync every hour
0 * * * * rsync -a /var/lib/quicproquo/blobs/ /backups/blobs/
# Prune backups older than 30 days
0 3 * * 0 find /backups -name 'qpq-*' -mtime +30 -delete
```
## Verification
Always verify backups after creation:
```bash
# SQLCipher integrity check
sqlite3 /backups/qpq-latest.db \
"PRAGMA key = '${QPQ_DB_KEY}'; PRAGMA integrity_check; SELECT count(*) FROM users;"
# File backend: check the archive is valid
tar tzf /backups/qpq-data-latest.tar.gz > /dev/null
# TLS cert: check it parses and is not expired
openssl x509 -inform DER -in /backups/tls/server-cert.der -noout -dates
```