docs: add operational runbook, Grafana dashboard, and production docker-compose

Add comprehensive operational documentation: - docs/operations/backup-restore.md: SQLCipher, file backend, blob backup/restore - docs/operations/key-rotation.md: auth token, TLS, federation, DB key, OPAQUE rotation - docs/operations/incident-response.md: playbook for common incidents - docs/operations/scaling-guide.md: resource sizing, scaling triggers, capacity planning - docs/operations/monitoring.md: Prometheus metrics, alert rules, log monitoring - docs/operations/dashboards/qpq-overview.json: Grafana dashboard template - docs/operations/prometheus.yml + alerts: Prometheus scrape and alert config - docs/operations/grafana-provisioning/: auto-provisioning for datasources and dashboards - docker-compose.prod.yml: production stack (server + Prometheus + Grafana) - .env.example: documented environment variable template
2026-03-04 20:30:57 +01:00
parent b94248b3b6
commit 91c5495ab7
12 changed files with 1872 additions and 0 deletions
--- a/docs/operations/monitoring.md
+++ b/docs/operations/monitoring.md
@@ -0,0 +1,225 @@
+# Monitoring Guide
+
+This document covers metrics collection, alerting, and dashboards for quicproquo.
+
+## Enabling Metrics
+
+The server exports Prometheus metrics via HTTP when configured:
+
+```bash
+# Environment variables
+QPQ_METRICS_LISTEN=0.0.0.0:9090
+QPQ_METRICS_ENABLED=true
+
+# Or in qpq-server.toml
+metrics_listen = "0.0.0.0:9090"
+metrics_enabled = true
+```
+
+Metrics are served at `http://<metrics_listen>/metrics` in Prometheus exposition format.
+
+## Available Metrics
+
+### Counters
+
+| Metric | Description | Labels |
+|--------|-------------|--------|
+| `enqueue_total` | Total messages enqueued | - |
+| `enqueue_bytes_total` | Total bytes enqueued | - |
+| `fetch_total` | Total message fetches completed | - |
+| `fetch_wait_total` | Total long-poll fetch waits | - |
+| `key_package_upload_total` | Total MLS key package uploads | - |
+| `auth_login_success_total` | Successful OPAQUE login completions | - |
+| `auth_login_failure_total` | Failed login attempts | - |
+| `rate_limit_hit_total` | Rate limit rejections | - |
+
+### Gauges
+
+| Metric | Description |
+|--------|-------------|
+| `delivery_queue_depth` | Current delivery queue depth (sampled) |
+
+## Prometheus Configuration
+
+```yaml
+# prometheus.yml
+global:
+  scrape_interval: 15s
+  evaluation_interval: 15s
+
+scrape_configs:
+  - job_name: 'qpq-server'
+    static_targets:
+      - targets: ['qpq-server:9090']
+    scrape_interval: 10s
+```
+
+## Alert Rules
+
+```yaml
+# prometheus-alerts.yml
+groups:
+  - name: qpq-server
+    rules:
+      # Server down
+      - alert: QpqServerDown
+        expr: up{job="qpq-server"} == 0
+        for: 1m
+        labels:
+          severity: critical
+        annotations:
+          summary: "qpq-server is down"
+          description: "Prometheus cannot scrape qpq-server metrics for > 1 minute."
+
+      # High auth failure rate (potential brute force)
+      - alert: QpqHighAuthFailureRate
+        expr: rate(auth_login_failure_total[5m]) > 10
+        for: 2m
+        labels:
+          severity: warning
+        annotations:
+          summary: "High authentication failure rate"
+          description: "{{ $value | printf \"%.1f\" }} auth failures/sec over 5 minutes."
+
+      # Rate limiting active
+      - alert: QpqRateLimitActive
+        expr: rate(rate_limit_hit_total[5m]) > 5
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Rate limiting is actively rejecting requests"
+          description: "{{ $value | printf \"%.1f\" }} rate limit hits/sec."
+
+      # Delivery queue growing
+      - alert: QpqDeliveryQueueHigh
+        expr: delivery_queue_depth > 10000
+        for: 10m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Delivery queue depth is high"
+          description: "Queue depth: {{ $value }}. Clients may not be fetching."
+
+      - alert: QpqDeliveryQueueCritical
+        expr: delivery_queue_depth > 100000
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Delivery queue depth is critical"
+          description: "Queue depth: {{ $value }}. Investigate immediately."
+
+      # No enqueue activity (service may be stuck)
+      - alert: QpqNoEnqueueActivity
+        expr: rate(enqueue_total[15m]) == 0
+        for: 30m
+        labels:
+          severity: warning
+        annotations:
+          summary: "No messages enqueued in 30 minutes"
+          description: "Check if the service is accepting connections."
+
+      # Auth success ratio too low
+      - alert: QpqLowAuthSuccessRatio
+        expr: >
+          rate(auth_login_success_total[5m])
+          / (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))
+          < 0.5
+        for: 10m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Auth success ratio below 50%"
+          description: "More than half of login attempts are failing."
+```
+
+## Key Dashboard Panels
+
+See `dashboards/qpq-overview.json` for the full Grafana dashboard. Key panels:
+
+### Message Throughput
+- **Enqueue rate**: `rate(enqueue_total[5m])`
+- **Fetch rate**: `rate(fetch_total[5m])`
+- **Enqueue bandwidth**: `rate(enqueue_bytes_total[5m])`
+
+### Authentication
+- **Login success rate**: `rate(auth_login_success_total[5m])`
+- **Login failure rate**: `rate(auth_login_failure_total[5m])`
+- **Success ratio**: `rate(auth_login_success_total[5m]) / (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))`
+
+### Delivery Queue
+- **Queue depth**: `delivery_queue_depth`
+- **Queue growth rate**: `deriv(delivery_queue_depth[10m])`
+
+### Rate Limiting
+- **Rate limit hits**: `rate(rate_limit_hit_total[5m])`
+
+### Infrastructure (Node Exporter)
+- CPU, memory, disk, network from `node_exporter`
+
+## Grafana Dashboard
+
+Import the dashboard from `dashboards/qpq-overview.json`:
+
+1. Open Grafana -> Dashboards -> Import
+2. Upload `docs/operations/dashboards/qpq-overview.json`
+3. Select your Prometheus data source
+4. Save
+
+## Log Monitoring
+
+The server uses `tracing` with `RUST_LOG` environment variable:
+
+```bash
+# Production: info level with structured JSON output
+RUST_LOG=info
+
+# Debug specific modules
+RUST_LOG=info,quicproquo_server::node_service=debug
+
+# Verbose debugging
+RUST_LOG=debug
+```
+
+### Key Log Messages to Monitor
+
+| Log Pattern | Meaning | Action |
+|-------------|---------|--------|
+| `"TLS certificate expires within 30 days"` | Cert expiring soon | Rotate certificate |
+| `"TLS certificate is self-signed"` | Self-signed cert in use | Replace with CA-signed cert in production |
+| `"connection rate limit exceeded"` | IP being rate limited | Check for DDoS |
+| `"running without QPQ_AUTH_TOKEN"` | Insecure mode | Must not appear in production |
+| `"db_key is empty; SQL store will be plaintext"` | Unencrypted DB | Must not appear in production |
+| `"shutdown signal received"` | Graceful shutdown started | Expected during deploys |
+| `"generated and persisted new OPAQUE ServerSetup"` | Fresh OPAQUE setup | Expected on first start only |
+
+### Log Aggregation
+
+For production, pipe logs to a log aggregator:
+
+```bash
+# Systemd -> journald -> Loki/Elasticsearch
+journalctl -u qpq-server -f --output=json | \
+  promtail --stdin --client.url=http://loki:3100/loki/api/v1/push
+
+# Docker -> Loki driver
+docker run --log-driver=loki \
+  --log-opt loki-url="http://loki:3100/loki/api/v1/push" \
+  qpq-server
+```
+
+## Health Checking
+
+The Docker image includes a basic health check (TLS cert file exists). For deeper health checks:
+
+```bash
+# Simple: check the process is running and port is open
+ss -ulnp | grep 7000
+
+# Metrics endpoint (if enabled)
+curl -sf http://localhost:9090/metrics > /dev/null
+
+# Full client connection test
+qpq-client --server 127.0.0.1:7000 --auth-token "$TOKEN" --ping
+```