docs: rewrite mdBook documentation for v2 architecture

Update 25+ files and add 6 new pages to reflect the v2 migration from Cap'n Proto to Protobuf framing over QUIC. Integrates SDK and Operations docs into the mdBook, restructures SUMMARY.md, and rewrites the wire format, architecture, and protocol sections with accurate v2 content.
2026-03-04 22:02:31 +01:00
parent f7a7f672b4
commit d073f614b3
31 changed files with 4423 additions and 2379 deletions
--- a/docs/src/operations/monitoring.md
+++ b/docs/src/operations/monitoring.md
@@ -0,0 +1,233 @@
+# Monitoring Guide
+
+This document covers metrics collection, alerting, and dashboards for
+quicproquo server deployments.
+
+## Enabling Metrics
+
+The server exports Prometheus metrics via HTTP when configured:
+
+```bash
+# Environment variables
+QPQ_METRICS_LISTEN=0.0.0.0:9090
+QPQ_METRICS_ENABLED=true
+
+# Or in qpq-server.toml
+metrics_listen = "0.0.0.0:9090"
+metrics_enabled = true
+```
+
+Metrics are served at `http://<metrics_listen>/metrics` in Prometheus
+exposition format.
+
+## Available Metrics
+
+### Counters
+
+| Metric | Description | Labels |
+|--------|-------------|--------|
+| `enqueue_total` | Total messages enqueued | - |
+| `enqueue_bytes_total` | Total bytes enqueued | - |
+| `fetch_total` | Total message fetches completed | - |
+| `fetch_wait_total` | Total long-poll fetch waits | - |
+| `key_package_upload_total` | Total MLS key package uploads | - |
+| `auth_login_success_total` | Successful OPAQUE login completions | - |
+| `auth_login_failure_total` | Failed login attempts | - |
+| `rate_limit_hit_total` | Rate limit rejections | - |
+
+### Gauges
+
+| Metric | Description |
+|--------|-------------|
+| `delivery_queue_depth` | Current delivery queue depth (sampled) |
+
+## Prometheus Configuration
+
+```yaml
+# prometheus.yml
+global:
+  scrape_interval: 15s
+  evaluation_interval: 15s
+
+scrape_configs:
+  - job_name: 'qpq-server'
+    static_configs:
+      - targets: ['qpq-server:9090']
+    scrape_interval: 10s
+```
+
+## Alert Rules
+
+```yaml
+# prometheus-alerts.yml
+groups:
+  - name: qpq-server
+    rules:
+      # Server down
+      - alert: QpqServerDown
+        expr: up{job="qpq-server"} == 0
+        for: 1m
+        labels:
+          severity: critical
+        annotations:
+          summary: "qpq-server is down"
+          description: "Prometheus cannot scrape qpq-server metrics for > 1 minute."
+
+      # High auth failure rate (potential brute force)
+      - alert: QpqHighAuthFailureRate
+        expr: rate(auth_login_failure_total[5m]) > 10
+        for: 2m
+        labels:
+          severity: warning
+        annotations:
+          summary: "High authentication failure rate"
+          description: "{{ $value | printf \"%.1f\" }} auth failures/sec over 5 minutes."
+
+      # Rate limiting active
+      - alert: QpqRateLimitActive
+        expr: rate(rate_limit_hit_total[5m]) > 5
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Rate limiting is actively rejecting requests"
+          description: "{{ $value | printf \"%.1f\" }} rate limit hits/sec."
+
+      # Delivery queue growing
+      - alert: QpqDeliveryQueueHigh
+        expr: delivery_queue_depth > 10000
+        for: 10m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Delivery queue depth is high"
+          description: "Queue depth: {{ $value }}. Clients may not be fetching."
+
+      - alert: QpqDeliveryQueueCritical
+        expr: delivery_queue_depth > 100000
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Delivery queue depth is critical"
+          description: "Queue depth: {{ $value }}. Investigate immediately."
+
+      # No enqueue activity (service may be stuck)
+      - alert: QpqNoEnqueueActivity
+        expr: rate(enqueue_total[15m]) == 0
+        for: 30m
+        labels:
+          severity: warning
+        annotations:
+          summary: "No messages enqueued in 30 minutes"
+          description: "Check if the service is accepting connections."
+
+      # Auth success ratio too low
+      - alert: QpqLowAuthSuccessRatio
+        expr: >
+          rate(auth_login_success_total[5m])
+          / (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))
+          < 0.5
+        for: 10m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Auth success ratio below 50%"
+          description: "More than half of login attempts are failing."
+```
+
+## Key Dashboard Panels
+
+See `dashboards/qpq-overview.json` for the full Grafana dashboard. Key panels:
+
+### Message Throughput
+
+- **Enqueue rate**: `rate(enqueue_total[5m])`
+- **Fetch rate**: `rate(fetch_total[5m])`
+- **Enqueue bandwidth**: `rate(enqueue_bytes_total[5m])`
+
+### Authentication
+
+- **Login success rate**: `rate(auth_login_success_total[5m])`
+- **Login failure rate**: `rate(auth_login_failure_total[5m])`
+- **Success ratio**: `rate(auth_login_success_total[5m]) / (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))`
+
+### Delivery Queue
+
+- **Queue depth**: `delivery_queue_depth`
+- **Queue growth rate**: `deriv(delivery_queue_depth[10m])`
+
+### Rate Limiting
+
+- **Rate limit hits**: `rate(rate_limit_hit_total[5m])`
+
+### Infrastructure (Node Exporter)
+
+- CPU, memory, disk, network from `node_exporter`
+
+## Grafana Dashboard
+
+Import the dashboard from `dashboards/qpq-overview.json`:
+
+1. Open Grafana -> Dashboards -> Import
+2. Upload `docs/operations/dashboards/qpq-overview.json`
+3. Select your Prometheus data source
+4. Save
+
+## Log Monitoring
+
+The server uses `tracing` with `RUST_LOG` environment variable:
+
+```bash
+# Production: info level with structured JSON output
+RUST_LOG=info
+
+# Debug specific modules
+RUST_LOG=info,quicproquo_server::node_service=debug
+
+# Verbose debugging
+RUST_LOG=debug
+```
+
+### Key Log Messages to Monitor
+
+| Log Pattern | Meaning | Action |
+|-------------|---------|--------|
+| `"TLS certificate expires within 30 days"` | Cert expiring soon | Rotate certificate |
+| `"TLS certificate is self-signed"` | Self-signed cert in use | Replace with CA-signed cert in production |
+| `"connection rate limit exceeded"` | IP being rate limited | Check for DDoS |
+| `"running without QPQ_AUTH_TOKEN"` | Insecure mode | Must not appear in production |
+| `"db_key is empty; SQL store will be plaintext"` | Unencrypted DB | Must not appear in production |
+| `"shutdown signal received"` | Graceful shutdown started | Expected during deploys |
+| `"generated and persisted new OPAQUE ServerSetup"` | Fresh OPAQUE setup | Expected on first start only |
+
+### Log Aggregation
+
+For production, pipe logs to a log aggregator:
+
+```bash
+# Systemd -> journald -> Loki/Elasticsearch
+journalctl -u qpq-server -f --output=json | \
+  promtail --stdin --client.url=http://loki:3100/loki/api/v1/push
+
+# Docker -> Loki driver
+docker run --log-driver=loki \
+  --log-opt loki-url="http://loki:3100/loki/api/v1/push" \
+  qpq-server
+```
+
+## Health Checking
+
+The Docker image includes a basic health check (TLS cert file exists). For
+deeper health checks:
+
+```bash
+# Simple: check the process is running and port is open
+ss -ulnp | grep 5001
+
+# Metrics endpoint (if enabled)
+curl -sf http://localhost:9090/metrics > /dev/null
+
+# Full client connection test
+qpq-client --server 127.0.0.1:5001 --ping
+```