docs: rewrite mdBook documentation for v2 architecture

Update 25+ files and add 6 new pages to reflect the v2 migration from
Cap'n Proto to Protobuf framing over QUIC. Integrates SDK and Operations
docs into the mdBook, restructures SUMMARY.md, and rewrites the wire
format, architecture, and protocol sections with accurate v2 content.
This commit is contained in:
2026-03-04 22:02:31 +01:00
parent f7a7f672b4
commit d073f614b3
31 changed files with 4423 additions and 2379 deletions

View File

@@ -0,0 +1,233 @@
# Monitoring Guide
This document covers metrics collection, alerting, and dashboards for
quicproquo server deployments.
## Enabling Metrics
The server exports Prometheus metrics via HTTP when configured:
```bash
# Environment variables
QPQ_METRICS_LISTEN=0.0.0.0:9090
QPQ_METRICS_ENABLED=true
# Or in qpq-server.toml
metrics_listen = "0.0.0.0:9090"
metrics_enabled = true
```
Metrics are served at `http://<metrics_listen>/metrics` in Prometheus
exposition format.
## Available Metrics
### Counters
| Metric | Description | Labels |
|--------|-------------|--------|
| `enqueue_total` | Total messages enqueued | - |
| `enqueue_bytes_total` | Total bytes enqueued | - |
| `fetch_total` | Total message fetches completed | - |
| `fetch_wait_total` | Total long-poll fetch waits | - |
| `key_package_upload_total` | Total MLS key package uploads | - |
| `auth_login_success_total` | Successful OPAQUE login completions | - |
| `auth_login_failure_total` | Failed login attempts | - |
| `rate_limit_hit_total` | Rate limit rejections | - |
### Gauges
| Metric | Description |
|--------|-------------|
| `delivery_queue_depth` | Current delivery queue depth (sampled) |
## Prometheus Configuration
```yaml
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'qpq-server'
static_configs:
- targets: ['qpq-server:9090']
scrape_interval: 10s
```
## Alert Rules
```yaml
# prometheus-alerts.yml
groups:
- name: qpq-server
rules:
# Server down
- alert: QpqServerDown
expr: up{job="qpq-server"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "qpq-server is down"
description: "Prometheus cannot scrape qpq-server metrics for > 1 minute."
# High auth failure rate (potential brute force)
- alert: QpqHighAuthFailureRate
expr: rate(auth_login_failure_total[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "High authentication failure rate"
description: "{{ $value | printf \"%.1f\" }} auth failures/sec over 5 minutes."
# Rate limiting active
- alert: QpqRateLimitActive
expr: rate(rate_limit_hit_total[5m]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Rate limiting is actively rejecting requests"
description: "{{ $value | printf \"%.1f\" }} rate limit hits/sec."
# Delivery queue growing
- alert: QpqDeliveryQueueHigh
expr: delivery_queue_depth > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "Delivery queue depth is high"
description: "Queue depth: {{ $value }}. Clients may not be fetching."
- alert: QpqDeliveryQueueCritical
expr: delivery_queue_depth > 100000
for: 5m
labels:
severity: critical
annotations:
summary: "Delivery queue depth is critical"
description: "Queue depth: {{ $value }}. Investigate immediately."
# No enqueue activity (service may be stuck)
- alert: QpqNoEnqueueActivity
expr: rate(enqueue_total[15m]) == 0
for: 30m
labels:
severity: warning
annotations:
summary: "No messages enqueued in 30 minutes"
description: "Check if the service is accepting connections."
# Auth success ratio too low
- alert: QpqLowAuthSuccessRatio
expr: >
rate(auth_login_success_total[5m])
/ (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))
< 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "Auth success ratio below 50%"
description: "More than half of login attempts are failing."
```
## Key Dashboard Panels
See `dashboards/qpq-overview.json` for the full Grafana dashboard. Key panels:
### Message Throughput
- **Enqueue rate**: `rate(enqueue_total[5m])`
- **Fetch rate**: `rate(fetch_total[5m])`
- **Enqueue bandwidth**: `rate(enqueue_bytes_total[5m])`
### Authentication
- **Login success rate**: `rate(auth_login_success_total[5m])`
- **Login failure rate**: `rate(auth_login_failure_total[5m])`
- **Success ratio**: `rate(auth_login_success_total[5m]) / (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))`
### Delivery Queue
- **Queue depth**: `delivery_queue_depth`
- **Queue growth rate**: `deriv(delivery_queue_depth[10m])`
### Rate Limiting
- **Rate limit hits**: `rate(rate_limit_hit_total[5m])`
### Infrastructure (Node Exporter)
- CPU, memory, disk, network from `node_exporter`
## Grafana Dashboard
Import the dashboard from `dashboards/qpq-overview.json`:
1. Open Grafana -> Dashboards -> Import
2. Upload `docs/operations/dashboards/qpq-overview.json`
3. Select your Prometheus data source
4. Save
## Log Monitoring
The server uses `tracing` with `RUST_LOG` environment variable:
```bash
# Production: info level with structured JSON output
RUST_LOG=info
# Debug specific modules
RUST_LOG=info,quicproquo_server::node_service=debug
# Verbose debugging
RUST_LOG=debug
```
### Key Log Messages to Monitor
| Log Pattern | Meaning | Action |
|-------------|---------|--------|
| `"TLS certificate expires within 30 days"` | Cert expiring soon | Rotate certificate |
| `"TLS certificate is self-signed"` | Self-signed cert in use | Replace with CA-signed cert in production |
| `"connection rate limit exceeded"` | IP being rate limited | Check for DDoS |
| `"running without QPQ_AUTH_TOKEN"` | Insecure mode | Must not appear in production |
| `"db_key is empty; SQL store will be plaintext"` | Unencrypted DB | Must not appear in production |
| `"shutdown signal received"` | Graceful shutdown started | Expected during deploys |
| `"generated and persisted new OPAQUE ServerSetup"` | Fresh OPAQUE setup | Expected on first start only |
### Log Aggregation
For production, pipe logs to a log aggregator:
```bash
# Systemd -> journald -> Loki/Elasticsearch
journalctl -u qpq-server -f --output=json | \
promtail --stdin --client.url=http://loki:3100/loki/api/v1/push
# Docker -> Loki driver
docker run --log-driver=loki \
--log-opt loki-url="http://loki:3100/loki/api/v1/push" \
qpq-server
```
## Health Checking
The Docker image includes a basic health check (TLS cert file exists). For
deeper health checks:
```bash
# Simple: check the process is running and port is open
ss -ulnp | grep 5001
# Metrics endpoint (if enabled)
curl -sf http://localhost:9090/metrics > /dev/null
# Full client connection test
qpq-client --server 127.0.0.1:5001 --ping
```