Files
quicproquo/docs/operations/monitoring.md
Christian Nennemann 91c5495ab7 docs: add operational runbook, Grafana dashboard, and production docker-compose
Add comprehensive operational documentation:
- docs/operations/backup-restore.md: SQLCipher, file backend, blob backup/restore
- docs/operations/key-rotation.md: auth token, TLS, federation, DB key, OPAQUE rotation
- docs/operations/incident-response.md: playbook for common incidents
- docs/operations/scaling-guide.md: resource sizing, scaling triggers, capacity planning
- docs/operations/monitoring.md: Prometheus metrics, alert rules, log monitoring
- docs/operations/dashboards/qpq-overview.json: Grafana dashboard template
- docs/operations/prometheus.yml + alerts: Prometheus scrape and alert config
- docs/operations/grafana-provisioning/: auto-provisioning for datasources and dashboards
- docker-compose.prod.yml: production stack (server + Prometheus + Grafana)
- .env.example: documented environment variable template
2026-03-04 20:30:57 +01:00

226 lines
6.6 KiB
Markdown

# Monitoring Guide
This document covers metrics collection, alerting, and dashboards for quicproquo.
## Enabling Metrics
The server exports Prometheus metrics via HTTP when configured:
```bash
# Environment variables
QPQ_METRICS_LISTEN=0.0.0.0:9090
QPQ_METRICS_ENABLED=true
# Or in qpq-server.toml
metrics_listen = "0.0.0.0:9090"
metrics_enabled = true
```
Metrics are served at `http://<metrics_listen>/metrics` in Prometheus exposition format.
## Available Metrics
### Counters
| Metric | Description | Labels |
|--------|-------------|--------|
| `enqueue_total` | Total messages enqueued | - |
| `enqueue_bytes_total` | Total bytes enqueued | - |
| `fetch_total` | Total message fetches completed | - |
| `fetch_wait_total` | Total long-poll fetch waits | - |
| `key_package_upload_total` | Total MLS key package uploads | - |
| `auth_login_success_total` | Successful OPAQUE login completions | - |
| `auth_login_failure_total` | Failed login attempts | - |
| `rate_limit_hit_total` | Rate limit rejections | - |
### Gauges
| Metric | Description |
|--------|-------------|
| `delivery_queue_depth` | Current delivery queue depth (sampled) |
## Prometheus Configuration
```yaml
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'qpq-server'
static_targets:
- targets: ['qpq-server:9090']
scrape_interval: 10s
```
## Alert Rules
```yaml
# prometheus-alerts.yml
groups:
- name: qpq-server
rules:
# Server down
- alert: QpqServerDown
expr: up{job="qpq-server"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "qpq-server is down"
description: "Prometheus cannot scrape qpq-server metrics for > 1 minute."
# High auth failure rate (potential brute force)
- alert: QpqHighAuthFailureRate
expr: rate(auth_login_failure_total[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "High authentication failure rate"
description: "{{ $value | printf \"%.1f\" }} auth failures/sec over 5 minutes."
# Rate limiting active
- alert: QpqRateLimitActive
expr: rate(rate_limit_hit_total[5m]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Rate limiting is actively rejecting requests"
description: "{{ $value | printf \"%.1f\" }} rate limit hits/sec."
# Delivery queue growing
- alert: QpqDeliveryQueueHigh
expr: delivery_queue_depth > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "Delivery queue depth is high"
description: "Queue depth: {{ $value }}. Clients may not be fetching."
- alert: QpqDeliveryQueueCritical
expr: delivery_queue_depth > 100000
for: 5m
labels:
severity: critical
annotations:
summary: "Delivery queue depth is critical"
description: "Queue depth: {{ $value }}. Investigate immediately."
# No enqueue activity (service may be stuck)
- alert: QpqNoEnqueueActivity
expr: rate(enqueue_total[15m]) == 0
for: 30m
labels:
severity: warning
annotations:
summary: "No messages enqueued in 30 minutes"
description: "Check if the service is accepting connections."
# Auth success ratio too low
- alert: QpqLowAuthSuccessRatio
expr: >
rate(auth_login_success_total[5m])
/ (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))
< 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "Auth success ratio below 50%"
description: "More than half of login attempts are failing."
```
## Key Dashboard Panels
See `dashboards/qpq-overview.json` for the full Grafana dashboard. Key panels:
### Message Throughput
- **Enqueue rate**: `rate(enqueue_total[5m])`
- **Fetch rate**: `rate(fetch_total[5m])`
- **Enqueue bandwidth**: `rate(enqueue_bytes_total[5m])`
### Authentication
- **Login success rate**: `rate(auth_login_success_total[5m])`
- **Login failure rate**: `rate(auth_login_failure_total[5m])`
- **Success ratio**: `rate(auth_login_success_total[5m]) / (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))`
### Delivery Queue
- **Queue depth**: `delivery_queue_depth`
- **Queue growth rate**: `deriv(delivery_queue_depth[10m])`
### Rate Limiting
- **Rate limit hits**: `rate(rate_limit_hit_total[5m])`
### Infrastructure (Node Exporter)
- CPU, memory, disk, network from `node_exporter`
## Grafana Dashboard
Import the dashboard from `dashboards/qpq-overview.json`:
1. Open Grafana -> Dashboards -> Import
2. Upload `docs/operations/dashboards/qpq-overview.json`
3. Select your Prometheus data source
4. Save
## Log Monitoring
The server uses `tracing` with `RUST_LOG` environment variable:
```bash
# Production: info level with structured JSON output
RUST_LOG=info
# Debug specific modules
RUST_LOG=info,quicproquo_server::node_service=debug
# Verbose debugging
RUST_LOG=debug
```
### Key Log Messages to Monitor
| Log Pattern | Meaning | Action |
|-------------|---------|--------|
| `"TLS certificate expires within 30 days"` | Cert expiring soon | Rotate certificate |
| `"TLS certificate is self-signed"` | Self-signed cert in use | Replace with CA-signed cert in production |
| `"connection rate limit exceeded"` | IP being rate limited | Check for DDoS |
| `"running without QPQ_AUTH_TOKEN"` | Insecure mode | Must not appear in production |
| `"db_key is empty; SQL store will be plaintext"` | Unencrypted DB | Must not appear in production |
| `"shutdown signal received"` | Graceful shutdown started | Expected during deploys |
| `"generated and persisted new OPAQUE ServerSetup"` | Fresh OPAQUE setup | Expected on first start only |
### Log Aggregation
For production, pipe logs to a log aggregator:
```bash
# Systemd -> journald -> Loki/Elasticsearch
journalctl -u qpq-server -f --output=json | \
promtail --stdin --client.url=http://loki:3100/loki/api/v1/push
# Docker -> Loki driver
docker run --log-driver=loki \
--log-opt loki-url="http://loki:3100/loki/api/v1/push" \
qpq-server
```
## Health Checking
The Docker image includes a basic health check (TLS cert file exists). For deeper health checks:
```bash
# Simple: check the process is running and port is open
ss -ulnp | grep 7000
# Metrics endpoint (if enabled)
curl -sf http://localhost:9090/metrics > /dev/null
# Full client connection test
qpq-client --server 127.0.0.1:7000 --auth-token "$TOKEN" --ping
```