Add comprehensive operational documentation: - docs/operations/backup-restore.md: SQLCipher, file backend, blob backup/restore - docs/operations/key-rotation.md: auth token, TLS, federation, DB key, OPAQUE rotation - docs/operations/incident-response.md: playbook for common incidents - docs/operations/scaling-guide.md: resource sizing, scaling triggers, capacity planning - docs/operations/monitoring.md: Prometheus metrics, alert rules, log monitoring - docs/operations/dashboards/qpq-overview.json: Grafana dashboard template - docs/operations/prometheus.yml + alerts: Prometheus scrape and alert config - docs/operations/grafana-provisioning/: auto-provisioning for datasources and dashboards - docker-compose.prod.yml: production stack (server + Prometheus + Grafana) - .env.example: documented environment variable template
6.6 KiB
6.6 KiB
Monitoring Guide
This document covers metrics collection, alerting, and dashboards for quicproquo.
Enabling Metrics
The server exports Prometheus metrics via HTTP when configured:
# Environment variables
QPQ_METRICS_LISTEN=0.0.0.0:9090
QPQ_METRICS_ENABLED=true
# Or in qpq-server.toml
metrics_listen = "0.0.0.0:9090"
metrics_enabled = true
Metrics are served at http://<metrics_listen>/metrics in Prometheus exposition format.
Available Metrics
Counters
| Metric | Description | Labels |
|---|---|---|
enqueue_total |
Total messages enqueued | - |
enqueue_bytes_total |
Total bytes enqueued | - |
fetch_total |
Total message fetches completed | - |
fetch_wait_total |
Total long-poll fetch waits | - |
key_package_upload_total |
Total MLS key package uploads | - |
auth_login_success_total |
Successful OPAQUE login completions | - |
auth_login_failure_total |
Failed login attempts | - |
rate_limit_hit_total |
Rate limit rejections | - |
Gauges
| Metric | Description |
|---|---|
delivery_queue_depth |
Current delivery queue depth (sampled) |
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'qpq-server'
static_targets:
- targets: ['qpq-server:9090']
scrape_interval: 10s
Alert Rules
# prometheus-alerts.yml
groups:
- name: qpq-server
rules:
# Server down
- alert: QpqServerDown
expr: up{job="qpq-server"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "qpq-server is down"
description: "Prometheus cannot scrape qpq-server metrics for > 1 minute."
# High auth failure rate (potential brute force)
- alert: QpqHighAuthFailureRate
expr: rate(auth_login_failure_total[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "High authentication failure rate"
description: "{{ $value | printf \"%.1f\" }} auth failures/sec over 5 minutes."
# Rate limiting active
- alert: QpqRateLimitActive
expr: rate(rate_limit_hit_total[5m]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Rate limiting is actively rejecting requests"
description: "{{ $value | printf \"%.1f\" }} rate limit hits/sec."
# Delivery queue growing
- alert: QpqDeliveryQueueHigh
expr: delivery_queue_depth > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "Delivery queue depth is high"
description: "Queue depth: {{ $value }}. Clients may not be fetching."
- alert: QpqDeliveryQueueCritical
expr: delivery_queue_depth > 100000
for: 5m
labels:
severity: critical
annotations:
summary: "Delivery queue depth is critical"
description: "Queue depth: {{ $value }}. Investigate immediately."
# No enqueue activity (service may be stuck)
- alert: QpqNoEnqueueActivity
expr: rate(enqueue_total[15m]) == 0
for: 30m
labels:
severity: warning
annotations:
summary: "No messages enqueued in 30 minutes"
description: "Check if the service is accepting connections."
# Auth success ratio too low
- alert: QpqLowAuthSuccessRatio
expr: >
rate(auth_login_success_total[5m])
/ (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))
< 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "Auth success ratio below 50%"
description: "More than half of login attempts are failing."
Key Dashboard Panels
See dashboards/qpq-overview.json for the full Grafana dashboard. Key panels:
Message Throughput
- Enqueue rate:
rate(enqueue_total[5m]) - Fetch rate:
rate(fetch_total[5m]) - Enqueue bandwidth:
rate(enqueue_bytes_total[5m])
Authentication
- Login success rate:
rate(auth_login_success_total[5m]) - Login failure rate:
rate(auth_login_failure_total[5m]) - Success ratio:
rate(auth_login_success_total[5m]) / (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))
Delivery Queue
- Queue depth:
delivery_queue_depth - Queue growth rate:
deriv(delivery_queue_depth[10m])
Rate Limiting
- Rate limit hits:
rate(rate_limit_hit_total[5m])
Infrastructure (Node Exporter)
- CPU, memory, disk, network from
node_exporter
Grafana Dashboard
Import the dashboard from dashboards/qpq-overview.json:
- Open Grafana -> Dashboards -> Import
- Upload
docs/operations/dashboards/qpq-overview.json - Select your Prometheus data source
- Save
Log Monitoring
The server uses tracing with RUST_LOG environment variable:
# Production: info level with structured JSON output
RUST_LOG=info
# Debug specific modules
RUST_LOG=info,quicproquo_server::node_service=debug
# Verbose debugging
RUST_LOG=debug
Key Log Messages to Monitor
| Log Pattern | Meaning | Action |
|---|---|---|
"TLS certificate expires within 30 days" |
Cert expiring soon | Rotate certificate |
"TLS certificate is self-signed" |
Self-signed cert in use | Replace with CA-signed cert in production |
"connection rate limit exceeded" |
IP being rate limited | Check for DDoS |
"running without QPQ_AUTH_TOKEN" |
Insecure mode | Must not appear in production |
"db_key is empty; SQL store will be plaintext" |
Unencrypted DB | Must not appear in production |
"shutdown signal received" |
Graceful shutdown started | Expected during deploys |
"generated and persisted new OPAQUE ServerSetup" |
Fresh OPAQUE setup | Expected on first start only |
Log Aggregation
For production, pipe logs to a log aggregator:
# Systemd -> journald -> Loki/Elasticsearch
journalctl -u qpq-server -f --output=json | \
promtail --stdin --client.url=http://loki:3100/loki/api/v1/push
# Docker -> Loki driver
docker run --log-driver=loki \
--log-opt loki-url="http://loki:3100/loki/api/v1/push" \
qpq-server
Health Checking
The Docker image includes a basic health check (TLS cert file exists). For deeper health checks:
# Simple: check the process is running and port is open
ss -ulnp | grep 7000
# Metrics endpoint (if enabled)
curl -sf http://localhost:9090/metrics > /dev/null
# Full client connection test
qpq-client --server 127.0.0.1:7000 --auth-token "$TOKEN" --ping