Files
quicproquo/docs/src/operations/monitoring.md
Christian Nennemann d073f614b3 docs: rewrite mdBook documentation for v2 architecture
Update 25+ files and add 6 new pages to reflect the v2 migration from
Cap'n Proto to Protobuf framing over QUIC. Integrates SDK and Operations
docs into the mdBook, restructures SUMMARY.md, and rewrites the wire
format, architecture, and protocol sections with accurate v2 content.
2026-03-04 22:02:31 +01:00

6.6 KiB

Monitoring Guide

This document covers metrics collection, alerting, and dashboards for quicproquo server deployments.

Enabling Metrics

The server exports Prometheus metrics via HTTP when configured:

# Environment variables
QPQ_METRICS_LISTEN=0.0.0.0:9090
QPQ_METRICS_ENABLED=true

# Or in qpq-server.toml
metrics_listen = "0.0.0.0:9090"
metrics_enabled = true

Metrics are served at http://<metrics_listen>/metrics in Prometheus exposition format.

Available Metrics

Counters

Metric Description Labels
enqueue_total Total messages enqueued -
enqueue_bytes_total Total bytes enqueued -
fetch_total Total message fetches completed -
fetch_wait_total Total long-poll fetch waits -
key_package_upload_total Total MLS key package uploads -
auth_login_success_total Successful OPAQUE login completions -
auth_login_failure_total Failed login attempts -
rate_limit_hit_total Rate limit rejections -

Gauges

Metric Description
delivery_queue_depth Current delivery queue depth (sampled)

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'qpq-server'
    static_configs:
      - targets: ['qpq-server:9090']
    scrape_interval: 10s

Alert Rules

# prometheus-alerts.yml
groups:
  - name: qpq-server
    rules:
      # Server down
      - alert: QpqServerDown
        expr: up{job="qpq-server"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "qpq-server is down"
          description: "Prometheus cannot scrape qpq-server metrics for > 1 minute."

      # High auth failure rate (potential brute force)
      - alert: QpqHighAuthFailureRate
        expr: rate(auth_login_failure_total[5m]) > 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High authentication failure rate"
          description: "{{ $value | printf \"%.1f\" }} auth failures/sec over 5 minutes."

      # Rate limiting active
      - alert: QpqRateLimitActive
        expr: rate(rate_limit_hit_total[5m]) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Rate limiting is actively rejecting requests"
          description: "{{ $value | printf \"%.1f\" }} rate limit hits/sec."

      # Delivery queue growing
      - alert: QpqDeliveryQueueHigh
        expr: delivery_queue_depth > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Delivery queue depth is high"
          description: "Queue depth: {{ $value }}. Clients may not be fetching."

      - alert: QpqDeliveryQueueCritical
        expr: delivery_queue_depth > 100000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Delivery queue depth is critical"
          description: "Queue depth: {{ $value }}. Investigate immediately."

      # No enqueue activity (service may be stuck)
      - alert: QpqNoEnqueueActivity
        expr: rate(enqueue_total[15m]) == 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "No messages enqueued in 30 minutes"
          description: "Check if the service is accepting connections."

      # Auth success ratio too low
      - alert: QpqLowAuthSuccessRatio
        expr: >
          rate(auth_login_success_total[5m])
          / (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))
          < 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Auth success ratio below 50%"
          description: "More than half of login attempts are failing."

Key Dashboard Panels

See dashboards/qpq-overview.json for the full Grafana dashboard. Key panels:

Message Throughput

  • Enqueue rate: rate(enqueue_total[5m])
  • Fetch rate: rate(fetch_total[5m])
  • Enqueue bandwidth: rate(enqueue_bytes_total[5m])

Authentication

  • Login success rate: rate(auth_login_success_total[5m])
  • Login failure rate: rate(auth_login_failure_total[5m])
  • Success ratio: rate(auth_login_success_total[5m]) / (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))

Delivery Queue

  • Queue depth: delivery_queue_depth
  • Queue growth rate: deriv(delivery_queue_depth[10m])

Rate Limiting

  • Rate limit hits: rate(rate_limit_hit_total[5m])

Infrastructure (Node Exporter)

  • CPU, memory, disk, network from node_exporter

Grafana Dashboard

Import the dashboard from dashboards/qpq-overview.json:

  1. Open Grafana -> Dashboards -> Import
  2. Upload docs/operations/dashboards/qpq-overview.json
  3. Select your Prometheus data source
  4. Save

Log Monitoring

The server uses tracing with RUST_LOG environment variable:

# Production: info level with structured JSON output
RUST_LOG=info

# Debug specific modules
RUST_LOG=info,quicproquo_server::node_service=debug

# Verbose debugging
RUST_LOG=debug

Key Log Messages to Monitor

Log Pattern Meaning Action
"TLS certificate expires within 30 days" Cert expiring soon Rotate certificate
"TLS certificate is self-signed" Self-signed cert in use Replace with CA-signed cert in production
"connection rate limit exceeded" IP being rate limited Check for DDoS
"running without QPQ_AUTH_TOKEN" Insecure mode Must not appear in production
"db_key is empty; SQL store will be plaintext" Unencrypted DB Must not appear in production
"shutdown signal received" Graceful shutdown started Expected during deploys
"generated and persisted new OPAQUE ServerSetup" Fresh OPAQUE setup Expected on first start only

Log Aggregation

For production, pipe logs to a log aggregator:

# Systemd -> journald -> Loki/Elasticsearch
journalctl -u qpq-server -f --output=json | \
  promtail --stdin --client.url=http://loki:3100/loki/api/v1/push

# Docker -> Loki driver
docker run --log-driver=loki \
  --log-opt loki-url="http://loki:3100/loki/api/v1/push" \
  qpq-server

Health Checking

The Docker image includes a basic health check (TLS cert file exists). For deeper health checks:

# Simple: check the process is running and port is open
ss -ulnp | grep 5001

# Metrics endpoint (if enabled)
curl -sf http://localhost:9090/metrics > /dev/null

# Full client connection test
qpq-client --server 127.0.0.1:5001 --ping