Files

Christian Nennemann d073f614b3 docs: rewrite mdBook documentation for v2 architecture

Update 25+ files and add 6 new pages to reflect the v2 migration from
Cap'n Proto to Protobuf framing over QUIC. Integrates SDK and Operations
docs into the mdBook, restructures SUMMARY.md, and rewrites the wire
format, architecture, and protocol sections with accurate v2 content.

2026-03-04 22:02:31 +01:00

6.6 KiB

Raw Blame History

Monitoring Guide

This document covers metrics collection, alerting, and dashboards for quicproquo server deployments.

Enabling Metrics

The server exports Prometheus metrics via HTTP when configured:

# Environment variables
QPQ_METRICS_LISTEN=0.0.0.0:9090
QPQ_METRICS_ENABLED=true

# Or in qpq-server.toml
metrics_listen = "0.0.0.0:9090"
metrics_enabled = true

Metrics are served at http://<metrics_listen>/metrics in Prometheus exposition format.

Available Metrics

Counters

Metric	Description	Labels
`enqueue_total`	Total messages enqueued	-
`enqueue_bytes_total`	Total bytes enqueued	-
`fetch_total`	Total message fetches completed	-
`fetch_wait_total`	Total long-poll fetch waits	-
`key_package_upload_total`	Total MLS key package uploads	-
`auth_login_success_total`	Successful OPAQUE login completions	-
`auth_login_failure_total`	Failed login attempts	-
`rate_limit_hit_total`	Rate limit rejections	-

Gauges

Metric	Description
`delivery_queue_depth`	Current delivery queue depth (sampled)

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'qpq-server'
    static_configs:
      - targets: ['qpq-server:9090']
    scrape_interval: 10s

Alert Rules

# prometheus-alerts.yml
groups:
  - name: qpq-server
    rules:
      # Server down
      - alert: QpqServerDown
        expr: up{job="qpq-server"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "qpq-server is down"
          description: "Prometheus cannot scrape qpq-server metrics for > 1 minute."

      # High auth failure rate (potential brute force)
      - alert: QpqHighAuthFailureRate
        expr: rate(auth_login_failure_total[5m]) > 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High authentication failure rate"
          description: "{{ $value | printf \"%.1f\" }} auth failures/sec over 5 minutes."

      # Rate limiting active
      - alert: QpqRateLimitActive
        expr: rate(rate_limit_hit_total[5m]) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Rate limiting is actively rejecting requests"
          description: "{{ $value | printf \"%.1f\" }} rate limit hits/sec."

      # Delivery queue growing
      - alert: QpqDeliveryQueueHigh
        expr: delivery_queue_depth > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Delivery queue depth is high"
          description: "Queue depth: {{ $value }}. Clients may not be fetching."

      - alert: QpqDeliveryQueueCritical
        expr: delivery_queue_depth > 100000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Delivery queue depth is critical"
          description: "Queue depth: {{ $value }}. Investigate immediately."

      # No enqueue activity (service may be stuck)
      - alert: QpqNoEnqueueActivity
        expr: rate(enqueue_total[15m]) == 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "No messages enqueued in 30 minutes"
          description: "Check if the service is accepting connections."

      # Auth success ratio too low
      - alert: QpqLowAuthSuccessRatio
        expr: >
          rate(auth_login_success_total[5m])
          / (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))
          < 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Auth success ratio below 50%"
          description: "More than half of login attempts are failing."

Key Dashboard Panels

See dashboards/qpq-overview.json for the full Grafana dashboard. Key panels:

Message Throughput

Enqueue rate: rate(enqueue_total[5m])
Fetch rate: rate(fetch_total[5m])
Enqueue bandwidth: rate(enqueue_bytes_total[5m])

Authentication

Login success rate: rate(auth_login_success_total[5m])
Login failure rate: rate(auth_login_failure_total[5m])
Success ratio: rate(auth_login_success_total[5m]) / (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))

Delivery Queue

Queue depth: delivery_queue_depth
Queue growth rate: deriv(delivery_queue_depth[10m])

Rate Limiting

Rate limit hits: rate(rate_limit_hit_total[5m])

Infrastructure (Node Exporter)

CPU, memory, disk, network from node_exporter

Grafana Dashboard

Import the dashboard from dashboards/qpq-overview.json:

Open Grafana -> Dashboards -> Import
Upload docs/operations/dashboards/qpq-overview.json
Select your Prometheus data source
Save

Log Monitoring

The server uses tracing with RUST_LOG environment variable:

# Production: info level with structured JSON output
RUST_LOG=info

# Debug specific modules
RUST_LOG=info,quicproquo_server::node_service=debug

# Verbose debugging
RUST_LOG=debug

Key Log Messages to Monitor

Log Pattern	Meaning	Action
`"TLS certificate expires within 30 days"`	Cert expiring soon	Rotate certificate
`"TLS certificate is self-signed"`	Self-signed cert in use	Replace with CA-signed cert in production
`"connection rate limit exceeded"`	IP being rate limited	Check for DDoS
`"running without QPQ_AUTH_TOKEN"`	Insecure mode	Must not appear in production
`"db_key is empty; SQL store will be plaintext"`	Unencrypted DB	Must not appear in production
`"shutdown signal received"`	Graceful shutdown started	Expected during deploys
`"generated and persisted new OPAQUE ServerSetup"`	Fresh OPAQUE setup	Expected on first start only

Log Aggregation

For production, pipe logs to a log aggregator:

# Systemd -> journald -> Loki/Elasticsearch
journalctl -u qpq-server -f --output=json | \
  promtail --stdin --client.url=http://loki:3100/loki/api/v1/push

# Docker -> Loki driver
docker run --log-driver=loki \
  --log-opt loki-url="http://loki:3100/loki/api/v1/push" \
  qpq-server

Health Checking

The Docker image includes a basic health check (TLS cert file exists). For deeper health checks:

# Simple: check the process is running and port is open
ss -ulnp | grep 5001

# Metrics endpoint (if enabled)
curl -sf http://localhost:9090/metrics > /dev/null

# Full client connection test
qpq-client --server 127.0.0.1:5001 --ping

6.6 KiB Raw Blame History