Files
quicproquo/docs/operations/monitoring.md
Christian Nennemann 2e081ead8e chore: rename quicproquo → quicprochat in docs, Docker, CI, and packaging
Rename all project references from quicproquo/qpq to quicprochat/qpc
across documentation, Docker configuration, CI workflows, packaging
scripts, operational configs, and build tooling.

- Docker: crate paths, binary names, user/group, data dirs, env vars
- CI: workflow crate references, binary names, artifact names
- Docs: all markdown files under docs/, SDK READMEs, book.toml
- Packaging: OpenWrt Makefile, init script, UCI config (file renames)
- Scripts: justfile, dev-shell, screenshot, cross-compile, ai_team
- Operations: Prometheus config, alert rules, Grafana dashboard
- Config: .env.example (QPQ_* → QPC_*), CODEOWNERS paths
- Top-level: README, CONTRIBUTING, ROADMAP, CLAUDE.md
2026-03-21 19:14:06 +01:00

6.6 KiB

Monitoring Guide

This document covers metrics collection, alerting, and dashboards for quicprochat.

Enabling Metrics

The server exports Prometheus metrics via HTTP when configured:

# Environment variables
QPC_METRICS_LISTEN=0.0.0.0:9090
QPC_METRICS_ENABLED=true

# Or in qpc-server.toml
metrics_listen = "0.0.0.0:9090"
metrics_enabled = true

Metrics are served at http://<metrics_listen>/metrics in Prometheus exposition format.

Available Metrics

Counters

Metric Description Labels
enqueue_total Total messages enqueued -
enqueue_bytes_total Total bytes enqueued -
fetch_total Total message fetches completed -
fetch_wait_total Total long-poll fetch waits -
key_package_upload_total Total MLS key package uploads -
auth_login_success_total Successful OPAQUE login completions -
auth_login_failure_total Failed login attempts -
rate_limit_hit_total Rate limit rejections -

Gauges

Metric Description
delivery_queue_depth Current delivery queue depth (sampled)

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'qpc-server'
    static_targets:
      - targets: ['qpc-server:9090']
    scrape_interval: 10s

Alert Rules

# prometheus-alerts.yml
groups:
  - name: qpc-server
    rules:
      # Server down
      - alert: QpqServerDown
        expr: up{job="qpc-server"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "qpc-server is down"
          description: "Prometheus cannot scrape qpc-server metrics for > 1 minute."

      # High auth failure rate (potential brute force)
      - alert: QpqHighAuthFailureRate
        expr: rate(auth_login_failure_total[5m]) > 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High authentication failure rate"
          description: "{{ $value | printf \"%.1f\" }} auth failures/sec over 5 minutes."

      # Rate limiting active
      - alert: QpqRateLimitActive
        expr: rate(rate_limit_hit_total[5m]) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Rate limiting is actively rejecting requests"
          description: "{{ $value | printf \"%.1f\" }} rate limit hits/sec."

      # Delivery queue growing
      - alert: QpqDeliveryQueueHigh
        expr: delivery_queue_depth > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Delivery queue depth is high"
          description: "Queue depth: {{ $value }}. Clients may not be fetching."

      - alert: QpqDeliveryQueueCritical
        expr: delivery_queue_depth > 100000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Delivery queue depth is critical"
          description: "Queue depth: {{ $value }}. Investigate immediately."

      # No enqueue activity (service may be stuck)
      - alert: QpqNoEnqueueActivity
        expr: rate(enqueue_total[15m]) == 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "No messages enqueued in 30 minutes"
          description: "Check if the service is accepting connections."

      # Auth success ratio too low
      - alert: QpqLowAuthSuccessRatio
        expr: >
          rate(auth_login_success_total[5m])
          / (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))
          < 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Auth success ratio below 50%"
          description: "More than half of login attempts are failing."

Key Dashboard Panels

See dashboards/qpc-overview.json for the full Grafana dashboard. Key panels:

Message Throughput

  • Enqueue rate: rate(enqueue_total[5m])
  • Fetch rate: rate(fetch_total[5m])
  • Enqueue bandwidth: rate(enqueue_bytes_total[5m])

Authentication

  • Login success rate: rate(auth_login_success_total[5m])
  • Login failure rate: rate(auth_login_failure_total[5m])
  • Success ratio: rate(auth_login_success_total[5m]) / (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))

Delivery Queue

  • Queue depth: delivery_queue_depth
  • Queue growth rate: deriv(delivery_queue_depth[10m])

Rate Limiting

  • Rate limit hits: rate(rate_limit_hit_total[5m])

Infrastructure (Node Exporter)

  • CPU, memory, disk, network from node_exporter

Grafana Dashboard

Import the dashboard from dashboards/qpc-overview.json:

  1. Open Grafana -> Dashboards -> Import
  2. Upload docs/operations/dashboards/qpc-overview.json
  3. Select your Prometheus data source
  4. Save

Log Monitoring

The server uses tracing with RUST_LOG environment variable:

# Production: info level with structured JSON output
RUST_LOG=info

# Debug specific modules
RUST_LOG=info,quicprochat_server::node_service=debug

# Verbose debugging
RUST_LOG=debug

Key Log Messages to Monitor

Log Pattern Meaning Action
"TLS certificate expires within 30 days" Cert expiring soon Rotate certificate
"TLS certificate is self-signed" Self-signed cert in use Replace with CA-signed cert in production
"connection rate limit exceeded" IP being rate limited Check for DDoS
"running without QPC_AUTH_TOKEN" Insecure mode Must not appear in production
"db_key is empty; SQL store will be plaintext" Unencrypted DB Must not appear in production
"shutdown signal received" Graceful shutdown started Expected during deploys
"generated and persisted new OPAQUE ServerSetup" Fresh OPAQUE setup Expected on first start only

Log Aggregation

For production, pipe logs to a log aggregator:

# Systemd -> journald -> Loki/Elasticsearch
journalctl -u qpc-server -f --output=json | \
  promtail --stdin --client.url=http://loki:3100/loki/api/v1/push

# Docker -> Loki driver
docker run --log-driver=loki \
  --log-opt loki-url="http://loki:3100/loki/api/v1/push" \
  qpc-server

Health Checking

The Docker image includes a basic health check (TLS cert file exists). For deeper health checks:

# Simple: check the process is running and port is open
ss -ulnp | grep 7000

# Metrics endpoint (if enabled)
curl -sf http://localhost:9090/metrics > /dev/null

# Full client connection test
qpc-client --server 127.0.0.1:7000 --auth-token "$TOKEN" --ping