Files

Christian Nennemann 2e081ead8e chore: rename quicproquo → quicprochat in docs, Docker, CI, and packaging

Rename all project references from quicproquo/qpq to quicprochat/qpc
across documentation, Docker configuration, CI workflows, packaging
scripts, operational configs, and build tooling.

- Docker: crate paths, binary names, user/group, data dirs, env vars
- CI: workflow crate references, binary names, artifact names
- Docs: all markdown files under docs/, SDK READMEs, book.toml
- Packaging: OpenWrt Makefile, init script, UCI config (file renames)
- Scripts: justfile, dev-shell, screenshot, cross-compile, ai_team
- Operations: Prometheus config, alert rules, Grafana dashboard
- Config: .env.example (QPQ_* → QPC_*), CODEOWNERS paths
- Top-level: README, CONTRIBUTING, ROADMAP, CLAUDE.md

2026-03-21 19:14:06 +01:00

6.6 KiB

Raw Blame History

Monitoring Guide

This document covers metrics collection, alerting, and dashboards for quicprochat.

Enabling Metrics

The server exports Prometheus metrics via HTTP when configured:

# Environment variables
QPC_METRICS_LISTEN=0.0.0.0:9090
QPC_METRICS_ENABLED=true

# Or in qpc-server.toml
metrics_listen = "0.0.0.0:9090"
metrics_enabled = true

Metrics are served at http://<metrics_listen>/metrics in Prometheus exposition format.

Available Metrics

Counters

Metric	Description	Labels
`enqueue_total`	Total messages enqueued	-
`enqueue_bytes_total`	Total bytes enqueued	-
`fetch_total`	Total message fetches completed	-
`fetch_wait_total`	Total long-poll fetch waits	-
`key_package_upload_total`	Total MLS key package uploads	-
`auth_login_success_total`	Successful OPAQUE login completions	-
`auth_login_failure_total`	Failed login attempts	-
`rate_limit_hit_total`	Rate limit rejections	-

Gauges

Metric	Description
`delivery_queue_depth`	Current delivery queue depth (sampled)

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'qpc-server'
    static_targets:
      - targets: ['qpc-server:9090']
    scrape_interval: 10s

Alert Rules

# prometheus-alerts.yml
groups:
  - name: qpc-server
    rules:
      # Server down
      - alert: QpqServerDown
        expr: up{job="qpc-server"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "qpc-server is down"
          description: "Prometheus cannot scrape qpc-server metrics for > 1 minute."

      # High auth failure rate (potential brute force)
      - alert: QpqHighAuthFailureRate
        expr: rate(auth_login_failure_total[5m]) > 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High authentication failure rate"
          description: "{{ $value | printf \"%.1f\" }} auth failures/sec over 5 minutes."

      # Rate limiting active
      - alert: QpqRateLimitActive
        expr: rate(rate_limit_hit_total[5m]) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Rate limiting is actively rejecting requests"
          description: "{{ $value | printf \"%.1f\" }} rate limit hits/sec."

      # Delivery queue growing
      - alert: QpqDeliveryQueueHigh
        expr: delivery_queue_depth > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Delivery queue depth is high"
          description: "Queue depth: {{ $value }}. Clients may not be fetching."

      - alert: QpqDeliveryQueueCritical
        expr: delivery_queue_depth > 100000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Delivery queue depth is critical"
          description: "Queue depth: {{ $value }}. Investigate immediately."

      # No enqueue activity (service may be stuck)
      - alert: QpqNoEnqueueActivity
        expr: rate(enqueue_total[15m]) == 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "No messages enqueued in 30 minutes"
          description: "Check if the service is accepting connections."

      # Auth success ratio too low
      - alert: QpqLowAuthSuccessRatio
        expr: >
          rate(auth_login_success_total[5m])
          / (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))
          < 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Auth success ratio below 50%"
          description: "More than half of login attempts are failing."

Key Dashboard Panels

See dashboards/qpc-overview.json for the full Grafana dashboard. Key panels:

Message Throughput

Enqueue rate: rate(enqueue_total[5m])
Fetch rate: rate(fetch_total[5m])
Enqueue bandwidth: rate(enqueue_bytes_total[5m])

Authentication

Login success rate: rate(auth_login_success_total[5m])
Login failure rate: rate(auth_login_failure_total[5m])
Success ratio: rate(auth_login_success_total[5m]) / (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))

Delivery Queue

Queue depth: delivery_queue_depth
Queue growth rate: deriv(delivery_queue_depth[10m])

Rate Limiting

Rate limit hits: rate(rate_limit_hit_total[5m])

Infrastructure (Node Exporter)

CPU, memory, disk, network from node_exporter

Grafana Dashboard

Import the dashboard from dashboards/qpc-overview.json:

Open Grafana -> Dashboards -> Import
Upload docs/operations/dashboards/qpc-overview.json
Select your Prometheus data source
Save

Log Monitoring

The server uses tracing with RUST_LOG environment variable:

# Production: info level with structured JSON output
RUST_LOG=info

# Debug specific modules
RUST_LOG=info,quicprochat_server::node_service=debug

# Verbose debugging
RUST_LOG=debug

Key Log Messages to Monitor

Log Pattern	Meaning	Action
`"TLS certificate expires within 30 days"`	Cert expiring soon	Rotate certificate
`"TLS certificate is self-signed"`	Self-signed cert in use	Replace with CA-signed cert in production
`"connection rate limit exceeded"`	IP being rate limited	Check for DDoS
`"running without QPC_AUTH_TOKEN"`	Insecure mode	Must not appear in production
`"db_key is empty; SQL store will be plaintext"`	Unencrypted DB	Must not appear in production
`"shutdown signal received"`	Graceful shutdown started	Expected during deploys
`"generated and persisted new OPAQUE ServerSetup"`	Fresh OPAQUE setup	Expected on first start only

Log Aggregation

For production, pipe logs to a log aggregator:

# Systemd -> journald -> Loki/Elasticsearch
journalctl -u qpc-server -f --output=json | \
  promtail --stdin --client.url=http://loki:3100/loki/api/v1/push

# Docker -> Loki driver
docker run --log-driver=loki \
  --log-opt loki-url="http://loki:3100/loki/api/v1/push" \
  qpc-server

Health Checking

The Docker image includes a basic health check (TLS cert file exists). For deeper health checks:

# Simple: check the process is running and port is open
ss -ulnp | grep 7000

# Metrics endpoint (if enabled)
curl -sf http://localhost:9090/metrics > /dev/null

# Full client connection test
qpc-client --server 127.0.0.1:7000 --auth-token "$TOKEN" --ping

6.6 KiB Raw Blame History