# Monitoring Guide This document covers metrics collection, alerting, and dashboards for quicprochat. ## Enabling Metrics The server exports Prometheus metrics via HTTP when configured: ```bash # Environment variables QPC_METRICS_LISTEN=0.0.0.0:9090 QPC_METRICS_ENABLED=true # Or in qpc-server.toml metrics_listen = "0.0.0.0:9090" metrics_enabled = true ``` Metrics are served at `http:///metrics` in Prometheus exposition format. ## Available Metrics ### Counters | Metric | Description | Labels | |--------|-------------|--------| | `enqueue_total` | Total messages enqueued | - | | `enqueue_bytes_total` | Total bytes enqueued | - | | `fetch_total` | Total message fetches completed | - | | `fetch_wait_total` | Total long-poll fetch waits | - | | `key_package_upload_total` | Total MLS key package uploads | - | | `auth_login_success_total` | Successful OPAQUE login completions | - | | `auth_login_failure_total` | Failed login attempts | - | | `rate_limit_hit_total` | Rate limit rejections | - | ### Gauges | Metric | Description | |--------|-------------| | `delivery_queue_depth` | Current delivery queue depth (sampled) | ## Prometheus Configuration ```yaml # prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'qpc-server' static_targets: - targets: ['qpc-server:9090'] scrape_interval: 10s ``` ## Alert Rules ```yaml # prometheus-alerts.yml groups: - name: qpc-server rules: # Server down - alert: QpqServerDown expr: up{job="qpc-server"} == 0 for: 1m labels: severity: critical annotations: summary: "qpc-server is down" description: "Prometheus cannot scrape qpc-server metrics for > 1 minute." # High auth failure rate (potential brute force) - alert: QpqHighAuthFailureRate expr: rate(auth_login_failure_total[5m]) > 10 for: 2m labels: severity: warning annotations: summary: "High authentication failure rate" description: "{{ $value | printf \"%.1f\" }} auth failures/sec over 5 minutes." # Rate limiting active - alert: QpqRateLimitActive expr: rate(rate_limit_hit_total[5m]) > 5 for: 5m labels: severity: warning annotations: summary: "Rate limiting is actively rejecting requests" description: "{{ $value | printf \"%.1f\" }} rate limit hits/sec." # Delivery queue growing - alert: QpqDeliveryQueueHigh expr: delivery_queue_depth > 10000 for: 10m labels: severity: warning annotations: summary: "Delivery queue depth is high" description: "Queue depth: {{ $value }}. Clients may not be fetching." - alert: QpqDeliveryQueueCritical expr: delivery_queue_depth > 100000 for: 5m labels: severity: critical annotations: summary: "Delivery queue depth is critical" description: "Queue depth: {{ $value }}. Investigate immediately." # No enqueue activity (service may be stuck) - alert: QpqNoEnqueueActivity expr: rate(enqueue_total[15m]) == 0 for: 30m labels: severity: warning annotations: summary: "No messages enqueued in 30 minutes" description: "Check if the service is accepting connections." # Auth success ratio too low - alert: QpqLowAuthSuccessRatio expr: > rate(auth_login_success_total[5m]) / (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m])) < 0.5 for: 10m labels: severity: warning annotations: summary: "Auth success ratio below 50%" description: "More than half of login attempts are failing." ``` ## Key Dashboard Panels See `dashboards/qpc-overview.json` for the full Grafana dashboard. Key panels: ### Message Throughput - **Enqueue rate**: `rate(enqueue_total[5m])` - **Fetch rate**: `rate(fetch_total[5m])` - **Enqueue bandwidth**: `rate(enqueue_bytes_total[5m])` ### Authentication - **Login success rate**: `rate(auth_login_success_total[5m])` - **Login failure rate**: `rate(auth_login_failure_total[5m])` - **Success ratio**: `rate(auth_login_success_total[5m]) / (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))` ### Delivery Queue - **Queue depth**: `delivery_queue_depth` - **Queue growth rate**: `deriv(delivery_queue_depth[10m])` ### Rate Limiting - **Rate limit hits**: `rate(rate_limit_hit_total[5m])` ### Infrastructure (Node Exporter) - CPU, memory, disk, network from `node_exporter` ## Grafana Dashboard Import the dashboard from `dashboards/qpc-overview.json`: 1. Open Grafana -> Dashboards -> Import 2. Upload `docs/operations/dashboards/qpc-overview.json` 3. Select your Prometheus data source 4. Save ## Log Monitoring The server uses `tracing` with `RUST_LOG` environment variable: ```bash # Production: info level with structured JSON output RUST_LOG=info # Debug specific modules RUST_LOG=info,quicprochat_server::node_service=debug # Verbose debugging RUST_LOG=debug ``` ### Key Log Messages to Monitor | Log Pattern | Meaning | Action | |-------------|---------|--------| | `"TLS certificate expires within 30 days"` | Cert expiring soon | Rotate certificate | | `"TLS certificate is self-signed"` | Self-signed cert in use | Replace with CA-signed cert in production | | `"connection rate limit exceeded"` | IP being rate limited | Check for DDoS | | `"running without QPC_AUTH_TOKEN"` | Insecure mode | Must not appear in production | | `"db_key is empty; SQL store will be plaintext"` | Unencrypted DB | Must not appear in production | | `"shutdown signal received"` | Graceful shutdown started | Expected during deploys | | `"generated and persisted new OPAQUE ServerSetup"` | Fresh OPAQUE setup | Expected on first start only | ### Log Aggregation For production, pipe logs to a log aggregator: ```bash # Systemd -> journald -> Loki/Elasticsearch journalctl -u qpc-server -f --output=json | \ promtail --stdin --client.url=http://loki:3100/loki/api/v1/push # Docker -> Loki driver docker run --log-driver=loki \ --log-opt loki-url="http://loki:3100/loki/api/v1/push" \ qpc-server ``` ## Health Checking The Docker image includes a basic health check (TLS cert file exists). For deeper health checks: ```bash # Simple: check the process is running and port is open ss -ulnp | grep 7000 # Metrics endpoint (if enabled) curl -sf http://localhost:9090/metrics > /dev/null # Full client connection test qpc-client --server 127.0.0.1:7000 --auth-token "$TOKEN" --ping ```