Rename all project references from quicproquo/qpq to quicprochat/qpc across documentation, Docker configuration, CI workflows, packaging scripts, operational configs, and build tooling. - Docker: crate paths, binary names, user/group, data dirs, env vars - CI: workflow crate references, binary names, artifact names - Docs: all markdown files under docs/, SDK READMEs, book.toml - Packaging: OpenWrt Makefile, init script, UCI config (file renames) - Scripts: justfile, dev-shell, screenshot, cross-compile, ai_team - Operations: Prometheus config, alert rules, Grafana dashboard - Config: .env.example (QPQ_* → QPC_*), CODEOWNERS paths - Top-level: README, CONTRIBUTING, ROADMAP, CLAUDE.md
6.6 KiB
6.6 KiB
Monitoring Guide
This document covers metrics collection, alerting, and dashboards for quicprochat.
Enabling Metrics
The server exports Prometheus metrics via HTTP when configured:
# Environment variables
QPC_METRICS_LISTEN=0.0.0.0:9090
QPC_METRICS_ENABLED=true
# Or in qpc-server.toml
metrics_listen = "0.0.0.0:9090"
metrics_enabled = true
Metrics are served at http://<metrics_listen>/metrics in Prometheus exposition format.
Available Metrics
Counters
| Metric | Description | Labels |
|---|---|---|
enqueue_total |
Total messages enqueued | - |
enqueue_bytes_total |
Total bytes enqueued | - |
fetch_total |
Total message fetches completed | - |
fetch_wait_total |
Total long-poll fetch waits | - |
key_package_upload_total |
Total MLS key package uploads | - |
auth_login_success_total |
Successful OPAQUE login completions | - |
auth_login_failure_total |
Failed login attempts | - |
rate_limit_hit_total |
Rate limit rejections | - |
Gauges
| Metric | Description |
|---|---|
delivery_queue_depth |
Current delivery queue depth (sampled) |
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'qpc-server'
static_targets:
- targets: ['qpc-server:9090']
scrape_interval: 10s
Alert Rules
# prometheus-alerts.yml
groups:
- name: qpc-server
rules:
# Server down
- alert: QpqServerDown
expr: up{job="qpc-server"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "qpc-server is down"
description: "Prometheus cannot scrape qpc-server metrics for > 1 minute."
# High auth failure rate (potential brute force)
- alert: QpqHighAuthFailureRate
expr: rate(auth_login_failure_total[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "High authentication failure rate"
description: "{{ $value | printf \"%.1f\" }} auth failures/sec over 5 minutes."
# Rate limiting active
- alert: QpqRateLimitActive
expr: rate(rate_limit_hit_total[5m]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Rate limiting is actively rejecting requests"
description: "{{ $value | printf \"%.1f\" }} rate limit hits/sec."
# Delivery queue growing
- alert: QpqDeliveryQueueHigh
expr: delivery_queue_depth > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "Delivery queue depth is high"
description: "Queue depth: {{ $value }}. Clients may not be fetching."
- alert: QpqDeliveryQueueCritical
expr: delivery_queue_depth > 100000
for: 5m
labels:
severity: critical
annotations:
summary: "Delivery queue depth is critical"
description: "Queue depth: {{ $value }}. Investigate immediately."
# No enqueue activity (service may be stuck)
- alert: QpqNoEnqueueActivity
expr: rate(enqueue_total[15m]) == 0
for: 30m
labels:
severity: warning
annotations:
summary: "No messages enqueued in 30 minutes"
description: "Check if the service is accepting connections."
# Auth success ratio too low
- alert: QpqLowAuthSuccessRatio
expr: >
rate(auth_login_success_total[5m])
/ (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))
< 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "Auth success ratio below 50%"
description: "More than half of login attempts are failing."
Key Dashboard Panels
See dashboards/qpc-overview.json for the full Grafana dashboard. Key panels:
Message Throughput
- Enqueue rate:
rate(enqueue_total[5m]) - Fetch rate:
rate(fetch_total[5m]) - Enqueue bandwidth:
rate(enqueue_bytes_total[5m])
Authentication
- Login success rate:
rate(auth_login_success_total[5m]) - Login failure rate:
rate(auth_login_failure_total[5m]) - Success ratio:
rate(auth_login_success_total[5m]) / (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))
Delivery Queue
- Queue depth:
delivery_queue_depth - Queue growth rate:
deriv(delivery_queue_depth[10m])
Rate Limiting
- Rate limit hits:
rate(rate_limit_hit_total[5m])
Infrastructure (Node Exporter)
- CPU, memory, disk, network from
node_exporter
Grafana Dashboard
Import the dashboard from dashboards/qpc-overview.json:
- Open Grafana -> Dashboards -> Import
- Upload
docs/operations/dashboards/qpc-overview.json - Select your Prometheus data source
- Save
Log Monitoring
The server uses tracing with RUST_LOG environment variable:
# Production: info level with structured JSON output
RUST_LOG=info
# Debug specific modules
RUST_LOG=info,quicprochat_server::node_service=debug
# Verbose debugging
RUST_LOG=debug
Key Log Messages to Monitor
| Log Pattern | Meaning | Action |
|---|---|---|
"TLS certificate expires within 30 days" |
Cert expiring soon | Rotate certificate |
"TLS certificate is self-signed" |
Self-signed cert in use | Replace with CA-signed cert in production |
"connection rate limit exceeded" |
IP being rate limited | Check for DDoS |
"running without QPC_AUTH_TOKEN" |
Insecure mode | Must not appear in production |
"db_key is empty; SQL store will be plaintext" |
Unencrypted DB | Must not appear in production |
"shutdown signal received" |
Graceful shutdown started | Expected during deploys |
"generated and persisted new OPAQUE ServerSetup" |
Fresh OPAQUE setup | Expected on first start only |
Log Aggregation
For production, pipe logs to a log aggregator:
# Systemd -> journald -> Loki/Elasticsearch
journalctl -u qpc-server -f --output=json | \
promtail --stdin --client.url=http://loki:3100/loki/api/v1/push
# Docker -> Loki driver
docker run --log-driver=loki \
--log-opt loki-url="http://loki:3100/loki/api/v1/push" \
qpc-server
Health Checking
The Docker image includes a basic health check (TLS cert file exists). For deeper health checks:
# Simple: check the process is running and port is open
ss -ulnp | grep 7000
# Metrics endpoint (if enabled)
curl -sf http://localhost:9090/metrics > /dev/null
# Full client connection test
qpc-client --server 127.0.0.1:7000 --auth-token "$TOKEN" --ping