Rename all project references from quicproquo/qpq to quicprochat/qpc across documentation, Docker configuration, CI workflows, packaging scripts, operational configs, and build tooling. - Docker: crate paths, binary names, user/group, data dirs, env vars - CI: workflow crate references, binary names, artifact names - Docs: all markdown files under docs/, SDK READMEs, book.toml - Packaging: OpenWrt Makefile, init script, UCI config (file renames) - Scripts: justfile, dev-shell, screenshot, cross-compile, ai_team - Operations: Prometheus config, alert rules, Grafana dashboard - Config: .env.example (QPQ_* → QPC_*), CODEOWNERS paths - Top-level: README, CONTRIBUTING, ROADMAP, CLAUDE.md
226 lines
6.6 KiB
Markdown
226 lines
6.6 KiB
Markdown
# Monitoring Guide
|
|
|
|
This document covers metrics collection, alerting, and dashboards for quicprochat.
|
|
|
|
## Enabling Metrics
|
|
|
|
The server exports Prometheus metrics via HTTP when configured:
|
|
|
|
```bash
|
|
# Environment variables
|
|
QPC_METRICS_LISTEN=0.0.0.0:9090
|
|
QPC_METRICS_ENABLED=true
|
|
|
|
# Or in qpc-server.toml
|
|
metrics_listen = "0.0.0.0:9090"
|
|
metrics_enabled = true
|
|
```
|
|
|
|
Metrics are served at `http://<metrics_listen>/metrics` in Prometheus exposition format.
|
|
|
|
## Available Metrics
|
|
|
|
### Counters
|
|
|
|
| Metric | Description | Labels |
|
|
|--------|-------------|--------|
|
|
| `enqueue_total` | Total messages enqueued | - |
|
|
| `enqueue_bytes_total` | Total bytes enqueued | - |
|
|
| `fetch_total` | Total message fetches completed | - |
|
|
| `fetch_wait_total` | Total long-poll fetch waits | - |
|
|
| `key_package_upload_total` | Total MLS key package uploads | - |
|
|
| `auth_login_success_total` | Successful OPAQUE login completions | - |
|
|
| `auth_login_failure_total` | Failed login attempts | - |
|
|
| `rate_limit_hit_total` | Rate limit rejections | - |
|
|
|
|
### Gauges
|
|
|
|
| Metric | Description |
|
|
|--------|-------------|
|
|
| `delivery_queue_depth` | Current delivery queue depth (sampled) |
|
|
|
|
## Prometheus Configuration
|
|
|
|
```yaml
|
|
# prometheus.yml
|
|
global:
|
|
scrape_interval: 15s
|
|
evaluation_interval: 15s
|
|
|
|
scrape_configs:
|
|
- job_name: 'qpc-server'
|
|
static_targets:
|
|
- targets: ['qpc-server:9090']
|
|
scrape_interval: 10s
|
|
```
|
|
|
|
## Alert Rules
|
|
|
|
```yaml
|
|
# prometheus-alerts.yml
|
|
groups:
|
|
- name: qpc-server
|
|
rules:
|
|
# Server down
|
|
- alert: QpqServerDown
|
|
expr: up{job="qpc-server"} == 0
|
|
for: 1m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "qpc-server is down"
|
|
description: "Prometheus cannot scrape qpc-server metrics for > 1 minute."
|
|
|
|
# High auth failure rate (potential brute force)
|
|
- alert: QpqHighAuthFailureRate
|
|
expr: rate(auth_login_failure_total[5m]) > 10
|
|
for: 2m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High authentication failure rate"
|
|
description: "{{ $value | printf \"%.1f\" }} auth failures/sec over 5 minutes."
|
|
|
|
# Rate limiting active
|
|
- alert: QpqRateLimitActive
|
|
expr: rate(rate_limit_hit_total[5m]) > 5
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Rate limiting is actively rejecting requests"
|
|
description: "{{ $value | printf \"%.1f\" }} rate limit hits/sec."
|
|
|
|
# Delivery queue growing
|
|
- alert: QpqDeliveryQueueHigh
|
|
expr: delivery_queue_depth > 10000
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Delivery queue depth is high"
|
|
description: "Queue depth: {{ $value }}. Clients may not be fetching."
|
|
|
|
- alert: QpqDeliveryQueueCritical
|
|
expr: delivery_queue_depth > 100000
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Delivery queue depth is critical"
|
|
description: "Queue depth: {{ $value }}. Investigate immediately."
|
|
|
|
# No enqueue activity (service may be stuck)
|
|
- alert: QpqNoEnqueueActivity
|
|
expr: rate(enqueue_total[15m]) == 0
|
|
for: 30m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "No messages enqueued in 30 minutes"
|
|
description: "Check if the service is accepting connections."
|
|
|
|
# Auth success ratio too low
|
|
- alert: QpqLowAuthSuccessRatio
|
|
expr: >
|
|
rate(auth_login_success_total[5m])
|
|
/ (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))
|
|
< 0.5
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Auth success ratio below 50%"
|
|
description: "More than half of login attempts are failing."
|
|
```
|
|
|
|
## Key Dashboard Panels
|
|
|
|
See `dashboards/qpc-overview.json` for the full Grafana dashboard. Key panels:
|
|
|
|
### Message Throughput
|
|
- **Enqueue rate**: `rate(enqueue_total[5m])`
|
|
- **Fetch rate**: `rate(fetch_total[5m])`
|
|
- **Enqueue bandwidth**: `rate(enqueue_bytes_total[5m])`
|
|
|
|
### Authentication
|
|
- **Login success rate**: `rate(auth_login_success_total[5m])`
|
|
- **Login failure rate**: `rate(auth_login_failure_total[5m])`
|
|
- **Success ratio**: `rate(auth_login_success_total[5m]) / (rate(auth_login_success_total[5m]) + rate(auth_login_failure_total[5m]))`
|
|
|
|
### Delivery Queue
|
|
- **Queue depth**: `delivery_queue_depth`
|
|
- **Queue growth rate**: `deriv(delivery_queue_depth[10m])`
|
|
|
|
### Rate Limiting
|
|
- **Rate limit hits**: `rate(rate_limit_hit_total[5m])`
|
|
|
|
### Infrastructure (Node Exporter)
|
|
- CPU, memory, disk, network from `node_exporter`
|
|
|
|
## Grafana Dashboard
|
|
|
|
Import the dashboard from `dashboards/qpc-overview.json`:
|
|
|
|
1. Open Grafana -> Dashboards -> Import
|
|
2. Upload `docs/operations/dashboards/qpc-overview.json`
|
|
3. Select your Prometheus data source
|
|
4. Save
|
|
|
|
## Log Monitoring
|
|
|
|
The server uses `tracing` with `RUST_LOG` environment variable:
|
|
|
|
```bash
|
|
# Production: info level with structured JSON output
|
|
RUST_LOG=info
|
|
|
|
# Debug specific modules
|
|
RUST_LOG=info,quicprochat_server::node_service=debug
|
|
|
|
# Verbose debugging
|
|
RUST_LOG=debug
|
|
```
|
|
|
|
### Key Log Messages to Monitor
|
|
|
|
| Log Pattern | Meaning | Action |
|
|
|-------------|---------|--------|
|
|
| `"TLS certificate expires within 30 days"` | Cert expiring soon | Rotate certificate |
|
|
| `"TLS certificate is self-signed"` | Self-signed cert in use | Replace with CA-signed cert in production |
|
|
| `"connection rate limit exceeded"` | IP being rate limited | Check for DDoS |
|
|
| `"running without QPC_AUTH_TOKEN"` | Insecure mode | Must not appear in production |
|
|
| `"db_key is empty; SQL store will be plaintext"` | Unencrypted DB | Must not appear in production |
|
|
| `"shutdown signal received"` | Graceful shutdown started | Expected during deploys |
|
|
| `"generated and persisted new OPAQUE ServerSetup"` | Fresh OPAQUE setup | Expected on first start only |
|
|
|
|
### Log Aggregation
|
|
|
|
For production, pipe logs to a log aggregator:
|
|
|
|
```bash
|
|
# Systemd -> journald -> Loki/Elasticsearch
|
|
journalctl -u qpc-server -f --output=json | \
|
|
promtail --stdin --client.url=http://loki:3100/loki/api/v1/push
|
|
|
|
# Docker -> Loki driver
|
|
docker run --log-driver=loki \
|
|
--log-opt loki-url="http://loki:3100/loki/api/v1/push" \
|
|
qpc-server
|
|
```
|
|
|
|
## Health Checking
|
|
|
|
The Docker image includes a basic health check (TLS cert file exists). For deeper health checks:
|
|
|
|
```bash
|
|
# Simple: check the process is running and port is open
|
|
ss -ulnp | grep 7000
|
|
|
|
# Metrics endpoint (if enabled)
|
|
curl -sf http://localhost:9090/metrics > /dev/null
|
|
|
|
# Full client connection test
|
|
qpc-client --server 127.0.0.1:7000 --auth-token "$TOKEN" --ping
|
|
```
|