Files

Christian Nennemann 91c5495ab7 docs: add operational runbook, Grafana dashboard, and production docker-compose

Add comprehensive operational documentation:
- docs/operations/backup-restore.md: SQLCipher, file backend, blob backup/restore
- docs/operations/key-rotation.md: auth token, TLS, federation, DB key, OPAQUE rotation
- docs/operations/incident-response.md: playbook for common incidents
- docs/operations/scaling-guide.md: resource sizing, scaling triggers, capacity planning
- docs/operations/monitoring.md: Prometheus metrics, alert rules, log monitoring
- docs/operations/dashboards/qpq-overview.json: Grafana dashboard template
- docs/operations/prometheus.yml + alerts: Prometheus scrape and alert config
- docs/operations/grafana-provisioning/: auto-provisioning for datasources and dashboards
- docker-compose.prod.yml: production stack (server + Prometheus + Grafana)
- .env.example: documented environment variable template

2026-03-04 20:30:57 +01:00

6.6 KiB

Raw Blame History

Scaling Guide

This document covers resource sizing, scaling triggers, and capacity planning for quicproquo deployments.

Architecture Overview

quicproquo runs as a single-process server handling QUIC connections. Key resource consumers:

CPU: TLS 1.3 handshakes (QUIC), OPAQUE PAKE authentication, message routing
Memory: In-memory session state (DashMap), QUIC connection state, delivery waiters, rate limit entries
Disk I/O: SQLCipher reads/writes (WAL mode), blob storage, KT Merkle log
Network: QUIC (UDP), metrics HTTP, optional WebSocket bridge

Single-Node Sizing

Minimum (Development / Small Team)

Resource	Value
CPU	1 vCPU
Memory	512 MB
Disk	10 GB SSD
Network	100 Mbps

Supports ~100 concurrent users, light message traffic.

Recommended (Production / Small-Medium)

Resource	Value
CPU	2-4 vCPU
Memory	2-4 GB
Disk	50-100 GB NVMe SSD
Network	1 Gbps

Supports ~1,000-5,000 concurrent users.

Large (High Traffic)

Resource	Value
CPU	8+ vCPU
Memory	8-16 GB
Disk	500 GB+ NVMe SSD (RAID 10)
Network	10 Gbps

Supports ~10,000+ concurrent users.

Scaling Triggers

Monitor these metrics and scale when thresholds are exceeded:

Metric	Warning	Critical	Action
CPU usage	> 70% sustained (5 min)	> 90% sustained	Add CPU or scale horizontally
Memory usage	> 75%	> 90%	Increase memory, check for leaks
Disk usage	> 70%	> 90%	Expand volume, clean old data
Disk I/O latency	> 5 ms p95	> 20 ms p95	Move to faster storage
`delivery_queue_depth`	> 10,000	> 100,000	Investigate stale queues
`rate_limit_hit_total` rate	> 100/min	> 1000/min	Investigate abuse, adjust limits
`auth_login_failure_total` rate	> 50/min	> 500/min	Potential brute force attack
Connection count	> 80% of `max_concurrent_bidi_streams`	> 95%	Scale horizontally
TLS handshake latency	> 100 ms p95	> 500 ms p95	Add CPU, check network

Vertical Scaling

CPU Scaling

The server is async (Tokio) and benefits from multiple cores. QUIC TLS handshakes and OPAQUE computations are CPU-intensive.

# Check current CPU usage
top -bn1 -p $(pgrep qpq-server)

# For Docker: increase CPU limits
# docker-compose.prod.yml:
#   deploy:
#     resources:
#       limits:
#         cpus: '4'

Memory Scaling

In-memory state scales linearly with concurrent connections:

~2-5 KB per active QUIC connection (quinn state)
~200 bytes per session entry (DashMap)
~100 bytes per rate limit entry
~100 bytes per delivery waiter

# Estimate memory for 10,000 connections:
# 10,000 * 5 KB = ~50 MB for connections
# 10,000 * 500 bytes = ~5 MB for sessions/rate limits
# SQLCipher connection pool: ~50 MB (4 connections, caches)
# Base process: ~30 MB
# Total: ~135 MB + headroom = 256-512 MB minimum

Disk I/O Scaling

SQLCipher uses WAL mode for concurrent reads. For write-heavy workloads:

# Check current I/O
iostat -x 1 5

# Move to NVMe if on spinning disk
# Increase WAL autocheckpoint threshold for burst writes
sqlite3 data/qpq.db "PRAGMA key='${QPQ_DB_KEY}'; PRAGMA wal_autocheckpoint=2000;"

Horizontal Scaling

quicproquo does not yet have built-in multi-node clustering. For horizontal scaling, use these patterns:

Load Balancer (UDP/QUIC)

Place a UDP load balancer in front of multiple qpq-server instances. Each instance runs independently with its own database.

                    +-----------+
    clients ------> | L4 LB     | ----> qpq-server-1 (db-1)
                    | (UDP/QUIC)| ----> qpq-server-2 (db-2)
                    +-----------+      qpq-server-3 (db-3)

Requirements:

Sticky sessions (by client IP or QUIC connection ID) so a client always reaches the same node
Shared storage backend or federation between nodes

Federation for Multi-Node

Enable federation to relay messages between nodes:

# qpq-server.toml on node-1
[federation]
enabled = true
domain = "node1.chat.example.com"
listen = "0.0.0.0:7001"
federation_cert = "data/federation-cert.der"
federation_key = "data/federation-key.der"
federation_ca = "data/federation-ca.der"

[[federation.peers]]
domain = "node2.chat.example.com"
address = "10.0.1.2:7001"

Shared Database (PostgreSQL)

For true horizontal scaling, migrate from SQLCipher to a shared PostgreSQL instance. This is not yet implemented but is the planned approach for multi-node deployments.

    qpq-server-1 --\
    qpq-server-2 ---+--> PostgreSQL (shared)
    qpq-server-3 --/

Connection Tuning

The server has these QUIC transport defaults:

Parameter	Default	Tunable
Max idle timeout	300s (5 min)	Code change required
Max concurrent bidi streams	1 per connection	Code change required
Max concurrent uni streams	0	Code change required
SQLCipher connection pool	4 connections	Code change required

For high connection counts, consider:

Increasing the OS file descriptor limit: ulimit -n 65536
Increasing UDP buffer sizes:

# /etc/sysctl.d/99-qpq.conf
net.core.rmem_max = 26214400
net.core.wmem_max = 26214400
net.core.rmem_default = 1048576
net.core.wmem_default = 1048576

sysctl -p /etc/sysctl.d/99-qpq.conf

Docker Resource Limits

# docker-compose.prod.yml
services:
  server:
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 4G
        reservations:
          cpus: '2'
          memory: 1G
    ulimits:
      nofile:
        soft: 65536
        hard: 65536

Load Testing

Use the included test infrastructure to benchmark:

# Build the test client
cargo build --release --bin qpq-client

# Run concurrent connection test (example)
for i in $(seq 1 100); do
  qpq-client --server 127.0.0.1:7000 --auth-token "$QPQ_AUTH_TOKEN" &
done
wait

# Monitor during load test
watch -n1 'curl -s http://localhost:9090/metrics | grep -E "enqueue_total|fetch_total|delivery_queue_depth|rate_limit"'

Capacity Planning Worksheet

Parameter	Your Value
Expected concurrent users
Messages per user per hour
Average message size (bytes)
Blob uploads per day
Average blob size (MB)
Data retention (days)

Formulas:

Storage per day = (users * msgs/hr * 24 * avg_msg_size) + (blob_uploads * avg_blob_size)
DB growth per month = storage_per_day * 30
Memory estimate = (concurrent_users * 5 KB) + 256 MB base
CPU estimate = 1 vCPU per ~2,500 concurrent connections (depends on message rate)

6.6 KiB Raw Blame History