Files
quicproquo/docs/operations/scaling-guide.md
Christian Nennemann 91c5495ab7 docs: add operational runbook, Grafana dashboard, and production docker-compose
Add comprehensive operational documentation:
- docs/operations/backup-restore.md: SQLCipher, file backend, blob backup/restore
- docs/operations/key-rotation.md: auth token, TLS, federation, DB key, OPAQUE rotation
- docs/operations/incident-response.md: playbook for common incidents
- docs/operations/scaling-guide.md: resource sizing, scaling triggers, capacity planning
- docs/operations/monitoring.md: Prometheus metrics, alert rules, log monitoring
- docs/operations/dashboards/qpq-overview.json: Grafana dashboard template
- docs/operations/prometheus.yml + alerts: Prometheus scrape and alert config
- docs/operations/grafana-provisioning/: auto-provisioning for datasources and dashboards
- docker-compose.prod.yml: production stack (server + Prometheus + Grafana)
- .env.example: documented environment variable template
2026-03-04 20:30:57 +01:00

6.6 KiB

Scaling Guide

This document covers resource sizing, scaling triggers, and capacity planning for quicproquo deployments.

Architecture Overview

quicproquo runs as a single-process server handling QUIC connections. Key resource consumers:

  • CPU: TLS 1.3 handshakes (QUIC), OPAQUE PAKE authentication, message routing
  • Memory: In-memory session state (DashMap), QUIC connection state, delivery waiters, rate limit entries
  • Disk I/O: SQLCipher reads/writes (WAL mode), blob storage, KT Merkle log
  • Network: QUIC (UDP), metrics HTTP, optional WebSocket bridge

Single-Node Sizing

Minimum (Development / Small Team)

Resource Value
CPU 1 vCPU
Memory 512 MB
Disk 10 GB SSD
Network 100 Mbps

Supports ~100 concurrent users, light message traffic.

Resource Value
CPU 2-4 vCPU
Memory 2-4 GB
Disk 50-100 GB NVMe SSD
Network 1 Gbps

Supports ~1,000-5,000 concurrent users.

Large (High Traffic)

Resource Value
CPU 8+ vCPU
Memory 8-16 GB
Disk 500 GB+ NVMe SSD (RAID 10)
Network 10 Gbps

Supports ~10,000+ concurrent users.

Scaling Triggers

Monitor these metrics and scale when thresholds are exceeded:

Metric Warning Critical Action
CPU usage > 70% sustained (5 min) > 90% sustained Add CPU or scale horizontally
Memory usage > 75% > 90% Increase memory, check for leaks
Disk usage > 70% > 90% Expand volume, clean old data
Disk I/O latency > 5 ms p95 > 20 ms p95 Move to faster storage
delivery_queue_depth > 10,000 > 100,000 Investigate stale queues
rate_limit_hit_total rate > 100/min > 1000/min Investigate abuse, adjust limits
auth_login_failure_total rate > 50/min > 500/min Potential brute force attack
Connection count > 80% of max_concurrent_bidi_streams > 95% Scale horizontally
TLS handshake latency > 100 ms p95 > 500 ms p95 Add CPU, check network

Vertical Scaling

CPU Scaling

The server is async (Tokio) and benefits from multiple cores. QUIC TLS handshakes and OPAQUE computations are CPU-intensive.

# Check current CPU usage
top -bn1 -p $(pgrep qpq-server)

# For Docker: increase CPU limits
# docker-compose.prod.yml:
#   deploy:
#     resources:
#       limits:
#         cpus: '4'

Memory Scaling

In-memory state scales linearly with concurrent connections:

  • ~2-5 KB per active QUIC connection (quinn state)
  • ~200 bytes per session entry (DashMap)
  • ~100 bytes per rate limit entry
  • ~100 bytes per delivery waiter
# Estimate memory for 10,000 connections:
# 10,000 * 5 KB = ~50 MB for connections
# 10,000 * 500 bytes = ~5 MB for sessions/rate limits
# SQLCipher connection pool: ~50 MB (4 connections, caches)
# Base process: ~30 MB
# Total: ~135 MB + headroom = 256-512 MB minimum

Disk I/O Scaling

SQLCipher uses WAL mode for concurrent reads. For write-heavy workloads:

# Check current I/O
iostat -x 1 5

# Move to NVMe if on spinning disk
# Increase WAL autocheckpoint threshold for burst writes
sqlite3 data/qpq.db "PRAGMA key='${QPQ_DB_KEY}'; PRAGMA wal_autocheckpoint=2000;"

Horizontal Scaling

quicproquo does not yet have built-in multi-node clustering. For horizontal scaling, use these patterns:

Load Balancer (UDP/QUIC)

Place a UDP load balancer in front of multiple qpq-server instances. Each instance runs independently with its own database.

                    +-----------+
    clients ------> | L4 LB     | ----> qpq-server-1 (db-1)
                    | (UDP/QUIC)| ----> qpq-server-2 (db-2)
                    +-----------+      qpq-server-3 (db-3)

Requirements:

  • Sticky sessions (by client IP or QUIC connection ID) so a client always reaches the same node
  • Shared storage backend or federation between nodes

Federation for Multi-Node

Enable federation to relay messages between nodes:

# qpq-server.toml on node-1
[federation]
enabled = true
domain = "node1.chat.example.com"
listen = "0.0.0.0:7001"
federation_cert = "data/federation-cert.der"
federation_key = "data/federation-key.der"
federation_ca = "data/federation-ca.der"

[[federation.peers]]
domain = "node2.chat.example.com"
address = "10.0.1.2:7001"

Shared Database (PostgreSQL)

For true horizontal scaling, migrate from SQLCipher to a shared PostgreSQL instance. This is not yet implemented but is the planned approach for multi-node deployments.

    qpq-server-1 --\
    qpq-server-2 ---+--> PostgreSQL (shared)
    qpq-server-3 --/

Connection Tuning

The server has these QUIC transport defaults:

Parameter Default Tunable
Max idle timeout 300s (5 min) Code change required
Max concurrent bidi streams 1 per connection Code change required
Max concurrent uni streams 0 Code change required
SQLCipher connection pool 4 connections Code change required

For high connection counts, consider:

  • Increasing the OS file descriptor limit: ulimit -n 65536
  • Increasing UDP buffer sizes:
# /etc/sysctl.d/99-qpq.conf
net.core.rmem_max = 26214400
net.core.wmem_max = 26214400
net.core.rmem_default = 1048576
net.core.wmem_default = 1048576
sysctl -p /etc/sysctl.d/99-qpq.conf

Docker Resource Limits

# docker-compose.prod.yml
services:
  server:
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 4G
        reservations:
          cpus: '2'
          memory: 1G
    ulimits:
      nofile:
        soft: 65536
        hard: 65536

Load Testing

Use the included test infrastructure to benchmark:

# Build the test client
cargo build --release --bin qpq-client

# Run concurrent connection test (example)
for i in $(seq 1 100); do
  qpq-client --server 127.0.0.1:7000 --auth-token "$QPQ_AUTH_TOKEN" &
done
wait

# Monitor during load test
watch -n1 'curl -s http://localhost:9090/metrics | grep -E "enqueue_total|fetch_total|delivery_queue_depth|rate_limit"'

Capacity Planning Worksheet

Parameter Your Value
Expected concurrent users
Messages per user per hour
Average message size (bytes)
Blob uploads per day
Average blob size (MB)
Data retention (days)

Formulas:

Storage per day = (users * msgs/hr * 24 * avg_msg_size) + (blob_uploads * avg_blob_size)
DB growth per month = storage_per_day * 30
Memory estimate = (concurrent_users * 5 KB) + 256 MB base
CPU estimate = 1 vCPU per ~2,500 concurrent connections (depends on message rate)