Files
quicproquo/docs/operations/scaling-guide.md
Christian Nennemann 2e081ead8e chore: rename quicproquo → quicprochat in docs, Docker, CI, and packaging
Rename all project references from quicproquo/qpq to quicprochat/qpc
across documentation, Docker configuration, CI workflows, packaging
scripts, operational configs, and build tooling.

- Docker: crate paths, binary names, user/group, data dirs, env vars
- CI: workflow crate references, binary names, artifact names
- Docs: all markdown files under docs/, SDK READMEs, book.toml
- Packaging: OpenWrt Makefile, init script, UCI config (file renames)
- Scripts: justfile, dev-shell, screenshot, cross-compile, ai_team
- Operations: Prometheus config, alert rules, Grafana dashboard
- Config: .env.example (QPQ_* → QPC_*), CODEOWNERS paths
- Top-level: README, CONTRIBUTING, ROADMAP, CLAUDE.md
2026-03-21 19:14:06 +01:00

245 lines
6.6 KiB
Markdown

# Scaling Guide
This document covers resource sizing, scaling triggers, and capacity planning for quicprochat deployments.
## Architecture Overview
quicprochat runs as a single-process server handling QUIC connections. Key resource consumers:
- **CPU**: TLS 1.3 handshakes (QUIC), OPAQUE PAKE authentication, message routing
- **Memory**: In-memory session state (DashMap), QUIC connection state, delivery waiters, rate limit entries
- **Disk I/O**: SQLCipher reads/writes (WAL mode), blob storage, KT Merkle log
- **Network**: QUIC (UDP), metrics HTTP, optional WebSocket bridge
## Single-Node Sizing
### Minimum (Development / Small Team)
| Resource | Value |
|----------|-------|
| CPU | 1 vCPU |
| Memory | 512 MB |
| Disk | 10 GB SSD |
| Network | 100 Mbps |
Supports ~100 concurrent users, light message traffic.
### Recommended (Production / Small-Medium)
| Resource | Value |
|----------|-------|
| CPU | 2-4 vCPU |
| Memory | 2-4 GB |
| Disk | 50-100 GB NVMe SSD |
| Network | 1 Gbps |
Supports ~1,000-5,000 concurrent users.
### Large (High Traffic)
| Resource | Value |
|----------|-------|
| CPU | 8+ vCPU |
| Memory | 8-16 GB |
| Disk | 500 GB+ NVMe SSD (RAID 10) |
| Network | 10 Gbps |
Supports ~10,000+ concurrent users.
## Scaling Triggers
Monitor these metrics and scale when thresholds are exceeded:
| Metric | Warning | Critical | Action |
|--------|---------|----------|--------|
| CPU usage | > 70% sustained (5 min) | > 90% sustained | Add CPU or scale horizontally |
| Memory usage | > 75% | > 90% | Increase memory, check for leaks |
| Disk usage | > 70% | > 90% | Expand volume, clean old data |
| Disk I/O latency | > 5 ms p95 | > 20 ms p95 | Move to faster storage |
| `delivery_queue_depth` | > 10,000 | > 100,000 | Investigate stale queues |
| `rate_limit_hit_total` rate | > 100/min | > 1000/min | Investigate abuse, adjust limits |
| `auth_login_failure_total` rate | > 50/min | > 500/min | Potential brute force attack |
| Connection count | > 80% of `max_concurrent_bidi_streams` | > 95% | Scale horizontally |
| TLS handshake latency | > 100 ms p95 | > 500 ms p95 | Add CPU, check network |
## Vertical Scaling
### CPU Scaling
The server is async (Tokio) and benefits from multiple cores. QUIC TLS handshakes and OPAQUE computations are CPU-intensive.
```bash
# Check current CPU usage
top -bn1 -p $(pgrep qpc-server)
# For Docker: increase CPU limits
# docker-compose.prod.yml:
# deploy:
# resources:
# limits:
# cpus: '4'
```
### Memory Scaling
In-memory state scales linearly with concurrent connections:
- ~2-5 KB per active QUIC connection (quinn state)
- ~200 bytes per session entry (DashMap)
- ~100 bytes per rate limit entry
- ~100 bytes per delivery waiter
```bash
# Estimate memory for 10,000 connections:
# 10,000 * 5 KB = ~50 MB for connections
# 10,000 * 500 bytes = ~5 MB for sessions/rate limits
# SQLCipher connection pool: ~50 MB (4 connections, caches)
# Base process: ~30 MB
# Total: ~135 MB + headroom = 256-512 MB minimum
```
### Disk I/O Scaling
SQLCipher uses WAL mode for concurrent reads. For write-heavy workloads:
```bash
# Check current I/O
iostat -x 1 5
# Move to NVMe if on spinning disk
# Increase WAL autocheckpoint threshold for burst writes
sqlite3 data/qpc.db "PRAGMA key='${QPC_DB_KEY}'; PRAGMA wal_autocheckpoint=2000;"
```
## Horizontal Scaling
quicprochat does not yet have built-in multi-node clustering. For horizontal scaling, use these patterns:
### Load Balancer (UDP/QUIC)
Place a UDP load balancer in front of multiple qpc-server instances. Each instance runs independently with its own database.
```
+-----------+
clients ------> | L4 LB | ----> qpc-server-1 (db-1)
| (UDP/QUIC)| ----> qpc-server-2 (db-2)
+-----------+ qpc-server-3 (db-3)
```
**Requirements:**
- Sticky sessions (by client IP or QUIC connection ID) so a client always reaches the same node
- Shared storage backend or federation between nodes
### Federation for Multi-Node
Enable federation to relay messages between nodes:
```toml
# qpc-server.toml on node-1
[federation]
enabled = true
domain = "node1.chat.example.com"
listen = "0.0.0.0:7001"
federation_cert = "data/federation-cert.der"
federation_key = "data/federation-key.der"
federation_ca = "data/federation-ca.der"
[[federation.peers]]
domain = "node2.chat.example.com"
address = "10.0.1.2:7001"
```
### Shared Database (PostgreSQL)
For true horizontal scaling, migrate from SQLCipher to a shared PostgreSQL instance. This is not yet implemented but is the planned approach for multi-node deployments.
```
qpc-server-1 --\
qpc-server-2 ---+--> PostgreSQL (shared)
qpc-server-3 --/
```
## Connection Tuning
The server has these QUIC transport defaults:
| Parameter | Default | Tunable |
|-----------|---------|---------|
| Max idle timeout | 300s (5 min) | Code change required |
| Max concurrent bidi streams | 1 per connection | Code change required |
| Max concurrent uni streams | 0 | Code change required |
| SQLCipher connection pool | 4 connections | Code change required |
For high connection counts, consider:
- Increasing the OS file descriptor limit: `ulimit -n 65536`
- Increasing UDP buffer sizes:
```bash
# /etc/sysctl.d/99-qpc.conf
net.core.rmem_max = 26214400
net.core.wmem_max = 26214400
net.core.rmem_default = 1048576
net.core.wmem_default = 1048576
```
```bash
sysctl -p /etc/sysctl.d/99-qpc.conf
```
## Docker Resource Limits
```yaml
# docker-compose.prod.yml
services:
server:
deploy:
resources:
limits:
cpus: '4'
memory: 4G
reservations:
cpus: '2'
memory: 1G
ulimits:
nofile:
soft: 65536
hard: 65536
```
## Load Testing
Use the included test infrastructure to benchmark:
```bash
# Build the test client
cargo build --release --bin qpc-client
# Run concurrent connection test (example)
for i in $(seq 1 100); do
qpc-client --server 127.0.0.1:7000 --auth-token "$QPC_AUTH_TOKEN" &
done
wait
# Monitor during load test
watch -n1 'curl -s http://localhost:9090/metrics | grep -E "enqueue_total|fetch_total|delivery_queue_depth|rate_limit"'
```
## Capacity Planning Worksheet
| Parameter | Your Value |
|-----------|-----------|
| Expected concurrent users | |
| Messages per user per hour | |
| Average message size (bytes) | |
| Blob uploads per day | |
| Average blob size (MB) | |
| Data retention (days) | |
**Formulas:**
```
Storage per day = (users * msgs/hr * 24 * avg_msg_size) + (blob_uploads * avg_blob_size)
DB growth per month = storage_per_day * 30
Memory estimate = (concurrent_users * 5 KB) + 256 MB base
CPU estimate = 1 vCPU per ~2,500 concurrent connections (depends on message rate)
```