Add comprehensive operational documentation: - docs/operations/backup-restore.md: SQLCipher, file backend, blob backup/restore - docs/operations/key-rotation.md: auth token, TLS, federation, DB key, OPAQUE rotation - docs/operations/incident-response.md: playbook for common incidents - docs/operations/scaling-guide.md: resource sizing, scaling triggers, capacity planning - docs/operations/monitoring.md: Prometheus metrics, alert rules, log monitoring - docs/operations/dashboards/qpq-overview.json: Grafana dashboard template - docs/operations/prometheus.yml + alerts: Prometheus scrape and alert config - docs/operations/grafana-provisioning/: auto-provisioning for datasources and dashboards - docker-compose.prod.yml: production stack (server + Prometheus + Grafana) - .env.example: documented environment variable template
245 lines
6.6 KiB
Markdown
245 lines
6.6 KiB
Markdown
# Scaling Guide
|
|
|
|
This document covers resource sizing, scaling triggers, and capacity planning for quicproquo deployments.
|
|
|
|
## Architecture Overview
|
|
|
|
quicproquo runs as a single-process server handling QUIC connections. Key resource consumers:
|
|
|
|
- **CPU**: TLS 1.3 handshakes (QUIC), OPAQUE PAKE authentication, message routing
|
|
- **Memory**: In-memory session state (DashMap), QUIC connection state, delivery waiters, rate limit entries
|
|
- **Disk I/O**: SQLCipher reads/writes (WAL mode), blob storage, KT Merkle log
|
|
- **Network**: QUIC (UDP), metrics HTTP, optional WebSocket bridge
|
|
|
|
## Single-Node Sizing
|
|
|
|
### Minimum (Development / Small Team)
|
|
|
|
| Resource | Value |
|
|
|----------|-------|
|
|
| CPU | 1 vCPU |
|
|
| Memory | 512 MB |
|
|
| Disk | 10 GB SSD |
|
|
| Network | 100 Mbps |
|
|
|
|
Supports ~100 concurrent users, light message traffic.
|
|
|
|
### Recommended (Production / Small-Medium)
|
|
|
|
| Resource | Value |
|
|
|----------|-------|
|
|
| CPU | 2-4 vCPU |
|
|
| Memory | 2-4 GB |
|
|
| Disk | 50-100 GB NVMe SSD |
|
|
| Network | 1 Gbps |
|
|
|
|
Supports ~1,000-5,000 concurrent users.
|
|
|
|
### Large (High Traffic)
|
|
|
|
| Resource | Value |
|
|
|----------|-------|
|
|
| CPU | 8+ vCPU |
|
|
| Memory | 8-16 GB |
|
|
| Disk | 500 GB+ NVMe SSD (RAID 10) |
|
|
| Network | 10 Gbps |
|
|
|
|
Supports ~10,000+ concurrent users.
|
|
|
|
## Scaling Triggers
|
|
|
|
Monitor these metrics and scale when thresholds are exceeded:
|
|
|
|
| Metric | Warning | Critical | Action |
|
|
|--------|---------|----------|--------|
|
|
| CPU usage | > 70% sustained (5 min) | > 90% sustained | Add CPU or scale horizontally |
|
|
| Memory usage | > 75% | > 90% | Increase memory, check for leaks |
|
|
| Disk usage | > 70% | > 90% | Expand volume, clean old data |
|
|
| Disk I/O latency | > 5 ms p95 | > 20 ms p95 | Move to faster storage |
|
|
| `delivery_queue_depth` | > 10,000 | > 100,000 | Investigate stale queues |
|
|
| `rate_limit_hit_total` rate | > 100/min | > 1000/min | Investigate abuse, adjust limits |
|
|
| `auth_login_failure_total` rate | > 50/min | > 500/min | Potential brute force attack |
|
|
| Connection count | > 80% of `max_concurrent_bidi_streams` | > 95% | Scale horizontally |
|
|
| TLS handshake latency | > 100 ms p95 | > 500 ms p95 | Add CPU, check network |
|
|
|
|
## Vertical Scaling
|
|
|
|
### CPU Scaling
|
|
|
|
The server is async (Tokio) and benefits from multiple cores. QUIC TLS handshakes and OPAQUE computations are CPU-intensive.
|
|
|
|
```bash
|
|
# Check current CPU usage
|
|
top -bn1 -p $(pgrep qpq-server)
|
|
|
|
# For Docker: increase CPU limits
|
|
# docker-compose.prod.yml:
|
|
# deploy:
|
|
# resources:
|
|
# limits:
|
|
# cpus: '4'
|
|
```
|
|
|
|
### Memory Scaling
|
|
|
|
In-memory state scales linearly with concurrent connections:
|
|
- ~2-5 KB per active QUIC connection (quinn state)
|
|
- ~200 bytes per session entry (DashMap)
|
|
- ~100 bytes per rate limit entry
|
|
- ~100 bytes per delivery waiter
|
|
|
|
```bash
|
|
# Estimate memory for 10,000 connections:
|
|
# 10,000 * 5 KB = ~50 MB for connections
|
|
# 10,000 * 500 bytes = ~5 MB for sessions/rate limits
|
|
# SQLCipher connection pool: ~50 MB (4 connections, caches)
|
|
# Base process: ~30 MB
|
|
# Total: ~135 MB + headroom = 256-512 MB minimum
|
|
```
|
|
|
|
### Disk I/O Scaling
|
|
|
|
SQLCipher uses WAL mode for concurrent reads. For write-heavy workloads:
|
|
|
|
```bash
|
|
# Check current I/O
|
|
iostat -x 1 5
|
|
|
|
# Move to NVMe if on spinning disk
|
|
# Increase WAL autocheckpoint threshold for burst writes
|
|
sqlite3 data/qpq.db "PRAGMA key='${QPQ_DB_KEY}'; PRAGMA wal_autocheckpoint=2000;"
|
|
```
|
|
|
|
## Horizontal Scaling
|
|
|
|
quicproquo does not yet have built-in multi-node clustering. For horizontal scaling, use these patterns:
|
|
|
|
### Load Balancer (UDP/QUIC)
|
|
|
|
Place a UDP load balancer in front of multiple qpq-server instances. Each instance runs independently with its own database.
|
|
|
|
```
|
|
+-----------+
|
|
clients ------> | L4 LB | ----> qpq-server-1 (db-1)
|
|
| (UDP/QUIC)| ----> qpq-server-2 (db-2)
|
|
+-----------+ qpq-server-3 (db-3)
|
|
```
|
|
|
|
**Requirements:**
|
|
- Sticky sessions (by client IP or QUIC connection ID) so a client always reaches the same node
|
|
- Shared storage backend or federation between nodes
|
|
|
|
### Federation for Multi-Node
|
|
|
|
Enable federation to relay messages between nodes:
|
|
|
|
```toml
|
|
# qpq-server.toml on node-1
|
|
[federation]
|
|
enabled = true
|
|
domain = "node1.chat.example.com"
|
|
listen = "0.0.0.0:7001"
|
|
federation_cert = "data/federation-cert.der"
|
|
federation_key = "data/federation-key.der"
|
|
federation_ca = "data/federation-ca.der"
|
|
|
|
[[federation.peers]]
|
|
domain = "node2.chat.example.com"
|
|
address = "10.0.1.2:7001"
|
|
```
|
|
|
|
### Shared Database (PostgreSQL)
|
|
|
|
For true horizontal scaling, migrate from SQLCipher to a shared PostgreSQL instance. This is not yet implemented but is the planned approach for multi-node deployments.
|
|
|
|
```
|
|
qpq-server-1 --\
|
|
qpq-server-2 ---+--> PostgreSQL (shared)
|
|
qpq-server-3 --/
|
|
```
|
|
|
|
## Connection Tuning
|
|
|
|
The server has these QUIC transport defaults:
|
|
|
|
| Parameter | Default | Tunable |
|
|
|-----------|---------|---------|
|
|
| Max idle timeout | 300s (5 min) | Code change required |
|
|
| Max concurrent bidi streams | 1 per connection | Code change required |
|
|
| Max concurrent uni streams | 0 | Code change required |
|
|
| SQLCipher connection pool | 4 connections | Code change required |
|
|
|
|
For high connection counts, consider:
|
|
- Increasing the OS file descriptor limit: `ulimit -n 65536`
|
|
- Increasing UDP buffer sizes:
|
|
|
|
```bash
|
|
# /etc/sysctl.d/99-qpq.conf
|
|
net.core.rmem_max = 26214400
|
|
net.core.wmem_max = 26214400
|
|
net.core.rmem_default = 1048576
|
|
net.core.wmem_default = 1048576
|
|
```
|
|
|
|
```bash
|
|
sysctl -p /etc/sysctl.d/99-qpq.conf
|
|
```
|
|
|
|
## Docker Resource Limits
|
|
|
|
```yaml
|
|
# docker-compose.prod.yml
|
|
services:
|
|
server:
|
|
deploy:
|
|
resources:
|
|
limits:
|
|
cpus: '4'
|
|
memory: 4G
|
|
reservations:
|
|
cpus: '2'
|
|
memory: 1G
|
|
ulimits:
|
|
nofile:
|
|
soft: 65536
|
|
hard: 65536
|
|
```
|
|
|
|
## Load Testing
|
|
|
|
Use the included test infrastructure to benchmark:
|
|
|
|
```bash
|
|
# Build the test client
|
|
cargo build --release --bin qpq-client
|
|
|
|
# Run concurrent connection test (example)
|
|
for i in $(seq 1 100); do
|
|
qpq-client --server 127.0.0.1:7000 --auth-token "$QPQ_AUTH_TOKEN" &
|
|
done
|
|
wait
|
|
|
|
# Monitor during load test
|
|
watch -n1 'curl -s http://localhost:9090/metrics | grep -E "enqueue_total|fetch_total|delivery_queue_depth|rate_limit"'
|
|
```
|
|
|
|
## Capacity Planning Worksheet
|
|
|
|
| Parameter | Your Value |
|
|
|-----------|-----------|
|
|
| Expected concurrent users | |
|
|
| Messages per user per hour | |
|
|
| Average message size (bytes) | |
|
|
| Blob uploads per day | |
|
|
| Average blob size (MB) | |
|
|
| Data retention (days) | |
|
|
|
|
**Formulas:**
|
|
|
|
```
|
|
Storage per day = (users * msgs/hr * 24 * avg_msg_size) + (blob_uploads * avg_blob_size)
|
|
DB growth per month = storage_per_day * 30
|
|
Memory estimate = (concurrent_users * 5 KB) + 256 MB base
|
|
CPU estimate = 1 vCPU per ~2,500 concurrent connections (depends on message rate)
|
|
```
|