# Scaling Guide This document covers resource sizing, scaling triggers, and capacity planning for quicprochat deployments. ## Architecture Overview quicprochat runs as a single-process server handling QUIC connections. Key resource consumers: - **CPU**: TLS 1.3 handshakes (QUIC), OPAQUE PAKE authentication, message routing - **Memory**: In-memory session state (DashMap), QUIC connection state, delivery waiters, rate limit entries - **Disk I/O**: SQLCipher reads/writes (WAL mode), blob storage, KT Merkle log - **Network**: QUIC (UDP), metrics HTTP, optional WebSocket bridge ## Single-Node Sizing ### Minimum (Development / Small Team) | Resource | Value | |----------|-------| | CPU | 1 vCPU | | Memory | 512 MB | | Disk | 10 GB SSD | | Network | 100 Mbps | Supports ~100 concurrent users, light message traffic. ### Recommended (Production / Small-Medium) | Resource | Value | |----------|-------| | CPU | 2-4 vCPU | | Memory | 2-4 GB | | Disk | 50-100 GB NVMe SSD | | Network | 1 Gbps | Supports ~1,000-5,000 concurrent users. ### Large (High Traffic) | Resource | Value | |----------|-------| | CPU | 8+ vCPU | | Memory | 8-16 GB | | Disk | 500 GB+ NVMe SSD (RAID 10) | | Network | 10 Gbps | Supports ~10,000+ concurrent users. ## Scaling Triggers Monitor these metrics and scale when thresholds are exceeded: | Metric | Warning | Critical | Action | |--------|---------|----------|--------| | CPU usage | > 70% sustained (5 min) | > 90% sustained | Add CPU or scale horizontally | | Memory usage | > 75% | > 90% | Increase memory, check for leaks | | Disk usage | > 70% | > 90% | Expand volume, clean old data | | Disk I/O latency | > 5 ms p95 | > 20 ms p95 | Move to faster storage | | `delivery_queue_depth` | > 10,000 | > 100,000 | Investigate stale queues | | `rate_limit_hit_total` rate | > 100/min | > 1000/min | Investigate abuse, adjust limits | | `auth_login_failure_total` rate | > 50/min | > 500/min | Potential brute force attack | | Connection count | > 80% of `max_concurrent_bidi_streams` | > 95% | Scale horizontally | | TLS handshake latency | > 100 ms p95 | > 500 ms p95 | Add CPU, check network | ## Vertical Scaling ### CPU Scaling The server is async (Tokio) and benefits from multiple cores. QUIC TLS handshakes and OPAQUE computations are CPU-intensive. ```bash # Check current CPU usage top -bn1 -p $(pgrep qpc-server) # For Docker: increase CPU limits # docker-compose.prod.yml: # deploy: # resources: # limits: # cpus: '4' ``` ### Memory Scaling In-memory state scales linearly with concurrent connections: - ~2-5 KB per active QUIC connection (quinn state) - ~200 bytes per session entry (DashMap) - ~100 bytes per rate limit entry - ~100 bytes per delivery waiter ```bash # Estimate memory for 10,000 connections: # 10,000 * 5 KB = ~50 MB for connections # 10,000 * 500 bytes = ~5 MB for sessions/rate limits # SQLCipher connection pool: ~50 MB (4 connections, caches) # Base process: ~30 MB # Total: ~135 MB + headroom = 256-512 MB minimum ``` ### Disk I/O Scaling SQLCipher uses WAL mode for concurrent reads. For write-heavy workloads: ```bash # Check current I/O iostat -x 1 5 # Move to NVMe if on spinning disk # Increase WAL autocheckpoint threshold for burst writes sqlite3 data/qpc.db "PRAGMA key='${QPC_DB_KEY}'; PRAGMA wal_autocheckpoint=2000;" ``` ## Horizontal Scaling quicprochat does not yet have built-in multi-node clustering. For horizontal scaling, use these patterns: ### Load Balancer (UDP/QUIC) Place a UDP load balancer in front of multiple qpc-server instances. Each instance runs independently with its own database. ``` +-----------+ clients ------> | L4 LB | ----> qpc-server-1 (db-1) | (UDP/QUIC)| ----> qpc-server-2 (db-2) +-----------+ qpc-server-3 (db-3) ``` **Requirements:** - Sticky sessions (by client IP or QUIC connection ID) so a client always reaches the same node - Shared storage backend or federation between nodes ### Federation for Multi-Node Enable federation to relay messages between nodes: ```toml # qpc-server.toml on node-1 [federation] enabled = true domain = "node1.chat.example.com" listen = "0.0.0.0:7001" federation_cert = "data/federation-cert.der" federation_key = "data/federation-key.der" federation_ca = "data/federation-ca.der" [[federation.peers]] domain = "node2.chat.example.com" address = "10.0.1.2:7001" ``` ### Shared Database (PostgreSQL) For true horizontal scaling, migrate from SQLCipher to a shared PostgreSQL instance. This is not yet implemented but is the planned approach for multi-node deployments. ``` qpc-server-1 --\ qpc-server-2 ---+--> PostgreSQL (shared) qpc-server-3 --/ ``` ## Connection Tuning The server has these QUIC transport defaults: | Parameter | Default | Tunable | |-----------|---------|---------| | Max idle timeout | 300s (5 min) | Code change required | | Max concurrent bidi streams | 1 per connection | Code change required | | Max concurrent uni streams | 0 | Code change required | | SQLCipher connection pool | 4 connections | Code change required | For high connection counts, consider: - Increasing the OS file descriptor limit: `ulimit -n 65536` - Increasing UDP buffer sizes: ```bash # /etc/sysctl.d/99-qpc.conf net.core.rmem_max = 26214400 net.core.wmem_max = 26214400 net.core.rmem_default = 1048576 net.core.wmem_default = 1048576 ``` ```bash sysctl -p /etc/sysctl.d/99-qpc.conf ``` ## Docker Resource Limits ```yaml # docker-compose.prod.yml services: server: deploy: resources: limits: cpus: '4' memory: 4G reservations: cpus: '2' memory: 1G ulimits: nofile: soft: 65536 hard: 65536 ``` ## Load Testing Use the included test infrastructure to benchmark: ```bash # Build the test client cargo build --release --bin qpc-client # Run concurrent connection test (example) for i in $(seq 1 100); do qpc-client --server 127.0.0.1:7000 --auth-token "$QPC_AUTH_TOKEN" & done wait # Monitor during load test watch -n1 'curl -s http://localhost:9090/metrics | grep -E "enqueue_total|fetch_total|delivery_queue_depth|rate_limit"' ``` ## Capacity Planning Worksheet | Parameter | Your Value | |-----------|-----------| | Expected concurrent users | | | Messages per user per hour | | | Average message size (bytes) | | | Blob uploads per day | | | Average blob size (MB) | | | Data retention (days) | | **Formulas:** ``` Storage per day = (users * msgs/hr * 24 * avg_msg_size) + (blob_uploads * avg_blob_size) DB growth per month = storage_per_day * 30 Memory estimate = (concurrent_users * 5 KB) + 256 MB base CPU estimate = 1 vCPU per ~2,500 concurrent connections (depends on message rate) ```