# Incident Response Playbook This document provides procedures for responding to common operational incidents in a quicprochat deployment. ## Severity Levels | Level | Description | Response Time | Examples | |-------|-------------|---------------|----------| | P1 - Critical | Service down, data loss, key compromise | Immediate | Server crash loop, DB corruption, leaked secrets | | P2 - Major | Degraded service, partial outage | 15 minutes | High latency, storage full, cert expiry | | P3 - Minor | Non-critical issue, monitoring alert | 1 hour | Rate limit spikes, non-critical warnings | ## Incident: Server Not Starting ### Symptoms - Server process exits immediately - Logs show "TLS cert or key missing" or "production forbids" errors ### Diagnosis ```bash # Check server logs journalctl -u qpc-server --since "10 min ago" --no-pager # Docker docker compose logs --tail=50 server ``` ### Common Causes and Fixes **Missing TLS certificates (production mode)** ```bash # Production requires pre-existing certs (no auto-generation) ls -la data/server-cert.der data/server-key.der # If missing, restore from backup or generate new ones # See: key-rotation.md ``` **Missing auth token (production mode)** ```bash # Production requires QPC_AUTH_TOKEN >= 16 chars, not "devtoken" echo $QPC_AUTH_TOKEN | wc -c ``` **Database locked or corrupt** ```bash # Check if another process holds the database fuser data/qpc.db # Verify database integrity sqlite3 data/qpc.db "PRAGMA key='${QPC_DB_KEY}'; PRAGMA integrity_check;" ``` **Port already in use** ```bash # Check if something is already listening on port 7000 ss -tlnp | grep 7000 ``` ## Incident: Node Down / Unresponsive ### Symptoms - Clients cannot connect - Health check failures - No new log entries ### Diagnosis ```bash # 1. Check if the process is running systemctl status qpc-server # or: docker compose ps # 2. Check resource usage top -bn1 | grep qpc df -h /var/lib/quicprochat free -h # 3. Check QUIC port is reachable # From another host: nc -uzv 7000 # 4. Check for OOM kills dmesg | grep -i "out of memory\|oom" | tail -5 journalctl -k | grep -i oom ``` ### Recovery ```bash # Restart the service systemctl restart qpc-server # If OOM: increase memory limit systemctl edit qpc-server --force # MemoryMax=2G # If disk full: see "Storage Full" incident below ``` ## Incident: Storage Full ### Symptoms - `enqueue` operations fail - Logs show "No space left on device" - `delivery_queue_depth` gauge rising ### Diagnosis ```bash # Check disk usage df -h /var/lib/quicprochat du -sh /var/lib/quicprochat/* # Check largest files du -a /var/lib/quicprochat | sort -rn | head -20 # Check blob storage specifically du -sh /var/lib/quicprochat/blobs/ find /var/lib/quicprochat/blobs/ -type f | wc -l ``` ### Recovery ```bash # 1. Identify and remove expired messages (the cleanup task handles this, # but if it's behind, you can trigger manual cleanup) # For SQL backend: delete expired delivery messages sqlite3 data/qpc.db <<'EOF' PRAGMA key = '${QPC_DB_KEY}'; DELETE FROM delivery_queue WHERE expires_at IS NOT NULL AND expires_at < unixepoch(); VACUUM; EOF # 2. Remove orphaned blobs (blobs not referenced by any message) # This is application-specific; coordinate with the codebase # 3. If the data partition is full, expand the volume # AWS EBS: aws ec2 modify-volume --volume-id vol-xxx --size 100 # Then: resize2fs /dev/xvdf # 4. Move to a larger disk systemctl stop qpc-server rsync -av /var/lib/quicprochat/ /mnt/new-volume/quicprochat/ # Update QPC_DATA_DIR and QPC_DB_PATH to point to the new location systemctl start qpc-server ``` ### Prevention - Set up disk usage alerts at 70% and 90% thresholds. - Configure message TTL (`ttl_secs`) to auto-expire old messages. - Schedule regular `VACUUM` on the SQLCipher database. ## Incident: DDoS / Connection Flood ### Symptoms - `rate_limit_hit_total` counter spiking - `auth_login_failure_total` counter spiking - High CPU usage - Legitimate clients cannot connect ### Diagnosis ```bash # Check connection rate limit hits in metrics curl -s http://localhost:9090/metrics | grep rate_limit # Check auth failure rate curl -s http://localhost:9090/metrics | grep auth_login_failure # Check active connections (QUIC uses UDP) ss -unp | grep 7000 | wc -l ``` ### Mitigation ```bash # 1. The server has built-in per-IP connection rate limiting. # Check the logs for "connection rate limit exceeded" messages. # 2. Block offending IPs at the firewall level iptables -A INPUT -s -j DROP # 3. For volumetric attacks, use upstream DDoS protection # (Cloudflare Spectrum, AWS Shield, etc.) # 4. If the server is overwhelmed, restart to clear state systemctl restart qpc-server # 5. Enable log redaction to reduce I/O pressure during attacks # Set QPC_REDACT_LOGS=true ``` ## Incident: Key Compromise ### Auth Token Compromised **Severity: P1** ```bash # 1. Immediately rotate the auth token NEW_TOKEN=$(openssl rand -base64 32) # 2. Update server config and restart # See: key-rotation.md "Auth Token Rotation" # 3. Notify all legitimate clients of the new token # 4. Review logs for unauthorized access journalctl -u qpc-server | grep "auth_login_success" | tail -100 ``` ### TLS Private Key Compromised **Severity: P1** ```bash # 1. Generate and install a new certificate immediately # See: key-rotation.md "TLS Certificate Rotation" # 2. Revoke the compromised certificate with your CA # (procedure depends on your CA) # 3. Restart the server with the new certificate systemctl restart qpc-server # 4. If clients pin certificates, notify them of the change ``` ### Database Key Compromised **Severity: P1** ```bash # 1. Stop the server systemctl stop qpc-server # 2. Rekey the database immediately # See: key-rotation.md "Database Encryption Key Rotation" # 3. Assess data exposure # If the attacker had access to the database file, assume all # stored data (users, key packages, delivery queues) is compromised. # 4. Consider notifying affected users ``` ### OPAQUE ServerSetup Compromised **Severity: P1** ```bash # 1. Rotate the OPAQUE ServerSetup # See: key-rotation.md "OPAQUE ServerSetup Rotation" # WARNING: This invalidates ALL OPAQUE credentials. # All users must re-register. # 2. All users must re-register with new credentials # 3. Review logs for unauthorized OPAQUE authentications ``` ## Incident: High Latency ### Symptoms - Clients report slow message delivery - `delivery_queue_depth` gauge is high - Fetch operations are slow ### Diagnosis ```bash # 1. Check system resources top -bn1 | head -20 iostat -x 1 3 # 2. Check database performance sqlite3 data/qpc.db <<'EOF' PRAGMA key = '${QPC_DB_KEY}'; PRAGMA integrity_check; PRAGMA wal_checkpoint(PASSIVE); -- Check table sizes SELECT 'delivery_queue', count(*) FROM delivery_queue UNION ALL SELECT 'key_packages', count(*) FROM key_packages UNION ALL SELECT 'users', count(*) FROM users; EOF # 3. Check queue depth via metrics curl -s http://localhost:9090/metrics | grep delivery_queue_depth ``` ### Recovery ```bash # 1. Checkpoint the WAL (reduces WAL file size) sqlite3 data/qpc.db "PRAGMA key='${QPC_DB_KEY}'; PRAGMA wal_checkpoint(TRUNCATE);" # 2. VACUUM to reclaim space and defragment sqlite3 data/qpc.db "PRAGMA key='${QPC_DB_KEY}'; VACUUM;" # 3. If the queue is huge, check for clients not fetching # (delivery_queue rows accumulate when clients are offline) # 4. If I/O-bound: move database to faster storage (SSD/NVMe) ``` ## Incident: Certificate Expiring ### Symptoms - Log warning: "TLS certificate expires within 30 days" - Monitoring alert on certificate expiry ### Response ```bash # 1. Check current certificate expiry openssl x509 -inform DER -in data/server-cert.der -noout -enddate # 2. Renew the certificate # See: key-rotation.md "TLS Certificate Rotation" # 3. Verify the new certificate is loaded journalctl -u qpc-server --since "1 min ago" | grep -i cert ``` ## Post-Incident Checklist After resolving any incident: 1. **Document** the incident: timeline, root cause, resolution steps 2. **Verify** the service is fully operational (check metrics, test client connections) 3. **Review** whether monitoring would have caught this earlier 4. **Update** alerts and thresholds based on findings 5. **Communicate** with affected users if there was data exposure or service disruption 6. **Schedule** follow-up actions (e.g., add monitoring, improve automation)