Files
quicproquo/docs/operations/incident-response.md
Christian Nennemann 2e081ead8e chore: rename quicproquo → quicprochat in docs, Docker, CI, and packaging
Rename all project references from quicproquo/qpq to quicprochat/qpc
across documentation, Docker configuration, CI workflows, packaging
scripts, operational configs, and build tooling.

- Docker: crate paths, binary names, user/group, data dirs, env vars
- CI: workflow crate references, binary names, artifact names
- Docs: all markdown files under docs/, SDK READMEs, book.toml
- Packaging: OpenWrt Makefile, init script, UCI config (file renames)
- Scripts: justfile, dev-shell, screenshot, cross-compile, ai_team
- Operations: Prometheus config, alert rules, Grafana dashboard
- Config: .env.example (QPQ_* → QPC_*), CODEOWNERS paths
- Top-level: README, CONTRIBUTING, ROADMAP, CLAUDE.md
2026-03-21 19:14:06 +01:00

8.3 KiB

Incident Response Playbook

This document provides procedures for responding to common operational incidents in a quicprochat deployment.

Severity Levels

Level Description Response Time Examples
P1 - Critical Service down, data loss, key compromise Immediate Server crash loop, DB corruption, leaked secrets
P2 - Major Degraded service, partial outage 15 minutes High latency, storage full, cert expiry
P3 - Minor Non-critical issue, monitoring alert 1 hour Rate limit spikes, non-critical warnings

Incident: Server Not Starting

Symptoms

  • Server process exits immediately
  • Logs show "TLS cert or key missing" or "production forbids" errors

Diagnosis

# Check server logs
journalctl -u qpc-server --since "10 min ago" --no-pager

# Docker
docker compose logs --tail=50 server

Common Causes and Fixes

Missing TLS certificates (production mode)

# Production requires pre-existing certs (no auto-generation)
ls -la data/server-cert.der data/server-key.der

# If missing, restore from backup or generate new ones
# See: key-rotation.md

Missing auth token (production mode)

# Production requires QPC_AUTH_TOKEN >= 16 chars, not "devtoken"
echo $QPC_AUTH_TOKEN | wc -c

Database locked or corrupt

# Check if another process holds the database
fuser data/qpc.db

# Verify database integrity
sqlite3 data/qpc.db "PRAGMA key='${QPC_DB_KEY}'; PRAGMA integrity_check;"

Port already in use

# Check if something is already listening on port 7000
ss -tlnp | grep 7000

Incident: Node Down / Unresponsive

Symptoms

  • Clients cannot connect
  • Health check failures
  • No new log entries

Diagnosis

# 1. Check if the process is running
systemctl status qpc-server
# or: docker compose ps

# 2. Check resource usage
top -bn1 | grep qpc
df -h /var/lib/quicprochat
free -h

# 3. Check QUIC port is reachable
# From another host:
nc -uzv <server-ip> 7000

# 4. Check for OOM kills
dmesg | grep -i "out of memory\|oom" | tail -5
journalctl -k | grep -i oom

Recovery

# Restart the service
systemctl restart qpc-server

# If OOM: increase memory limit
systemctl edit qpc-server --force
# MemoryMax=2G

# If disk full: see "Storage Full" incident below

Incident: Storage Full

Symptoms

  • enqueue operations fail
  • Logs show "No space left on device"
  • delivery_queue_depth gauge rising

Diagnosis

# Check disk usage
df -h /var/lib/quicprochat
du -sh /var/lib/quicprochat/*

# Check largest files
du -a /var/lib/quicprochat | sort -rn | head -20

# Check blob storage specifically
du -sh /var/lib/quicprochat/blobs/
find /var/lib/quicprochat/blobs/ -type f | wc -l

Recovery

# 1. Identify and remove expired messages (the cleanup task handles this,
#    but if it's behind, you can trigger manual cleanup)

# For SQL backend: delete expired delivery messages
sqlite3 data/qpc.db <<'EOF'
PRAGMA key = '${QPC_DB_KEY}';
DELETE FROM delivery_queue WHERE expires_at IS NOT NULL AND expires_at < unixepoch();
VACUUM;
EOF

# 2. Remove orphaned blobs (blobs not referenced by any message)
# This is application-specific; coordinate with the codebase

# 3. If the data partition is full, expand the volume
# AWS EBS: aws ec2 modify-volume --volume-id vol-xxx --size 100
# Then: resize2fs /dev/xvdf

# 4. Move to a larger disk
systemctl stop qpc-server
rsync -av /var/lib/quicprochat/ /mnt/new-volume/quicprochat/
# Update QPC_DATA_DIR and QPC_DB_PATH to point to the new location
systemctl start qpc-server

Prevention

  • Set up disk usage alerts at 70% and 90% thresholds.
  • Configure message TTL (ttl_secs) to auto-expire old messages.
  • Schedule regular VACUUM on the SQLCipher database.

Incident: DDoS / Connection Flood

Symptoms

  • rate_limit_hit_total counter spiking
  • auth_login_failure_total counter spiking
  • High CPU usage
  • Legitimate clients cannot connect

Diagnosis

# Check connection rate limit hits in metrics
curl -s http://localhost:9090/metrics | grep rate_limit

# Check auth failure rate
curl -s http://localhost:9090/metrics | grep auth_login_failure

# Check active connections (QUIC uses UDP)
ss -unp | grep 7000 | wc -l

Mitigation

# 1. The server has built-in per-IP connection rate limiting.
#    Check the logs for "connection rate limit exceeded" messages.

# 2. Block offending IPs at the firewall level
iptables -A INPUT -s <attacker-ip> -j DROP

# 3. For volumetric attacks, use upstream DDoS protection
#    (Cloudflare Spectrum, AWS Shield, etc.)

# 4. If the server is overwhelmed, restart to clear state
systemctl restart qpc-server

# 5. Enable log redaction to reduce I/O pressure during attacks
# Set QPC_REDACT_LOGS=true

Incident: Key Compromise

Auth Token Compromised

Severity: P1

# 1. Immediately rotate the auth token
NEW_TOKEN=$(openssl rand -base64 32)

# 2. Update server config and restart
# See: key-rotation.md "Auth Token Rotation"

# 3. Notify all legitimate clients of the new token

# 4. Review logs for unauthorized access
journalctl -u qpc-server | grep "auth_login_success" | tail -100

TLS Private Key Compromised

Severity: P1

# 1. Generate and install a new certificate immediately
# See: key-rotation.md "TLS Certificate Rotation"

# 2. Revoke the compromised certificate with your CA
# (procedure depends on your CA)

# 3. Restart the server with the new certificate
systemctl restart qpc-server

# 4. If clients pin certificates, notify them of the change

Database Key Compromised

Severity: P1

# 1. Stop the server
systemctl stop qpc-server

# 2. Rekey the database immediately
# See: key-rotation.md "Database Encryption Key Rotation"

# 3. Assess data exposure
# If the attacker had access to the database file, assume all
# stored data (users, key packages, delivery queues) is compromised.

# 4. Consider notifying affected users

OPAQUE ServerSetup Compromised

Severity: P1

# 1. Rotate the OPAQUE ServerSetup
# See: key-rotation.md "OPAQUE ServerSetup Rotation"

# WARNING: This invalidates ALL OPAQUE credentials.
# All users must re-register.

# 2. All users must re-register with new credentials
# 3. Review logs for unauthorized OPAQUE authentications

Incident: High Latency

Symptoms

  • Clients report slow message delivery
  • delivery_queue_depth gauge is high
  • Fetch operations are slow

Diagnosis

# 1. Check system resources
top -bn1 | head -20
iostat -x 1 3

# 2. Check database performance
sqlite3 data/qpc.db <<'EOF'
PRAGMA key = '${QPC_DB_KEY}';
PRAGMA integrity_check;
PRAGMA wal_checkpoint(PASSIVE);
-- Check table sizes
SELECT 'delivery_queue', count(*) FROM delivery_queue
UNION ALL SELECT 'key_packages', count(*) FROM key_packages
UNION ALL SELECT 'users', count(*) FROM users;
EOF

# 3. Check queue depth via metrics
curl -s http://localhost:9090/metrics | grep delivery_queue_depth

Recovery

# 1. Checkpoint the WAL (reduces WAL file size)
sqlite3 data/qpc.db "PRAGMA key='${QPC_DB_KEY}'; PRAGMA wal_checkpoint(TRUNCATE);"

# 2. VACUUM to reclaim space and defragment
sqlite3 data/qpc.db "PRAGMA key='${QPC_DB_KEY}'; VACUUM;"

# 3. If the queue is huge, check for clients not fetching
# (delivery_queue rows accumulate when clients are offline)

# 4. If I/O-bound: move database to faster storage (SSD/NVMe)

Incident: Certificate Expiring

Symptoms

  • Log warning: "TLS certificate expires within 30 days"
  • Monitoring alert on certificate expiry

Response

# 1. Check current certificate expiry
openssl x509 -inform DER -in data/server-cert.der -noout -enddate

# 2. Renew the certificate
# See: key-rotation.md "TLS Certificate Rotation"

# 3. Verify the new certificate is loaded
journalctl -u qpc-server --since "1 min ago" | grep -i cert

Post-Incident Checklist

After resolving any incident:

  1. Document the incident: timeline, root cause, resolution steps
  2. Verify the service is fully operational (check metrics, test client connections)
  3. Review whether monitoring would have caught this earlier
  4. Update alerts and thresholds based on findings
  5. Communicate with affected users if there was data exposure or service disruption
  6. Schedule follow-up actions (e.g., add monitoring, improve automation)