Files
quicproquo/docs/operations/incident-response.md
Christian Nennemann 2e081ead8e chore: rename quicproquo → quicprochat in docs, Docker, CI, and packaging
Rename all project references from quicproquo/qpq to quicprochat/qpc
across documentation, Docker configuration, CI workflows, packaging
scripts, operational configs, and build tooling.

- Docker: crate paths, binary names, user/group, data dirs, env vars
- CI: workflow crate references, binary names, artifact names
- Docs: all markdown files under docs/, SDK READMEs, book.toml
- Packaging: OpenWrt Makefile, init script, UCI config (file renames)
- Scripts: justfile, dev-shell, screenshot, cross-compile, ai_team
- Operations: Prometheus config, alert rules, Grafana dashboard
- Config: .env.example (QPQ_* → QPC_*), CODEOWNERS paths
- Top-level: README, CONTRIBUTING, ROADMAP, CLAUDE.md
2026-03-21 19:14:06 +01:00

339 lines
8.3 KiB
Markdown

# Incident Response Playbook
This document provides procedures for responding to common operational incidents in a quicprochat deployment.
## Severity Levels
| Level | Description | Response Time | Examples |
|-------|-------------|---------------|----------|
| P1 - Critical | Service down, data loss, key compromise | Immediate | Server crash loop, DB corruption, leaked secrets |
| P2 - Major | Degraded service, partial outage | 15 minutes | High latency, storage full, cert expiry |
| P3 - Minor | Non-critical issue, monitoring alert | 1 hour | Rate limit spikes, non-critical warnings |
## Incident: Server Not Starting
### Symptoms
- Server process exits immediately
- Logs show "TLS cert or key missing" or "production forbids" errors
### Diagnosis
```bash
# Check server logs
journalctl -u qpc-server --since "10 min ago" --no-pager
# Docker
docker compose logs --tail=50 server
```
### Common Causes and Fixes
**Missing TLS certificates (production mode)**
```bash
# Production requires pre-existing certs (no auto-generation)
ls -la data/server-cert.der data/server-key.der
# If missing, restore from backup or generate new ones
# See: key-rotation.md
```
**Missing auth token (production mode)**
```bash
# Production requires QPC_AUTH_TOKEN >= 16 chars, not "devtoken"
echo $QPC_AUTH_TOKEN | wc -c
```
**Database locked or corrupt**
```bash
# Check if another process holds the database
fuser data/qpc.db
# Verify database integrity
sqlite3 data/qpc.db "PRAGMA key='${QPC_DB_KEY}'; PRAGMA integrity_check;"
```
**Port already in use**
```bash
# Check if something is already listening on port 7000
ss -tlnp | grep 7000
```
## Incident: Node Down / Unresponsive
### Symptoms
- Clients cannot connect
- Health check failures
- No new log entries
### Diagnosis
```bash
# 1. Check if the process is running
systemctl status qpc-server
# or: docker compose ps
# 2. Check resource usage
top -bn1 | grep qpc
df -h /var/lib/quicprochat
free -h
# 3. Check QUIC port is reachable
# From another host:
nc -uzv <server-ip> 7000
# 4. Check for OOM kills
dmesg | grep -i "out of memory\|oom" | tail -5
journalctl -k | grep -i oom
```
### Recovery
```bash
# Restart the service
systemctl restart qpc-server
# If OOM: increase memory limit
systemctl edit qpc-server --force
# MemoryMax=2G
# If disk full: see "Storage Full" incident below
```
## Incident: Storage Full
### Symptoms
- `enqueue` operations fail
- Logs show "No space left on device"
- `delivery_queue_depth` gauge rising
### Diagnosis
```bash
# Check disk usage
df -h /var/lib/quicprochat
du -sh /var/lib/quicprochat/*
# Check largest files
du -a /var/lib/quicprochat | sort -rn | head -20
# Check blob storage specifically
du -sh /var/lib/quicprochat/blobs/
find /var/lib/quicprochat/blobs/ -type f | wc -l
```
### Recovery
```bash
# 1. Identify and remove expired messages (the cleanup task handles this,
# but if it's behind, you can trigger manual cleanup)
# For SQL backend: delete expired delivery messages
sqlite3 data/qpc.db <<'EOF'
PRAGMA key = '${QPC_DB_KEY}';
DELETE FROM delivery_queue WHERE expires_at IS NOT NULL AND expires_at < unixepoch();
VACUUM;
EOF
# 2. Remove orphaned blobs (blobs not referenced by any message)
# This is application-specific; coordinate with the codebase
# 3. If the data partition is full, expand the volume
# AWS EBS: aws ec2 modify-volume --volume-id vol-xxx --size 100
# Then: resize2fs /dev/xvdf
# 4. Move to a larger disk
systemctl stop qpc-server
rsync -av /var/lib/quicprochat/ /mnt/new-volume/quicprochat/
# Update QPC_DATA_DIR and QPC_DB_PATH to point to the new location
systemctl start qpc-server
```
### Prevention
- Set up disk usage alerts at 70% and 90% thresholds.
- Configure message TTL (`ttl_secs`) to auto-expire old messages.
- Schedule regular `VACUUM` on the SQLCipher database.
## Incident: DDoS / Connection Flood
### Symptoms
- `rate_limit_hit_total` counter spiking
- `auth_login_failure_total` counter spiking
- High CPU usage
- Legitimate clients cannot connect
### Diagnosis
```bash
# Check connection rate limit hits in metrics
curl -s http://localhost:9090/metrics | grep rate_limit
# Check auth failure rate
curl -s http://localhost:9090/metrics | grep auth_login_failure
# Check active connections (QUIC uses UDP)
ss -unp | grep 7000 | wc -l
```
### Mitigation
```bash
# 1. The server has built-in per-IP connection rate limiting.
# Check the logs for "connection rate limit exceeded" messages.
# 2. Block offending IPs at the firewall level
iptables -A INPUT -s <attacker-ip> -j DROP
# 3. For volumetric attacks, use upstream DDoS protection
# (Cloudflare Spectrum, AWS Shield, etc.)
# 4. If the server is overwhelmed, restart to clear state
systemctl restart qpc-server
# 5. Enable log redaction to reduce I/O pressure during attacks
# Set QPC_REDACT_LOGS=true
```
## Incident: Key Compromise
### Auth Token Compromised
**Severity: P1**
```bash
# 1. Immediately rotate the auth token
NEW_TOKEN=$(openssl rand -base64 32)
# 2. Update server config and restart
# See: key-rotation.md "Auth Token Rotation"
# 3. Notify all legitimate clients of the new token
# 4. Review logs for unauthorized access
journalctl -u qpc-server | grep "auth_login_success" | tail -100
```
### TLS Private Key Compromised
**Severity: P1**
```bash
# 1. Generate and install a new certificate immediately
# See: key-rotation.md "TLS Certificate Rotation"
# 2. Revoke the compromised certificate with your CA
# (procedure depends on your CA)
# 3. Restart the server with the new certificate
systemctl restart qpc-server
# 4. If clients pin certificates, notify them of the change
```
### Database Key Compromised
**Severity: P1**
```bash
# 1. Stop the server
systemctl stop qpc-server
# 2. Rekey the database immediately
# See: key-rotation.md "Database Encryption Key Rotation"
# 3. Assess data exposure
# If the attacker had access to the database file, assume all
# stored data (users, key packages, delivery queues) is compromised.
# 4. Consider notifying affected users
```
### OPAQUE ServerSetup Compromised
**Severity: P1**
```bash
# 1. Rotate the OPAQUE ServerSetup
# See: key-rotation.md "OPAQUE ServerSetup Rotation"
# WARNING: This invalidates ALL OPAQUE credentials.
# All users must re-register.
# 2. All users must re-register with new credentials
# 3. Review logs for unauthorized OPAQUE authentications
```
## Incident: High Latency
### Symptoms
- Clients report slow message delivery
- `delivery_queue_depth` gauge is high
- Fetch operations are slow
### Diagnosis
```bash
# 1. Check system resources
top -bn1 | head -20
iostat -x 1 3
# 2. Check database performance
sqlite3 data/qpc.db <<'EOF'
PRAGMA key = '${QPC_DB_KEY}';
PRAGMA integrity_check;
PRAGMA wal_checkpoint(PASSIVE);
-- Check table sizes
SELECT 'delivery_queue', count(*) FROM delivery_queue
UNION ALL SELECT 'key_packages', count(*) FROM key_packages
UNION ALL SELECT 'users', count(*) FROM users;
EOF
# 3. Check queue depth via metrics
curl -s http://localhost:9090/metrics | grep delivery_queue_depth
```
### Recovery
```bash
# 1. Checkpoint the WAL (reduces WAL file size)
sqlite3 data/qpc.db "PRAGMA key='${QPC_DB_KEY}'; PRAGMA wal_checkpoint(TRUNCATE);"
# 2. VACUUM to reclaim space and defragment
sqlite3 data/qpc.db "PRAGMA key='${QPC_DB_KEY}'; VACUUM;"
# 3. If the queue is huge, check for clients not fetching
# (delivery_queue rows accumulate when clients are offline)
# 4. If I/O-bound: move database to faster storage (SSD/NVMe)
```
## Incident: Certificate Expiring
### Symptoms
- Log warning: "TLS certificate expires within 30 days"
- Monitoring alert on certificate expiry
### Response
```bash
# 1. Check current certificate expiry
openssl x509 -inform DER -in data/server-cert.der -noout -enddate
# 2. Renew the certificate
# See: key-rotation.md "TLS Certificate Rotation"
# 3. Verify the new certificate is loaded
journalctl -u qpc-server --since "1 min ago" | grep -i cert
```
## Post-Incident Checklist
After resolving any incident:
1. **Document** the incident: timeline, root cause, resolution steps
2. **Verify** the service is fully operational (check metrics, test client connections)
3. **Review** whether monitoring would have caught this earlier
4. **Update** alerts and thresholds based on findings
5. **Communicate** with affected users if there was data exposure or service disruption
6. **Schedule** follow-up actions (e.g., add monitoring, improve automation)