Files
quicproquo/docs/PRODUCTION-READINESS-AUDIT.md
Christian Nennemann 00b0aa92a1 feat: add protocol comparison docs, P2P crate, production audit, and design fixes
Add comprehensive documentation comparing quicnprotochat against classical
chat protocols (IRC+SSL, XMPP, Telegram) with diagrams and attack scenarios.
Promote comparison pages to top-level sidebar section. Include P2P transport
crate (iroh), production readiness audit, CI workflows, dependency policy,
and continued architecture improvements across all crates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 12:15:44 +01:00

128 lines
7.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Production Readiness Audit
This document summarizes issues and fixes needed to get quicnprotochat production-ready, based on a codebase review. It aligns with the existing [Production Readiness WBS](src/roadmap/production-readiness.md) and [Coding Standards](src/contributing/coding-standards.md).
---
## Critical (fix before production)
### 1. **Auth token and dev defaults**
- **README and example config** use `auth_token = "devtoken"` and `db_key = ""`.
- **Risk:** Deploying with default/example config allows weak or no auth and unencrypted DB.
- **Fix:** Require explicit `QUICNPROTOCHAT_AUTH_TOKEN` (or config) in production; reject empty or `"devtoken"` when a production mode/env is set. Document that `db_key` empty disables SQLCipher and is not acceptable for production.
### 2. **Database encryption optional**
- **`sql_store.rs`:** If `db_key` is empty, SQLCipher is not applied; DB is plaintext on disk.
- **Fix:** In production, require non-empty `db_key` (or fail startup with a clear error). Document in README and deployment docs.
### 3. **Secrets and generated files not ignored**
- **`.gitignore`** does not include `data/`, so `data/server-cert.der`, `data/server-key.der`, and `data/quicnprotochat.db` could be committed.
- **Fix:** Add `data/` (and any other dirs that hold certs, keys, or DBs) to `.gitignore`. Consider adding `*.der` and `*.db` if used only for local/dev.
### 4. **Dockerfile out of sync with workspace**
- **Workspace** has 5 members including `crates/quicnprotochat-p2p`.
- **Dockerfile** only copies 4 crate manifests and creates stub dirs for those 4; it never copies `quicnprotochat-p2p`.
- **Result:** `cargo build --release --bin quicnprotochat-server` can fail (missing workspace member) or behave inconsistently.
- **Fix:** Add `COPY crates/quicnprotochat-p2p/Cargo.toml` and a stub `crates/quicnprotochat-p2p/src` (or equivalent) in the dependency-cache layer so the workspace resolves. Ensure the final `COPY crates/ crates/` still brings in real p2p source.
### 5. **E2E test failing (rustls CryptoProvider)**
- **Symptom:** `e2e_happy_path_register_invite_join_send_recv` panics: *"Could not automatically determine the process-level CryptoProvider"*.
- **Cause:** rustls 0.23 requires a default `CryptoProvider` (e.g. `ring` or `aws-lc-rs`). In the test process, nothing calls `CryptoProvider::install_default()` before the client uses QUIC/rustls.
- **Fix:** In the E2E test (or in a shared test harness), call `rustls::crypto::ring::default_provider().install_default().ok()` (or the chosen provider) once at process start before any QUIC/rustls usage. Ensure the crate has exactly one of the `ring` / `aws-lc-rs` features so the default is unambiguous.
---
## High (security and reliability)
### 6. **Panic risk in client RPC path**
- **`quicnprotochat-client/src/lib.rs`:** `set_auth()` uses `.expect("init_auth must be called with a non-empty token before RPCs")`. If RPC is called without `init_auth`, the process panics.
- **Fix:** Replace with a `Result` or an error return (e.g. a dedicated error type) so callers get a recoverable error instead of a panic. Document that `init_auth` must be called before RPCs.
### 7. **Mutex `.unwrap()` in production paths**
- **`sql_store.rs`:** All `self.conn.lock().unwrap()` calls can panic if the mutex is poisoned.
- **`storage.rs` (file backend):** Same pattern with `.lock().unwrap()` on shared maps.
- **Coding standards:** Prefer handling `Result` from `lock()` (e.g. `lock().map_err(...)?`) or use a type that encapsulates poisoning so production paths dont panic on contention/poison.
### 8. **`unwrap()` in client library**
- **`lib.rs`:** `"0.0.0.0:0".parse().unwrap()` for the client endpoint. If parsing ever changed or failed, this would panic.
- **Fix:** Use `.context("parse client bind address")?` (or equivalent) so this is a proper error path.
### 9. **TLS certificate generation is silent on first run**
- **Server** auto-generates a self-signed cert if files are missing. Production readiness WBS says: *"Self-signed certificates acceptable for development; production deployments must use a CA-signed certificate or certificate pinning."*
- **Fix:** Add a startup check (e.g. env or config flag) that in production rejects auto-generation and requires existing cert/key paths. Log clearly when running with self-signed certs so operators know theyre in dev mode.
---
## Medium (hygiene and ops)
### 10. **No CI pipeline**
- **Production Readiness WBS** expects: GitHub Actions with `cargo test --workspace`, `cargo clippy`, `cargo fmt --check`, `cargo deny check`.
- **Current state:** No `.github/workflows` (or equivalent) found.
- **Fix:** Add a CI workflow that runs tests, clippy, fmt, and deny so every PR is validated.
### 11. **No CODEOWNERS**
- WBS requires CODEOWNERS for review ownership and security-sensitive changes.
- **Fix:** Add `.github/CODEOWNERS` mapping crates to owners.
### 12. **No dependency audit in CI**
- WBS mentions `cargo audit` in CI.
- **Fix:** Add a CI job that runs `cargo audit` (and optionally `cargo deny check`) and fails on known vulnerabilities or policy violations.
### 13. **No `deny.toml` / `deny.toml` config**
- Coding standards reference `cargo deny check`; no config file was found.
- **Fix:** Add `deny.toml` (or equivalent) and run `cargo deny check` in CI.
### 14. **Warnings in build**
- **Cap'n Proto generated code:** `unused_parens` in generated `.rs` files. Standards allow `#[allow(...)]` on generated code; consider suppressing in the codegen output or in the crate that includes it.
- **Server:** `SessionInfo` has `username` and `identity_key` never read (dead code). Either use them (e.g. audit logging) or remove/allow with a short comment.
- **E2E test:** Deprecated `cargo_bin`, `unused_mut`; trivial to fix.
- **openmls:** Future-incompat warning; track upstream and plan upgrade.
### 15. **Docker image runs as `nobody`**
- **Dockerfile** uses `USER nobody`. Good for not running as root, but `nobody` may not have a writable home or data dir.
- **Fix:** Ensure `QUICNPROTOCHAT_DATA_DIR` (and cert paths) point to a directory writable by `nobody`, or create a dedicated user/group with a known UID and use that in the Dockerfile and docs.
---
## Already in good shape
- **Auth token comparison:** Uses `subtle::ConstantTimeEq` (`ct_eq`) for the static token — good.
- **Input validation:** Recipient key length (32), payload size (5 MB), wire version, rate limiting, queue depth — present and consistent.
- **Structured logging:** `tracing` with env filter; no secret material in log messages in the reviewed paths.
- **Error handling:** RPC handlers return coded errors; no `unwrap()` on crypto in server RPC paths.
- **Health endpoint:** Server exposes a health RPC used by E2E and can be used for readiness probes.
---
## Summary checklist
| Area | Status | Action |
|-------------------|----------|--------|
| Auth / tokens | Fix | Require strong auth in prod; document devtoken / empty db_key |
| DB encryption | Fix | Require non-empty db_key in production |
| .gitignore | Fix | Add `data/` (and cert/DB patterns as needed) |
| Dockerfile | Fix | Include p2p crate in workspace build |
| E2E test | Fix | Set rustls CryptoProvider in test harness |
| Client panic | Improve | Replace expect with Result in set_auth |
| Mutex unwrap | Improve | Handle poison or use non-panicking API |
| TLS in production| Improve | Reject auto-generated cert in prod mode |
| CI / CODEOWNERS | Add | GitHub Actions, deny, audit, CODEOWNERS |
| Warnings | Clean up | Dead code, deprecated APIs, generated allows |
This audit should be revisited after implementing Phase 12 of the [Production Readiness WBS](src/roadmap/production-readiness.md) and before any production deployment.