# Production Readiness Audit This document summarizes issues and fixes needed to get quicproquo production-ready, based on a codebase review. It aligns with the existing [Production Readiness WBS](src/roadmap/production-readiness.md) and [Coding Standards](src/contributing/coding-standards.md). --- ## Critical (fix before production) ### 1. **Auth token and dev defaults** - **README and example config** use `auth_token = "devtoken"` and `db_key = ""`. - **Risk:** Deploying with default/example config allows weak or no auth and unencrypted DB. - **Fix:** Require explicit `QPQ_AUTH_TOKEN` (or config) in production; reject empty or `"devtoken"` when a production mode/env is set. Document that `db_key` empty disables SQLCipher and is not acceptable for production. ### 2. **Database encryption optional** - **`sql_store.rs`:** If `db_key` is empty, SQLCipher is not applied; DB is plaintext on disk. - **Fix:** In production, require non-empty `db_key` (or fail startup with a clear error). Document in README and deployment docs. ### 3. **Secrets and generated files not ignored** - **`.gitignore`** does not include `data/`, so `data/server-cert.der`, `data/server-key.der`, and `data/qpq.db` could be committed. - **Fix:** Add `data/` (and any other dirs that hold certs, keys, or DBs) to `.gitignore`. Consider adding `*.der` and `*.db` if used only for local/dev. ### 4. **Dockerfile out of sync with workspace** - **Workspace** has 5 members including `crates/quicproquo-p2p`. - **Dockerfile** only copies 4 crate manifests and creates stub dirs for those 4; it never copies `quicproquo-p2p`. - **Result:** `cargo build --release --bin quicproquo-server` can fail (missing workspace member) or behave inconsistently. - **Fix:** Add `COPY crates/quicproquo-p2p/Cargo.toml` and a stub `crates/quicproquo-p2p/src` (or equivalent) in the dependency-cache layer so the workspace resolves. Ensure the final `COPY crates/ crates/` still brings in real p2p source. ### 5. **E2E test failing (rustls CryptoProvider)** - **Symptom:** `e2e_happy_path_register_invite_join_send_recv` panics: *"Could not automatically determine the process-level CryptoProvider"*. - **Cause:** rustls 0.23 requires a default `CryptoProvider` (e.g. `ring` or `aws-lc-rs`). In the test process, nothing calls `CryptoProvider::install_default()` before the client uses QUIC/rustls. - **Fix:** In the E2E test (or in a shared test harness), call `rustls::crypto::ring::default_provider().install_default().ok()` (or the chosen provider) once at process start before any QUIC/rustls usage. Ensure the crate has exactly one of the `ring` / `aws-lc-rs` features so the default is unambiguous. --- ## High (security and reliability) ### 6. **Panic risk in client RPC path** - **`quicproquo-client/src/lib.rs`:** `set_auth()` uses `.expect("init_auth must be called with a non-empty token before RPCs")`. If RPC is called without `init_auth`, the process panics. - **Fix:** Replace with a `Result` or an error return (e.g. a dedicated error type) so callers get a recoverable error instead of a panic. Document that `init_auth` must be called before RPCs. ### 7. **Mutex `.unwrap()` in production paths** - **`sql_store.rs`:** All `self.conn.lock().unwrap()` calls can panic if the mutex is poisoned. - **`storage.rs` (file backend):** Same pattern with `.lock().unwrap()` on shared maps. - **Coding standards:** Prefer handling `Result` from `lock()` (e.g. `lock().map_err(...)?`) or use a type that encapsulates poisoning so production paths don’t panic on contention/poison. ### 8. **`unwrap()` in client library** - **`lib.rs`:** `"0.0.0.0:0".parse().unwrap()` for the client endpoint. If parsing ever changed or failed, this would panic. - **Fix:** Use `.context("parse client bind address")?` (or equivalent) so this is a proper error path. ### 9. **TLS certificate generation is silent on first run** - **Server** auto-generates a self-signed cert if files are missing. Production readiness WBS says: *"Self-signed certificates acceptable for development; production deployments must use a CA-signed certificate or certificate pinning."* - **Fix:** Add a startup check (e.g. env or config flag) that in production rejects auto-generation and requires existing cert/key paths. Log clearly when running with self-signed certs so operators know they’re in dev mode. --- ## Medium (hygiene and ops) ### 10. **No CI pipeline** - **Production Readiness WBS** expects: GitHub Actions with `cargo test --workspace`, `cargo clippy`, `cargo fmt --check`, `cargo deny check`. - **Current state:** No `.github/workflows` (or equivalent) found. - **Fix:** Add a CI workflow that runs tests, clippy, fmt, and deny so every PR is validated. ### 11. **No CODEOWNERS** - WBS requires CODEOWNERS for review ownership and security-sensitive changes. - **Fix:** Add `.github/CODEOWNERS` mapping crates to owners. ### 12. **No dependency audit in CI** - WBS mentions `cargo audit` in CI. - **Fix:** Add a CI job that runs `cargo audit` (and optionally `cargo deny check`) and fails on known vulnerabilities or policy violations. ### 13. **No `deny.toml` / `deny.toml` config** - Coding standards reference `cargo deny check`; no config file was found. - **Fix:** Add `deny.toml` (or equivalent) and run `cargo deny check` in CI. ### 14. **Warnings in build** - **Cap'n Proto generated code:** `unused_parens` in generated `.rs` files. Standards allow `#[allow(...)]` on generated code; consider suppressing in the codegen output or in the crate that includes it. - **Server:** `SessionInfo` has `username` and `identity_key` never read (dead code). Either use them (e.g. audit logging) or remove/allow with a short comment. - **E2E test:** Deprecated `cargo_bin`, `unused_mut`; trivial to fix. - **openmls:** Future-incompat warning; track upstream and plan upgrade. ### 15. **Docker image runs as `nobody`** - **Dockerfile** uses `USER nobody`. Good for not running as root, but `nobody` may not have a writable home or data dir. - **Fix:** Ensure `QPQ_DATA_DIR` (and cert paths) point to a directory writable by `nobody`, or create a dedicated user/group with a known UID and use that in the Dockerfile and docs. --- ## Already in good shape - **Auth token comparison:** Uses `subtle::ConstantTimeEq` (`ct_eq`) for the static token — good. - **Input validation:** Recipient key length (32), payload size (5 MB), wire version, rate limiting, queue depth — present and consistent. - **Structured logging:** `tracing` with env filter; no secret material in log messages in the reviewed paths. - **Error handling:** RPC handlers return coded errors; no `unwrap()` on crypto in server RPC paths. - **Health endpoint:** Server exposes a health RPC used by E2E and can be used for readiness probes. --- ## Summary checklist | Area | Status | Action | |-------------------|----------|--------| | Auth / tokens | Fix | Require strong auth in prod; document devtoken / empty db_key | | DB encryption | Fix | Require non-empty db_key in production | | .gitignore | Fix | Add `data/` (and cert/DB patterns as needed) | | Dockerfile | Fix | Include p2p crate in workspace build | | E2E test | Fix | Set rustls CryptoProvider in test harness | | Client panic | Improve | Replace expect with Result in set_auth | | Mutex unwrap | Improve | Handle poison or use non-panicking API | | TLS in production| Improve | Reject auto-generated cert in prod mode | | CI / CODEOWNERS | Add | GitHub Actions, deny, audit, CODEOWNERS | | Warnings | Clean up | Dead code, deprecated APIs, generated allows | This audit should be revisited after implementing Phase 1–2 of the [Production Readiness WBS](src/roadmap/production-readiness.md) and before any production deployment.