Files
quicproquo/docs/PRODUCTION-READINESS-AUDIT.md
Christian Nennemann 00b0aa92a1 feat: add protocol comparison docs, P2P crate, production audit, and design fixes
Add comprehensive documentation comparing quicnprotochat against classical
chat protocols (IRC+SSL, XMPP, Telegram) with diagrams and attack scenarios.
Promote comparison pages to top-level sidebar section. Include P2P transport
crate (iroh), production readiness audit, CI workflows, dependency policy,
and continued architecture improvements across all crates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 12:15:44 +01:00

7.7 KiB
Raw Permalink Blame History

Production Readiness Audit

This document summarizes issues and fixes needed to get quicnprotochat production-ready, based on a codebase review. It aligns with the existing Production Readiness WBS and Coding Standards.


Critical (fix before production)

1. Auth token and dev defaults

  • README and example config use auth_token = "devtoken" and db_key = "".
  • Risk: Deploying with default/example config allows weak or no auth and unencrypted DB.
  • Fix: Require explicit QUICNPROTOCHAT_AUTH_TOKEN (or config) in production; reject empty or "devtoken" when a production mode/env is set. Document that db_key empty disables SQLCipher and is not acceptable for production.

2. Database encryption optional

  • sql_store.rs: If db_key is empty, SQLCipher is not applied; DB is plaintext on disk.
  • Fix: In production, require non-empty db_key (or fail startup with a clear error). Document in README and deployment docs.

3. Secrets and generated files not ignored

  • .gitignore does not include data/, so data/server-cert.der, data/server-key.der, and data/quicnprotochat.db could be committed.
  • Fix: Add data/ (and any other dirs that hold certs, keys, or DBs) to .gitignore. Consider adding *.der and *.db if used only for local/dev.

4. Dockerfile out of sync with workspace

  • Workspace has 5 members including crates/quicnprotochat-p2p.
  • Dockerfile only copies 4 crate manifests and creates stub dirs for those 4; it never copies quicnprotochat-p2p.
  • Result: cargo build --release --bin quicnprotochat-server can fail (missing workspace member) or behave inconsistently.
  • Fix: Add COPY crates/quicnprotochat-p2p/Cargo.toml and a stub crates/quicnprotochat-p2p/src (or equivalent) in the dependency-cache layer so the workspace resolves. Ensure the final COPY crates/ crates/ still brings in real p2p source.

5. E2E test failing (rustls CryptoProvider)

  • Symptom: e2e_happy_path_register_invite_join_send_recv panics: "Could not automatically determine the process-level CryptoProvider".
  • Cause: rustls 0.23 requires a default CryptoProvider (e.g. ring or aws-lc-rs). In the test process, nothing calls CryptoProvider::install_default() before the client uses QUIC/rustls.
  • Fix: In the E2E test (or in a shared test harness), call rustls::crypto::ring::default_provider().install_default().ok() (or the chosen provider) once at process start before any QUIC/rustls usage. Ensure the crate has exactly one of the ring / aws-lc-rs features so the default is unambiguous.

High (security and reliability)

6. Panic risk in client RPC path

  • quicnprotochat-client/src/lib.rs: set_auth() uses .expect("init_auth must be called with a non-empty token before RPCs"). If RPC is called without init_auth, the process panics.
  • Fix: Replace with a Result or an error return (e.g. a dedicated error type) so callers get a recoverable error instead of a panic. Document that init_auth must be called before RPCs.

7. Mutex .unwrap() in production paths

  • sql_store.rs: All self.conn.lock().unwrap() calls can panic if the mutex is poisoned.
  • storage.rs (file backend): Same pattern with .lock().unwrap() on shared maps.
  • Coding standards: Prefer handling Result from lock() (e.g. lock().map_err(...)?) or use a type that encapsulates poisoning so production paths dont panic on contention/poison.

8. unwrap() in client library

  • lib.rs: "0.0.0.0:0".parse().unwrap() for the client endpoint. If parsing ever changed or failed, this would panic.
  • Fix: Use .context("parse client bind address")? (or equivalent) so this is a proper error path.

9. TLS certificate generation is silent on first run

  • Server auto-generates a self-signed cert if files are missing. Production readiness WBS says: "Self-signed certificates acceptable for development; production deployments must use a CA-signed certificate or certificate pinning."
  • Fix: Add a startup check (e.g. env or config flag) that in production rejects auto-generation and requires existing cert/key paths. Log clearly when running with self-signed certs so operators know theyre in dev mode.

Medium (hygiene and ops)

10. No CI pipeline

  • Production Readiness WBS expects: GitHub Actions with cargo test --workspace, cargo clippy, cargo fmt --check, cargo deny check.
  • Current state: No .github/workflows (or equivalent) found.
  • Fix: Add a CI workflow that runs tests, clippy, fmt, and deny so every PR is validated.

11. No CODEOWNERS

  • WBS requires CODEOWNERS for review ownership and security-sensitive changes.
  • Fix: Add .github/CODEOWNERS mapping crates to owners.

12. No dependency audit in CI

  • WBS mentions cargo audit in CI.
  • Fix: Add a CI job that runs cargo audit (and optionally cargo deny check) and fails on known vulnerabilities or policy violations.

13. No deny.toml / deny.toml config

  • Coding standards reference cargo deny check; no config file was found.
  • Fix: Add deny.toml (or equivalent) and run cargo deny check in CI.

14. Warnings in build

  • Cap'n Proto generated code: unused_parens in generated .rs files. Standards allow #[allow(...)] on generated code; consider suppressing in the codegen output or in the crate that includes it.
  • Server: SessionInfo has username and identity_key never read (dead code). Either use them (e.g. audit logging) or remove/allow with a short comment.
  • E2E test: Deprecated cargo_bin, unused_mut; trivial to fix.
  • openmls: Future-incompat warning; track upstream and plan upgrade.

15. Docker image runs as nobody

  • Dockerfile uses USER nobody. Good for not running as root, but nobody may not have a writable home or data dir.
  • Fix: Ensure QUICNPROTOCHAT_DATA_DIR (and cert paths) point to a directory writable by nobody, or create a dedicated user/group with a known UID and use that in the Dockerfile and docs.

Already in good shape

  • Auth token comparison: Uses subtle::ConstantTimeEq (ct_eq) for the static token — good.
  • Input validation: Recipient key length (32), payload size (5 MB), wire version, rate limiting, queue depth — present and consistent.
  • Structured logging: tracing with env filter; no secret material in log messages in the reviewed paths.
  • Error handling: RPC handlers return coded errors; no unwrap() on crypto in server RPC paths.
  • Health endpoint: Server exposes a health RPC used by E2E and can be used for readiness probes.

Summary checklist

Area Status Action
Auth / tokens Fix Require strong auth in prod; document devtoken / empty db_key
DB encryption Fix Require non-empty db_key in production
.gitignore Fix Add data/ (and cert/DB patterns as needed)
Dockerfile Fix Include p2p crate in workspace build
E2E test Fix Set rustls CryptoProvider in test harness
Client panic Improve Replace expect with Result in set_auth
Mutex unwrap Improve Handle poison or use non-panicking API
TLS in production Improve Reject auto-generated cert in prod mode
CI / CODEOWNERS Add GitHub Actions, deny, audit, CODEOWNERS
Warnings Clean up Dead code, deprecated APIs, generated allows

This audit should be revisited after implementing Phase 12 of the Production Readiness WBS and before any production deployment.