feat: add protocol comparison docs, P2P crate, production audit, and design fixes

Add comprehensive documentation comparing quicnprotochat against classical
chat protocols (IRC+SSL, XMPP, Telegram) with diagrams and attack scenarios.
Promote comparison pages to top-level sidebar section. Include P2P transport
crate (iroh), production readiness audit, CI workflows, dependency policy,
and continued architecture improvements across all crates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-22 12:15:44 +01:00
parent 0bdc222724
commit 00b0aa92a1
28 changed files with 1566 additions and 340 deletions

View File

@@ -0,0 +1,127 @@
# Production Readiness Audit
This document summarizes issues and fixes needed to get quicnprotochat production-ready, based on a codebase review. It aligns with the existing [Production Readiness WBS](src/roadmap/production-readiness.md) and [Coding Standards](src/contributing/coding-standards.md).
---
## Critical (fix before production)
### 1. **Auth token and dev defaults**
- **README and example config** use `auth_token = "devtoken"` and `db_key = ""`.
- **Risk:** Deploying with default/example config allows weak or no auth and unencrypted DB.
- **Fix:** Require explicit `QUICNPROTOCHAT_AUTH_TOKEN` (or config) in production; reject empty or `"devtoken"` when a production mode/env is set. Document that `db_key` empty disables SQLCipher and is not acceptable for production.
### 2. **Database encryption optional**
- **`sql_store.rs`:** If `db_key` is empty, SQLCipher is not applied; DB is plaintext on disk.
- **Fix:** In production, require non-empty `db_key` (or fail startup with a clear error). Document in README and deployment docs.
### 3. **Secrets and generated files not ignored**
- **`.gitignore`** does not include `data/`, so `data/server-cert.der`, `data/server-key.der`, and `data/quicnprotochat.db` could be committed.
- **Fix:** Add `data/` (and any other dirs that hold certs, keys, or DBs) to `.gitignore`. Consider adding `*.der` and `*.db` if used only for local/dev.
### 4. **Dockerfile out of sync with workspace**
- **Workspace** has 5 members including `crates/quicnprotochat-p2p`.
- **Dockerfile** only copies 4 crate manifests and creates stub dirs for those 4; it never copies `quicnprotochat-p2p`.
- **Result:** `cargo build --release --bin quicnprotochat-server` can fail (missing workspace member) or behave inconsistently.
- **Fix:** Add `COPY crates/quicnprotochat-p2p/Cargo.toml` and a stub `crates/quicnprotochat-p2p/src` (or equivalent) in the dependency-cache layer so the workspace resolves. Ensure the final `COPY crates/ crates/` still brings in real p2p source.
### 5. **E2E test failing (rustls CryptoProvider)**
- **Symptom:** `e2e_happy_path_register_invite_join_send_recv` panics: *"Could not automatically determine the process-level CryptoProvider"*.
- **Cause:** rustls 0.23 requires a default `CryptoProvider` (e.g. `ring` or `aws-lc-rs`). In the test process, nothing calls `CryptoProvider::install_default()` before the client uses QUIC/rustls.
- **Fix:** In the E2E test (or in a shared test harness), call `rustls::crypto::ring::default_provider().install_default().ok()` (or the chosen provider) once at process start before any QUIC/rustls usage. Ensure the crate has exactly one of the `ring` / `aws-lc-rs` features so the default is unambiguous.
---
## High (security and reliability)
### 6. **Panic risk in client RPC path**
- **`quicnprotochat-client/src/lib.rs`:** `set_auth()` uses `.expect("init_auth must be called with a non-empty token before RPCs")`. If RPC is called without `init_auth`, the process panics.
- **Fix:** Replace with a `Result` or an error return (e.g. a dedicated error type) so callers get a recoverable error instead of a panic. Document that `init_auth` must be called before RPCs.
### 7. **Mutex `.unwrap()` in production paths**
- **`sql_store.rs`:** All `self.conn.lock().unwrap()` calls can panic if the mutex is poisoned.
- **`storage.rs` (file backend):** Same pattern with `.lock().unwrap()` on shared maps.
- **Coding standards:** Prefer handling `Result` from `lock()` (e.g. `lock().map_err(...)?`) or use a type that encapsulates poisoning so production paths dont panic on contention/poison.
### 8. **`unwrap()` in client library**
- **`lib.rs`:** `"0.0.0.0:0".parse().unwrap()` for the client endpoint. If parsing ever changed or failed, this would panic.
- **Fix:** Use `.context("parse client bind address")?` (or equivalent) so this is a proper error path.
### 9. **TLS certificate generation is silent on first run**
- **Server** auto-generates a self-signed cert if files are missing. Production readiness WBS says: *"Self-signed certificates acceptable for development; production deployments must use a CA-signed certificate or certificate pinning."*
- **Fix:** Add a startup check (e.g. env or config flag) that in production rejects auto-generation and requires existing cert/key paths. Log clearly when running with self-signed certs so operators know theyre in dev mode.
---
## Medium (hygiene and ops)
### 10. **No CI pipeline**
- **Production Readiness WBS** expects: GitHub Actions with `cargo test --workspace`, `cargo clippy`, `cargo fmt --check`, `cargo deny check`.
- **Current state:** No `.github/workflows` (or equivalent) found.
- **Fix:** Add a CI workflow that runs tests, clippy, fmt, and deny so every PR is validated.
### 11. **No CODEOWNERS**
- WBS requires CODEOWNERS for review ownership and security-sensitive changes.
- **Fix:** Add `.github/CODEOWNERS` mapping crates to owners.
### 12. **No dependency audit in CI**
- WBS mentions `cargo audit` in CI.
- **Fix:** Add a CI job that runs `cargo audit` (and optionally `cargo deny check`) and fails on known vulnerabilities or policy violations.
### 13. **No `deny.toml` / `deny.toml` config**
- Coding standards reference `cargo deny check`; no config file was found.
- **Fix:** Add `deny.toml` (or equivalent) and run `cargo deny check` in CI.
### 14. **Warnings in build**
- **Cap'n Proto generated code:** `unused_parens` in generated `.rs` files. Standards allow `#[allow(...)]` on generated code; consider suppressing in the codegen output or in the crate that includes it.
- **Server:** `SessionInfo` has `username` and `identity_key` never read (dead code). Either use them (e.g. audit logging) or remove/allow with a short comment.
- **E2E test:** Deprecated `cargo_bin`, `unused_mut`; trivial to fix.
- **openmls:** Future-incompat warning; track upstream and plan upgrade.
### 15. **Docker image runs as `nobody`**
- **Dockerfile** uses `USER nobody`. Good for not running as root, but `nobody` may not have a writable home or data dir.
- **Fix:** Ensure `QUICNPROTOCHAT_DATA_DIR` (and cert paths) point to a directory writable by `nobody`, or create a dedicated user/group with a known UID and use that in the Dockerfile and docs.
---
## Already in good shape
- **Auth token comparison:** Uses `subtle::ConstantTimeEq` (`ct_eq`) for the static token — good.
- **Input validation:** Recipient key length (32), payload size (5 MB), wire version, rate limiting, queue depth — present and consistent.
- **Structured logging:** `tracing` with env filter; no secret material in log messages in the reviewed paths.
- **Error handling:** RPC handlers return coded errors; no `unwrap()` on crypto in server RPC paths.
- **Health endpoint:** Server exposes a health RPC used by E2E and can be used for readiness probes.
---
## Summary checklist
| Area | Status | Action |
|-------------------|----------|--------|
| Auth / tokens | Fix | Require strong auth in prod; document devtoken / empty db_key |
| DB encryption | Fix | Require non-empty db_key in production |
| .gitignore | Fix | Add `data/` (and cert/DB patterns as needed) |
| Dockerfile | Fix | Include p2p crate in workspace build |
| E2E test | Fix | Set rustls CryptoProvider in test harness |
| Client panic | Improve | Replace expect with Result in set_auth |
| Mutex unwrap | Improve | Handle poison or use non-panicking API |
| TLS in production| Improve | Reject auto-generated cert in prod mode |
| CI / CODEOWNERS | Add | GitHub Actions, deny, audit, CODEOWNERS |
| Warnings | Clean up | Dead code, deprecated APIs, generated allows |
This audit should be revisited after implementing Phase 12 of the [Production Readiness WBS](src/roadmap/production-readiness.md) and before any production deployment.