feat: add protocol comparison docs, P2P crate, production audit, and design fixes
Add comprehensive documentation comparing quicnprotochat against classical chat protocols (IRC+SSL, XMPP, Telegram) with diagrams and attack scenarios. Promote comparison pages to top-level sidebar section. Include P2P transport crate (iroh), production readiness audit, CI workflows, dependency policy, and continued architecture improvements across all crates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
127
docs/PRODUCTION-READINESS-AUDIT.md
Normal file
127
docs/PRODUCTION-READINESS-AUDIT.md
Normal file
@@ -0,0 +1,127 @@
|
||||
# Production Readiness Audit
|
||||
|
||||
This document summarizes issues and fixes needed to get quicnprotochat production-ready, based on a codebase review. It aligns with the existing [Production Readiness WBS](src/roadmap/production-readiness.md) and [Coding Standards](src/contributing/coding-standards.md).
|
||||
|
||||
---
|
||||
|
||||
## Critical (fix before production)
|
||||
|
||||
### 1. **Auth token and dev defaults**
|
||||
|
||||
- **README and example config** use `auth_token = "devtoken"` and `db_key = ""`.
|
||||
- **Risk:** Deploying with default/example config allows weak or no auth and unencrypted DB.
|
||||
- **Fix:** Require explicit `QUICNPROTOCHAT_AUTH_TOKEN` (or config) in production; reject empty or `"devtoken"` when a production mode/env is set. Document that `db_key` empty disables SQLCipher and is not acceptable for production.
|
||||
|
||||
### 2. **Database encryption optional**
|
||||
|
||||
- **`sql_store.rs`:** If `db_key` is empty, SQLCipher is not applied; DB is plaintext on disk.
|
||||
- **Fix:** In production, require non-empty `db_key` (or fail startup with a clear error). Document in README and deployment docs.
|
||||
|
||||
### 3. **Secrets and generated files not ignored**
|
||||
|
||||
- **`.gitignore`** does not include `data/`, so `data/server-cert.der`, `data/server-key.der`, and `data/quicnprotochat.db` could be committed.
|
||||
- **Fix:** Add `data/` (and any other dirs that hold certs, keys, or DBs) to `.gitignore`. Consider adding `*.der` and `*.db` if used only for local/dev.
|
||||
|
||||
### 4. **Dockerfile out of sync with workspace**
|
||||
|
||||
- **Workspace** has 5 members including `crates/quicnprotochat-p2p`.
|
||||
- **Dockerfile** only copies 4 crate manifests and creates stub dirs for those 4; it never copies `quicnprotochat-p2p`.
|
||||
- **Result:** `cargo build --release --bin quicnprotochat-server` can fail (missing workspace member) or behave inconsistently.
|
||||
- **Fix:** Add `COPY crates/quicnprotochat-p2p/Cargo.toml` and a stub `crates/quicnprotochat-p2p/src` (or equivalent) in the dependency-cache layer so the workspace resolves. Ensure the final `COPY crates/ crates/` still brings in real p2p source.
|
||||
|
||||
### 5. **E2E test failing (rustls CryptoProvider)**
|
||||
|
||||
- **Symptom:** `e2e_happy_path_register_invite_join_send_recv` panics: *"Could not automatically determine the process-level CryptoProvider"*.
|
||||
- **Cause:** rustls 0.23 requires a default `CryptoProvider` (e.g. `ring` or `aws-lc-rs`). In the test process, nothing calls `CryptoProvider::install_default()` before the client uses QUIC/rustls.
|
||||
- **Fix:** In the E2E test (or in a shared test harness), call `rustls::crypto::ring::default_provider().install_default().ok()` (or the chosen provider) once at process start before any QUIC/rustls usage. Ensure the crate has exactly one of the `ring` / `aws-lc-rs` features so the default is unambiguous.
|
||||
|
||||
---
|
||||
|
||||
## High (security and reliability)
|
||||
|
||||
### 6. **Panic risk in client RPC path**
|
||||
|
||||
- **`quicnprotochat-client/src/lib.rs`:** `set_auth()` uses `.expect("init_auth must be called with a non-empty token before RPCs")`. If RPC is called without `init_auth`, the process panics.
|
||||
- **Fix:** Replace with a `Result` or an error return (e.g. a dedicated error type) so callers get a recoverable error instead of a panic. Document that `init_auth` must be called before RPCs.
|
||||
|
||||
### 7. **Mutex `.unwrap()` in production paths**
|
||||
|
||||
- **`sql_store.rs`:** All `self.conn.lock().unwrap()` calls can panic if the mutex is poisoned.
|
||||
- **`storage.rs` (file backend):** Same pattern with `.lock().unwrap()` on shared maps.
|
||||
- **Coding standards:** Prefer handling `Result` from `lock()` (e.g. `lock().map_err(...)?`) or use a type that encapsulates poisoning so production paths don’t panic on contention/poison.
|
||||
|
||||
### 8. **`unwrap()` in client library**
|
||||
|
||||
- **`lib.rs`:** `"0.0.0.0:0".parse().unwrap()` for the client endpoint. If parsing ever changed or failed, this would panic.
|
||||
- **Fix:** Use `.context("parse client bind address")?` (or equivalent) so this is a proper error path.
|
||||
|
||||
### 9. **TLS certificate generation is silent on first run**
|
||||
|
||||
- **Server** auto-generates a self-signed cert if files are missing. Production readiness WBS says: *"Self-signed certificates acceptable for development; production deployments must use a CA-signed certificate or certificate pinning."*
|
||||
- **Fix:** Add a startup check (e.g. env or config flag) that in production rejects auto-generation and requires existing cert/key paths. Log clearly when running with self-signed certs so operators know they’re in dev mode.
|
||||
|
||||
---
|
||||
|
||||
## Medium (hygiene and ops)
|
||||
|
||||
### 10. **No CI pipeline**
|
||||
|
||||
- **Production Readiness WBS** expects: GitHub Actions with `cargo test --workspace`, `cargo clippy`, `cargo fmt --check`, `cargo deny check`.
|
||||
- **Current state:** No `.github/workflows` (or equivalent) found.
|
||||
- **Fix:** Add a CI workflow that runs tests, clippy, fmt, and deny so every PR is validated.
|
||||
|
||||
### 11. **No CODEOWNERS**
|
||||
|
||||
- WBS requires CODEOWNERS for review ownership and security-sensitive changes.
|
||||
- **Fix:** Add `.github/CODEOWNERS` mapping crates to owners.
|
||||
|
||||
### 12. **No dependency audit in CI**
|
||||
|
||||
- WBS mentions `cargo audit` in CI.
|
||||
- **Fix:** Add a CI job that runs `cargo audit` (and optionally `cargo deny check`) and fails on known vulnerabilities or policy violations.
|
||||
|
||||
### 13. **No `deny.toml` / `deny.toml` config**
|
||||
|
||||
- Coding standards reference `cargo deny check`; no config file was found.
|
||||
- **Fix:** Add `deny.toml` (or equivalent) and run `cargo deny check` in CI.
|
||||
|
||||
### 14. **Warnings in build**
|
||||
|
||||
- **Cap'n Proto generated code:** `unused_parens` in generated `.rs` files. Standards allow `#[allow(...)]` on generated code; consider suppressing in the codegen output or in the crate that includes it.
|
||||
- **Server:** `SessionInfo` has `username` and `identity_key` never read (dead code). Either use them (e.g. audit logging) or remove/allow with a short comment.
|
||||
- **E2E test:** Deprecated `cargo_bin`, `unused_mut`; trivial to fix.
|
||||
- **openmls:** Future-incompat warning; track upstream and plan upgrade.
|
||||
|
||||
### 15. **Docker image runs as `nobody`**
|
||||
|
||||
- **Dockerfile** uses `USER nobody`. Good for not running as root, but `nobody` may not have a writable home or data dir.
|
||||
- **Fix:** Ensure `QUICNPROTOCHAT_DATA_DIR` (and cert paths) point to a directory writable by `nobody`, or create a dedicated user/group with a known UID and use that in the Dockerfile and docs.
|
||||
|
||||
---
|
||||
|
||||
## Already in good shape
|
||||
|
||||
- **Auth token comparison:** Uses `subtle::ConstantTimeEq` (`ct_eq`) for the static token — good.
|
||||
- **Input validation:** Recipient key length (32), payload size (5 MB), wire version, rate limiting, queue depth — present and consistent.
|
||||
- **Structured logging:** `tracing` with env filter; no secret material in log messages in the reviewed paths.
|
||||
- **Error handling:** RPC handlers return coded errors; no `unwrap()` on crypto in server RPC paths.
|
||||
- **Health endpoint:** Server exposes a health RPC used by E2E and can be used for readiness probes.
|
||||
|
||||
---
|
||||
|
||||
## Summary checklist
|
||||
|
||||
| Area | Status | Action |
|
||||
|-------------------|----------|--------|
|
||||
| Auth / tokens | Fix | Require strong auth in prod; document devtoken / empty db_key |
|
||||
| DB encryption | Fix | Require non-empty db_key in production |
|
||||
| .gitignore | Fix | Add `data/` (and cert/DB patterns as needed) |
|
||||
| Dockerfile | Fix | Include p2p crate in workspace build |
|
||||
| E2E test | Fix | Set rustls CryptoProvider in test harness |
|
||||
| Client panic | Improve | Replace expect with Result in set_auth |
|
||||
| Mutex unwrap | Improve | Handle poison or use non-panicking API |
|
||||
| TLS in production| Improve | Reject auto-generated cert in prod mode |
|
||||
| CI / CODEOWNERS | Add | GitHub Actions, deny, audit, CODEOWNERS |
|
||||
| Warnings | Clean up | Dead code, deprecated APIs, generated allows |
|
||||
|
||||
This audit should be revisited after implementing Phase 1–2 of the [Production Readiness WBS](src/roadmap/production-readiness.md) and before any production deployment.
|
||||
Reference in New Issue
Block a user