DM channels (createChannel), channel authz, security/docs, future improvements

- Add createChannel RPC (node.capnp @18): create 1:1 channel, returns 16-byte channelId - Store: create_channel(member_a, member_b), get_channel_members(channel_id) - FileBackedStore: channels.bin; SqlStore: migration 003_channels, schema v4 - channel_ops: handle_create_channel (auth + identity, peerKey 32 bytes) - Delivery authz: when channel_id.len() == 16, require caller and recipient are channel members (E022/E023) - Error codes E022 CHANNEL_ACCESS_DENIED, E023 CHANNEL_NOT_FOUND - SUMMARY: link Certificate lifecycle; security audit, future improvements, multi-agent plan docs - Certificate lifecycle doc, SECURITY-AUDIT, FUTURE-IMPROVEMENTS, MULTI-AGENT-WORK-PLAN - Client/core/tls/auth/server main: assorted fixes and updates from review and audit Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-23 22:54:28 +01:00
parent 6b8b61c6ae
commit 750b794342
40 changed files with 4715 additions and 152 deletions
--- a/docs/FUTURE-IMPROVEMENTS.md
+++ b/docs/FUTURE-IMPROVEMENTS.md
@@ -0,0 +1,182 @@
+# Future Improvements
+
+This document consolidates suggested improvements for quicnprotochat, drawn from the [roadmap](src/roadmap/milestones.md), [production readiness WBS](src/roadmap/production-readiness.md), [security audit](SECURITY-AUDIT.md), [production readiness audit](PRODUCTION-READINESS-AUDIT.md), and [future research](src/roadmap/future-research.md). Items are grouped by theme and ordered by impact and dependency.
+
+---
+
+## 1. Security and hardening
+
+### 1.1 M7 — Post-quantum MLS (next milestone)
+
+- **Goal:** Hybrid X25519 + ML-KEM-768 in the MLS crypto provider so group key material has post-quantum confidentiality.
+- **Ref:** [Milestones § M7](src/roadmap/milestones.md), [Hybrid KEM](src/protocol-layers/hybrid-kem.md).
+- **Status:** Hybrid KEM exists at the envelope level; integrate into OpenMLS provider and run full test suite.
+
+### 1.2 CA-signed TLS / certificate lifecycle
+
+- **Current:** Self-signed certs; client pins by using server cert as `ca_cert`.
+- **Improve:** Document or add support for CA-issued certs (e.g. Let's Encrypt), cert rotation, and optional OCSP/CRL. Keep pinning as the recommended option for single-server deployments.
+- **Ref:** [Threat model § Known gaps](src/cryptography/threat-model.md).
+
+### 1.3 Stronger credential binding
+
+- **Current:** MLS `BasicCredential` (raw Ed25519); no revocation or CA chain.
+- **Improve:** X.509-based MLS credentials, or Key Transparency / verifiable log for public keys to detect substitution.
+- **Ref:** [Threat model](src/cryptography/threat-model.md), [Future research](src/roadmap/future-research.md).
+
+### 1.4 Username enumeration
+
+- **Current:** OPAQUE login start uses `get_user_record`; timing or response shape might reveal user existence.
+- **Improve:** If user enumeration is in scope, consider constant-time or uniform response for unknown users (without weakening OPAQUE).
+- **Ref:** [Security audit § 8.3](SECURITY-AUDIT.md).
+
+---
+
+## 2. Authorization and abuse prevention
+
+### 2.1 Full AUTHZ plan (accounts, devices, tokens)
+
+- **Current:** Bearer/session tokens and identity binding; no formal account/device model.
+- **Improve:** Implement the [authz plan](src/roadmap/authz-plan.md): accounts, devices, device_id in Auth, per-account/per-device rate limits, and binding KeyPackage uploads to the authenticated account.
+- **Ref:** [Production readiness WBS](src/roadmap/production-readiness.md), [Threat model § No client auth on DS](src/cryptography/threat-model.md).
+
+### 2.2 Per-IP and connection limits
+
+- **Current:** Per-token rate limit; no per-IP or global connection cap.
+- **Improve:** Configurable per-IP rate limit and max concurrent QUIC connections to reduce DoS and resource exhaustion.
+- **Ref:** [Production readiness WBS § Abuse / DoS](src/roadmap/production-readiness.md).
+
+---
+
+## 3. Reliability and resilience
+
+### 3.1 Client offline queue and retry
+
+- **Current:** Retry with backoff for RPCs; no offline queue or gap detection.
+- **Improve:** Offline message queue, idempotent message IDs, and gap detection so clients can recover after long disconnects without duplicate or lost messages.
+- **Ref:** [Production readiness WBS § Client resilience](src/roadmap/production-readiness.md).
+
+### 3.2 Connection draining and graceful shutdown
+
+- **Current:** QUIC endpoint closed on ctrl_c; in-flight RPCs may be cut.
+- **Improve:** Draining period: stop accepting new connections, wait for in-flight RPCs (with timeout), then close. Document expected behaviour for load balancers.
+
+### 3.3 N-1 compatibility and wire versioning
+
+- **Current:** `CURRENT_WIRE_VERSION` and server-side check; no formal N-1 support policy.
+- **Improve:** Document supported client/server version matrix and how to deprecate old wire versions safely.
+- **Ref:** [Production readiness WBS § Compatibility](src/roadmap/production-readiness.md).
+
+---
+
+## 4. Operations and observability
+
+### 4.1 CI pipeline
+
+- **Add:** GitHub Actions (or equivalent) for:
+  - `cargo test --workspace`
+  - `cargo clippy`
+  - `cargo fmt --check`
+  - `cargo audit` (and optionally `cargo deny check`)
+- **Ref:** [Production readiness audit § 10](PRODUCTION-READINESS-AUDIT.md).
+
+### 4.2 CODEOWNERS and review policy
+
+- **Add:** `.github/CODEOWNERS` mapping crates to owners; document that security-sensitive changes (crypto, auth, wire format) require two reviewers.
+- **Ref:** [Production readiness WBS § Governance](src/roadmap/production-readiness.md).
+
+### 4.3 Dependency policy (deny.toml)
+
+- **Add:** `deny.toml` (or equivalent) for `cargo deny` (licenses, duplicate crates, banned crates, etc.) and run in CI.
+- **Ref:** [Production readiness audit § 13](PRODUCTION-READINESS-AUDIT.md).
+
+### 4.4 HTTP health endpoint (optional)
+
+- **Current:** Health is an RPC over QUIC; no separate HTTP endpoint.
+- **Improve:** Optional HTTP (e.g. port 8080) `/health` or `/ready` for load balancers and orchestrators that expect HTTP, or document that health is QUIC-only and how to probe it.
+
+### 4.5 Docker user and writable paths
+
+- **Current:** Image runs as `nobody`; data dir may not be writable.
+- **Improve:** Create a dedicated user/group in the image and set `QUICNPROTOCHAT_DATA_DIR` (and cert paths) to a directory writable by that user; document in deployment docs.
+- **Ref:** [Production readiness audit § 15](PRODUCTION-READINESS-AUDIT.md).
+
+---
+
+## 5. Features and product
+
+### 5.1 Private 1:1 channels (DM)
+
+- **Goal:** Channel creation, per-channel authz, TTL, and DM-specific flows so 1:1 chats are first-class and access-controlled.
+- **Ref:** [DM channels](src/roadmap/dm-channels.md), [Production readiness WBS](src/roadmap/production-readiness.md).
+
+### 5.2 MLS lifecycle (remove, update, proposals)
+
+- **Current:** Add member, send, receive; no remove/update or explicit proposal handling.
+- **Improve:** Member remove, credential update, and handling of MLS proposals (Remove, Update) for full group lifecycle.
+- **Ref:** [Milestones § M5](src/roadmap/milestones.md) (optional follow-ups).
+
+### 5.3 Sealed Sender and metadata resistance
+
+- **Goal:** Hide sender identity from the server (sender inside MLS ciphertext); optionally PIR for fetch so server does not learn which queue was accessed.
+- **Ref:** [Threat model § Future mitigations](src/cryptography/threat-model.md), [Future research](src/roadmap/future-research.md).
+
+### 5.4 Traffic analysis resistance
+
+- **Goal:** Padding and/or traffic shaping to reduce inference from message sizes and timing.
+- **Ref:** [Threat model § Future mitigations](src/cryptography/threat-model.md).
+
+---
+
+## 6. Transport and topology
+
+### 6.1 P2P / NAT traversal (iroh, LibP2P)
+
+- **Goal:** Direct peer-to-peer when possible; server as optional relay/rendezvous. Reduces single-point-of-failure and can improve latency.
+- **Ref:** [Future research § LibP2P / iroh](src/roadmap/future-research.md). The `quicnprotochat-p2p` crate is a starting point.
+
+### 6.2 WebTransport (browser client)
+
+- **Goal:** HTTP/3 + WebTransport endpoint so a web client can use the same RPC layer without raw QUIC in the browser.
+- **Ref:** [Future research § WebTransport](src/roadmap/future-research.md).
+
+### 6.3 Tor / I2P
+
+- **Goal:** Optional routing over Tor or I2P to hide client IP and reduce metadata leakage.
+- **Ref:** [Threat model § Future mitigations](src/cryptography/threat-model.md), [Future research](src/roadmap/future-research.md).
+
+---
+
+## 7. Code and maintenance
+
+### 7.1 Warnings and dead code
+
+- **Clean up:** Cap'n Proto generated `unused_parens`; `SessionInfo` dead fields (use or document); E2E deprecated `cargo_bin` and `unused_mut`; track openmls future-incompat.
+- **Ref:** [Production readiness audit § 14](PRODUCTION-READINESS-AUDIT.md).
+
+### 7.2 Integration and E2E coverage
+
+- **Add:** More integration tests (e.g. auth + delivery together, failure paths, concurrent register, rate limit, queue full). Broader E2E scenarios (multi-party, rejoin, key refresh).
+- **Ref:** [Multi-perspective review](SECURITY-AUDIT.md) maintainability section.
+
+---
+
+## Priority overview
+
+| Priority | Theme | Examples |
+|----------|--------|----------|
+| **High** | Security | M7 PQ, CA/pinning docs, AUTHZ plan, CI + audit |
+| **High** | Ops | CI, CODEOWNERS, deny.toml, Docker user/paths |
+| **Medium** | Reliability | Offline queue, draining, N-1 policy |
+| **Medium** | Features | DM channels, MLS remove/update |
+| **Lower** | Research | Sealed Sender, PIR, P2P, WebTransport, Tor |
+
+---
+
+## Related documents
+
+- [Milestones](src/roadmap/milestones.md) — M7 and beyond
+- [Production readiness WBS](src/roadmap/production-readiness.md) — phased hardening
+- [Future research](src/roadmap/future-research.md) — technologies and options
+- [Security audit](SECURITY-AUDIT.md) — recommendations and status
+- [Production readiness audit](PRODUCTION-READINESS-AUDIT.md) — checklist and fixes
--- a/docs/MULTI-AGENT-WORK-PLAN.md
+++ b/docs/MULTI-AGENT-WORK-PLAN.md
@@ -0,0 +1,106 @@
+# Multi-Agent Work Plan: Sections 1 (Security) + 5 (Features)
+
+This document splits work for **Future Improvements §1 (Security and hardening)** and **§5 (Features and product)** between two agents so they can work in parallel with minimal merge conflicts.
+
+---
+
+## Agent A: Security and hardening
+
+**Owns:** Server auth/OPAQUE, TLS config, core crypto (identity, keypackage, hybrid_kem), docs under `docs/src/cryptography/` and TLS/cert docs.
+
+### A1. 1.2 CA-signed TLS / certificate lifecycle
+- **Files:** `docs/src/getting-started/` (new or existing), `crates/quicnprotochat-server/src/tls.rs` (optional env), `README.md`.
+- **Tasks:**
+  1. Add **Certificate lifecycle** doc: using CA-issued certs (e.g. Let's Encrypt), cert rotation, OCSP/CRL optional. Recommend pinning for single-server.
+  2. Optional: server config or env to prefer CA-signed cert path (e.g. `QUICNPROTOCHAT_USE_CA_CERT=1` and read from a different path). Low priority if docs suffice.
+- **Deliverable:** `docs/src/getting-started/certificate-lifecycle.md` (or section in running-the-server) + README link.
+
+### A2. 1.4 Username enumeration (OPAQUE)
+- **Files:** `crates/quicnprotochat-server/src/node_service/auth_ops.rs`, `docs/SECURITY-AUDIT.md`.
+- **Tasks:**
+  1. Document the risk in SECURITY-AUDIT (already mentioned).
+  2. Optional mitigation: ensure `get_user_record` is always called before `ServerLogin::start` (already true). If desired, add a constant-time delay or dummy work when user not found so response timing does not leak existence. Keep OPAQUE security unchanged.
+- **Deliverable:** Doc update; optional small code change in `handle_opaque_login_start`.
+
+### A3. 1.1 M7 — Post-quantum MLS
+- **Files:** `crates/quicnprotochat-core/src/` (new or modified crypto provider), `crates/quicnprotochat-core/src/group.rs`, `crates/quicnprotochat-core/src/hybrid_kem.rs`, `crates/quicnprotochat-core/src/hybrid_crypto.rs`.
+- **Tasks:**
+  1. Implement a custom `OpenMlsCryptoProvider` (or adapter) that uses hybrid X25519 + ML-KEM-768 for MLS KEM (HPKE layer).
+  2. Wire hybrid shared secret derivation (see milestones M7) into the provider.
+  3. Run full test suite; ensure M3/M4/M5 tests pass.
+- **Deliverable:** Hybrid KEM in MLS path; tests green. Large change; coordinate with core crate.
+
+### A4. 1.3 Stronger credential binding
+- **Files:** Docs only for now.
+- **Tasks:** Add a short **Future research** subsection or ADR: X.509-based MLS credentials, or Key Transparency for public key binding. No code change in this round.
+- **Deliverable:** `docs/src/roadmap/future-research.md` or ADR update.
+
+---
+
+## Agent B: Features and product
+
+**Owns:** Cap'n Proto schema (node.capnp delivery/channel methods), server storage (Store trait, FileBackedStore, SqlStore), `node_service/delivery.rs`, `node_service/key_ops.rs` (if createChannel lives there), client commands for channels.
+
+### B1. 5.1 Private 1:1 channels (DM)
+- **Files:** `schemas/node.capnp`, `crates/quicnprotochat-server/src/storage.rs`, `crates/quicnprotochat-server/src/sql_store.rs`, `crates/quicnprotochat-server/src/node_service/delivery.rs`, new `crates/quicnprotochat-server/src/node_service/channel_ops.rs` (or add to delivery), migrations for channels table.
+- **Tasks:**
+  1. **Schema:** Add `createChannel @N (auth :Auth, peerKey :Data) -> (channelId :Data);` to `node.capnp`. Rebuild proto.
+  2. **Store trait:** Add `create_channel(&self, member_a: &[u8], member_b: &[u8]) -> Result<Vec<u8>, StorageError>`, `get_channel_members(&self, channel_id: &[u8]) -> Result<Option<(Vec<u8>, Vec<u8>)>, StorageError>`. Implement in FileBackedStore (in-memory map channel_id -> (a, b)) and SqlStore (channels table, unique on sorted (a,b)).
+  3. **Server:** Implement `handle_create_channel`: auth required, identity required; create channel with (caller_identity, peer_key); return 16-byte channel_id (e.g. UUID).
+  4. **Delivery authz:** When `channel_id.len() == 16`: call `get_channel_members`. If Some((a, b)), verify caller identity is one of a/b and recipient_key is the other. If channel not found or authz fails, return E022 (or new code). Legacy: `channel_id` empty = current behaviour (no channel check).
+  5. **Config:** Optional server flag to require channel authz for non-empty channel_id (default on).
+- **Deliverable:** createChannel RPC, channel storage, per-channel authz on enqueue/fetch/fetchWait; legacy mode when channel_id empty.
+- **Ref:** [DM channels design](src/roadmap/dm-channels.md).
+
+### B2. 5.2 MLS lifecycle (remove, update, proposals)
+- **Files:** `crates/quicnprotochat-core/src/group.rs`, client commands that use GroupMember.
+- **Tasks:**
+  1. Add `remove_member` (by index or identity) and `update_credential` / rekey using openmls APIs.
+  2. Handle incoming MLS proposals (Remove, Update) in `receive_message` path and apply to group state.
+  3. CLI: `remove` and `update` subcommands or options.
+- **Deliverable:** Members can be removed and credentials updated; proposals handled; CLI exposed.
+- **Ref:** OpenMLS API for `MlsGroup::remove_member`, `MlsGroup::process_pending_proposals`, etc.
+
+### B3. 5.3 Sealed Sender and 5.4 Traffic analysis
+- **Files:** Docs; optionally `crates/quicnprotochat-server`, `crates/quicnprotochat-client` for padding.
+- **Tasks:**
+  1. Document current `sealed_sender` behaviour (enqueue without identity binding) and that full “sender in ciphertext” is a future protocol change.
+  2. Optional: add optional payload padding (e.g. pad to next 256 bytes) or random delay in client send path for 5.4.
+- **Deliverable:** Doc update; optional padding/behaviour.
+
+---
+
+## File ownership (avoid conflicts)
+
+| Area | Agent A | Agent B |
+|------|---------|---------|
+| `schemas/node.capnp` | — | Add createChannel |
+| `crates/quicnprotochat-server/src/node_service/auth_ops.rs` | 1.4 username enum | — |
+| `crates/quicnprotochat-server/src/node_service/delivery.rs` | — | 5.1 channel authz |
+| `crates/quicnprotochat-server/src/storage.rs` | — | 5.1 Store channel methods |
+| `crates/quicnprotochat-server/src/sql_store.rs` | — | 5.1 channels table + impl |
+| `crates/quicnprotochat-server/src/tls.rs` | 1.2 optional | — |
+| `crates/quicnprotochat-core/` | 1.1 M7, 1.3 doc | 5.2 group.rs |
+| `docs/` | 1.2, 1.3, 1.4, 5.3/5.4 | — (or shared) |
+
+**Shared:** `docs/`, `README.md`. Prefer non-overlapping files (e.g. A adds `certificate-lifecycle.md`, B does not edit it).
+
+---
+
+## Order of operations (recommended)
+
+1. **Both:** Sync on schema and Store trait changes so B adds `createChannel` and channel methods without A touching the same trait.
+2. **Agent A:** Ship A1 (CA/TLS docs) and A2 (1.4 doc + optional code) first; then A3 (M7) in a follow-up PR/batch.
+3. **Agent B:** Ship B1 (createChannel + channel authz) first; then B2 (MLS remove/update); then B3/B4 (docs/padding).
+
+---
+
+## Completion checklist
+
+- [ ] A1: CA-signed TLS / certificate lifecycle doc
+- [ ] A2: Username enumeration doc and/or mitigation
+- [ ] A3: M7 hybrid KEM in MLS provider
+- [ ] A4: 1.3 credential binding (docs)
+- [ ] B1: createChannel RPC + channel storage + delivery authz
+- [ ] B2: MLS remove/update and proposal handling
+- [ ] B3/B4: Sealed Sender and traffic analysis (docs + optional padding)
--- a/docs/SECURITY-AUDIT.md
+++ b/docs/SECURITY-AUDIT.md
@@ -0,0 +1,226 @@
+# Security Audit
+
+This document is a security audit of the quicnprotochat codebase as of the audit date. It aligns with the [Threat Model](src/cryptography/threat-model.md) and [Production Readiness Audit](PRODUCTION-READINESS-AUDIT.md). The project has **not** undergone a formal third-party audit; this is an internal review.
+
+---
+
+## Executive Summary
+
+| Area | Finding | Severity |
+|------|---------|----------|
+| Authentication & sessions | Token comparison constant-time; session tokens CSPRNG; OPAQUE used correctly; no secrets in logs | ✅ Strong |
+| Cryptography | MLS, Ed25519, hybrid KEM, zeroization where appropriate; Argon2/ChaCha20 for state | ✅ Strong |
+| Transport (TLS) | TLS 1.3 only; client verifies server cert; self-signed default is documented weakness | ⚠️ Known gap |
+| Authorization (DS/AS) | Enqueue/fetch/fetchWait/key ops require auth + identity binding (or sealed_sender); health unauthenticated by design | ✅ Appropriate |
+| Input validation & limits | Key/recipient length, payload/KeyPackage size, queue depth, rate limit, UTF-8 username | ✅ Good |
+| Secrets handling | No tokens/keys/passwords in logs; DB key optional (documented); state encryption optional | ✅ Good |
+| Dependency hygiene | No `cargo audit` in tree; recommend adding and running in CI | ⚠️ Recommendation |
+
+**Overall:** The design and implementation are security-conscious and match the documented threat model. Remaining risks are largely documented (self-signed TLS, metadata visibility, BasicCredential) or operational (deps, production config).
+
+---
+
+## 1. Authentication and Session Management
+
+### 1.1 Token comparison
+
+- **Location:** `crates/quicnprotochat-server/src/auth.rs`
+- **Finding:** Bearer token and identity key comparisons use `subtle::ConstantTimeEq` (`ct_eq`). Length is checked before comparison where applicable.
+- **Status:** ✅ No timing leakage from token or identity comparison.
+
+### 1.2 Session token generation
+
+- **Location:** `crates/quicnprotochat-server/src/node_service/auth_ops.rs` (login finish)
+- **Finding:** Session tokens are 32 bytes from `rand::RngCore::fill_bytes(&mut rand::rngs::OsRng, &mut token)`. Stored in `sessions` with TTL (24h). Expired sessions are removed on next use.
+- **Status:** ✅ Cryptographically strong, single-use style (opaque 32-byte token).
+
+### 1.3 OPAQUE (RFC 9497)
+
+- **Location:** `crates/quicnprotochat-core/src/opaque_auth.rs`, server `auth_ops.rs`
+- **Finding:** Shared `OpaqueSuite` (Ristretto255, Triple-DH, Argon2id). Server never sees password. Registration and login flows use `ServerRegistration`/`ServerLogin` correctly. Pending login state is stored server-side and removed on consume. Identity key is bound at login finish; mismatch returns E016 and is not logged with secrets.
+- **DoS:** Pending-login check runs **before** expensive OPAQUE work (login start); repeated attempts for the same username within 60s are rejected early.
+- **Status:** ✅ Correct usage; DoS mitigation in place.
+
+### 1.4 Audit logging (no secrets)
+
+- **Finding:** Comments in `auth_ops.rs` and `delivery.rs` explicitly forbid logging tokens, passwords, or raw keys. Logged fields: username, recipient/key prefix (`fmt_hex`), payload length, seq, counts. Login success/failure and rate-limit hit are logged without session token or identity.
+- **Status:** ✅ No sensitive material in logs.
+
+---
+
+## 2. Cryptography
+
+### 2.1 MLS and identity
+
+- **Location:** `quicnprotochat-core` (group, identity, keypackage)
+- **Finding:** MLS ciphersuite `MLS_128_DHKEMX25519_AES128GCM_SHA256_Ed25519` (RFC 9420). Ed25519 identity seed stored in `Zeroizing<[u8; 32]>`; zeroize-on-drop. KeyPackages validated for ciphersuite before server stores. Single-use KeyPackage semantics enforced (consume-on-fetch).
+- **Status:** ✅ Aligns with key lifecycle and zeroization goals.
+
+### 2.2 Hybrid KEM (X25519 + ML-KEM-768)
+
+- **Location:** `crates/quicnprotochat-core/src/hybrid_kem.rs`
+- **Finding:** Hybrid keypair and shared secrets use `Zeroizing` where appropriate. HKDF domain separation (`quicnprotochat-hybrid-v1`). ChaCha20-Poly1305 for AEAD. Versioned envelope.
+- **Status:** ✅ PQ-ready envelope layer; secret handling is careful.
+
+### 2.3 Client state encryption (QPCE)
+
+- **Location:** `crates/quicnprotochat-client/src/client/state.rs`
+- **Finding:** Optional password protection: Argon2id (default params) for key derivation, ChaCha20-Poly1305, random salt and nonce. Derived key held in `Zeroizing` during use. Unencrypted state is a documented option (e.g. dev).
+- **Recommendation:** Document Argon2 params (memory, iterations) for auditability; consider explicit `Argon2::new()` with named params in a future revision.
+- **Status:** ✅ Appropriate for optional at-rest protection.
+
+---
+
+## 3. Transport (TLS / QUIC)
+
+### 3.1 Server TLS
+
+- **Location:** `crates/quicnprotochat-server/src/tls.rs`
+- **Finding:** TLS 1.3 only. No client cert. ALPN `capnp`. When not in production, missing cert/key triggers self-signed generation; key file permissions set to `0o600` on Unix. Production mode requires existing cert/key (no auto-generation).
+- **Status:** ✅ Matches documented design; self-signed limitation is documented in threat model.
+
+### 3.2 Client TLS
+
+- **Location:** `crates/quicnprotochat-client/src/client/rpc.rs`
+- **Finding:** Client loads CA cert from file, builds `RootCertStore` with that single cert, uses it for server verification. Server name from CLI/env is used for connection (SNI and cert verification). No custom bypass.
+- **Status:** ✅ Proper verification against provided CA; trust-on-first-use / self-signed caveat is documented.
+
+---
+
+## 4. Authorization and Access Control
+
+### 4.1 RPC auth matrix
+
+| RPC | Auth required | Identity binding |
+|-----|----------------|------------------|
+| health | No | N/A (liveness) |
+| opaqueLoginStart/Finish, opaqueRegisterStart/Finish | No (password/session flow) | After login |
+| uploadKeyPackage, fetchKeyPackage | Yes | Must match identity_key (or allow_insecure) |
+| enqueue | Yes | Must match recipient_key unless sealed_sender |
+| fetch, fetchWait | Yes | Must match recipient_key (or allow_insecure) |
+| uploadHybridKey, fetchHybridKey | Yes | Must match identity_key (or allow_insecure) |
+| publishEndpoint, resolveEndpoint | Yes | Publish: match identity_key; Resolve: any valid token |
+
+- **Finding:** Sensitive operations require `validate_auth_context` and, where relevant, `require_identity_or_request`. Fetch/fetchWait ensure the authenticated identity matches the requested recipient_key, so only the recipient (or someone with their session) can pull messages. With `sealed_sender`, enqueue only requires a valid token (no identity binding to sender).
+- **Status:** ✅ Authorization consistent with design; sealed_sender trade-off is documented.
+
+### 4.2 Rate limiting
+
+- **Location:** `crates/quicnprotochat-server/src/auth.rs`, `delivery.rs`
+- **Finding:** Per-token rate limit (e.g. 100 enqueues per 60s). Enqueue path checks before storage. Queue depth and payload size caps (1000 messages, 5 MB) enforced.
+- **Status:** ✅ Limits in place to curb abuse and DoS.
+
+---
+
+## 5. Input Validation and Limits
+
+- **Identity/recipient keys:** Rejected unless length exactly 32 bytes (E004).
+- **Payload:** Non-empty; max 5 MB (E005, E006).
+- **KeyPackage:** Non-empty; max 1 MB (E007, E008); ciphersuite validated before store (E021).
+- **Username:** Non-empty; must be valid UTF-8 (E011, E020).
+- **Wire version:** Rejected if &gt; CURRENT_WIRE_VERSION (E012).
+- **Cap'n Proto:** Server and client set `traversal_limit_in_words(Some(4 * 1024 * 1024))` (4M words = 32 MiB) to bound parsing DoS.
+- **Status:** ✅ Validation and limits are consistently applied.
+
+---
+
+## 6. Storage and Persistence
+
+### 6.1 Server
+
+- **File-backed store:** Mutex-protected; lock errors mapped to `StorageError` (no unwrap in hot path). OPAQUE server setup file permissions `0o600` on Unix.
+- **SQL store:** Optional SQLCipher with `db_key`; empty key = plaintext (documented). Production validation requires non-empty `db_key` when backend is SQL. User records use INSERT (no OR REPLACE) and unique constraint; duplicate user returns `StorageError::DuplicateUser` (E018).
+- **Status:** ✅ Matches production-readiness and auth design; DB encryption caveat documented.
+
+### 6.2 Client
+
+- **State file:** Optional QPCE encryption (Argon2id + ChaCha20-Poly1305). Unencrypted state contains identity seed; documented.
+- **Keystore:** Persisted for HPKE init keys so Welcome can be processed after restart; path and format documented.
+- **Status:** ✅ Acceptable for threat model; optional encryption and handling of secrets are clear.
+
+---
+
+## 7. Known Gaps (from Threat Model and Docs)
+
+These remain as documented, not new findings:
+
+1. **Self-signed TLS:** MITM possible on first connection if client does not pin or verify out-of-band. Mitigation: certificate pinning or CA-signed certs.
+2. **No client auth on DS (by design):** Anyone with a valid token can enqueue to any recipient_key when identity is not required (e.g. sealed_sender). Rate limit and queue/payload caps mitigate abuse.
+3. **BasicCredential only:** No CA or revocation; key substitution possible if AS is compromised. Mitigation: Key Transparency or X.509 credentials.
+4. **Metadata:** Server sees recipient_key, timing, sizes; Sealed Sender and PIR are future mitigations.
+
+---
+
+## 8. Recommendations
+
+### 8.1 High value
+
+- **Dependency audit:** Run `cargo install cargo-audit` then `cargo audit` locally (and in CI if available) to check for known-vulnerable dependencies. Fix or document any findings. See [Checking dependencies](#checking-dependencies) below.
+- **Argon2 params:** Implemented: client state KDF now uses explicit Argon2id parameters (19 MiB memory, 2 iterations, 1 lane) in `quicnprotochat-client` so they are auditable.
+
+### 8.2 Medium value
+
+- **Certificate pinning:** To pin the server, use the server's certificate as the client's `ca_cert` (e.g. copy `server-cert.der` from the server and pass it via `--ca-cert` or `QUICNPROTOCHAT_CA_CERT`). Do not use a general CA unless you intend to trust that CA for all servers. See [Certificate pinning](#certificate-pinning) below.
+- **Health endpoint:** The `health` RPC is unauthenticated by design for liveness probes and load balancers; this is documented in code and in this audit.
+
+### 8.3 Lower priority
+
+- **Cap'n Proto traversal limit:** Implemented: reduced to 4M words (32 MiB) with a named constant; trade-off documented in code.
+- **Username enumeration:** OPAQUE login start uses `get_user_record`; timing or response shape might still reveal user existence. Mitigation in scope: consider constant-time or uniform response for unknown users in a future revision (e.g. fixed dummy work when user not found) without weakening OPAQUE. Current code does not implement this.
+
+---
+
+## 9. Summary Table
+
+| Category | Item | Status |
+|----------|------|--------|
+| Auth | Constant-time token/identity comparison | ✅ |
+| Auth | Session token from CSPRNG, not logged | ✅ |
+| Auth | OPAQUE used correctly; no password on server | ✅ |
+| Auth | Pending-login DoS check before OPAQUE work | ✅ |
+| Auth | No secrets in audit logs | ✅ |
+| Crypto | MLS ciphersuite, KeyPackage validation | ✅ |
+| Crypto | Identity seed zeroized on drop | ✅ |
+| Crypto | Hybrid KEM and state encryption use Zeroizing | ✅ |
+| Transport | TLS 1.3 only; client verifies server cert | ✅ |
+| Transport | Self-signed default (documented weakness) | ⚠️ Known |
+| Authz | Enqueue/fetch/key/p2p require auth + identity where required | ✅ |
+| Authz | Rate limit and size/depth limits | ✅ |
+| Input | Lengths, sizes, UTF-8, wire version, traversal limit | ✅ |
+| Storage | Server: lock Result, DB key in prod, duplicate user | ✅ |
+| Storage | Client: optional QPCE; explicit Argon2 params; unencrypted state documented | ✅ |
+| Deps | Run `cargo audit` locally/CI (see [Checking dependencies](#checking-dependencies)) | ⚠️ Recommend |
+
+---
+
+## Checking dependencies
+
+To check for known vulnerabilities in dependencies:
+
+```bash
+cargo install cargo-audit
+cargo audit
+```
+
+Fix or document any reported issues. Running `cargo audit` in CI (e.g. GitHub Actions) is recommended.
+
+---
+
+## Certificate pinning
+
+The client trusts the server certificate(s) in the file given by `--ca-cert` (or `QUICNPROTOCHAT_CA_CERT`). To **pin** a specific server:
+
+1. Obtain the server's certificate (e.g. copy `data/server-cert.der` from the server, or export from your deployment).
+2. Use that file as the client's `ca_cert`. The client will only connect to a server that presents that exact certificate (or chain).
+3. Do not use a broad CA bundle as `ca_cert` unless you intend to trust any server certified by that CA.
+
+This gives trust-on-first-use behaviour when you deploy the server and then distribute its cert to clients.
+
+---
+
+## Related Documents
+
+- [Threat Model](src/cryptography/threat-model.md)
+- [Production Readiness Audit](PRODUCTION-READINESS-AUDIT.md)
+- [Cryptography Overview](src/cryptography/overview.md)
+- [Key Lifecycle and Zeroization](src/cryptography/key-lifecycle.md)
--- a/docs/src/SUMMARY.md
+++ b/docs/src/SUMMARY.md
@@ -17,6 +17,7 @@
 - [Building from Source](getting-started/building.md)
 - [Running the Server](getting-started/running-the-server.md)
 - [Running the Client](getting-started/running-the-client.md)
+- [Certificate Lifecycle and CA-Signed TLS](getting-started/certificate-lifecycle.md)
 - [Docker Deployment](getting-started/docker.md)
 - [Demo Walkthrough: Alice and Bob](getting-started/demo-walkthrough.md)

--- a/docs/src/architecture/crate-responsibilities.md
+++ b/docs/src/architecture/crate-responsibilities.md
@@ -1,9 +1,11 @@
 # Crate Responsibilities

-The quicnprotochat workspace is split into four crates with strict layering
-rules. Each crate owns one concern and depends only on the crates below it.
-This page documents what each crate provides, what it explicitly avoids, and
-how the crates relate to one another.
+The quicnprotochat workspace contains six crates. The main four (proto, core,
+server, client) follow strict layering rules; each owns one concern and depends
+only on the crates below it. The workspace also includes **quicnprotochat-gui**
+(Tauri desktop app) and **quicnprotochat-p2p** (P2P endpoint resolution). This
+page documents what each crate provides, what it explicitly avoids, and how the
+crates relate to one another.

 ---

@@ -198,6 +200,17 @@ group state to disk.

 ---

+## Other workspace crates
+
+| Crate                   | Role |
+|-------------------------|------|
+| **quicnprotochat-gui**  | Tauri 2 desktop application; provides a GUI on top of the client/core stack. |
+| **quicnprotochat-p2p**  | P2P endpoint publish/resolve; used by the server and clients for direct peer discovery. |
+
+These crates are optional for building and running the server and CLI client.
+
+---
+
 ## Layering Rules

 1. **proto** depends on nothing in-workspace. It is pure data definition.
@@ -207,6 +220,8 @@ group state to disk.
 4. **client** depends on **core** and **proto**. It does not depend on server.
 5. **server** and **client** never depend on each other. They communicate
   exclusively via the Cap'n Proto RPC wire protocol.
+6. **quicnprotochat-gui** and **quicnprotochat-p2p** are optional; they depend
+   on client/core/proto as needed and do not change the core layering.

 This layering ensures that:

--- a/docs/src/getting-started/certificate-lifecycle.md
+++ b/docs/src/getting-started/certificate-lifecycle.md
@@ -0,0 +1,75 @@
+# Certificate lifecycle and CA-signed TLS
+
+This page describes how to use CA-issued certificates with quicnprotochat and how to think about certificate pinning, rotation, and lifecycle.
+
+For basic server TLS setup (self-signed certs, generation), see [Running the Server](running-the-server.md#tls-certificate-handling).
+
+---
+
+## Current behaviour
+
+- **Server:** Uses a single TLS certificate and private key (DER format). If the files are missing and the server is not in production mode, it generates a self-signed certificate. Production mode (`QUICNPROTOCHAT_PRODUCTION=1`) requires existing cert and key files.
+- **Client:** Trusts exactly the roots in the file given by `--ca-cert` (or `QUICNPROTOCHAT_CA_CERT`). Typically this is the server's own certificate (pinning) or a CA that signed the server cert.
+
+---
+
+## Certificate pinning (recommended for single-server)
+
+To pin the server so the client only connects to that server:
+
+1. Copy the server's certificate file (e.g. `data/server-cert.der`) from the server (or your deployment).
+2. Use that file as the client's CA cert:
+   ```bash
+   quicnprotochat --ca-cert /path/to/server-cert.der ...
+   ```
+3. The client will only accept a connection if the server presents that exact certificate (or a chain ending in it). No separate CA bundle is required.
+
+This is **trust-on-first-use**: whoever deploys the server and distributes the cert to clients is the trust anchor. Suitable for single-server or small deployments.
+
+---
+
+## CA-issued certificates (e.g. Let's Encrypt)
+
+To use a certificate issued by a public CA (e.g. Let's Encrypt):
+
+1. **Obtain the certificate and key** using your preferred method (e.g. certbot, acme-client). The server expects:
+   - Certificate in **DER** format (not PEM). Convert if needed:
+     ```bash
+     openssl x509 -in fullchain.pem -outform DER -out server-cert.der
+     ```
+   - Private key in **DER** format (PKCS#8). Convert if needed:
+     ```bash
+     openssl pkcs8 -topk8 -inform PEM -outform DER -in privkey.pem -out server-key.der -nocrypt
+     ```
+2. **Configure the server** to use those paths:
+   ```bash
+   export QUICNPROTOCHAT_TLS_CERT=/etc/quicnprotochat/server-cert.der
+   export QUICNPROTOCHAT_TLS_KEY=/etc/quicnprotochat/server-key.der
+   ```
+3. **Configure the client** to trust the CA that signed the server cert. Use the CA’s certificate (or the CA bundle) as `--ca-cert`:
+   ```bash
+   quicnprotochat --ca-cert /etc/ssl/certs/your-ca.der --server-name your.server.example ...
+   ```
+   The `--server-name` must match the certificate’s SAN (e.g. DNS name).
+
+**Note:** The server does not currently reload the certificate on SIGHUP or on a timer. Certificate rotation is done by replacing the cert/key files and restarting the server (or by adding a future “reload” mechanism).
+
+---
+
+## Certificate rotation
+
+- **Manual rotation:** Replace `server-cert.der` and `server-key.der` on disk, then restart the server. Clients that pin the new cert must be updated with the new cert file.
+- **Let’s Encrypt renewal:** After renewing (e.g. via certbot), convert the new cert and key to DER, replace the files, and restart the server. If clients use the CA cert (e.g. ISRG Root X1) as `--ca-cert`, they do not need updates when the server cert is renewed.
+- **OCSP / CRL:** The quicnprotochat server does not currently perform OCSP stapling or CRL checks. Revocation is handled by the client or by operational procedures (e.g. short-lived certs, rotation on compromise).
+
+---
+
+## Summary
+
+| Deployment style | Server cert | Client `--ca-cert` |
+|------------------|-------------|--------------------|
+| Pinned (single server) | Self-signed or any | Server’s cert file |
+| CA-issued | Let’s Encrypt (or other CA) | CA cert (or bundle) |
+| Production | Always use existing cert/key; set `QUICNPROTOCHAT_PRODUCTION=1` | CA or pinned server cert |
+
+For production, prefer either (a) certificate pinning with the server’s cert or (b) a CA-issued server cert with clients trusting the CA, and plan for rotation and restart (or future reload support).
--- a/docs/src/getting-started/running-the-server.md
+++ b/docs/src/getting-started/running-the-server.md
@@ -55,6 +55,14 @@ RUST_LOG=debug \
 cargo run -p quicnprotochat-server
 ```

+### Production deployment
+
+Set `QUICNPROTOCHAT_PRODUCTION=1` (or `true` / `yes`) so the server enforces production checks:
+
+- **Auth:** A non-empty `QUICNPROTOCHAT_AUTH_TOKEN` is required; the value `devtoken` is rejected.
+- **TLS:** Existing cert and key files are required (auto-generation is disabled).
+- **SQL store:** When `--store-backend=sql`, a non-empty `QUICNPROTOCHAT_DB_KEY` is required. An empty key leaves the database unencrypted on disk and is not acceptable for production.
+
 ---

 ## TLS certificate handling
--- a/docs/src/introduction.md
+++ b/docs/src/introduction.md
@@ -90,7 +90,8 @@ Known limitations:
 - MLS credentials use `CredentialType::Basic` (raw public key). A production system would bind credentials to a certificate authority or use X.509 certificates.
 - The Delivery Service performs **no authentication** of the `recipientKey` field -- anyone who knows a recipient's public key can enqueue messages for them. Access control is a future milestone.
 - The HPKE init private key generated during `register-state` is held in-process memory (or on-disk via the key store). If the process exits before the corresponding Welcome is consumed, `join` will fail because the private key is lost.
- Group membership is currently limited to two-party groups in practice. Multi-party Commit fan-out is planned for milestone M5.
+
+Multi-party groups (N > 2) are supported (milestone M5): Commit fan-out, `send --all`, and epoch sync work for all members.

 For the full milestone tracker, see [Milestones](roadmap/milestones.md).

--- a/docs/src/protocol-layers/capn-proto.md
+++ b/docs/src/protocol-layers/capn-proto.md
@@ -175,10 +175,10 @@ pub fn from_bytes(bytes: &[u8]) -> Result<Reader<OwnedSegments>, capnp::Error>
 ```

 `from_bytes` uses `ReaderOptions::new()` with default limits:
- **Traversal limit**: 64 MiB (8 * 1024 * 1024 words)
+- **Traversal limit**: 32 MiB (4 * 1024 * 1024 words)
 - **Nesting limit**: 512 levels

-These defaults are reasonable for trusted data. For untrusted data from the network, callers should consider tightening `traversal_limit_in_words` to prevent denial-of-service via deeply nested or excessively large messages. The server enforces its own size limits: 5 MB per payload (`MAX_PAYLOAD_BYTES`) and 1 MB per KeyPackage (`MAX_KEYPACKAGE_BYTES`).
+The traversal limit bounds DoS from deeply nested or excessively large Cap'n Proto messages. The server also enforces size limits: 5 MB per payload (`MAX_PAYLOAD_BYTES`) and 1 MB per KeyPackage (`MAX_KEYPACKAGE_BYTES`).

 ## The NodeService RPC interface