feat: add post-quantum hybrid KEM + SQLCipher persistence

Feature 1 — Post-Quantum Hybrid KEM (X25519 + ML-KEM-768): - Create hybrid_kem.rs with keygen, encrypt, decrypt + 11 unit tests - Wire format: version(1) | x25519_eph_pk(32) | mlkem_ct(1088) | nonce(12) | ct - Add uploadHybridKey/fetchHybridKey RPCs to node.capnp schema - Server: hybrid key storage in FileBackedStore + RPC handlers - Client: hybrid keypair in StoredState, auto-wrap/unwrap in send/recv/invite/join - demo-group runs full hybrid PQ envelope round-trip Feature 2 — SQLCipher Persistence: - Extract Store trait from FileBackedStore API - Create SqlStore (rusqlite + bundled-sqlcipher) with encrypted-at-rest SQLite - Schema: key_packages, deliveries, hybrid_keys tables with indexes - Server CLI: --store-backend=sql, --db-path, --db-key flags - 5 unit tests for SqlStore (FIFO, round-trip, upsert, channel isolation) Also includes: client lib.rs refactor, auth config, TOML config file support, mdBook documentation, and various cleanups by user. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 08:07:48 +01:00
parent d1ddef4cea
commit f334ed3d43
81 changed files with 14502 additions and 2289 deletions
--- a/docs/src/roadmap/authz-plan.md
+++ b/docs/src/roadmap/authz-plan.md
@@ -0,0 +1,256 @@
+# Auth, Devices, and Tokens
+
+This page describes the authentication, device management, and authorisation
+design for quicnprotochat. It introduces account and device identities, gates
+server operations by authenticated identity, enforces rate and size limits, and
+binds MLS identity keys to accounts.
+
+This design cuts across milestones M4 through M6. For the broader production
+readiness plan, see [Production Readiness WBS](production-readiness.md).
+
+---
+
+## Goals
+
+1. **Introduce accounts and devices** with authenticated access to `NodeService`.
+2. **Gate operations by identity:** enqueue/fetch/fetchWait require a valid token
+   bound to the caller's account and device.
+3. **Enforce rate and size limits** per account, per device, and per IP.
+4. **Bind MLS identity keys to accounts:** a KeyPackage upload must be associated
+   with the uploading account, preventing impersonation.
+5. **Keep wire changes minimal and versioned:** the `Auth` struct is additive
+   and uses a version field for backward compatibility.
+
+---
+
+## Data Model (Server)
+
+### Accounts
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `account_id` | UUID | Unique account identifier |
+| `created_at` | Timestamp | Account creation time |
+| `status` | Enum | `active`, `suspended`, `deleted` |
+
+### Devices
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `device_id` | UUID | Unique device identifier |
+| `account_id` | UUID | Owning account (foreign key) |
+| `device_pubkey` | Ed25519 public key (32 bytes) | Device signing key |
+| `created_at` | Timestamp | Device registration time |
+| `status` | Enum | `active`, `revoked` |
+
+### Sessions / Tokens
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `session_id` | UUID | Unique session identifier |
+| `account_id` | UUID | Owning account |
+| `device_id` | UUID | Originating device |
+| `access_token` | Opaque bytes | Short-lived bearer token |
+| `refresh_token` | Opaque bytes | Long-lived token for renewal |
+| `expires_at` | Timestamp | Access token expiry |
+| `created_at` | Timestamp | Session creation time |
+
+### Identity Binding
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `account_id` | UUID | Owning account |
+| `mls_identity_key` | Ed25519 public key (32 bytes) | MLS credential public key |
+| `verified_fp` | SHA-256 fingerprint (32 bytes) | Fingerprint of the bound key |
+
+The identity binding table ensures that only the account that registered an
+Ed25519 public key can upload KeyPackages for that key. This prevents a
+compromised or malicious client from uploading KeyPackages under another
+account's identity.
+
+---
+
+## Wire / API Changes
+
+### Auth Struct
+
+A new `Auth` struct is added to all `NodeService` RPC methods:
+
+```capnp
+struct Auth {
+  version     @0 :UInt16;     # 0 = legacy (no auth), 1 = token-based
+  accessToken @1 :Data;       # opaque bearer token
+  deviceId    @2 :Data;       # optional UUID (16 bytes) for audit/rate limit
+}
+```
+
+The `Auth` struct is included as a parameter in `enqueue`, `fetch`, `fetchWait`,
+`uploadKeyPackage`, and `fetchKeyPackage`.
+
+### Versioning
+
+| Version | Meaning |
+|---------|---------|
+| 0 | Legacy mode: no authentication. Server can allow-list in development but defaults to rejecting in production. |
+| 1 | Token-based authentication. `accessToken` is required and validated. |
+
+The server rejects any `version` value higher than its current maximum. This
+ensures that a newer client connecting to an older server fails cleanly rather
+than silently skipping auth.
+
+### Optional Device ID
+
+The `deviceId` field is optional. When present, the server uses it for:
+
+- Per-device rate limiting (in addition to per-account limits).
+- Audit logging (which device performed which operation).
+- Future: device revocation without revoking the entire account.
+
+---
+
+## Server Enforcement
+
+### Token Validation
+
+1. Extract `Auth` struct from the incoming RPC.
+2. If `version == 0` and server is in production mode, reject with
+   `AUTHENTICATION_REQUIRED`.
+3. If `version == 1`, validate `accessToken`:
+   - Token must exist in the session store.
+   - Token must not be expired (`expires_at > now`).
+   - Associated account must have `status == active`.
+   - Associated device (if `deviceId` present) must have `status == active`.
+4. Map validated token to `(account_id, device_id)` for downstream authorisation.
+
+### Identity Matching
+
+- **uploadKeyPackage:** The `identityKey` in the RPC must match an identity
+  binding for the authenticated account. Reject with `IDENTITY_MISMATCH` if the
+  key is not bound to the caller's account.
+- **fetchKeyPackage:** No identity restriction (any authenticated client can
+  fetch any identity's KeyPackage -- this is required for the MLS add-member flow).
+- **enqueue:** If `channelId` is present, the caller's identity must be in the
+  channel membership. If `channelId` is absent (legacy mode), the operation is
+  allowed for any authenticated client.
+- **fetch / fetchWait:** The `recipientKey` must correspond to an identity bound
+  to the caller's account.
+
+### Rate Limits
+
+| Limit | Scope | Default |
+|-------|-------|---------|
+| Request rate | Per IP | 50 requests/second |
+| Request rate | Per account | 50 requests/second |
+| Request rate | Per device | 50 requests/second |
+| Payload size | Per RPC call | 5 MB |
+| KeyPackage TTL | Per package | 24 hours |
+| KeyPackage uploads | Per account | Configurable (prevents store exhaustion) |
+
+Rate limit counters use a sliding window. When a limit is exceeded, the server
+responds with `RATE_LIMITED` and includes a `Retry-After` hint.
+
+### Audit Logging
+
+The following events are logged at audit level:
+
+- Authentication success (account, device, IP).
+- Authentication failure (reason, IP).
+- Token issuance and refresh (account, device).
+- KeyPackage upload (account, identity key fingerprint).
+- Enqueue (account, channel, recipient).
+- Fetch / fetchWait (account, recipient).
+- Rate limit exceeded (scope, account/IP, current rate).
+
+All audit log entries include a timestamp and correlation ID. Sensitive fields
+(token values, ciphertext, private keys) are never logged.
+
+---
+
+## Client Changes
+
+### Login / Register Flow
+
+1. **Register:** Client generates an Ed25519 identity keypair, sends the public
+   key to the server. Server creates an account, binds the identity key, and
+   returns an `(access_token, refresh_token)` pair.
+2. **Login:** Client presents credentials (initially: signed challenge from
+   device key). Server validates and issues tokens.
+3. **Token storage:** Access and refresh tokens stored in the client state file
+   (same location as identity keypair). The state file should be
+   permission-restricted (`0600`).
+4. **Token refresh:** Client detects `TOKEN_EXPIRED` errors and uses the refresh
+   token to obtain a new access token without re-authenticating.
+
+### RPC Integration
+
+Every RPC call includes the `Auth` struct:
+
+```rust
+// Pseudocode for client RPC calls
+let auth = Auth {
+    version: 1,
+    access_token: state.access_token.clone(),
+    device_id: Some(state.device_id),
+};
+node_service.enqueue(auth, recipient_key, channel_id, payload).await?;
+```
+
+### Identity Binding
+
+At registration, the client's Ed25519 public key is bound to the new account.
+The client must refuse to upload KeyPackages if the local identity key does not
+match the bound key -- this prevents accidental identity confusion after key
+rotation.
+
+---
+
+## Compatibility
+
+### Wire Version Field
+
+The `Auth` struct includes its own `version` field, independent of the delivery
+message version. This allows auth changes to evolve separately from the delivery
+protocol.
+
+### Legacy Support
+
+- `version == 0`: No auth. Server behaviour is configurable:
+  - **Development:** Allow legacy calls (default for `cargo run`).
+  - **Production:** Reject legacy calls (default for Docker deployment).
+- `version == 1`: Full auth. This is the target for M4+.
+
+### N-1 Integration Tests
+
+Compatibility testing covers:
+
+- New client (v1 auth) against new server -- expected: full auth flow works.
+- Old client (v0 legacy) against new server in dev mode -- expected: legacy
+  calls succeed.
+- Old client (v0 legacy) against new server in prod mode -- expected: clean
+  rejection with `AUTHENTICATION_REQUIRED`.
+- New client (v1 auth) against old server -- expected: server ignores unknown
+  `Auth` struct fields; operations succeed if server does not enforce auth.
+
+---
+
+## Implementation Sequence
+
+1. Extend Cap'n Proto schemas with the `Auth` struct and add it to all
+   `NodeService` methods.
+2. Implement token validation middleware in server RPC handlers; add an in-memory
+   token store (upgradeable to SQLite at M6).
+3. Bind `identityKey` to account on upload; enforce on fetch/enqueue.
+4. Add tests: unit tests for token validation; integration tests for auth
+   success and failure paths.
+5. Add rate limiting middleware with configurable thresholds.
+6. Add audit logging for all auth-related events.
+
+---
+
+## Cross-references
+
+- [Milestones](milestones.md) -- M4 and M6 deliverables
+- [Production Readiness WBS](production-readiness.md) -- Phase 3 (Auth/Device/Server Hardening)
+- [1:1 Channel Design](dm-channels.md) -- channel-level authz
+- [Wire Format: NodeService Schema](../wire-format/node-service-schema.md) -- RPC schema
+- [Coding Standards](../contributing/coding-standards.md) -- security-by-design requirements
--- a/docs/src/roadmap/dm-channels.md
+++ b/docs/src/roadmap/dm-channels.md
@@ -0,0 +1,261 @@
+# 1:1 Channel Design
+
+This page describes the design for first-class 1:1 (direct message) channels in
+quicnprotochat. Channels provide per-conversation authorisation, MLS-encrypted
+payloads, message retention with TTL eviction, and backward compatibility with
+the legacy delivery model.
+
+For the broader roadmap context, see [Milestones](milestones.md) and
+[Production Readiness WBS](production-readiness.md) (Phase 4).
+
+---
+
+## Goals
+
+1. **First-class 1:1 channels.** Each conversation between two participants has
+   a unique `channelId`, enabling per-channel authorisation, storage, and
+   eviction.
+2. **Per-channel authorisation.** The server enforces that only the two channel
+   members can enqueue and fetch messages for a given channel.
+3. **MLS-encrypted payloads.** All message content is MLS ciphertext. The server
+   never sees plaintext. Channel metadata (ID + participant keys) is the only
+   information the server holds.
+4. **7-day message retention.** Messages older than 7 days are evicted. This is
+   configurable but defaults to 7 days.
+5. **24-hour KeyPackage TTL.** KeyPackages expire after 24 hours. Clients must
+   rotate KeyPackages before expiry to remain reachable.
+
+---
+
+## Schema Changes (Cap'n Proto)
+
+### New Fields
+
+The following fields are added to the existing `NodeService` RPC methods:
+
+| RPC Method | New Field | Type | Description |
+|------------|-----------|------|-------------|
+| `enqueue` | `channelId` | `Data` (UUID, 16 bytes) | Target channel |
+| `fetch` | `channelId` | `Data` (UUID, 16 bytes) | Channel to fetch from |
+| `fetchWait` | `channelId` | `Data` (UUID, 16 bytes) | Channel to long-poll |
+| All messages | `version` | `UInt16` | Wire version for forward compat |
+
+### Version Field
+
+The `version` field on delivery messages allows the server to reject messages
+with unknown versions. The current version is `1`. Clients that do not set
+`channelId` are treated as version `0` (legacy mode).
+
+### New RPC Method
+
+A new `createChannel` method is added to `NodeService`:
+
+```capnp
+createChannel @N (
+  auth       :Auth,
+  peerKey    :Data     # Ed25519 public key of the other participant
+) -> (
+  channelId  :Data     # UUID, 16 bytes
+);
+```
+
+The server generates the `channelId`, stores the membership, and returns the ID
+to the caller. The peer discovers the channel when they receive a message
+addressed to it (or via a separate discovery mechanism in a future milestone).
+
+---
+
+## AuthZ Model
+
+### Channel Membership
+
+Each channel has exactly two members, identified by their Ed25519 public keys:
+
+```
+Channel {
+  channelId:  UUID (16 bytes)
+  members:    {a_key: Ed25519PubKey, b_key: Ed25519PubKey}
+  created_at: Timestamp
+}
+```
+
+The server stores this mapping and enforces it on every operation.
+
+### Enqueue Authorisation
+
+When a client calls `enqueue(auth, channelId, recipientKey, payload)`:
+
+1. Validate the `Auth` token (see [Auth, Devices, and Tokens](authz-plan.md)).
+2. Look up the channel by `channelId`.
+3. Verify that the caller's identity (from the token) is one of the channel's
+   two members.
+4. Verify that `recipientKey` is the *other* member of the channel (prevents
+   sending to yourself or to a non-member).
+5. Apply rate limits (50 r/s per identity, 5 MB payload cap).
+6. Enqueue the payload.
+
+### Fetch Authorisation
+
+When a client calls `fetch(auth, channelId, recipientKey)` or
+`fetchWait(auth, channelId, recipientKey, timeout)`:
+
+1. Validate the `Auth` token.
+2. Verify that the caller's identity matches `recipientKey`.
+3. Verify that `recipientKey` is a member of the specified channel.
+4. Return messages for `(channelId, recipientKey)`, filtering out expired
+   messages (TTL check).
+
+---
+
+## Storage Model
+
+### Channels Table
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `channel_id` | UUID (16 bytes) | Primary key |
+| `member_a_key` | Ed25519 public key (32 bytes) | First member |
+| `member_b_key` | Ed25519 public key (32 bytes) | Second member |
+| `created_at` | Timestamp | Channel creation time |
+
+A unique constraint on `(member_a_key, member_b_key)` (sorted) prevents
+duplicate channels between the same pair of identities.
+
+### Delivery Queue
+
+Messages are keyed by `(channelId, recipient_key)`:
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `channel_id` | UUID (16 bytes) | Channel |
+| `recipient_key` | Ed25519 public key (32 bytes) | Intended recipient |
+| `payload` | Bytes | MLS ciphertext (opaque to server) |
+| `received_at` | Timestamp | Server receive time |
+| `sequence_no` | UInt64 | Per-channel, per-recipient monotonic counter |
+
+### TTL Eviction
+
+Messages are evicted in two ways:
+
+1. **Fetch-time check:** When a client fetches messages, the server filters out
+   any message where `received_at + TTL < now`. This is the primary eviction
+   path.
+2. **Background sweep:** A periodic task (configurable interval, default 1 hour)
+   scans for and deletes expired messages. This prevents unbounded storage
+   growth from inactive channels.
+
+Default TTL values:
+
+| Entity | TTL | Configurable |
+|--------|-----|-------------|
+| Messages | 7 days | Yes |
+| KeyPackages | 24 hours | Yes |
+
+---
+
+## Flows
+
+### Create Channel
+
+```
+Alice                           Server                          Bob
+  |                                |                              |
+  |-- createChannel(auth, bob_key) |                              |
+  |                                |-- generate channelId         |
+  |                                |-- store {channelId,          |
+  |                                |    alice_key, bob_key}       |
+  |<- channelId ------------------|                              |
+  |                                |                              |
+```
+
+Alice receives the `channelId` and can now send messages to Bob on this channel.
+Bob discovers the channel when he receives the first message (the `channelId` is
+included in the delivery metadata).
+
+### Send (with AuthZ)
+
+```
+Alice                           Server
+  |                                |
+  |-- enqueue(auth, channelId,     |
+  |     bob_key, mls_ciphertext)   |
+  |                                |-- validate auth token
+  |                                |-- lookup channel membership
+  |                                |-- verify alice_key in members
+  |                                |-- verify bob_key is recipient
+  |                                |-- check rate limits
+  |                                |-- store (channelId, bob_key,
+  |                                |    payload, received_at, seq)
+  |<- ok (sequence_no) ------------|
+  |                                |
+```
+
+### Receive (with TTL)
+
+```
+Bob                             Server
+  |                                |
+  |-- fetchWait(auth, channelId,   |
+  |     bob_key, timeout)          |
+  |                                |-- validate auth token
+  |                                |-- verify bob_key in channel
+  |                                |-- query (channelId, bob_key)
+  |                                |-- filter: received_at + 7d > now
+  |                                |-- return non-expired messages
+  |<- messages[] ------------------|
+  |                                |
+```
+
+---
+
+## Backward Compatibility
+
+### Legacy Mode (channelId = nil)
+
+When `channelId` is empty or absent:
+
+- The server treats the request as a legacy delivery (pre-channel behavior).
+- Messages are routed solely by `recipientKey`, without channel-level authz.
+- This mode can be disabled in production via server configuration.
+
+### Version Negotiation
+
+The `version` field on delivery messages allows clean rejection of future schema
+changes:
+
+| Version | Behavior |
+|---------|----------|
+| 0 | Legacy mode: no `channelId`, no per-channel authz |
+| 1 | Channel-aware: `channelId` required, authz enforced |
+
+The server rejects messages with `version > max_supported`.
+
+---
+
+## Open Items
+
+These items are deferred to future milestones:
+
+- **Persistence backend:** The current `DashMap`-based store must be extended to
+  SQLite (or SQLCipher) for durable channel and delivery state. See
+  [Milestones: M6](milestones.md#m6----persistence-planned).
+- **Channel discovery API:** A dedicated RPC for Bob to discover channels he is
+  a member of, rather than relying on first-message discovery.
+- **Client UX:** Map peer identity to `channelId` discovery; cache `channelId`
+  in the client state file.
+- **Audit logging:** Log channel creation, authz failures, send/recv events with
+  redaction of ciphertext. See [Auth, Devices, and Tokens](authz-plan.md) for
+  the audit logging design.
+- **Multi-device:** A single account on multiple devices sharing the same
+  channel. Requires per-device delivery queues and MLS multi-device support.
+
+---
+
+## Cross-references
+
+- [Milestones](milestones.md) -- M4 (CLI subcommands) and M6 (persistence)
+- [Production Readiness WBS](production-readiness.md) -- Phase 4 (Delivery Semantics)
+- [Auth, Devices, and Tokens](authz-plan.md) -- token validation and identity binding
+- [Wire Format: Delivery Schema](../wire-format/delivery-schema.md) -- current delivery schema
+- [Wire Format: NodeService Schema](../wire-format/node-service-schema.md) -- RPC interface
+- [Architecture Overview](../architecture/overview.md) -- system diagram and service model
--- a/docs/src/roadmap/future-research.md
+++ b/docs/src/roadmap/future-research.md
@@ -0,0 +1,406 @@
+# Future Research Directions
+
+This page catalogues technologies and research directions that could strengthen
+quicnprotochat beyond the current [milestone plan](milestones.md). Each entry
+includes a brief description, the problem it solves, relevant crates or
+specifications, and how it maps to the project architecture.
+
+For the production readiness work breakdown, see
+[Production Readiness WBS](production-readiness.md).
+
+---
+
+## Transport and Networking
+
+### LibP2P / iroh (n0)
+
+**Problem:** The current architecture is strictly client-server. Clients behind
+NAT cannot communicate directly, and the server is a single point of failure for
+delivery.
+
+**Solution:** [LibP2P](https://libp2p.io/) and [iroh](https://iroh.computer/)
+(from n0) provide peer discovery, NAT traversal (hole-punching), and relay
+fallback. iroh is particularly interesting because it is Rust-native and built on
+QUIC, aligning with quicnprotochat's existing transport layer.
+
+**Architecture impact:** Move from pure client-server to a hybrid topology where
+peers communicate directly when possible and fall back to server relay when NAT
+traversal fails. The server role shifts from mandatory relay to optional
+rendezvous/relay node.
+
+**Crates:** `libp2p`, `iroh`, `iroh-net`
+
+### WebTransport (HTTP/3)
+
+**Problem:** Browser clients cannot use raw QUIC. The current stack requires a
+native Rust binary.
+
+**Solution:** [WebTransport](https://w3c.github.io/webtransport/) exposes
+QUIC-like semantics (multiplexed bidirectional streams, datagrams) to browsers
+over HTTP/3. A WebTransport endpoint alongside the existing QUIC listener would
+enable a web client without WebSocket degradation.
+
+**Architecture impact:** Add a second listener (HTTP/3 + WebTransport) that
+terminates WebTransport and bridges into the existing `NodeService` RPC layer.
+Cap'n Proto serialisation works in WASM via `capnp` crate.
+
+**Crates:** `h3`, `h3-webtransport`, `wtransport`
+
+### Tor / I2P Integration
+
+**Problem:** MLS protects message content, but connection metadata (who connects
+to the server, when, how often) leaks to the server and network observers.
+
+**Solution:** Route client-server connections through
+[Tor](https://www.torproject.org/) onion services or
+[I2P](https://geti2p.net/) tunnels. This provides metadata resistance at the
+network layer.
+
+**Architecture impact:** The server exposes a `.onion` address (Tor) or an I2P
+destination. Clients connect through the anonymity network. Latency increases
+significantly, so this should be optional.
+
+**Crates:** `arti` (Tor client in Rust), `arti-client`
+
+---
+
+## Storage and Persistence
+
+### SQLCipher / libsql (Turso)
+
+**Problem:** At M6, quicnprotochat needs persistent storage for group state, key
+material, and message queues. Storing private keys in a plaintext SQLite database
+is insufficient.
+
+**Solution:** [SQLCipher](https://www.zetetic.net/sqlcipher/) provides
+transparent, page-level AES-256 encryption for SQLite. Alternatively,
+[libsql](https://turso.tech/libsql) (Turso) offers a SQLite fork with
+encryption, replication, and embedded server capabilities.
+
+**Architecture impact:** Replace the `sqlx` SQLite backend with SQLCipher.
+Encryption key derived from a user-provided passphrase (via Argon2id) or a
+hardware-backed key.
+
+**Crates:** `rusqlite` (with `bundled-sqlcipher` feature), `libsql`
+
+### CRDTs (Automerge / Yrs)
+
+**Problem:** Multi-device support requires synchronising state (group membership,
+read receipts, settings) across devices without a central authority resolving
+conflicts.
+
+**Solution:** Conflict-free replicated data types (CRDTs) allow concurrent edits
+to converge without coordination. [Automerge](https://automerge.org/) and
+[Yrs](https://docs.rs/yrs/) (Yjs in Rust) provide production-quality CRDT
+implementations.
+
+**Architecture impact:** Client-side state (contact list, group membership
+cache, read markers) stored as CRDT documents. Synchronisation happens over the
+existing MLS-encrypted channel, ensuring the server never sees the state.
+
+**Crates:** `automerge`, `yrs`
+
+### Object Storage (S3-compatible)
+
+**Problem:** Encrypted file and media attachments need a storage backend that
+the server can host without seeing the content.
+
+**Solution:** An S3-compatible object store (MinIO, Garage, or a cloud provider)
+for encrypted blobs. Clients encrypt attachments client-side (using a key derived
+from the MLS group secret) and upload the ciphertext. The server stores and
+serves opaque blobs.
+
+**Architecture impact:** Add a media upload/download RPC to `NodeService`. The
+server proxies to the object store or returns pre-signed URLs.
+
+**Crates:** `aws-sdk-s3`, `opendal`
+
+---
+
+## Cryptography and Privacy
+
+### ML-KEM + ML-DSA Hybrid (Post-Quantum MLS)
+
+**Problem:** Quantum computers threaten X25519 and Ed25519. While MLS content is
+protected by ephemeral key exchange, the init keys and credential signatures are
+vulnerable to harvest-now-decrypt-later attacks.
+
+**Solution:** Hybrid X25519 + ML-KEM-768 KEM for MLS init keys, and optionally
+hybrid Ed25519 + ML-DSA-65 for credential signatures. The `ml-kem` crate is
+already vendored in the workspace.
+
+**Architecture impact:** Custom `OpenMlsCryptoProvider` in `quicnprotochat-core`
+implementing the hybrid combiner. This is the M7 milestone -- see
+[Milestones](milestones.md#m7----post-quantum-planned) and
+[Hybrid KEM](../protocol-layers/hybrid-kem.md).
+
+**Crates:** `ml-kem`, `ml-dsa`
+
+**References:** NIST FIPS 203 (ML-KEM), `draft-ietf-tls-hybrid-design`
+
+### Private Information Retrieval (PIR)
+
+**Problem:** When a client fetches messages or KeyPackages, the server learns
+*which* recipient is requesting -- even though it cannot read the content.
+
+**Solution:** Private Information Retrieval (PIR) allows a client to fetch a
+record from the server without revealing which record was requested.
+[SealPIR](https://github.com/microsoft/SealPIR) and SimplePIR provide practical
+constructions.
+
+**Architecture impact:** Replace the `fetch` / `fetchKeyPackage` RPCs with PIR
+queries. This is a significant performance trade-off: PIR has high computational
+cost. Suitable for KeyPackage fetch (small database) before message fetch (large
+database).
+
+### Sealed Sender (Signal-style)
+
+**Problem:** The server sees `(sender, recipient, timestamp)` metadata on every
+enqueued message. Even without reading content, this metadata reveals social
+graphs.
+
+**Solution:** [Sealed Sender](https://signal.org/blog/sealed-sender/) encrypts
+the sender's identity inside the MLS ciphertext. The server routes by
+`recipientKey` only and cannot determine who sent the message.
+
+**Architecture impact:** Modify the `enqueue` RPC to omit sender identity from
+the server-visible metadata. The sender identity is included only inside the
+MLS application message (encrypted).
+
+### Key Transparency (RFC draft)
+
+**Problem:** A compromised server could substitute public keys, performing a
+man-in-the-middle attack on MLS group formation.
+
+**Solution:** A verifiable, append-only log of public key bindings (similar to
+Certificate Transparency for TLS). Clients verify that the server's response
+matches the log before trusting a fetched KeyPackage.
+
+**Architecture impact:** Add a key transparency log (Merkle tree) alongside the
+Authentication Service. Clients verify inclusion proofs on every `fetchKeyPackage`
+response.
+
+**References:** `draft-ietf-keytrans-protocol`
+
+---
+
+## Identity and Authentication
+
+### DIDs (Decentralized Identifiers)
+
+**Problem:** User identities are currently bound to the server. If the server
+goes away, identities are lost.
+
+**Solution:** [Decentralized Identifiers](https://www.w3.org/TR/did-core/)
+(`did:key`, `did:web`) provide self-sovereign identity. A user's DID is derived
+from their Ed25519 public key and is portable across servers.
+
+**Architecture impact:** Replace raw Ed25519 public keys in MLS credentials with
+DID URIs. The server resolves DIDs to public keys for routing.
+
+**Crates:** `did-key`, `ssi`
+
+### OPAQUE (aPAKE)
+
+**Problem:** If quicnprotochat adds password-based account registration, the
+server must never see the password -- not even a hash.
+
+**Solution:** [OPAQUE](https://datatracker.ietf.org/doc/rfc9497/) is an
+asymmetric password-authenticated key exchange where the server stores only a
+one-way transformation of the password. The server cannot perform offline
+dictionary attacks.
+
+**Architecture impact:** Replace the registration/login flow with OPAQUE. The
+server stores an OPAQUE registration record; the client runs the OPAQUE protocol
+to authenticate and derive a session key.
+
+**Crates:** `opaque-ke`
+
+**References:** RFC 9497
+
+### WebAuthn / Passkeys
+
+**Problem:** Password-based auth (even with OPAQUE) is vulnerable to phishing.
+Hardware-backed authentication provides stronger device binding.
+
+**Solution:** [WebAuthn](https://www.w3.org/TR/webauthn-3/) / Passkeys allow
+authentication via hardware tokens (YubiKey), platform authenticators (Touch ID,
+Windows Hello), or synced passkeys.
+
+**Architecture impact:** Add a WebAuthn registration/authentication flow to the
+account system. Requires a server-side WebAuthn relying party implementation.
+
+**Crates:** `webauthn-rs`
+
+### Verifiable Credentials (W3C VC)
+
+**Problem:** Proving attributes (organization membership, role, age) without
+revealing full identity.
+
+**Solution:** [Verifiable Credentials](https://www.w3.org/TR/vc-data-model/)
+allow a user to present cryptographic proofs of attributes issued by a trusted
+authority.
+
+**Architecture impact:** Extend MLS credentials with VC presentation. A group
+admin could require proof of organization membership before allowing join.
+
+---
+
+## Application Layer
+
+### Matrix-style Federation
+
+**Problem:** A single server is a single point of failure and a single point of
+trust. Users on different servers cannot communicate.
+
+**Solution:** Federation allows multiple quicnprotochat servers to exchange
+messages, similar to [Matrix](https://matrix.org/) homeserver federation. Each
+server manages its own users and relays messages to peer servers.
+
+**Architecture impact:** Major. Requires server-to-server protocol, distributed
+identity resolution, and cross-server MLS group management.
+
+### WASM Plugin System
+
+**Problem:** Extensibility (bots, bridges, custom message types) currently
+requires forking the codebase.
+
+**Solution:** A sandboxed WASM plugin system allows third-party extensions to run
+inside the client or server without access to private key material.
+
+**Architecture impact:** Define a plugin API (message hooks, command handlers).
+Plugins compiled to WASM and loaded at runtime via `wasmtime` or `wasmer`.
+
+**Crates:** `wasmtime`, `wasmer`, `extism`
+
+### Double-Ratchet DM Layer
+
+**Problem:** MLS is optimised for groups. For efficient 1:1 conversations, the
+Signal double ratchet (X3DH + Axolotl) provides better performance
+characteristics (no tree overhead for two parties).
+
+**Solution:** Implement a double-ratchet layer for 1:1 DMs, using MLS only for
+groups with N > 2. The [1:1 Channel Design](dm-channels.md) currently uses MLS
+for DMs; this would be an optimisation.
+
+**References:** [The Double Ratchet Algorithm](https://signal.org/docs/specifications/doubleratchet/),
+[X3DH Key Agreement Protocol](https://signal.org/docs/specifications/x3dh/)
+
+---
+
+## Observability and Operations
+
+### OpenTelemetry (Tracing + Metrics)
+
+**Problem:** The current logging is `tracing`-based but lacks distributed
+tracing context and structured metrics export.
+
+**Solution:** [OpenTelemetry](https://opentelemetry.io/) provides a unified
+framework for distributed tracing, metrics, and log correlation. OTLP export
+enables integration with any observability backend.
+
+**Architecture impact:** Add `tracing-opentelemetry` and `opentelemetry-otlp`
+to the server. Instrument RPC handlers with spans. Export to Jaeger, Grafana
+Tempo, or any OTLP-compatible backend.
+
+**Crates:** `opentelemetry`, `opentelemetry-otlp`, `tracing-opentelemetry`
+
+### Prometheus + Grafana
+
+**Problem:** No quantitative visibility into server performance (throughput,
+latency, queue depth, epoch advancement rate).
+
+**Solution:** Export Prometheus metrics from the server. Visualise with Grafana
+dashboards.
+
+**Metrics to export:** message throughput (enqueue/fetch per second), RPC
+latency histograms, MLS epoch advancement rate, delivery queue depth, KeyPackage
+store size, active connections.
+
+**Crates:** `prometheus`, `metrics`, `metrics-exporter-prometheus`
+
+### Testcontainers-rs
+
+**Problem:** Integration tests currently run server and client in the same
+process (`tokio::spawn`). This does not test real network conditions, container
+startup, or multi-process interactions.
+
+**Solution:** [Testcontainers-rs](https://docs.rs/testcontainers/) runs Docker
+containers from Rust tests, enabling true end-to-end CI with real network
+boundaries.
+
+**Architecture impact:** Add testcontainers-based integration tests alongside
+the existing in-process tests. The Docker image is already maintained.
+
+**Crates:** `testcontainers`, `testcontainers-modules`
+
+---
+
+## Developer Experience
+
+### Tauri / Dioxus (Native GUI)
+
+**Problem:** The current interface is CLI-only. A graphical client would broaden
+the user base for testing and demonstration.
+
+**Solution:** [Tauri](https://tauri.app/) or [Dioxus](https://dioxuslabs.com/)
+provide native cross-platform GUI frameworks in Rust. The
+`quicnprotochat-core` crate can be shared directly with the GUI client.
+
+**Architecture impact:** Add a `quicnprotochat-gui` crate that depends on
+`quicnprotochat-core` and `quicnprotochat-proto`. The GUI drives the same
+`GroupMember` and RPC logic as the CLI client.
+
+**Crates:** `tauri`, `dioxus`
+
+### uniffi / diplomat (Mobile FFI)
+
+**Problem:** Mobile clients (iOS, Android) cannot use the Rust binary directly.
+
+**Solution:** [uniffi](https://github.com/aspect-build/aspect-cli) (Mozilla) and
+[diplomat](https://github.com/nickelc/diplomat) generate idiomatic Swift and
+Kotlin bindings from Rust definitions.
+
+**Architecture impact:** Expose `quicnprotochat-core` through a C-compatible FFI
+layer. Mobile apps call into the Rust crypto and protocol logic.
+
+**Crates:** `uniffi`, `diplomat`
+
+### Nix Flakes
+
+**Problem:** The development environment requires `capnp` (Cap'n Proto compiler),
+a specific Rust toolchain version, and test infrastructure. Setup varies across
+developer machines.
+
+**Solution:** [Nix flakes](https://nixos.wiki/wiki/Flakes) provide a
+reproducible, declarative development environment. A single `nix develop`
+command sets up the toolchain, `capnp`, and all dependencies.
+
+**Architecture impact:** Add `flake.nix` and `flake.lock` to the repository root.
+
+---
+
+## Top 5 Priority Implementations
+
+The following table ranks the most impactful technologies for near-term adoption,
+considering the current state of the codebase and the [milestone plan](milestones.md).
+
+| Priority | Technology | Why | Unlocks |
+|----------|-----------|-----|---------|
+| 1 | **Post-quantum hybrid KEM** | `ml-kem` is already vendored in the workspace. Completing the hybrid `OpenMlsCryptoProvider` makes quicnprotochat one of the first PQ MLS implementations. | M7 |
+| 2 | **SQLCipher persistence** | Encrypted-at-rest storage is the prerequisite for multi-device support, offline usage, and server restart survival. | M6 |
+| 3 | **OPAQUE auth** | Zero-knowledge password authentication is a massive security uplift for the account system. The server never sees or stores passwords. | Phase 3 (authz) |
+| 4 | **iroh / LibP2P** | NAT traversal and optional P2P mesh makes quicnprotochat deployable without centralised infrastructure. Aligns with the existing QUIC transport. | Beyond M7 |
+| 5 | **Sealed Sender + PIR** | Content encryption is table stakes. Metadata resistance (hiding who talks to whom) is the frontier of private messaging research. | Beyond M7 |
+
+---
+
+## Cross-references
+
+- [Milestones](milestones.md) -- current milestone tracker
+- [Production Readiness WBS](production-readiness.md) -- phased work breakdown
+- [Auth, Devices, and Tokens](authz-plan.md) -- OPAQUE integration point
+- [1:1 Channel Design](dm-channels.md) -- double-ratchet optimisation context
+- [Hybrid KEM](../protocol-layers/hybrid-kem.md) -- existing PQ design
+- [ADR-006: PQ Gap in Noise Transport](../design-rationale/adr-006-pq-gap.md) -- accepted PQ risk
+- [References](../appendix/references.md) -- standards and crate documentation
--- a/docs/src/roadmap/milestones.md
+++ b/docs/src/roadmap/milestones.md
@@ -0,0 +1,194 @@
+# Milestone Tracker
+
+This page tracks the project milestones for quicnprotochat, from initial transport
+layer through post-quantum cryptography. Each milestone produces production-ready,
+tested, deployable code -- see [Coding Standards](../contributing/coding-standards.md)
+for what that means in practice.
+
+---
+
+## Milestone Summary
+
+| # | Name | Status | What it adds |
+|---|------|--------|-------------|
+| M1 | QUIC/TLS Transport | **Complete** | QUIC + TLS 1.3 endpoint, length-prefixed framing, Ping/Pong |
+| M2 | Authentication Service | **Complete** | Ed25519 identity, KeyPackage generation, AS upload/fetch |
+| M3 | Delivery Service + MLS Groups | **Complete** | DS relay, GroupMember create/join/add/send/recv |
+| M4 | Group CLI Subcommands | **Next** | Persistent CLI (create-group, invite, join, send, recv); `demo-group` already available |
+| M5 | Multi-party Groups | Planned | N > 2 members, Commit fan-out, Proposal handling |
+| M6 | Persistence | Planned | SQLite key store, durable group state |
+| M7 | Post-quantum | Planned | PQ hybrid for MLS/HPKE (X25519 + ML-KEM-768) |
+
+---
+
+## M1 -- QUIC/TLS Transport (Complete)
+
+**Goal:** Two processes establish a QUIC connection over TLS 1.3 and exchange
+typed Cap'n Proto frames.
+
+**Deliverables:**
+
+- `schemas/envelope.capnp`: `Envelope` struct with `MsgType` enum (Ping/Pong at this stage)
+- `quicnprotochat-proto`: `build.rs` invoking `capnpc`, generated type re-exports,
+  canonical serialisation helpers
+- `quicnprotochat-core`: static X25519 keypair generation, Noise\_XX initiator and
+  responder, length-prefixed Cap'n Proto frame codec (Tokio `Encoder`/`Decoder`)
+- `quicnprotochat-server`: QUIC listener with TLS 1.3 (quinn/rustls), Ping to Pong
+  handler, one tokio task per connection
+- `quicnprotochat-client`: connects over QUIC, sends Ping, receives Pong, exits 0
+- Integration test: server and client in same test binary using `tokio::spawn`
+- `docker-compose.yml` running the server
+
+**Tests:** codec (7 unit tests), keypair (3 unit tests), Noise transport integration.
+
+**Branch:** `feat/m1-noise-transport`
+
+---
+
+## M2 -- Authentication Service (Complete)
+
+**Goal:** Clients register an Ed25519 identity and publish/fetch MLS KeyPackages
+via Cap'n Proto RPC.
+
+**Deliverables:**
+
+- `schemas/auth.capnp`: `AuthenticationService` interface (`uploadKeyPackage`,
+  `fetchKeyPackage`)
+- `quicnprotochat-core`: Ed25519 identity keypair generation, MLS KeyPackage
+  generation via `openmls`
+- `quicnprotochat-server`: AS RPC server with `DashMap` store, atomic consume-on-fetch
+- `quicnprotochat-client`: `register-state` and `fetch-key` CLI subcommands
+- Integration test: Alice uploads KeyPackage, Bob fetches it, fingerprints match
+
+**Tests:** auth\_service.rs integration tests (upload, fetch, consume semantics).
+
+---
+
+## M3 -- Delivery Service + MLS Groups (Complete)
+
+**Goal:** Alice creates a group and adds Bob via MLS Welcome. Both exchange
+encrypted application messages through the Delivery Service.
+
+**Deliverables:**
+
+- Unified `NodeService` on port 7000 combining Authentication Service and Delivery
+  Service into a single Cap'n Proto RPC interface
+- `GroupMember` struct with full MLS lifecycle: `create_group`, `add_member`,
+  `join_from_welcome`, `send_message`, `receive_message`
+- DS relay with `enqueue`, `fetch`, and `fetchWait` (long-polling) operations
+- `demo-group` subcommand exercising the complete Alice/Bob flow in one process
+- Channel-aware delivery: messages routed by `(channelId, recipientKey)`
+
+**Tests:** All passing -- codec (5+ tests), keypair (3 tests), group round-trip,
+group\_id lifecycle, MLS integration.
+
+**Key design decisions from M3:**
+
+1. **OpenMlsRustCrypto backend holds the HPKE init key in memory.** The same
+   `GroupMember` instance that generated the KeyPackage must process the
+   corresponding Welcome. If the process exits in between, the init private key
+   is lost. This is by design for M3; persistence comes at M6.
+
+2. **KeyPackage wire format: raw TLS-encoded bytes.** KeyPackages are serialised
+   using `tls_serialize_detached()` rather than wrapped in `MlsMessageOut`. This
+   avoids an extra layer of indirection and matches what `openmls` expects on the
+   receive side via `KeyPackageIn::tls_deserialize_exact()`.
+
+3. **openmls 0.5 API gotchas.** Several `openmls` methods changed signatures
+   between 0.4 and 0.5 (e.g., `MlsGroup::new` vs `MlsGroup::new_with_group_id`,
+   `BasicCredential::new` taking `Vec<u8>` directly). These differences are
+   documented inline in `quicnprotochat-core/src/group.rs`.
+
+**Branch:** `feat/m1-noise-transport`
+
+---
+
+## M4 -- Group CLI Subcommands (Next)
+
+**Goal:** Persistent, composable CLI subcommands for group operations, replacing
+the monolithic `demo-group` proof-of-concept.
+
+**Planned deliverables:**
+
+- `create-group` -- creates a new MLS group, stores state locally
+- `invite <identity>` -- adds a member by fetching their KeyPackage from the AS
+- `join` -- processes a Welcome message and joins an existing group
+- `send <message>` -- encrypts and enqueues an application message
+- `recv` -- fetches and decrypts pending messages (or long-polls with `fetchWait`)
+
+The `demo-group` subcommand remains available as a single-command demonstration
+of the full flow.
+
+---
+
+## M5 -- Multi-party Groups (Planned)
+
+**Goal:** Support groups with N > 2 members, including Commit fan-out and
+Proposal handling.
+
+**Planned deliverables:**
+
+- Commit fan-out through the DS to all group members
+- Proposal handling (Add, Remove, Update)
+- Epoch synchronisation across N members
+- Criterion benchmarks: key generation, encap/decap, group-add latency
+  (10/100/1000 members)
+
+---
+
+## M6 -- Persistence (Planned)
+
+**Goal:** Server survives restart. Client state persists across sessions.
+
+**Planned deliverables:**
+
+- `quicnprotochat-server`: SQLite via `sqlx` for AS key store and DS message log,
+  `migrations/` directory
+- `docker/Dockerfile`: multi-stage build (`rust:bookworm` builder, `debian:bookworm-slim` runtime)
+- `docker-compose.yml`: server + SQLite volume, healthcheck
+- Client reconnect with session resume (re-handshake + rejoin group epoch from
+  DS log)
+
+See [Future Research: SQLCipher](future-research.md#storage--persistence) for
+encrypted-at-rest options.
+
+---
+
+## M7 -- Post-quantum (Planned)
+
+**Goal:** Replace the MLS crypto backend with a hybrid X25519 + ML-KEM-768 KEM,
+providing post-quantum confidentiality for all group key material.
+
+**Planned deliverables:**
+
+- Custom `OpenMlsCryptoProvider` with hybrid KEM in `quicnprotochat-core`
+- Hybrid shared secret derivation:
+
+  ```
+  SharedSecret = HKDF-SHA256(
+    ikm  = X25519_ss || ML-KEM-768_ss,
+    info = "quicnprotochat-hybrid-v1",
+    len  = 32
+  )
+  ```
+
+- All M3/M4/M5 tests pass unchanged with the new ciphersuite
+- Follows the combiner approach from `draft-ietf-tls-hybrid-design`
+
+The `ml-kem` crate is already vendored in the workspace. See
+[Hybrid KEM](../protocol-layers/hybrid-kem.md) for the detailed design and
+[ADR-006: PQ Gap in Noise Transport](../design-rationale/adr-006-pq-gap.md) for
+the accepted residual risk in the transport layer.
+
+---
+
+## Cross-references
+
+- [Production Readiness WBS](production-readiness.md) -- phased work breakdown
+  for hardening beyond the milestone track
+- [Auth, Devices, and Tokens](authz-plan.md) -- authentication and authorisation
+  design that cuts across M4--M6
+- [1:1 Channel Design](dm-channels.md) -- DM channel schema and authz model
+- [Future Research](future-research.md) -- technology options for M6+ and beyond
+- [Testing Strategy](../contributing/testing.md) -- how tests are structured
+  across milestones
--- a/docs/src/roadmap/production-readiness.md
+++ b/docs/src/roadmap/production-readiness.md
@@ -0,0 +1,226 @@
+# Production Readiness WBS
+
+This page defines the work breakdown structure (WBS) for taking quicnprotochat
+from a proof-of-concept to a production-hardened system. It covers feature scope,
+security policy, phased delivery, and a planning checklist.
+
+For the milestone-by-milestone tracker, see [Milestones](milestones.md). This
+document focuses on the cross-cutting concerns that span multiple milestones.
+
+---
+
+## Feature Scope (Must-Have)
+
+These are the feature areas that must be addressed before quicnprotochat can be
+considered production-ready. Each area maps to one or more milestones or phases
+in the WBS below.
+
+| Area | Description | Primary Milestone |
+|------|-------------|-------------------|
+| **Identity / Auth** | Account creation, device registration, token-based RPC authentication, MLS identity binding | M4 + Phase 3 |
+| **Key / MLS Lifecycle** | KeyPackage rotation, epoch advancement, member removal, credential updates | M5 + Phase 2 |
+| **Transport / Delivery** | QUIC + TLS 1.3 hardening, ALPN enforcement, connection draining, reconnect | M1 (done) + Phase 2 |
+| **Private 1:1 Channels** | Channel creation, per-channel authz, TTL eviction, DM-specific flows | Phase 4 |
+| **Storage / Persistence** | SQLite (or SQLCipher) for AS, DS, client state; migrations; backup/restore | M6 + Phase 6 |
+| **Observability / Ops** | Structured logging, metrics, distributed tracing, healthcheck endpoints | Phase 6 |
+| **Client Resilience** | Offline queue, retry with backoff, idempotent message IDs, gap detection | Phase 4 |
+| **Compatibility / Protocols** | Wire versioning, N-1 client interoperability, ciphersuite negotiation | Phase 2 + Phase 5 |
+
+---
+
+## Security Plan (By Design)
+
+quicnprotochat follows a security-by-design philosophy. The standards below are
+non-negotiable -- see [Coding Standards](../contributing/coding-standards.md) for
+how they are enforced in code.
+
+### Governance
+
+- `CODEOWNERS` file mapping each crate to a responsible reviewer.
+- All PRs require at least one review from a crate owner.
+- Security-sensitive changes (crypto, auth, wire format) require two reviewers.
+- GPG-signed commits only.
+
+### Transport Policy
+
+- TLS 1.3 only (`rustls` configured with `TLS13` cipher suites exclusively).
+- ALPN token `b"capnp"` required; reject connections with mismatched ALPN.
+- Self-signed certificates acceptable for development; production deployments
+  must use a CA-signed certificate or certificate pinning.
+- Connection draining on shutdown (QUIC `CONNECTION_CLOSE`).
+
+### MLS Policy
+
+- Ciphersuite: `MLS_128_DHKEMX25519_AES128GCM_SHA256_Ed25519` (baseline).
+- Single-use KeyPackages (consumed on fetch, per RFC 9420).
+- KeyPackage TTL: 24 hours; clients must rotate before expiry.
+- Ciphersuite allowlist: server rejects KeyPackages with unknown ciphersuites.
+- No downgrade: once a group has used a ciphersuite, members cannot rejoin with
+  a weaker one.
+
+### Input Validation
+
+- All incoming Cap'n Proto messages validated against schema before processing.
+- Maximum payload size: 5 MB per RPC call.
+- Group ID, identity key, and channel ID fields validated for correct length
+  (32 bytes, 32 bytes, 16 bytes respectively).
+- UTF-8 validation on all string fields.
+
+### Secrets Management
+
+- All private key material wrapped in `Zeroizing<T>` (via the `zeroize` crate).
+- No secret material in log output at any level.
+- No `unwrap()` on cryptographic operations -- all errors are typed and propagated.
+- Constant-time comparison for authentication tokens and key fingerprints.
+
+### Abuse / DoS Controls
+
+- Rate limiting: 50 requests/second per IP, per account, and per device.
+- Payload cap: 5 MB per message.
+- Connection limit: configurable max concurrent QUIC connections.
+- KeyPackage upload limit: configurable per account (prevents store exhaustion).
+- Long-poll timeout cap: server-enforced maximum for `fetchWait`.
+
+### Data Protection
+
+- MLS ciphertext is opaque to the server (DS never holds group keys).
+- Message retention: 7 days default, configurable.
+- KeyPackage retention: 24 hours (TTL eviction).
+- At-rest encryption for persistent storage (SQLCipher at M6).
+
+### Logging Safety
+
+- Structured logging via `tracing` with `env-filter`.
+- Sensitive fields (keys, tokens, ciphertext) are never logged, even at `TRACE`.
+- Audit-level events: auth success/failure, token issuance, keypackage upload,
+  enqueue/fetch, rate limit hits.
+
+### Testing
+
+- Unit tests for all crypto operations (see [Testing Strategy](../contributing/testing.md)).
+- Integration tests for every RPC method.
+- Negative tests: malformed input, expired tokens, wrong identity, replay attempts.
+- N-1 compatibility tests (old client against new server).
+- Fuzzing targets for Cap'n Proto parsers and MLS message handling (Phase 5).
+
+---
+
+## Work Breakdown (6 Phases)
+
+### Phase 1 -- Baselines and Governance
+
+**Goal:** Establish project hygiene before adding features.
+
+| Task | Description |
+|------|-------------|
+| CODEOWNERS | Map crates to responsible reviewers |
+| CI pipeline | GitHub Actions: `cargo test --workspace`, `cargo clippy`, `cargo fmt --check`, `cargo deny check` |
+| SBOM generation | `cargo-cyclonedx` or `cargo-about` in CI; publish with each release |
+| Threat model | Document assets, adversaries, attack surface, trust boundaries; reference in [Threat Model](../cryptography/threat-model.md) |
+| Dependency audit | `cargo audit` in CI; pin all major versions per [Coding Standards](../contributing/coding-standards.md) |
+
+### Phase 2 -- Protocols and Core Hardening
+
+**Goal:** Lock down the wire format and cryptographic policy.
+
+| Task | Description |
+|------|-------------|
+| Wire versioning | Add `version` field to all Cap'n Proto structs; reject unknown versions |
+| Ciphersuite allowlist | Server rejects KeyPackages outside the allowed set |
+| Downgrade guards | Prevent epoch rollback; reject Commits with weaker ciphersuites |
+| ALPN enforcement | Reject connections without `b"capnp"` ALPN token |
+| Connection draining | Graceful QUIC `CONNECTION_CLOSE` on server shutdown |
+| KeyPackage rotation | Client-side timer to upload fresh KeyPackages before TTL expiry |
+
+### Phase 3 -- Auth, Device, and Server Hardening
+
+**Goal:** Add account/device identity and token-based authentication.
+
+See [Auth, Devices, and Tokens](authz-plan.md) for the full design.
+
+| Task | Description |
+|------|-------------|
+| Account + device model | `{account_id, device_id, device_pubkey}` with status lifecycle |
+| Token issuance | Access + refresh tokens; configurable expiry |
+| RPC auth middleware | Validate token on every RPC; map to account/device |
+| Identity binding | Bind MLS identity key to account; reject mismatched uploads |
+| Rate limiting | Per-IP, per-account, per-device counters |
+| Audit logging | Auth events, token lifecycle, rate limit hits |
+
+### Phase 4 -- Delivery Semantics and Client Resilience
+
+**Goal:** Reliable message delivery and 1:1 channels.
+
+See [1:1 Channel Design](dm-channels.md) for the DM-specific design.
+
+| Task | Description |
+|------|-------------|
+| Idempotent message IDs | Client-generated UUIDs; server deduplicates |
+| Ordering guarantees | Per-channel sequence numbers; client detects gaps |
+| Offline queue | Server retains messages for offline recipients (up to TTL) |
+| 1:1 channels | Channel creation, membership, per-channel authz |
+| TTL eviction | Background sweep + fetch-time check for expired messages |
+| Client retry | Exponential backoff with jitter on transient failures |
+
+### Phase 5 -- E2E Harness and Security Tests
+
+**Goal:** Automated end-to-end testing and security validation.
+
+| Task | Description |
+|------|-------------|
+| docker-compose testnet | Multi-node test environment with configurable topology |
+| Positive E2E tests | Full group lifecycle: register, create, invite, join, send, recv, leave |
+| Negative E2E tests | Expired tokens, wrong identity, replay, malformed messages |
+| Compat matrix | N-1 client/server version testing |
+| Fuzz targets | `cargo-fuzz` targets for Cap'n Proto parsers, MLS message handlers |
+| Golden-wire fixtures | Serialised test vectors for regression testing across versions |
+
+### Phase 6 -- Reliability, Performance, and Operations
+
+**Goal:** Production-grade operations and performance validation.
+
+| Task | Description |
+|------|-------------|
+| SQLite/SQLCipher persistence | AS key store, DS message log, client state (M6) |
+| Soak testing | 72-hour continuous operation under synthetic load |
+| Load testing | Throughput and latency benchmarks (Criterion + custom harness) |
+| Chaos testing | Network partitions, process crashes, disk full scenarios |
+| Backup / restore | SQLite backup with integrity verification |
+| Canary / rollback | Rolling deployment strategy with automatic rollback on failure |
+| Metrics + dashboards | Prometheus metrics, Grafana dashboards (see [Future Research](future-research.md)) |
+
+---
+
+## Planning Checklist
+
+Use this checklist when planning a new milestone or phase. Each item should have
+a documented decision before implementation begins.
+
+- [ ] **Release criteria / SLOs** -- Define what "done" means. Latency targets,
+      error rate thresholds, test coverage minimums.
+- [ ] **Threat model review** -- Update the [Threat Model](../cryptography/threat-model.md)
+      for any new attack surface introduced by this phase.
+- [ ] **Protocol policy** -- Ciphersuite allowlist, wire version, downgrade rules.
+- [ ] **Identity / auth model** -- Who authenticates, how, and what operations
+      are gated.
+- [ ] **Data model** -- Schema changes, migrations, backward compatibility.
+- [ ] **Abuse controls** -- Rate limits, size caps, connection limits for this phase.
+- [ ] **Observability contracts** -- What new metrics, logs, and traces are needed.
+- [ ] **Environments / secrets** -- Dev, staging, production configuration;
+      secret rotation plan.
+- [ ] **Testing matrix** -- Unit, integration, E2E, negative, fuzz, compat tests
+      for this phase.
+- [ ] **Rollout / ops** -- Deployment strategy, rollback plan, monitoring during
+      rollout.
+
+---
+
+## Cross-references
+
+- [Milestones](milestones.md) -- feature milestone tracker
+- [Auth, Devices, and Tokens](authz-plan.md) -- Phase 3 design
+- [1:1 Channel Design](dm-channels.md) -- Phase 4 design
+- [Future Research](future-research.md) -- technology options for Phase 6+
+- [Coding Standards](../contributing/coding-standards.md) -- engineering standards
+- [Testing Strategy](../contributing/testing.md) -- test structure and conventions
+- [Threat Model](../cryptography/threat-model.md) -- security analysis