feat: add post-quantum hybrid KEM + SQLCipher persistence

Feature 1 — Post-Quantum Hybrid KEM (X25519 + ML-KEM-768): - Create hybrid_kem.rs with keygen, encrypt, decrypt + 11 unit tests - Wire format: version(1) | x25519_eph_pk(32) | mlkem_ct(1088) | nonce(12) | ct - Add uploadHybridKey/fetchHybridKey RPCs to node.capnp schema - Server: hybrid key storage in FileBackedStore + RPC handlers - Client: hybrid keypair in StoredState, auto-wrap/unwrap in send/recv/invite/join - demo-group runs full hybrid PQ envelope round-trip Feature 2 — SQLCipher Persistence: - Extract Store trait from FileBackedStore API - Create SqlStore (rusqlite + bundled-sqlcipher) with encrypted-at-rest SQLite - Schema: key_packages, deliveries, hybrid_keys tables with indexes - Server CLI: --store-backend=sql, --db-path, --db-key flags - 5 unit tests for SqlStore (FIFO, round-trip, upsert, channel isolation) Also includes: client lib.rs refactor, auth config, TOML config file support, mdBook documentation, and various cleanups by user. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 08:07:48 +01:00
parent d1ddef4cea
commit f334ed3d43
81 changed files with 14502 additions and 2289 deletions
--- a/docs/src/design-rationale/adr-001-noise-xx.md
+++ b/docs/src/design-rationale/adr-001-noise-xx.md
@@ -0,0 +1,118 @@
+# ADR-001: Noise\_XX for Transport Authentication
+
+**Status:** Accepted
+
+---
+
+## Context
+
+quicnprotochat needs mutual authentication at the transport layer: both client and server must prove their identity before any application data is exchanged. The standard solution is TLS with X.509 certificates, but this brings significant operational complexity:
+
+- A Certificate Authority (CA) must be operated or purchased from.
+- Certificates must be provisioned, rotated, and revoked.
+- Client certificate authentication in TLS is cumbersome and poorly supported by many libraries.
+- The X.509 PKI is a large attack surface with a long history of CA compromises.
+
+An alternative is needed that provides mutual authentication with simpler key management, ideally using raw public keys rather than certificates.
+
+### Alternatives considered
+
+1. **TLS 1.3 with X.509 certificates.** Standard, widely deployed, but requires CA infrastructure. Client certificate authentication is possible but adds complexity. Later adopted for the QUIC transport (M3+), where server authentication is sufficient and client auth is handled at the application layer via the `Auth` struct.
+
+2. **TLS 1.3 with Raw Public Keys (RFC 7250).** Eliminates the CA dependency but has limited library support. The `rustls` crate did not support RPK at the time of the M1 design.
+
+3. **Noise Protocol Framework.** Purpose-built for authenticated key exchange using raw static keys. Multiple handshake patterns available. Mature specification with formal security analysis. Well-supported by the `snow` crate in Rust.
+
+4. **WireGuard-style handshake.** Based on Noise\_IK. Assumes the initiator already knows the responder's static key. Does not provide identity hiding for the initiator.
+
+---
+
+## Decision
+
+Use the **Noise\_XX handshake pattern** for the M1 transport layer. Both parties hold static X25519 keypairs that are registered out-of-band (e.g., via a future directory service, QR code, or manual configuration).
+
+### Why Noise\_XX specifically?
+
+The Noise Protocol Framework defines several handshake patterns, differing in which static keys are transmitted during the handshake:
+
+| Pattern | Initiator static key | Responder static key | Identity hiding |
+|---|---|---|---|
+| **NN** | Not transmitted | Not transmitted | No authentication |
+| **NK** | Not transmitted | Known to initiator | Server-only auth |
+| **KK** | Known to responder | Known to initiator | Mutual auth, no identity hiding |
+| **XX** | Transmitted (encrypted) | Transmitted (encrypted) | **Mutual auth + identity hiding for initiator** |
+| **IK** | Transmitted (encrypted) | Known to initiator | Mutual auth, initiator identity hidden from passive observers |
+
+**XX** was chosen because:
+
+1. **Mutual authentication.** Both parties prove possession of their static private keys during the handshake. The server verifies the client's identity, and the client verifies the server's identity.
+
+2. **Identity hiding for the initiator.** The initiator's static public key is transmitted encrypted under an ephemeral key, so a passive network observer cannot determine who is connecting. The responder's static key is also transmitted encrypted, though an active attacker performing a man-in-the-middle on the first message could learn it (this is inherent to any pattern where the responder's key is not pre-known).
+
+3. **No pre-shared keys required.** Unlike IK or KK, the XX pattern does not require either party to know the other's static key before the handshake begins. This simplifies bootstrapping: a new client can connect to a server without prior key exchange.
+
+4. **Three-message handshake.** XX completes in 3 messages (-> e, <- e ee s es, -> s se), which is one round-trip more than IK but provides stronger identity hiding guarantees.
+
+### Cryptographic parameters
+
+| Parameter | Value |
+|---|---|
+| Handshake pattern | `Noise_XX_25519_ChaChaPoly_SHA256` |
+| DH function | X25519 (Curve25519) |
+| AEAD cipher | ChaCha20-Poly1305 |
+| Hash function | SHA-256 |
+| Static key size | 32 bytes (X25519 public key) |
+| Ephemeral key size | 32 bytes (X25519 public key) |
+
+### Implementation
+
+The Noise handshake is implemented using the `snow` crate (`snow 0.9`). Key source files:
+
+- `crates/quicnprotochat-core/src/noise.rs` -- `NoiseTransport` struct, handshake state machine, encrypted read/write methods.
+- `crates/quicnprotochat-core/src/codec.rs` -- `LengthPrefixedCodec` that frames Noise handshake and transport messages over TCP.
+- `crates/quicnprotochat-core/src/error.rs` -- `CoreError::Noise` variant for handshake and transport errors.
+
+---
+
+## Consequences
+
+### Benefits
+
+- **No CA infrastructure.** Key management is reduced to generating, storing, and distributing raw 32-byte X25519 public keys. No certificates, no expiration, no revocation lists.
+- **Simpler key management.** Each node has a single static X25519 keypair. The public key is its transport-layer identity.
+- **Identity hiding.** Passive network observers cannot determine which client is connecting to the server.
+- **Well-analyzed security.** The Noise Protocol Framework has formal security proofs (Kobeissi et al., 2019). The XX pattern specifically has been analyzed for identity hiding and key compromise impersonation resistance.
+- **Lightweight.** The `snow` crate is small, auditable, and has no transitive dependency on OpenSSL or ring (it uses pure-Rust cryptography).
+
+### Costs and trade-offs
+
+- **Three-message handshake.** XX requires 3 messages (1.5 round-trips) compared to TLS 1.3's 1-RTT handshake (or 0-RTT with resumption). This adds latency to connection establishment. In practice, this is only significant for short-lived connections.
+- **No PQ protection.** The Noise handshake uses classical X25519. A quantum adversary performing a harvest-now-decrypt-later attack could recover the handshake transcript and learn the static keys. This is accepted as a known risk (see [ADR-006: PQ Gap](adr-006-pq-gap.md)).
+- **Out-of-band key distribution.** Without a CA or directory service, clients must obtain the server's static public key through some out-of-band mechanism. This is currently handled by hardcoding or configuration.
+- **Superseded for client-server transport.** With the move to QUIC + TLS 1.3 in M3, the Noise transport is no longer the primary client-server path. It remains available for direct peer-to-peer connections and as a fallback in environments where QUIC/UDP is blocked.
+
+### Residual risks
+
+- **Harvest-now-decrypt-later for metadata.** An adversary who records the Noise handshake today and obtains a quantum computer in the future could decrypt the handshake transcript, revealing the static public keys of both parties (identity metadata). However, no long-lived content secrets transit the handshake -- MLS provides its own key agreement. See [ADR-006](adr-006-pq-gap.md) for the full analysis.
+- **Key compromise impersonation (KCI).** If a party's static private key is compromised, an attacker can impersonate other parties to the compromised party. This is inherent to any DH-based mutual authentication scheme without a PKI. Mitigated by key rotation and secure key storage.
+
+---
+
+## Code references
+
+| File | Relevance |
+|---|---|
+| `crates/quicnprotochat-core/src/noise.rs` | `NoiseTransport` implementation: handshake, encrypted read/write |
+| `crates/quicnprotochat-core/src/codec.rs` | `LengthPrefixedCodec`: frames Noise messages over TCP |
+| `crates/quicnprotochat-core/src/error.rs` | `CoreError::Noise`, `CodecError` error types |
+
+---
+
+## Further reading
+
+- [Design Decisions Overview](overview.md) -- index of all ADRs
+- [ADR-003: RPC Inside the Noise Tunnel](adr-003-rpc-inside-noise.md) -- how Cap'n Proto RPC runs over the Noise channel
+- [ADR-006: PQ Gap in Noise Transport](adr-006-pq-gap.md) -- analysis of the post-quantum gap
+- [Framing Codec](../wire-format/framing-codec.md) -- the codec that frames Noise messages
+- [Protocol Layers Overview](../protocol-layers/overview.md) -- how Noise fits in the protocol stack
+- [Noise Protocol Framework specification](https://noiseprotocol.org/noise.html) -- upstream specification
--- a/docs/src/design-rationale/adr-002-capnproto.md
+++ b/docs/src/design-rationale/adr-002-capnproto.md
@@ -0,0 +1,140 @@
+# ADR-002: Cap'n Proto over MessagePack
+
+**Status:** Accepted
+
+---
+
+## Context
+
+quicnprotochat needs an efficient, typed wire format for client-server communication. The format must support:
+
+1. **Typed messages** with compile-time schema enforcement to eliminate hand-rolled serialisation bugs.
+2. **Schema evolution** so that new fields and methods can be added without breaking existing clients.
+3. **RPC support** for clean method dispatch, eliminating the need for manual message-type routing.
+4. **Efficient encoding** to minimize overhead on constrained networks and high-throughput server paths.
+5. **Canonical serialisation** so that identical logical messages produce identical byte sequences, enabling reliable signing.
+
+The original M0 prototype used MessagePack (via the `rmp-serde` crate) with hand-rolled dispatch based on integer message-type tags. This approach had several problems:
+
+- **No schema enforcement.** The wire format was defined implicitly by Rust `#[derive(Serialize, Deserialize)]` annotations. There was no single source of truth for the wire format, and changes to Rust struct layout silently changed the wire format.
+- **No RPC.** Message dispatch was a manual `match` on a `MsgType` enum. Adding a new message type required modifying the dispatch table in both client and server, with no compile-time guarantee that all cases were handled.
+- **No canonical form.** MessagePack's map encoding does not guarantee key ordering, so the same logical message could produce different byte sequences depending on the Rust `HashMap` iteration order. This made signing over serialised data unreliable.
+- **Deserialization overhead.** MessagePack requires a full decode pass that allocates and copies data. For a messaging system processing many small messages, this overhead is unnecessary.
+
+### Alternatives considered
+
+1. **MessagePack (status quo).** Keep the existing format. Rejected because of the schema, dispatch, and canonicity problems described above.
+
+2. **Protocol Buffers (Protobuf).** Schema-defined, binary, widely used. However:
+   - Protobuf does not guarantee canonical serialisation (default value elision, field ordering, and unknown field handling can vary between implementations).
+   - Protobuf RPC requires a separate framework (gRPC), which brings in HTTP/2 and a heavy runtime.
+   - Protobuf deserialization requires an allocation and copy pass (not zero-copy).
+
+3. **FlatBuffers.** Zero-copy, schema-defined. However:
+   - No built-in RPC framework.
+   - The Rust crate ecosystem was less mature than Cap'n Proto at the time of evaluation.
+   - No canonical serialisation guarantee.
+
+4. **Cap'n Proto.** Zero-copy, schema-defined, canonical serialisation, built-in async RPC. The `capnp` and `capnp-rpc` Rust crates are mature and actively maintained.
+
+---
+
+## Decision
+
+Replace MessagePack with **Cap'n Proto** for all wire-format serialisation and RPC dispatch. Define all message types and service interfaces in `.capnp` schema files, and use the `capnpc` compiler for Rust code generation.
+
+### Key properties of Cap'n Proto
+
+**Zero-copy deserialization:**
+
+Cap'n Proto's wire format is designed so that the byte layout on the wire is identical to the byte layout in memory. A receiver can traverse the message in-place using pointer arithmetic, without allocating or copying data. For a messaging server that processes many small messages per second, this eliminates a significant class of allocation overhead.
+
+```text
+Traditional serialisation:  wire bytes -> decode -> allocate -> application struct
+Cap'n Proto:                wire bytes == application struct (traverse in place)
+```
+
+**Schema enforcement:**
+
+All messages and RPC interfaces are defined in `.capnp` schema files checked into the repository. The `capnpc` compiler generates Rust code with type-safe builders and readers. A mismatched field type or missing field is caught at compile time, not at runtime.
+
+**Canonical serialisation:**
+
+Cap'n Proto defines a canonical form for messages: fields are laid out in a deterministic order with deterministic padding. Two implementations that build the same logical message produce identical byte sequences. This property is essential for signing: the MLS layer signs over serialised Cap'n Proto data, and non-deterministic serialisation would make signature verification unreliable.
+
+**Built-in async RPC:**
+
+The `capnp-rpc` crate provides a full RPC framework built on top of Cap'n Proto serialisation. Features include:
+
+- **Method dispatch:** Each interface method has a unique ordinal, and the RPC runtime dispatches incoming calls to the correct handler automatically.
+- **Promise pipelining:** A client can call a method on the result of a previous call before the first call has returned. The RPC runtime resolves the pipeline when the result is available.
+- **Cancellation:** An in-flight RPC call can be cancelled by the client, and the server is notified.
+- **Level 1 RPC:** The `capnp-rpc` crate implements Cap'n Proto's Level 1 RPC protocol, which supports most features needed for client-server communication.
+
+**Schema evolution:**
+
+Cap'n Proto supports forward-compatible schema evolution:
+
+- New fields can be added to structs (with the next available field number). Old readers ignore unknown fields.
+- New methods can be added to interfaces (with the next available ordinal). Old clients cannot call them; old servers reject unknown method calls.
+- Fields and methods can never be removed or renumbered, but they can be deprecated.
+- The `version` field in the `Auth` struct provides application-level versioning on top of structural evolution.
+
+### Schema files
+
+The Cap'n Proto schemas are stored in the `schemas/` directory:
+
+| File | Content | Documentation |
+|---|---|---|
+| `schemas/envelope.capnp` | Legacy `Envelope` struct and `MsgType` enum | [Envelope Schema](../wire-format/envelope-schema.md) |
+| `schemas/auth.capnp` | `AuthenticationService` interface | [Auth Schema](../wire-format/auth-schema.md) |
+| `schemas/delivery.capnp` | `DeliveryService` interface | [Delivery Schema](../wire-format/delivery-schema.md) |
+| `schemas/node.capnp` | `NodeService` interface and `Auth` struct | [NodeService Schema](../wire-format/node-service-schema.md) |
+
+---
+
+## Consequences
+
+### Benefits
+
+- **Eliminated hand-rolled dispatch.** The manual `MsgType` match table is replaced by Cap'n Proto RPC's automatic method dispatch. Adding a new operation means adding a method to the `.capnp` schema and implementing the handler -- no dispatch table to update.
+- **Compile-time type safety.** Schema violations are caught at compile time by the generated Rust code. A field type mismatch or missing required parameter is a compile error, not a runtime panic.
+- **Zero-copy performance.** The server avoids deserialization overhead for messages it routes but does not inspect (which is most messages, since the DS is MLS-unaware). The server can read the routing fields (recipient key, channel ID) directly from the wire bytes.
+- **Canonical form for signing.** MLS operations that sign over serialised data can rely on Cap'n Proto producing deterministic byte sequences.
+- **Schema as documentation.** The `.capnp` files serve as the authoritative specification of the wire format, readable by both humans and tools.
+
+### Costs and trade-offs
+
+- **Build-time code generation.** The `capnpc` compiler must run during the build (via `build.rs` in `quicnprotochat-proto`). This adds a build dependency and increases compile times slightly.
+- **Learning curve.** Cap'n Proto's builder/reader API is different from typical `serde`-based Rust serialisation. Developers must learn the Cap'n Proto programming model (builders for construction, readers for traversal, owned messages for storage).
+- **Generated code verbosity.** The generated Rust code is verbose and not intended to be read directly. Application code interacts with it through the builder/reader traits.
+- **Smaller ecosystem than Protobuf.** Cap'n Proto has fewer users, fewer tutorials, and fewer third-party tools than Protobuf. However, the core Rust crates are well-maintained.
+- **No dynamic reflection.** Unlike Protobuf (which supports `Any` and `DynamicMessage`), Cap'n Proto does not provide runtime reflection over unknown schemas. This has not been a limitation in practice.
+
+### Residual risks
+
+- **Crate maintenance.** The `capnp` and `capnp-rpc` crates are maintained primarily by David Renshaw. If maintenance lapses, the project would need to fork or switch serialisation formats. Mitigated by the crates' maturity and the relatively stable Cap'n Proto specification.
+- **RPC limitations.** The Rust `capnp-rpc` crate implements Level 1 of the Cap'n Proto RPC protocol. Level 3 features (three-party handoffs) are not supported. This has not been a limitation for quicnprotochat's client-server architecture.
+
+---
+
+## Code references
+
+| File | Relevance |
+|---|---|
+| `schemas/envelope.capnp` | Legacy Envelope struct definition |
+| `schemas/auth.capnp` | AuthenticationService RPC interface |
+| `schemas/delivery.capnp` | DeliveryService RPC interface |
+| `schemas/node.capnp` | NodeService unified RPC interface |
+| `crates/quicnprotochat-proto/build.rs` | Build script that invokes `capnpc` for code generation |
+| `crates/quicnprotochat-proto/src/lib.rs` | Re-exports generated Cap'n Proto modules |
+
+---
+
+## Further reading
+
+- [Design Decisions Overview](overview.md) -- index of all ADRs
+- [Wire Format Overview](../wire-format/overview.md) -- how Cap'n Proto fits in the serialisation pipeline
+- [ADR-003: RPC Inside the Noise Tunnel](adr-003-rpc-inside-noise.md) -- how Cap'n Proto RPC runs over the encrypted transport
+- [Why This Design, Not Signal/Matrix/...](why-not-signal.md) -- serialisation comparison against Protobuf and JSON
+- [Cap'n Proto encoding specification](https://capnproto.org/encoding.html) -- upstream specification
--- a/docs/src/design-rationale/adr-003-rpc-inside-noise.md
+++ b/docs/src/design-rationale/adr-003-rpc-inside-noise.md
@@ -0,0 +1,147 @@
+# ADR-003: RPC Inside the Noise Tunnel
+
+**Status:** Accepted
+
+---
+
+## Context
+
+Cap'n Proto RPC provides typed method dispatch, promise pipelining, and automatic serialisation -- but it has **no built-in transport security**. The RPC protocol assumes it operates over a trusted byte stream. If that byte stream is a raw TCP connection, all RPC traffic (method names, parameters, return values) is transmitted in cleartext.
+
+quicnprotochat requires that all client-server communication be encrypted and authenticated. The question is: how should encryption and RPC be composed?
+
+### Alternatives considered
+
+1. **RPC over raw TCP, with application-level encryption.** Each RPC payload would be individually encrypted by the application before passing it to Cap'n Proto. This is complex, error-prone, and does not protect RPC metadata (method ordinals, message structure).
+
+2. **RPC over TLS.** Use TLS 1.3 as the transport for the Cap'n Proto RPC byte stream. This is the conventional approach for web services (gRPC uses TLS). However, in the M1 design, TLS with mutual authentication required CA infrastructure that we wanted to avoid (see [ADR-001](adr-001-noise-xx.md)).
+
+3. **RPC over Noise.** Use the Noise\_XX handshake to establish an encrypted, authenticated session, then feed the Cap'n Proto RPC byte stream through the Noise transport layer. The RPC layer is completely unaware of the encryption beneath it.
+
+4. **RPC over QUIC.** Use QUIC + TLS 1.3 as the transport. Cap'n Proto RPC operates over a QUIC bidirectional stream. This is the approach adopted in M3+.
+
+---
+
+## Decision
+
+Cap'n Proto RPC operates over the encrypted byte stream provided by the transport layer. The transport layer -- whether Noise\_XX (M1) or QUIC + TLS 1.3 (M3+) -- owns all security properties (confidentiality, integrity, authentication). Cap'n Proto owns all framing and dispatch properties (serialisation, method routing, schema enforcement).
+
+This is a **separation of concerns** at the protocol layer boundary:
+
+```text
+┌─────────────────────────────────┐
+│  Cap'n Proto RPC                │  Dispatch, serialisation, typing
+│  (capnp-rpc crate)             │
+├─────────────────────────────────┤
+│  Encrypted byte stream          │  Confidentiality, integrity, auth
+│  (Noise_XX or QUIC/TLS 1.3)    │
+├─────────────────────────────────┤
+│  TCP or UDP                     │  Reliable (TCP) or datagram (UDP)
+└─────────────────────────────────┘
+```
+
+### Noise transport path (M1)
+
+In the M1 stack, the composition works as follows:
+
+1. Client and server perform a Noise\_XX handshake over a TCP connection, establishing a shared session key.
+2. The resulting `NoiseTransport` wraps the TCP stream, providing `AsyncRead + AsyncWrite` that transparently encrypts/decrypts all data.
+3. Cap'n Proto RPC is instantiated over this `NoiseTransport`. The RPC runtime reads and writes to the `NoiseTransport` as if it were a plain byte stream.
+4. Each RPC message is framed by the [LengthPrefixedCodec](../wire-format/framing-codec.md) before encryption and after decryption.
+
+```text
+Client                                              Server
+  |                                                    |
+  |  --- Noise_XX handshake (3 messages) ----------->  |
+  |  <-- Noise_XX handshake -------------------------  |
+  |                                                    |
+  |  [Noise-encrypted Cap'n Proto RPC traffic]         |
+  |  --- uploadKeyPackage(identityKey, pkg, auth) -->  |
+  |  <-- (fingerprint) -------------------------------- |
+  |  --- enqueue(recipientKey, payload, ch, v, a) -->  |
+  |  <-- () ------------------------------------------ |
+  |  ...                                               |
+```
+
+### QUIC transport path (M3+)
+
+In the M3+ stack, the composition is:
+
+1. Client connects to the server via QUIC, which performs a TLS 1.3 handshake internally.
+2. The client opens a bidirectional QUIC stream.
+3. Cap'n Proto RPC is instantiated over the QUIC stream. The `quinn` crate provides `AsyncRead + AsyncWrite` for each stream.
+4. The `LengthPrefixedCodec` is **not used** in this path -- QUIC provides native stream framing, and `capnp-rpc` handles message delimitation internally.
+
+```text
+Client                                              Server
+  |                                                    |
+  |  --- QUIC handshake (TLS 1.3) ----------------->  |
+  |  <-- QUIC handshake ----------------------------  |
+  |                                                    |
+  |  [QUIC-encrypted Cap'n Proto RPC traffic]          |
+  |  --- uploadKeyPackage(identityKey, pkg, auth) -->  |
+  |  <-- (fingerprint) -------------------------------- |
+  |  --- fetchWait(recipientKey, ch, v, t, a) ------>  |
+  |  <-- (payloads) ---------------------------------- |
+  |  ...                                               |
+```
+
+### Transport agnosticism
+
+The key architectural property is that **Cap'n Proto RPC is transport-agnostic**. The same RPC interface (`NodeService`) works identically over both transport paths. The server implementation does not know or care which transport the client used -- it receives the same typed method calls either way.
+
+This is achieved by abstracting the transport behind Rust's `AsyncRead + AsyncWrite` traits. The `capnp-rpc` crate accepts any type that implements these traits as its underlying stream.
+
+---
+
+## Consequences
+
+### Benefits
+
+- **Clean layering.** Each layer has a single, well-defined responsibility. The transport layer does not need to understand Cap'n Proto. Cap'n Proto does not need to understand encryption. This makes each layer independently testable and replaceable.
+
+- **Transport flexibility.** Switching from Noise to QUIC (or adding a future transport) required no changes to the RPC interface or the application logic. Only the transport initialization code changed.
+
+- **Full metadata protection.** Because encryption wraps the entire RPC byte stream, not just individual payloads, all RPC metadata is protected: method ordinals, parameter values, return values, and even the timing pattern of RPC calls (within the limits of the transport's traffic analysis resistance).
+
+- **No double encryption.** The application layer does not need to encrypt RPC payloads separately. The transport layer provides confidentiality for the entire stream.
+
+- **Composable security.** The Noise/QUIC layer provides transport security (server authentication, channel confidentiality). MLS provides end-to-end security (group key agreement, forward secrecy, PCS). The RPC layer is the bridge between them, carrying MLS ciphertext as opaque blobs. No single layer needs to provide all security properties.
+
+### Costs and trade-offs
+
+- **No end-to-end RPC security.** The RPC layer trusts the transport for confidentiality. If the transport is compromised (e.g., a TLS vulnerability), all RPC traffic is exposed. This is mitigated by MLS providing a second layer of encryption for message content.
+
+- **Transport must be established first.** The Noise handshake or QUIC connection must complete before any RPC call can be made. This adds latency to the first interaction. In the QUIC path, this is mitigated by 0-RTT resumption.
+
+- **Debugging complexity.** Because all traffic is encrypted, debugging wire-level issues requires either decrypting the transport (which requires the session keys) or logging at the application layer. This is an inherent trade-off of transport encryption.
+
+### Residual risks
+
+- **Transport-layer vulnerability.** A vulnerability in `snow` (Noise) or `rustls` (TLS) could expose the RPC byte stream. Mitigated by keeping dependencies updated and by the fact that MLS ciphertext within the stream is independently encrypted.
+
+- **Side channels.** The transport encrypts content but may not fully hide message sizes or timing patterns. A sophisticated adversary could infer information from traffic analysis. This is a known limitation of any encrypted transport and is orthogonal to the RPC-inside-transport decision.
+
+---
+
+## Code references
+
+| File | Relevance |
+|---|---|
+| `crates/quicnprotochat-core/src/noise.rs` | `NoiseTransport`: encrypted `AsyncRead + AsyncWrite` wrapper |
+| `crates/quicnprotochat-core/src/codec.rs` | `LengthPrefixedCodec`: frames messages in the Noise path |
+| `crates/quicnprotochat-server/src/main.rs` | Server: accepts QUIC connections, instantiates Cap'n Proto RPC over QUIC streams |
+| `crates/quicnprotochat-client/src/main.rs` | Client: connects via QUIC, instantiates Cap'n Proto RPC client |
+| `schemas/node.capnp` | `NodeService` RPC interface definition |
+
+---
+
+## Further reading
+
+- [Design Decisions Overview](overview.md) -- index of all ADRs
+- [ADR-001: Noise\_XX for Transport Auth](adr-001-noise-xx.md) -- the Noise transport that RPC runs inside (M1)
+- [ADR-002: Cap'n Proto over MessagePack](adr-002-capnproto.md) -- why Cap'n Proto was chosen for serialisation
+- [Wire Format Overview](../wire-format/overview.md) -- the full serialisation pipeline
+- [Framing Codec](../wire-format/framing-codec.md) -- length-prefixed framing in the Noise path
+- [NodeService Schema](../wire-format/node-service-schema.md) -- the RPC interface that runs over the encrypted tunnel
+- [Protocol Layers Overview](../protocol-layers/overview.md) -- how all protocol layers compose
--- a/docs/src/design-rationale/adr-004-mls-unaware-ds.md
+++ b/docs/src/design-rationale/adr-004-mls-unaware-ds.md
@@ -0,0 +1,124 @@
+# ADR-004: MLS-Unaware Delivery Service
+
+**Status:** Accepted
+
+---
+
+## Context
+
+The Delivery Service (DS) is the server-side component that stores and forwards messages between clients. A fundamental design question is: **should the DS understand MLS messages?**
+
+An MLS-aware DS could inspect message types and perform optimizations:
+
+- **Fan-out:** When a client sends a Commit or Application message intended for all group members, an MLS-aware DS could parse the group membership and deliver the message to all members automatically, instead of requiring the client to enqueue separately for each recipient.
+- **Membership validation:** An MLS-aware DS could verify that a sender is actually a member of the group before accepting a message, preventing spam from non-members.
+- **Epoch filtering:** An MLS-aware DS could reject messages from stale epochs, reducing the processing burden on recipients.
+- **Tree optimization:** An MLS-aware DS could cache the ratchet tree and assist with tree synchronization.
+
+However, an MLS-aware DS would also:
+
+- Have access to MLS message metadata (group IDs, epoch numbers, sender positions in the tree).
+- Require an MLS library dependency on the server.
+- Be more complex to implement, test, and audit.
+- Potentially violate the MLS architecture's trust model.
+
+### What RFC 9420 says
+
+RFC 9420 Section 4 defines the DS as a component that:
+
+> "is responsible for ordering handshake messages and delivering them to each client."
+
+Critically, the RFC specifies that the DS **does not have access to group keys** and treats message content as opaque. The DS's role is limited to:
+
+1. Ordering: ensuring that handshake messages (Commits) are applied in a consistent order across all group members.
+2. Delivery: routing messages to the correct recipients.
+3. Optional: enforcing access control (e.g., only group members can send to the group).
+
+The RFC explicitly envisions that the DS operates on opaque blobs, not on decrypted MLS content.
+
+---
+
+## Decision
+
+The quicnprotochat Delivery Service is **MLS-unaware**. It routes opaque byte strings by `(recipientKey, channelId)` without parsing, inspecting, or validating any MLS content.
+
+### What the DS sees
+
+```text
+DS perspective:
+  enqueue(recipientKey=0x1234..., payload=<opaque bytes>, channelId=<uuid>, version=1)
+  fetch(recipientKey=0x1234..., channelId=<uuid>, version=1) -> [<opaque bytes>, ...]
+
+DS does NOT see:
+  - Whether the payload is a Welcome, Commit, or Application message
+  - The MLS group ID or epoch number
+  - The sender's position in the ratchet tree
+  - Any plaintext content
+```
+
+### Routing responsibility
+
+Because the DS does not parse MLS messages, the **client** is responsible for routing:
+
+| MLS Operation | Client's Routing Responsibility |
+|---|---|
+| `add_members()` | Enqueue the Welcome message to the new member's `recipientKey`. Enqueue the Commit to each existing member's `recipientKey`. |
+| `remove_members()` | Enqueue the Commit to each remaining member's `recipientKey`. |
+| `create_message()` | Enqueue the Application message to each group member's `recipientKey`. |
+| `self_update()` | Enqueue the Commit to each other member's `recipientKey`. |
+
+This means that sending a message to a group of n members requires n-1 enqueue calls (one per recipient, excluding the sender). The client must maintain its own copy of the group membership list.
+
+---
+
+## Consequences
+
+### Benefits
+
+- **Correct MLS architecture.** The DS does not hold group keys or inspect group state, which is the architecture recommended by RFC 9420 Section 4. A compromised DS learns nothing about message content or group structure beyond the routing metadata (recipient keys and channel IDs).
+
+- **Audit-friendly.** The DS's audit log is a simple append-only sequence of `(timestamp, recipientKey, channelId, payload_hash)` entries. There is no complex state machine to audit. The server's behavior is trivially verifiable: it accepts blobs and returns them in FIFO order.
+
+- **No MLS dependency on the server.** The server does not depend on `openmls` or any MLS library. This reduces the server's attack surface, compile time, and binary size. It also means the server is completely decoupled from MLS version upgrades.
+
+- **Simplicity.** The DS is a hash map of FIFO queues. The entire implementation fits in a few hundred lines of Rust. There are no edge cases around epoch transitions, tree synchronization, or membership conflicts.
+
+- **Protocol agnosticism.** The DS can carry any payload, not just MLS messages. Future protocol extensions (e.g., signaling for voice/video, file transfer metadata) can reuse the same delivery infrastructure without modification.
+
+### Costs and trade-offs
+
+- **No server-side fan-out.** The client must enqueue separately for each recipient. For a group of n members, this means n-1 enqueue calls per message, compared to 1 call if the DS could fan out. This increases client bandwidth usage by a factor of approximately n for the routing metadata (though the payload is the same in each call).
+
+- **No server-side membership validation.** The DS cannot verify that a sender is a member of the group. A malicious client could enqueue messages to any recipient key, potentially causing the recipient to process (and reject) invalid MLS messages. This is mitigated by MLS's own authentication: invalid messages are rejected during MLS processing.
+
+- **No server-side ordering guarantees.** RFC 9420 envisions the DS providing a consistent ordering of handshake messages. The current DS provides FIFO ordering per `(recipientKey, channelId)` queue, but it does not provide global ordering across all group members. In practice, MLS handles out-of-order delivery gracefully (Commits include the epoch number, and clients can buffer messages for future epochs).
+
+- **Client complexity.** The client must maintain the group membership list and perform per-recipient routing. This is additional state that the client must manage correctly. An incorrect membership list results in some members not receiving messages.
+
+### Residual risks
+
+- **Metadata exposure.** While the DS does not see message content, it does see routing metadata: which recipient keys receive messages, when, and on which channels. This metadata can reveal communication patterns. Mitigation: use channel IDs that are not correlated with real-world identifiers, and consider padding to hide message sizes.
+
+- **Denial of service.** Because the DS does not validate senders, a malicious client could flood a recipient's queue with garbage payloads. Mitigation: rate limiting (planned for a future milestone) and the `Auth` struct for sender identification.
+
+---
+
+## Code references
+
+| File | Relevance |
+|---|---|
+| `schemas/delivery.capnp` | DeliveryService RPC interface (opaque `Data` payloads) |
+| `schemas/node.capnp` | NodeService: `enqueue`, `fetch`, `fetchWait` methods |
+| `crates/quicnprotochat-server/src/storage.rs` | Server-side queue storage (DashMap-based FIFO queues) |
+| `crates/quicnprotochat-server/src/main.rs` | NodeService RPC handler implementation |
+
+---
+
+## Further reading
+
+- [Design Decisions Overview](overview.md) -- index of all ADRs
+- [Delivery Schema](../wire-format/delivery-schema.md) -- the DS RPC interface definition
+- [NodeService Schema](../wire-format/node-service-schema.md) -- the unified interface that includes DS methods
+- [ADR-005: Single-Use KeyPackages](adr-005-single-use-keypackages.md) -- related AS design decision
+- [Architecture Overview](../architecture/overview.md) -- system-level view showing DS in context
+- [Why This Design, Not Signal/Matrix/...](why-not-signal.md) -- broader protocol comparison
--- a/docs/src/design-rationale/adr-005-single-use-keypackages.md
+++ b/docs/src/design-rationale/adr-005-single-use-keypackages.md
@@ -0,0 +1,114 @@
+# ADR-005: Single-Use KeyPackages
+
+**Status:** Accepted
+
+---
+
+## Context
+
+MLS (RFC 9420) specifies that KeyPackages must be used at most once. A KeyPackage contains the client's HPKE init key, which is used during the `add_members()` operation to encrypt the Welcome message. If the same KeyPackage is used twice, the same HPKE shared secret is derived for both group additions, which destroys the forward secrecy of the initial key exchange.
+
+The Authentication Service (AS) stores KeyPackages uploaded by clients and serves them to peers who want to add the client to a group. The design question is: **how should the AS enforce single-use semantics?**
+
+### Alternatives considered
+
+1. **Mark-as-used.** The AS could keep a "used" flag on each KeyPackage and reject subsequent fetch requests for packages already marked as used. This preserves the package on the server (for auditing or retry) but requires additional state tracking and introduces a race condition: if two peers fetch the same package concurrently, one of them will receive a "used" package unless the flag is set atomically with the first fetch.
+
+2. **Reference counting.** The AS could allow a KeyPackage to be fetched a configurable number of times. This would support use cases like "allow the same package to be used in N group additions." However, MLS requires strict single-use, making this approach non-compliant.
+
+3. **Atomic removal on fetch.** The AS removes the KeyPackage from storage in the same operation that returns it. The first fetch succeeds and returns the package; subsequent fetches for the same package find nothing. This is the simplest approach and provides the strongest guarantee.
+
+---
+
+## Decision
+
+The Authentication Service **atomically removes** a KeyPackage when it is fetched. The `fetchKeyPackage` method is destructive: it returns the package and deletes it in a single operation. If no packages are stored for the requested identity, an empty response is returned.
+
+### Implementation
+
+The server stores KeyPackages in a per-identity queue (currently backed by `DashMap` with `Vec<Vec<u8>>` values). The `fetchKeyPackage` operation:
+
+1. Locks the entry for the requested identity key.
+2. Pops the first KeyPackage from the queue (FIFO order).
+3. Returns the popped package.
+4. The lock is released.
+
+If the queue is empty (or no entry exists for the identity key), the method returns empty `Data`.
+
+```text
+Before fetch:
+  identity_key_0x1234 -> [KP_1, KP_2, KP_3]
+
+After fetchKeyPackage(identity_key=0x1234):
+  Returns: KP_1
+  identity_key_0x1234 -> [KP_2, KP_3]
+```
+
+### Client responsibilities
+
+Because the AS consumes KeyPackages on fetch, clients must manage their KeyPackage supply:
+
+1. **Pre-upload multiple KeyPackages.** After generating their identity, a client should upload several KeyPackages (e.g., 10-100) so that multiple peers can add them to groups concurrently.
+
+2. **Monitor supply.** Clients should periodically check (via a future monitoring endpoint or heuristic) whether their KeyPackage supply on the server is running low, and replenish by uploading more.
+
+3. **Handle empty responses.** A client trying to add a peer whose KeyPackage supply is exhausted will receive an empty response from `fetchKeyPackage`. The client should handle this gracefully -- e.g., by notifying the user that the peer needs to upload more KeyPackages.
+
+### Fingerprint for tamper detection
+
+The `uploadKeyPackage` method returns a SHA-256 fingerprint of the uploaded package. This fingerprint serves as a tamper-detection mechanism:
+
+1. The uploading client records the fingerprint.
+2. When a peer fetches the KeyPackage, they can compute the SHA-256 hash of the received package.
+3. If the fetched package's hash does not match the expected fingerprint (communicated out-of-band), the server may have tampered with the package.
+
+This is a defense-in-depth measure. In practice, MLS's own signature verification on KeyPackages also detects tampering, since the KeyPackage includes a signature over its contents using the uploader's Ed25519 identity key.
+
+---
+
+## Consequences
+
+### Benefits
+
+- **Forward secrecy of initial key exchange.** Each `add_members()` operation uses a fresh HPKE init key, so the shared secret derived from the Welcome message is unique. Compromising one group addition does not compromise others.
+
+- **Simplicity.** Atomic removal is the simplest possible implementation of single-use semantics. There is no "used" flag, no reference count, no expiration timer. The package is either in the store (available) or not (consumed).
+
+- **No race conditions.** Because removal is atomic with fetch, two concurrent fetches for the same identity key will each receive a different KeyPackage (or one will receive an empty response if only one package remains). There is no window where two fetchers could receive the same package.
+
+- **Compliance with RFC 9420.** The single-use semantics are a direct implementation of MLS's requirement that each KeyPackage's HPKE init key be used at most once.
+
+### Costs and trade-offs
+
+- **Client must manage supply.** Unlike a reusable credential, single-use KeyPackages are a consumable resource. Clients must proactively upload packages and monitor their supply. A client that goes offline for an extended period may exhaust its supply, becoming unreachable for new group additions.
+
+- **No retry after fetch.** If a client fetches a KeyPackage and then fails to complete the `add_members()` operation (e.g., due to a crash or network error), the KeyPackage is consumed and wasted. The client must fetch a new one and retry.
+
+- **Storage scaling.** If each client uploads N KeyPackages and there are M clients, the AS must store up to M * N packages. For reasonable values (e.g., 1000 clients, 100 packages each), this is 100,000 packages -- well within the capacity of an in-memory store. For larger deployments, persistent storage would be needed.
+
+### Residual risks
+
+- **KeyPackage exhaustion attack.** A malicious client could repeatedly fetch a target's KeyPackages without using them, draining the target's supply and preventing legitimate peers from adding the target to groups. Mitigation: rate limiting on `fetchKeyPackage` calls (planned for a future milestone) and the `Auth` struct for identifying and blocking abusive clients.
+
+- **Server-side compromise.** If the AS is compromised, the attacker could read stored KeyPackages and use the HPKE init keys to decrypt future Welcome messages. Mitigation: this is inherent to any prekey distribution service (Signal has the same risk with X3DH prekey bundles). MLS's post-compromise security means that even if the initial key exchange is compromised, subsequent epoch updates restore security.
+
+---
+
+## Code references
+
+| File | Relevance |
+|---|---|
+| `schemas/auth.capnp` | `AuthenticationService` interface: `uploadKeyPackage`, `fetchKeyPackage` |
+| `schemas/node.capnp` | `NodeService` interface: same methods with `Auth` parameter |
+| `crates/quicnprotochat-server/src/storage.rs` | Server-side KeyPackage storage (DashMap-backed queue) |
+| `crates/quicnprotochat-server/src/main.rs` | RPC handler: `fetchKeyPackage` implementation with atomic removal |
+
+---
+
+## Further reading
+
+- [Design Decisions Overview](overview.md) -- index of all ADRs
+- [Auth Schema](../wire-format/auth-schema.md) -- the RPC interface for KeyPackage operations
+- [NodeService Schema](../wire-format/node-service-schema.md) -- the unified interface including auth methods
+- [ADR-004: MLS-Unaware Delivery Service](adr-004-mls-unaware-ds.md) -- related design decision for the DS
+- [Architecture Overview](../architecture/overview.md) -- system-level view showing the AS in context
--- a/docs/src/design-rationale/adr-006-pq-gap.md
+++ b/docs/src/design-rationale/adr-006-pq-gap.md
@@ -0,0 +1,119 @@
+# ADR-006: PQ Gap in Noise Transport
+
+**Status:** Accepted
+
+---
+
+## Context
+
+quicnprotochat's security architecture has two encryption layers:
+
+1. **Transport layer** (Noise\_XX or QUIC + TLS 1.3): encrypts the byte stream between client and server using classical Diffie-Hellman key exchange (X25519).
+2. **Content layer** (MLS, RFC 9420): provides end-to-end group key agreement using DHKEM(X25519, HKDF-SHA256) in the current ciphersuite, with a hybrid KEM (X25519 + ML-KEM-768) available at the envelope level and planned for integration into the MLS ciphersuite at M5.
+
+The content layer will have post-quantum protection from M5 onward via the hybrid KEM. However, the transport layer uses classical X25519 exclusively. This creates a **post-quantum gap**: the transport layer is vulnerable to a quantum adversary, even after the content layer is PQ-protected.
+
+### The threat: harvest-now, decrypt-later
+
+A quantum adversary who does not yet have a cryptographically relevant quantum computer (CRQC) can still:
+
+1. **Record** all encrypted traffic transiting the network today.
+2. **Store** the recordings until a CRQC becomes available.
+3. **Decrypt** the recorded traffic using Shor's algorithm to break X25519.
+
+This is known as the "harvest-now, decrypt-later" (HNDL) attack. The question is: **what is the practical impact of HNDL on quicnprotochat's transport layer?**
+
+### What a quantum adversary learns from breaking the transport
+
+If the Noise\_XX handshake is broken, the adversary learns:
+
+| Data | Sensitivity | Exposure |
+|---|---|---|
+| Static X25519 public keys of both parties | Identity metadata | Reveals which client connected to which server |
+| Timing and size of RPC calls | Traffic metadata | Reveals communication patterns |
+| Cap'n Proto RPC traffic (method calls, parameters) | Routing metadata | Reveals recipient keys, channel IDs, and message timestamps |
+| MLS ciphertext (payload bytes) | **Still encrypted** | MLS uses its own key agreement; breaking the transport does not break MLS |
+
+Critically, **no long-lived content secrets transit the Noise handshake**. The MLS key schedule derives group keys independently of the transport. Even with full transport decryption, the adversary sees only MLS ciphertext, which they cannot decrypt without breaking MLS's own key exchange (which will be PQ-protected from M5).
+
+### Why not use PQ-Noise now?
+
+The Noise Protocol Framework community has drafted extensions for post-quantum Noise (PQ-Noise), which replace or augment X25519 with PQ key exchange mechanisms (e.g., Kyber/ML-KEM). However:
+
+1. **The `snow` crate does not support PQ-Noise.** As of snow 0.9, there is no API for PQ handshake patterns. Adding PQ support would require forking `snow` or switching to a different Noise implementation.
+
+2. **PQ-Noise is not yet standardized.** The draft specifications (e.g., `draft-noise-pq`) are still evolving. Adopting an unstable specification risks incompatibility with future versions.
+
+3. **Performance and size concerns.** ML-KEM-768 ciphertexts are 1,088 bytes, and encapsulation keys are 1,184 bytes. These are significantly larger than X25519's 32-byte keys. In a Noise handshake, where multiple key exchanges occur, the handshake size and latency increase substantially.
+
+4. **The QUIC path uses TLS 1.3.** The primary transport in M3+ is QUIC + TLS 1.3, which has its own PQ migration path (via `rustls` and the `x25519-mlkem768` TLS key exchange group). This path is more likely to receive PQ support before `snow` does.
+
+---
+
+## Decision
+
+Accept the PQ gap in the Noise transport layer for milestones M1 through M5. The content layer (MLS) will be PQ-protected from M5 via the hybrid KEM. The transport layer will gain PQ protection when either:
+
+- The `snow` crate adds PQ-Noise support, or
+- The QUIC/TLS path gains PQ key exchange support via `rustls`, or
+- A PQ-Noise Rust implementation becomes available and is adopted.
+
+Until then, the transport layer uses classical X25519, and the PQ gap is accepted as a known, bounded risk.
+
+---
+
+## Consequences
+
+### What is protected
+
+- **Message content** is protected by MLS's own key agreement. Even if the transport is broken, MLS ciphertext remains secure (assuming MLS uses a PQ-safe ciphersuite, which is the plan for M5).
+- **MLS key material** (epoch secrets, application secrets) never transits the Noise handshake. They are derived from the MLS tree, not from the transport.
+- **Forward secrecy of content** is provided by MLS epoch ratcheting, independent of the transport.
+
+### What is exposed
+
+- **Identity metadata.** A quantum adversary who breaks the Noise handshake learns the static X25519 public keys of both parties. This reveals *which* client connected to *which* server, and *when*.
+- **Timing metadata.** The adversary learns the timing and size pattern of RPC calls, which can reveal communication patterns (e.g., "Alice and Bob exchanged messages at 3pm").
+- **Routing metadata.** The adversary learns the recipient keys and channel IDs in RPC calls (since Cap'n Proto RPC traffic is visible after transport decryption). This reveals *who* is communicating with *whom*, even though the message content remains encrypted by MLS.
+
+### Practical impact assessment
+
+| Risk Factor | Assessment |
+|---|---|
+| **Timeline to CRQC** | Most estimates place cryptographically relevant quantum computers at 10-20+ years away. The PQ gap is a near-term risk only for adversaries with very long storage horizons. |
+| **Value of metadata** | Identity and timing metadata is sensitive for high-value targets but less critical than message content for most users. |
+| **Content protection** | Message content is independently protected by MLS. Breaking the transport does not break content encryption. |
+| **Migration path** | PQ key exchange for TLS 1.3 is being standardized (ML-KEM in TLS). The QUIC/TLS path is likely to gain PQ protection before the Noise path. |
+| **Overall risk** | **Low to moderate.** The PQ gap exposes metadata only, not content. The risk is limited to adversaries who (a) can record traffic today, (b) will have a CRQC in the future, and (c) are interested in metadata about quicnprotochat users. |
+
+### Mitigation timeline
+
+| Milestone | Transport PQ Status | Content PQ Status |
+|---|---|---|
+| M1 | Classical X25519 (Noise) | Classical DHKEM (MLS) |
+| M2 | Classical X25519 (Noise) | Classical DHKEM (MLS) |
+| M3 | Classical X25519 (QUIC/TLS 1.3) | Classical DHKEM (MLS) + hybrid KEM at envelope level |
+| M4 | Classical X25519 (QUIC/TLS 1.3) | Classical DHKEM (MLS) + hybrid KEM at envelope level |
+| M5 | Classical X25519 (QUIC/TLS 1.3) | **PQ-protected** (hybrid KEM integrated into MLS ciphersuite) |
+| Future | **PQ-protected** (PQ key exchange in TLS or PQ-Noise) | PQ-protected |
+
+---
+
+## Code references
+
+| File | Relevance |
+|---|---|
+| `crates/quicnprotochat-core/src/noise.rs` | Noise\_XX handshake using classical X25519 (`Noise_XX_25519_ChaChaPoly_SHA256`) |
+| `crates/quicnprotochat-core/src/hybrid_kem.rs` | Hybrid KEM (X25519 + ML-KEM-768) for content-layer PQ protection |
+| `crates/quicnprotochat-server/src/main.rs` | QUIC server using `rustls` with classical TLS 1.3 |
+| `crates/quicnprotochat-client/src/main.rs` | QUIC client using `rustls` with classical TLS 1.3 |
+
+---
+
+## Further reading
+
+- [Design Decisions Overview](overview.md) -- index of all ADRs
+- [ADR-001: Noise\_XX for Transport Auth](adr-001-noise-xx.md) -- the Noise transport that has the PQ gap
+- [Why This Design, Not Signal/Matrix/...](why-not-signal.md) -- PQ readiness comparison across protocols
+- [Protocol Layers Overview](../protocol-layers/overview.md) -- how transport and content layers compose
+- [Noise Protocol Framework specification](https://noiseprotocol.org/noise.html) -- upstream Noise specification
--- a/docs/src/design-rationale/overview.md
+++ b/docs/src/design-rationale/overview.md
@@ -0,0 +1,63 @@
+# Design Decisions Overview
+
+This section collects the Architecture Decision Records (ADRs) that document the key design choices in quicnprotochat. Each ADR follows a standard format: context (why the decision was needed), decision (what was chosen), and consequences (trade-offs, benefits, and residual risks).
+
+These decisions are not immutable. Each ADR has a status field and can be superseded by a later ADR if circumstances change. The goal is to preserve the reasoning behind each choice so that future contributors understand *why* the system works the way it does, not just *how*.
+
+---
+
+## ADR index
+
+| ADR | Title | Status | One-line summary |
+|---|---|---|---|
+| [ADR-001](adr-001-noise-xx.md) | Noise\_XX for Transport Auth | Accepted | Mutual authentication via static X25519 keys; no CA infrastructure required. |
+| [ADR-002](adr-002-capnproto.md) | Cap'n Proto over MessagePack | Accepted | Zero-copy, schema-enforced serialisation with built-in async RPC replaces hand-rolled MessagePack dispatch. |
+| [ADR-003](adr-003-rpc-inside-noise.md) | RPC Inside the Noise Tunnel | Accepted | Cap'n Proto RPC operates over the encrypted byte stream; transport owns security, RPC owns dispatch. |
+| [ADR-004](adr-004-mls-unaware-ds.md) | MLS-Unaware Delivery Service | Accepted | The DS routes opaque blobs by recipient key; it never inspects MLS content. |
+| [ADR-005](adr-005-single-use-keypackages.md) | Single-Use KeyPackages | Accepted | The AS atomically removes a KeyPackage on fetch to preserve MLS forward secrecy. |
+| [ADR-006](adr-006-pq-gap.md) | PQ Gap in Noise Transport | Accepted | Classical X25519 in Noise is accepted for M1-M5; MLS content is PQ-protected separately. |
+
+---
+
+## Design comparison
+
+For a broader comparison of quicnprotochat's design against alternative messaging protocols (Signal, Matrix/Olm/Megolm), see [Why This Design, Not Signal/Matrix/...](why-not-signal.md).
+
+---
+
+## How to read an ADR
+
+Each ADR page follows this structure:
+
+1. **Status** -- One of: Proposed, Accepted, Deprecated, Superseded. All current ADRs are Accepted.
+2. **Context** -- The problem or force that motivated the decision. What constraints existed? What alternatives were considered?
+3. **Decision** -- The specific choice that was made. What was selected and what was rejected?
+4. **Consequences** -- The trade-offs that result from the decision. What are the benefits? What are the costs? What residual risks remain?
+5. **Code references** -- Links to the source files where the decision is implemented.
+
+---
+
+## Cross-cutting themes
+
+Several themes recur across multiple ADRs:
+
+### Layered security
+
+ADR-001, ADR-003, and ADR-006 all concern the separation between transport-layer security (Noise or QUIC/TLS) and application-layer security (MLS). The core principle is that **no single layer is trusted alone**. Transport encryption protects metadata and provides authentication; MLS provides end-to-end content encryption with forward secrecy and post-compromise security.
+
+### Server minimalism
+
+ADR-004 and ADR-005 reflect a design philosophy where the server does as little as possible. The DS does not parse MLS messages. The AS enforces single-use semantics through atomic removal rather than complex state tracking. This minimalism reduces the server's attack surface and makes it easier to audit.
+
+### Schema-first design
+
+ADR-002 and ADR-003 establish Cap'n Proto as the single source of truth for the wire format. Every message and RPC call is defined in `.capnp` schema files, which are checked into the repository and used for code generation. This eliminates the class of bugs that arises from hand-rolled serialisation and ensures that the wire format is documented, versioned, and evolvable.
+
+---
+
+## Further reading
+
+- [Why This Design, Not Signal/Matrix/...](why-not-signal.md) -- comparative analysis against alternative protocols
+- [Wire Format Overview](../wire-format/overview.md) -- the serialisation pipeline that implements these decisions
+- [Architecture Overview](../architecture/overview.md) -- system-level view
+- [Protocol Layers Overview](../protocol-layers/overview.md) -- how the protocol layers stack
--- a/docs/src/design-rationale/why-not-signal.md
+++ b/docs/src/design-rationale/why-not-signal.md
@@ -0,0 +1,164 @@
+# Why This Design, Not Signal/Matrix/...
+
+This page compares quicnprotochat's protocol choices against two widely deployed secure messaging systems -- the Signal Protocol and the Matrix ecosystem (Olm/Megolm) -- to explain why a different architecture was chosen. The comparison covers four dimensions: group key agreement, transport, serialisation, and overall trade-offs.
+
+---
+
+## Group key agreement
+
+The choice of group key agreement protocol is the most consequential architectural decision in any end-to-end encrypted group messenger. It determines the cryptographic properties available to the application, the cost of group operations, and the complexity of the client state machine.
+
+### Signal Protocol (Double Ratchet + X3DH + Sender Keys)
+
+The Signal Protocol was designed for **1:1 messaging** and later extended to groups via Sender Keys.
+
+**1:1 (Double Ratchet + X3DH):**
+
+- X3DH performs an initial key agreement between two parties using prekey bundles (analogous to MLS KeyPackages).
+- The Double Ratchet derives per-message keys using a combination of a Diffie-Hellman ratchet and a symmetric hash ratchet.
+- Provides forward secrecy (past messages are protected after key compromise) and post-compromise security (future messages are protected after a compromise is healed by a new DH exchange).
+- Well-studied and battle-tested for over a decade. Formal security analysis by Cohn-Gordon et al. (2017).
+
+**Groups (Sender Keys):**
+
+- Each group member generates a Sender Key and distributes it to all other members via pairwise Double Ratchet channels.
+- Sender Keys provide a symmetric ratchet for forward secrecy, but **no post-compromise security**. If a Sender Key is compromised, all future messages from that sender are compromised until the key is manually rotated.
+- Group membership changes require O(n) pairwise Sender Key distributions. Adding or removing a member requires the affected member to generate a new Sender Key and distribute it to all n-1 other members.
+- The pairwise key exchange for initial setup is O(n^2): each of n members must establish a Double Ratchet session with each of the other n-1 members.
+
+**Limitations for quicnprotochat's use case:**
+
+- O(n^2) pairwise setup cost limits practical group size.
+- No post-compromise security for groups is a significant gap.
+- The protocol requires a central server for X3DH prekey bundle distribution (similar to quicnprotochat's AS, but tightly coupled to the Signal server).
+
+### Matrix / Olm / Megolm
+
+The Matrix ecosystem uses two distinct cryptographic protocols:
+
+**Olm (1:1):**
+
+- An implementation of the Double Ratchet, similar to Signal's 1:1 protocol.
+- Used to establish pairwise encrypted channels between devices.
+- Provides forward secrecy and post-compromise security for 1:1 sessions.
+
+**Megolm (groups):**
+
+- A symmetric sender ratchet. Each sender in a group generates a Megolm session and distributes the initial ratchet state to all other members via Olm channels.
+- The ratchet is **forward-only**: it provides forward secrecy (a compromised ratchet state cannot decrypt past messages) but **no post-compromise security** (a compromised ratchet state decrypts all future messages from that sender until a new Megolm session is created).
+- Session rotation is typically triggered by membership changes or periodic timers, but it is not cryptographically enforced.
+
+**Additional Matrix-specific considerations:**
+
+- **Federation** adds significant complexity. Messages may traverse multiple homeservers, each of which sees encrypted ciphertext but also metadata (sender, recipient, room ID, timestamps). Federation increases metadata exposure compared to a single-server architecture.
+- **Eventually consistent state** model means that room membership, key sharing, and message ordering can diverge between homeservers. The client must reconcile these inconsistencies, adding complexity to the state machine.
+- **Device verification** is a persistent UX challenge. The cross-signing mechanism is powerful but difficult for users to understand.
+
+**Limitations for quicnprotochat's use case:**
+
+- No post-compromise security for groups (same limitation as Signal's Sender Keys).
+- Federation adds latency, metadata exposure, and state management complexity that quicnprotochat does not need.
+- JSON-based wire format is inefficient (see serialisation comparison below).
+
+### quicnprotochat: MLS (RFC 9420)
+
+quicnprotochat uses the **Messaging Layer Security (MLS)** protocol, standardized as RFC 9420 by the IETF.
+
+**Key properties:**
+
+- **Native group key agreement.** MLS was designed from the ground up for groups, not bolted onto a pairwise protocol. The ratchet tree structure provides O(log n) cost for group operations (add, remove, update), compared to O(n) or O(n^2) for pairwise-based schemes.
+- **Post-compromise security.** Any group member can issue an Update proposal that replaces their leaf in the ratchet tree, generating a new group secret. This heals the tree: even if a member's key material was previously compromised, the new group secret is unknown to the attacker. This property is **not available** in Signal Sender Keys or Megolm.
+- **Forward secrecy.** Each epoch (a new group state after a Commit) derives fresh keys. Past epoch keys are deleted and cannot decrypt old messages.
+- **Single Commit to update all members.** A Commit message applies one or more proposals (Add, Remove, Update) atomically and is processed by all group members with a single message. No pairwise distribution is needed.
+- **Standardized.** RFC 9420 was published by the IETF in July 2023 after years of design, analysis, and interoperability testing. Multiple independent implementations exist (openmls, mls-rs, Cisco's MLS, etc.).
+
+**Cost of group operations:**
+
+| Operation | Signal (Sender Keys) | Matrix (Megolm) | MLS (quicnprotochat) |
+|---|---|---|---|
+| Add member | O(n) Sender Key distributions | O(n) Megolm session shares | O(log n) tree update |
+| Remove member | O(n) Sender Key rotations | O(n) new Megolm session | O(log n) tree update |
+| Update (PCS heal) | Not supported | Not supported (session rotation is coarse) | O(log n) path update |
+| Per-message encrypt | O(1) symmetric ratchet | O(1) symmetric ratchet | O(1) symmetric ratchet |
+
+---
+
+## Transport comparison
+
+The transport layer determines how encrypted payloads reach the server and how client-server authentication is performed.
+
+| Property | Signal | Matrix | quicnprotochat |
+|---|---|---|---|
+| **Transport protocol** | TLS over TCP (HTTP/2) | HTTPS (TLS over TCP) | QUIC (UDP) + TLS 1.3 |
+| **Multiplexing** | HTTP/2 stream multiplexing | HTTP/1.1 or HTTP/2 | Native QUIC stream multiplexing |
+| **Head-of-line blocking** | Mitigated by HTTP/2 streams, but TCP HOL blocking remains | Same as Signal | Eliminated: QUIC streams are independent at the transport layer |
+| **Connection establishment** | 1-RTT (TLS 1.3) or 0-RTT (TLS resumption) | 1-RTT (TLS 1.3) or 0-RTT | 0-RTT capable (QUIC resumption) or 1-RTT |
+| **Client authentication** | Bearer tokens over TLS | Bearer tokens over TLS | TLS client certs (rustls/quinn) or bearer tokens via `Auth` struct |
+| **Fallback** | TCP only | TCP only | Noise\_XX over TCP (M1 stack) for environments where UDP/QUIC is blocked |
+
+**Why QUIC?**
+
+QUIC eliminates TCP head-of-line blocking, which is particularly important for a messaging application where multiple independent conversations may be active simultaneously. A lost packet in one QUIC stream does not block delivery of packets in other streams. QUIC also provides built-in connection migration (useful for mobile clients changing networks) and 0-RTT resumption for reduced latency on reconnection.
+
+---
+
+## Serialisation comparison
+
+The serialisation format determines the overhead of encoding and decoding messages, the type safety of the wire format, and the feasibility of schema evolution.
+
+| Property | Signal (Protobuf) | Matrix (JSON) | quicnprotochat (Cap'n Proto) |
+|---|---|---|---|
+| **Format** | Binary, schema-defined | Text, schema-optional (JSON Schema exists but is not enforced by the wire format) | Binary, schema-defined |
+| **Deserialization cost** | Requires a decode pass (allocates and copies) | Requires a parse pass (allocates, copies, and handles UTF-8) | **Zero-copy**: the wire bytes are the in-memory representation. Readers traverse pointers in-place. |
+| **Schema enforcement** | Compile-time via protoc codegen | Runtime only (if at all) | Compile-time via capnpc codegen |
+| **Schema evolution** | Forward-compatible (unknown fields preserved) | Forward-compatible (unknown keys ignored) | Forward-compatible (unknown fields and methods ignored) |
+| **RPC support** | Separate framework (gRPC) | REST/HTTP (no built-in RPC) | **Built-in async RPC** (capnp-rpc). Method dispatch, pipelining, and cancellation are part of the serialisation layer. |
+| **Canonical form** | Not guaranteed (field ordering, default elision vary) | Not guaranteed (key ordering is implementation-dependent) | **Canonical serialisation** (deterministic byte output for identical messages). Suitable for signing. |
+| **Overhead** | Low (varint encoding, no field names on wire) | High (field names as strings, quoting, escaping, UTF-8) | Very low (8-byte aligned, fixed-width fields, pointer-based data) |
+
+**Why Cap'n Proto over Protobuf?**
+
+While Protobuf is a reasonable choice (and Signal uses it successfully), Cap'n Proto provides two features that are particularly valuable for quicnprotochat:
+
+1. **Zero-copy deserialization** eliminates a class of allocation and performance overhead. In a messaging system that processes many small messages, avoiding deserialization copies adds up.
+2. **Built-in RPC** means that Cap'n Proto is both the serialisation format and the RPC framework. There is no need for a separate gRPC or HTTP layer. The same `.capnp` schema file defines both the data structures and the service interface.
+3. **Canonical form** means that two implementations producing the same logical message will generate identical bytes. This is important for signatures: the MLS layer signs over serialised data, and non-deterministic serialisation would make signature verification unreliable.
+
+---
+
+## Summary comparison table
+
+| Dimension | Signal | Matrix | quicnprotochat |
+|---|---|---|---|
+| **1:1 encryption** | Double Ratchet (FS + PCS) | Olm / Double Ratchet (FS + PCS) | MLS (FS + PCS) |
+| **Group encryption** | Sender Keys (FS only) | Megolm (FS only) | MLS (FS + PCS) |
+| **Group PCS** | No | No | **Yes** (any member can heal the tree) |
+| **Group op cost** | O(n) to O(n^2) | O(n) | **O(log n)** |
+| **Transport** | TLS/TCP (HTTP/2) | TLS/TCP (HTTPS) | **QUIC/UDP** (0-RTT, no HOL blocking) |
+| **Serialisation** | Protobuf | JSON | **Cap'n Proto** (zero-copy, canonical, built-in RPC) |
+| **Standardization** | De facto standard | Matrix spec (open, community-governed) | **IETF RFC 9420** (MLS) + Noise Protocol Framework |
+| **Federation** | No (centralized) | Yes (decentralized) | No (single server per deployment) |
+| **PQ readiness** | PQXDH (X3DH + ML-KEM) in 1:1, not in groups | Not yet | Hybrid KEM (X25519 + ML-KEM-768) at envelope layer; MLS PQ integration planned (M5) |
+| **Maturity** | 10+ years, billions of users | 7+ years, millions of users | Early development (M1-M3) |
+
+---
+
+## What quicnprotochat gives up
+
+No design is without trade-offs. Compared to Signal and Matrix, quicnprotochat:
+
+- **Has no federation.** A single server per deployment means no decentralized architecture. This is a deliberate simplification -- federation adds significant complexity and metadata exposure.
+- **Is less mature.** Signal and Matrix have years of production hardening, formal security audits, and battle-tested implementations. quicnprotochat is in early development.
+- **Has a smaller ecosystem.** Signal and Matrix have extensive client libraries, bridges, and integrations. quicnprotochat is a standalone Rust implementation.
+- **Requires MLS client complexity.** MLS clients must maintain a ratchet tree, process Commits, and handle epoch transitions. This is more complex than a simple symmetric ratchet (Sender Keys / Megolm), though the complexity buys post-compromise security.
+
+---
+
+## Further reading
+
+- [Design Decisions Overview](overview.md) -- index of all ADRs
+- [ADR-001: Noise\_XX for Transport Auth](adr-001-noise-xx.md) -- transport authentication choice
+- [ADR-002: Cap'n Proto over MessagePack](adr-002-capnproto.md) -- serialisation format choice
+- [Protocol Layers Overview](../protocol-layers/overview.md) -- how quicnprotochat's layers compose
+- [MLS (RFC 9420)](../protocol-layers/mls.md) -- deep dive into the MLS protocol layer
+- [Architecture Overview](../architecture/overview.md) -- system-level architecture