feat: add post-quantum hybrid KEM + SQLCipher persistence
Feature 1 — Post-Quantum Hybrid KEM (X25519 + ML-KEM-768): - Create hybrid_kem.rs with keygen, encrypt, decrypt + 11 unit tests - Wire format: version(1) | x25519_eph_pk(32) | mlkem_ct(1088) | nonce(12) | ct - Add uploadHybridKey/fetchHybridKey RPCs to node.capnp schema - Server: hybrid key storage in FileBackedStore + RPC handlers - Client: hybrid keypair in StoredState, auto-wrap/unwrap in send/recv/invite/join - demo-group runs full hybrid PQ envelope round-trip Feature 2 — SQLCipher Persistence: - Extract Store trait from FileBackedStore API - Create SqlStore (rusqlite + bundled-sqlcipher) with encrypted-at-rest SQLite - Schema: key_packages, deliveries, hybrid_keys tables with indexes - Server CLI: --store-backend=sql, --db-path, --db-key flags - 5 unit tests for SqlStore (FIFO, round-trip, upsert, channel isolation) Also includes: client lib.rs refactor, auth config, TOML config file support, mdBook documentation, and various cleanups by user. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
256
docs/src/roadmap/authz-plan.md
Normal file
256
docs/src/roadmap/authz-plan.md
Normal file
@@ -0,0 +1,256 @@
|
||||
# Auth, Devices, and Tokens
|
||||
|
||||
This page describes the authentication, device management, and authorisation
|
||||
design for quicnprotochat. It introduces account and device identities, gates
|
||||
server operations by authenticated identity, enforces rate and size limits, and
|
||||
binds MLS identity keys to accounts.
|
||||
|
||||
This design cuts across milestones M4 through M6. For the broader production
|
||||
readiness plan, see [Production Readiness WBS](production-readiness.md).
|
||||
|
||||
---
|
||||
|
||||
## Goals
|
||||
|
||||
1. **Introduce accounts and devices** with authenticated access to `NodeService`.
|
||||
2. **Gate operations by identity:** enqueue/fetch/fetchWait require a valid token
|
||||
bound to the caller's account and device.
|
||||
3. **Enforce rate and size limits** per account, per device, and per IP.
|
||||
4. **Bind MLS identity keys to accounts:** a KeyPackage upload must be associated
|
||||
with the uploading account, preventing impersonation.
|
||||
5. **Keep wire changes minimal and versioned:** the `Auth` struct is additive
|
||||
and uses a version field for backward compatibility.
|
||||
|
||||
---
|
||||
|
||||
## Data Model (Server)
|
||||
|
||||
### Accounts
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `account_id` | UUID | Unique account identifier |
|
||||
| `created_at` | Timestamp | Account creation time |
|
||||
| `status` | Enum | `active`, `suspended`, `deleted` |
|
||||
|
||||
### Devices
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `device_id` | UUID | Unique device identifier |
|
||||
| `account_id` | UUID | Owning account (foreign key) |
|
||||
| `device_pubkey` | Ed25519 public key (32 bytes) | Device signing key |
|
||||
| `created_at` | Timestamp | Device registration time |
|
||||
| `status` | Enum | `active`, `revoked` |
|
||||
|
||||
### Sessions / Tokens
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `session_id` | UUID | Unique session identifier |
|
||||
| `account_id` | UUID | Owning account |
|
||||
| `device_id` | UUID | Originating device |
|
||||
| `access_token` | Opaque bytes | Short-lived bearer token |
|
||||
| `refresh_token` | Opaque bytes | Long-lived token for renewal |
|
||||
| `expires_at` | Timestamp | Access token expiry |
|
||||
| `created_at` | Timestamp | Session creation time |
|
||||
|
||||
### Identity Binding
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `account_id` | UUID | Owning account |
|
||||
| `mls_identity_key` | Ed25519 public key (32 bytes) | MLS credential public key |
|
||||
| `verified_fp` | SHA-256 fingerprint (32 bytes) | Fingerprint of the bound key |
|
||||
|
||||
The identity binding table ensures that only the account that registered an
|
||||
Ed25519 public key can upload KeyPackages for that key. This prevents a
|
||||
compromised or malicious client from uploading KeyPackages under another
|
||||
account's identity.
|
||||
|
||||
---
|
||||
|
||||
## Wire / API Changes
|
||||
|
||||
### Auth Struct
|
||||
|
||||
A new `Auth` struct is added to all `NodeService` RPC methods:
|
||||
|
||||
```capnp
|
||||
struct Auth {
|
||||
version @0 :UInt16; # 0 = legacy (no auth), 1 = token-based
|
||||
accessToken @1 :Data; # opaque bearer token
|
||||
deviceId @2 :Data; # optional UUID (16 bytes) for audit/rate limit
|
||||
}
|
||||
```
|
||||
|
||||
The `Auth` struct is included as a parameter in `enqueue`, `fetch`, `fetchWait`,
|
||||
`uploadKeyPackage`, and `fetchKeyPackage`.
|
||||
|
||||
### Versioning
|
||||
|
||||
| Version | Meaning |
|
||||
|---------|---------|
|
||||
| 0 | Legacy mode: no authentication. Server can allow-list in development but defaults to rejecting in production. |
|
||||
| 1 | Token-based authentication. `accessToken` is required and validated. |
|
||||
|
||||
The server rejects any `version` value higher than its current maximum. This
|
||||
ensures that a newer client connecting to an older server fails cleanly rather
|
||||
than silently skipping auth.
|
||||
|
||||
### Optional Device ID
|
||||
|
||||
The `deviceId` field is optional. When present, the server uses it for:
|
||||
|
||||
- Per-device rate limiting (in addition to per-account limits).
|
||||
- Audit logging (which device performed which operation).
|
||||
- Future: device revocation without revoking the entire account.
|
||||
|
||||
---
|
||||
|
||||
## Server Enforcement
|
||||
|
||||
### Token Validation
|
||||
|
||||
1. Extract `Auth` struct from the incoming RPC.
|
||||
2. If `version == 0` and server is in production mode, reject with
|
||||
`AUTHENTICATION_REQUIRED`.
|
||||
3. If `version == 1`, validate `accessToken`:
|
||||
- Token must exist in the session store.
|
||||
- Token must not be expired (`expires_at > now`).
|
||||
- Associated account must have `status == active`.
|
||||
- Associated device (if `deviceId` present) must have `status == active`.
|
||||
4. Map validated token to `(account_id, device_id)` for downstream authorisation.
|
||||
|
||||
### Identity Matching
|
||||
|
||||
- **uploadKeyPackage:** The `identityKey` in the RPC must match an identity
|
||||
binding for the authenticated account. Reject with `IDENTITY_MISMATCH` if the
|
||||
key is not bound to the caller's account.
|
||||
- **fetchKeyPackage:** No identity restriction (any authenticated client can
|
||||
fetch any identity's KeyPackage -- this is required for the MLS add-member flow).
|
||||
- **enqueue:** If `channelId` is present, the caller's identity must be in the
|
||||
channel membership. If `channelId` is absent (legacy mode), the operation is
|
||||
allowed for any authenticated client.
|
||||
- **fetch / fetchWait:** The `recipientKey` must correspond to an identity bound
|
||||
to the caller's account.
|
||||
|
||||
### Rate Limits
|
||||
|
||||
| Limit | Scope | Default |
|
||||
|-------|-------|---------|
|
||||
| Request rate | Per IP | 50 requests/second |
|
||||
| Request rate | Per account | 50 requests/second |
|
||||
| Request rate | Per device | 50 requests/second |
|
||||
| Payload size | Per RPC call | 5 MB |
|
||||
| KeyPackage TTL | Per package | 24 hours |
|
||||
| KeyPackage uploads | Per account | Configurable (prevents store exhaustion) |
|
||||
|
||||
Rate limit counters use a sliding window. When a limit is exceeded, the server
|
||||
responds with `RATE_LIMITED` and includes a `Retry-After` hint.
|
||||
|
||||
### Audit Logging
|
||||
|
||||
The following events are logged at audit level:
|
||||
|
||||
- Authentication success (account, device, IP).
|
||||
- Authentication failure (reason, IP).
|
||||
- Token issuance and refresh (account, device).
|
||||
- KeyPackage upload (account, identity key fingerprint).
|
||||
- Enqueue (account, channel, recipient).
|
||||
- Fetch / fetchWait (account, recipient).
|
||||
- Rate limit exceeded (scope, account/IP, current rate).
|
||||
|
||||
All audit log entries include a timestamp and correlation ID. Sensitive fields
|
||||
(token values, ciphertext, private keys) are never logged.
|
||||
|
||||
---
|
||||
|
||||
## Client Changes
|
||||
|
||||
### Login / Register Flow
|
||||
|
||||
1. **Register:** Client generates an Ed25519 identity keypair, sends the public
|
||||
key to the server. Server creates an account, binds the identity key, and
|
||||
returns an `(access_token, refresh_token)` pair.
|
||||
2. **Login:** Client presents credentials (initially: signed challenge from
|
||||
device key). Server validates and issues tokens.
|
||||
3. **Token storage:** Access and refresh tokens stored in the client state file
|
||||
(same location as identity keypair). The state file should be
|
||||
permission-restricted (`0600`).
|
||||
4. **Token refresh:** Client detects `TOKEN_EXPIRED` errors and uses the refresh
|
||||
token to obtain a new access token without re-authenticating.
|
||||
|
||||
### RPC Integration
|
||||
|
||||
Every RPC call includes the `Auth` struct:
|
||||
|
||||
```rust
|
||||
// Pseudocode for client RPC calls
|
||||
let auth = Auth {
|
||||
version: 1,
|
||||
access_token: state.access_token.clone(),
|
||||
device_id: Some(state.device_id),
|
||||
};
|
||||
node_service.enqueue(auth, recipient_key, channel_id, payload).await?;
|
||||
```
|
||||
|
||||
### Identity Binding
|
||||
|
||||
At registration, the client's Ed25519 public key is bound to the new account.
|
||||
The client must refuse to upload KeyPackages if the local identity key does not
|
||||
match the bound key -- this prevents accidental identity confusion after key
|
||||
rotation.
|
||||
|
||||
---
|
||||
|
||||
## Compatibility
|
||||
|
||||
### Wire Version Field
|
||||
|
||||
The `Auth` struct includes its own `version` field, independent of the delivery
|
||||
message version. This allows auth changes to evolve separately from the delivery
|
||||
protocol.
|
||||
|
||||
### Legacy Support
|
||||
|
||||
- `version == 0`: No auth. Server behaviour is configurable:
|
||||
- **Development:** Allow legacy calls (default for `cargo run`).
|
||||
- **Production:** Reject legacy calls (default for Docker deployment).
|
||||
- `version == 1`: Full auth. This is the target for M4+.
|
||||
|
||||
### N-1 Integration Tests
|
||||
|
||||
Compatibility testing covers:
|
||||
|
||||
- New client (v1 auth) against new server -- expected: full auth flow works.
|
||||
- Old client (v0 legacy) against new server in dev mode -- expected: legacy
|
||||
calls succeed.
|
||||
- Old client (v0 legacy) against new server in prod mode -- expected: clean
|
||||
rejection with `AUTHENTICATION_REQUIRED`.
|
||||
- New client (v1 auth) against old server -- expected: server ignores unknown
|
||||
`Auth` struct fields; operations succeed if server does not enforce auth.
|
||||
|
||||
---
|
||||
|
||||
## Implementation Sequence
|
||||
|
||||
1. Extend Cap'n Proto schemas with the `Auth` struct and add it to all
|
||||
`NodeService` methods.
|
||||
2. Implement token validation middleware in server RPC handlers; add an in-memory
|
||||
token store (upgradeable to SQLite at M6).
|
||||
3. Bind `identityKey` to account on upload; enforce on fetch/enqueue.
|
||||
4. Add tests: unit tests for token validation; integration tests for auth
|
||||
success and failure paths.
|
||||
5. Add rate limiting middleware with configurable thresholds.
|
||||
6. Add audit logging for all auth-related events.
|
||||
|
||||
---
|
||||
|
||||
## Cross-references
|
||||
|
||||
- [Milestones](milestones.md) -- M4 and M6 deliverables
|
||||
- [Production Readiness WBS](production-readiness.md) -- Phase 3 (Auth/Device/Server Hardening)
|
||||
- [1:1 Channel Design](dm-channels.md) -- channel-level authz
|
||||
- [Wire Format: NodeService Schema](../wire-format/node-service-schema.md) -- RPC schema
|
||||
- [Coding Standards](../contributing/coding-standards.md) -- security-by-design requirements
|
||||
261
docs/src/roadmap/dm-channels.md
Normal file
261
docs/src/roadmap/dm-channels.md
Normal file
@@ -0,0 +1,261 @@
|
||||
# 1:1 Channel Design
|
||||
|
||||
This page describes the design for first-class 1:1 (direct message) channels in
|
||||
quicnprotochat. Channels provide per-conversation authorisation, MLS-encrypted
|
||||
payloads, message retention with TTL eviction, and backward compatibility with
|
||||
the legacy delivery model.
|
||||
|
||||
For the broader roadmap context, see [Milestones](milestones.md) and
|
||||
[Production Readiness WBS](production-readiness.md) (Phase 4).
|
||||
|
||||
---
|
||||
|
||||
## Goals
|
||||
|
||||
1. **First-class 1:1 channels.** Each conversation between two participants has
|
||||
a unique `channelId`, enabling per-channel authorisation, storage, and
|
||||
eviction.
|
||||
2. **Per-channel authorisation.** The server enforces that only the two channel
|
||||
members can enqueue and fetch messages for a given channel.
|
||||
3. **MLS-encrypted payloads.** All message content is MLS ciphertext. The server
|
||||
never sees plaintext. Channel metadata (ID + participant keys) is the only
|
||||
information the server holds.
|
||||
4. **7-day message retention.** Messages older than 7 days are evicted. This is
|
||||
configurable but defaults to 7 days.
|
||||
5. **24-hour KeyPackage TTL.** KeyPackages expire after 24 hours. Clients must
|
||||
rotate KeyPackages before expiry to remain reachable.
|
||||
|
||||
---
|
||||
|
||||
## Schema Changes (Cap'n Proto)
|
||||
|
||||
### New Fields
|
||||
|
||||
The following fields are added to the existing `NodeService` RPC methods:
|
||||
|
||||
| RPC Method | New Field | Type | Description |
|
||||
|------------|-----------|------|-------------|
|
||||
| `enqueue` | `channelId` | `Data` (UUID, 16 bytes) | Target channel |
|
||||
| `fetch` | `channelId` | `Data` (UUID, 16 bytes) | Channel to fetch from |
|
||||
| `fetchWait` | `channelId` | `Data` (UUID, 16 bytes) | Channel to long-poll |
|
||||
| All messages | `version` | `UInt16` | Wire version for forward compat |
|
||||
|
||||
### Version Field
|
||||
|
||||
The `version` field on delivery messages allows the server to reject messages
|
||||
with unknown versions. The current version is `1`. Clients that do not set
|
||||
`channelId` are treated as version `0` (legacy mode).
|
||||
|
||||
### New RPC Method
|
||||
|
||||
A new `createChannel` method is added to `NodeService`:
|
||||
|
||||
```capnp
|
||||
createChannel @N (
|
||||
auth :Auth,
|
||||
peerKey :Data # Ed25519 public key of the other participant
|
||||
) -> (
|
||||
channelId :Data # UUID, 16 bytes
|
||||
);
|
||||
```
|
||||
|
||||
The server generates the `channelId`, stores the membership, and returns the ID
|
||||
to the caller. The peer discovers the channel when they receive a message
|
||||
addressed to it (or via a separate discovery mechanism in a future milestone).
|
||||
|
||||
---
|
||||
|
||||
## AuthZ Model
|
||||
|
||||
### Channel Membership
|
||||
|
||||
Each channel has exactly two members, identified by their Ed25519 public keys:
|
||||
|
||||
```
|
||||
Channel {
|
||||
channelId: UUID (16 bytes)
|
||||
members: {a_key: Ed25519PubKey, b_key: Ed25519PubKey}
|
||||
created_at: Timestamp
|
||||
}
|
||||
```
|
||||
|
||||
The server stores this mapping and enforces it on every operation.
|
||||
|
||||
### Enqueue Authorisation
|
||||
|
||||
When a client calls `enqueue(auth, channelId, recipientKey, payload)`:
|
||||
|
||||
1. Validate the `Auth` token (see [Auth, Devices, and Tokens](authz-plan.md)).
|
||||
2. Look up the channel by `channelId`.
|
||||
3. Verify that the caller's identity (from the token) is one of the channel's
|
||||
two members.
|
||||
4. Verify that `recipientKey` is the *other* member of the channel (prevents
|
||||
sending to yourself or to a non-member).
|
||||
5. Apply rate limits (50 r/s per identity, 5 MB payload cap).
|
||||
6. Enqueue the payload.
|
||||
|
||||
### Fetch Authorisation
|
||||
|
||||
When a client calls `fetch(auth, channelId, recipientKey)` or
|
||||
`fetchWait(auth, channelId, recipientKey, timeout)`:
|
||||
|
||||
1. Validate the `Auth` token.
|
||||
2. Verify that the caller's identity matches `recipientKey`.
|
||||
3. Verify that `recipientKey` is a member of the specified channel.
|
||||
4. Return messages for `(channelId, recipientKey)`, filtering out expired
|
||||
messages (TTL check).
|
||||
|
||||
---
|
||||
|
||||
## Storage Model
|
||||
|
||||
### Channels Table
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `channel_id` | UUID (16 bytes) | Primary key |
|
||||
| `member_a_key` | Ed25519 public key (32 bytes) | First member |
|
||||
| `member_b_key` | Ed25519 public key (32 bytes) | Second member |
|
||||
| `created_at` | Timestamp | Channel creation time |
|
||||
|
||||
A unique constraint on `(member_a_key, member_b_key)` (sorted) prevents
|
||||
duplicate channels between the same pair of identities.
|
||||
|
||||
### Delivery Queue
|
||||
|
||||
Messages are keyed by `(channelId, recipient_key)`:
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `channel_id` | UUID (16 bytes) | Channel |
|
||||
| `recipient_key` | Ed25519 public key (32 bytes) | Intended recipient |
|
||||
| `payload` | Bytes | MLS ciphertext (opaque to server) |
|
||||
| `received_at` | Timestamp | Server receive time |
|
||||
| `sequence_no` | UInt64 | Per-channel, per-recipient monotonic counter |
|
||||
|
||||
### TTL Eviction
|
||||
|
||||
Messages are evicted in two ways:
|
||||
|
||||
1. **Fetch-time check:** When a client fetches messages, the server filters out
|
||||
any message where `received_at + TTL < now`. This is the primary eviction
|
||||
path.
|
||||
2. **Background sweep:** A periodic task (configurable interval, default 1 hour)
|
||||
scans for and deletes expired messages. This prevents unbounded storage
|
||||
growth from inactive channels.
|
||||
|
||||
Default TTL values:
|
||||
|
||||
| Entity | TTL | Configurable |
|
||||
|--------|-----|-------------|
|
||||
| Messages | 7 days | Yes |
|
||||
| KeyPackages | 24 hours | Yes |
|
||||
|
||||
---
|
||||
|
||||
## Flows
|
||||
|
||||
### Create Channel
|
||||
|
||||
```
|
||||
Alice Server Bob
|
||||
| | |
|
||||
|-- createChannel(auth, bob_key) | |
|
||||
| |-- generate channelId |
|
||||
| |-- store {channelId, |
|
||||
| | alice_key, bob_key} |
|
||||
|<- channelId ------------------| |
|
||||
| | |
|
||||
```
|
||||
|
||||
Alice receives the `channelId` and can now send messages to Bob on this channel.
|
||||
Bob discovers the channel when he receives the first message (the `channelId` is
|
||||
included in the delivery metadata).
|
||||
|
||||
### Send (with AuthZ)
|
||||
|
||||
```
|
||||
Alice Server
|
||||
| |
|
||||
|-- enqueue(auth, channelId, |
|
||||
| bob_key, mls_ciphertext) |
|
||||
| |-- validate auth token
|
||||
| |-- lookup channel membership
|
||||
| |-- verify alice_key in members
|
||||
| |-- verify bob_key is recipient
|
||||
| |-- check rate limits
|
||||
| |-- store (channelId, bob_key,
|
||||
| | payload, received_at, seq)
|
||||
|<- ok (sequence_no) ------------|
|
||||
| |
|
||||
```
|
||||
|
||||
### Receive (with TTL)
|
||||
|
||||
```
|
||||
Bob Server
|
||||
| |
|
||||
|-- fetchWait(auth, channelId, |
|
||||
| bob_key, timeout) |
|
||||
| |-- validate auth token
|
||||
| |-- verify bob_key in channel
|
||||
| |-- query (channelId, bob_key)
|
||||
| |-- filter: received_at + 7d > now
|
||||
| |-- return non-expired messages
|
||||
|<- messages[] ------------------|
|
||||
| |
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Backward Compatibility
|
||||
|
||||
### Legacy Mode (channelId = nil)
|
||||
|
||||
When `channelId` is empty or absent:
|
||||
|
||||
- The server treats the request as a legacy delivery (pre-channel behavior).
|
||||
- Messages are routed solely by `recipientKey`, without channel-level authz.
|
||||
- This mode can be disabled in production via server configuration.
|
||||
|
||||
### Version Negotiation
|
||||
|
||||
The `version` field on delivery messages allows clean rejection of future schema
|
||||
changes:
|
||||
|
||||
| Version | Behavior |
|
||||
|---------|----------|
|
||||
| 0 | Legacy mode: no `channelId`, no per-channel authz |
|
||||
| 1 | Channel-aware: `channelId` required, authz enforced |
|
||||
|
||||
The server rejects messages with `version > max_supported`.
|
||||
|
||||
---
|
||||
|
||||
## Open Items
|
||||
|
||||
These items are deferred to future milestones:
|
||||
|
||||
- **Persistence backend:** The current `DashMap`-based store must be extended to
|
||||
SQLite (or SQLCipher) for durable channel and delivery state. See
|
||||
[Milestones: M6](milestones.md#m6----persistence-planned).
|
||||
- **Channel discovery API:** A dedicated RPC for Bob to discover channels he is
|
||||
a member of, rather than relying on first-message discovery.
|
||||
- **Client UX:** Map peer identity to `channelId` discovery; cache `channelId`
|
||||
in the client state file.
|
||||
- **Audit logging:** Log channel creation, authz failures, send/recv events with
|
||||
redaction of ciphertext. See [Auth, Devices, and Tokens](authz-plan.md) for
|
||||
the audit logging design.
|
||||
- **Multi-device:** A single account on multiple devices sharing the same
|
||||
channel. Requires per-device delivery queues and MLS multi-device support.
|
||||
|
||||
---
|
||||
|
||||
## Cross-references
|
||||
|
||||
- [Milestones](milestones.md) -- M4 (CLI subcommands) and M6 (persistence)
|
||||
- [Production Readiness WBS](production-readiness.md) -- Phase 4 (Delivery Semantics)
|
||||
- [Auth, Devices, and Tokens](authz-plan.md) -- token validation and identity binding
|
||||
- [Wire Format: Delivery Schema](../wire-format/delivery-schema.md) -- current delivery schema
|
||||
- [Wire Format: NodeService Schema](../wire-format/node-service-schema.md) -- RPC interface
|
||||
- [Architecture Overview](../architecture/overview.md) -- system diagram and service model
|
||||
406
docs/src/roadmap/future-research.md
Normal file
406
docs/src/roadmap/future-research.md
Normal file
@@ -0,0 +1,406 @@
|
||||
# Future Research Directions
|
||||
|
||||
This page catalogues technologies and research directions that could strengthen
|
||||
quicnprotochat beyond the current [milestone plan](milestones.md). Each entry
|
||||
includes a brief description, the problem it solves, relevant crates or
|
||||
specifications, and how it maps to the project architecture.
|
||||
|
||||
For the production readiness work breakdown, see
|
||||
[Production Readiness WBS](production-readiness.md).
|
||||
|
||||
---
|
||||
|
||||
## Transport and Networking
|
||||
|
||||
### LibP2P / iroh (n0)
|
||||
|
||||
**Problem:** The current architecture is strictly client-server. Clients behind
|
||||
NAT cannot communicate directly, and the server is a single point of failure for
|
||||
delivery.
|
||||
|
||||
**Solution:** [LibP2P](https://libp2p.io/) and [iroh](https://iroh.computer/)
|
||||
(from n0) provide peer discovery, NAT traversal (hole-punching), and relay
|
||||
fallback. iroh is particularly interesting because it is Rust-native and built on
|
||||
QUIC, aligning with quicnprotochat's existing transport layer.
|
||||
|
||||
**Architecture impact:** Move from pure client-server to a hybrid topology where
|
||||
peers communicate directly when possible and fall back to server relay when NAT
|
||||
traversal fails. The server role shifts from mandatory relay to optional
|
||||
rendezvous/relay node.
|
||||
|
||||
**Crates:** `libp2p`, `iroh`, `iroh-net`
|
||||
|
||||
### WebTransport (HTTP/3)
|
||||
|
||||
**Problem:** Browser clients cannot use raw QUIC. The current stack requires a
|
||||
native Rust binary.
|
||||
|
||||
**Solution:** [WebTransport](https://w3c.github.io/webtransport/) exposes
|
||||
QUIC-like semantics (multiplexed bidirectional streams, datagrams) to browsers
|
||||
over HTTP/3. A WebTransport endpoint alongside the existing QUIC listener would
|
||||
enable a web client without WebSocket degradation.
|
||||
|
||||
**Architecture impact:** Add a second listener (HTTP/3 + WebTransport) that
|
||||
terminates WebTransport and bridges into the existing `NodeService` RPC layer.
|
||||
Cap'n Proto serialisation works in WASM via `capnp` crate.
|
||||
|
||||
**Crates:** `h3`, `h3-webtransport`, `wtransport`
|
||||
|
||||
### Tor / I2P Integration
|
||||
|
||||
**Problem:** MLS protects message content, but connection metadata (who connects
|
||||
to the server, when, how often) leaks to the server and network observers.
|
||||
|
||||
**Solution:** Route client-server connections through
|
||||
[Tor](https://www.torproject.org/) onion services or
|
||||
[I2P](https://geti2p.net/) tunnels. This provides metadata resistance at the
|
||||
network layer.
|
||||
|
||||
**Architecture impact:** The server exposes a `.onion` address (Tor) or an I2P
|
||||
destination. Clients connect through the anonymity network. Latency increases
|
||||
significantly, so this should be optional.
|
||||
|
||||
**Crates:** `arti` (Tor client in Rust), `arti-client`
|
||||
|
||||
---
|
||||
|
||||
## Storage and Persistence
|
||||
|
||||
### SQLCipher / libsql (Turso)
|
||||
|
||||
**Problem:** At M6, quicnprotochat needs persistent storage for group state, key
|
||||
material, and message queues. Storing private keys in a plaintext SQLite database
|
||||
is insufficient.
|
||||
|
||||
**Solution:** [SQLCipher](https://www.zetetic.net/sqlcipher/) provides
|
||||
transparent, page-level AES-256 encryption for SQLite. Alternatively,
|
||||
[libsql](https://turso.tech/libsql) (Turso) offers a SQLite fork with
|
||||
encryption, replication, and embedded server capabilities.
|
||||
|
||||
**Architecture impact:** Replace the `sqlx` SQLite backend with SQLCipher.
|
||||
Encryption key derived from a user-provided passphrase (via Argon2id) or a
|
||||
hardware-backed key.
|
||||
|
||||
**Crates:** `rusqlite` (with `bundled-sqlcipher` feature), `libsql`
|
||||
|
||||
### CRDTs (Automerge / Yrs)
|
||||
|
||||
**Problem:** Multi-device support requires synchronising state (group membership,
|
||||
read receipts, settings) across devices without a central authority resolving
|
||||
conflicts.
|
||||
|
||||
**Solution:** Conflict-free replicated data types (CRDTs) allow concurrent edits
|
||||
to converge without coordination. [Automerge](https://automerge.org/) and
|
||||
[Yrs](https://docs.rs/yrs/) (Yjs in Rust) provide production-quality CRDT
|
||||
implementations.
|
||||
|
||||
**Architecture impact:** Client-side state (contact list, group membership
|
||||
cache, read markers) stored as CRDT documents. Synchronisation happens over the
|
||||
existing MLS-encrypted channel, ensuring the server never sees the state.
|
||||
|
||||
**Crates:** `automerge`, `yrs`
|
||||
|
||||
### Object Storage (S3-compatible)
|
||||
|
||||
**Problem:** Encrypted file and media attachments need a storage backend that
|
||||
the server can host without seeing the content.
|
||||
|
||||
**Solution:** An S3-compatible object store (MinIO, Garage, or a cloud provider)
|
||||
for encrypted blobs. Clients encrypt attachments client-side (using a key derived
|
||||
from the MLS group secret) and upload the ciphertext. The server stores and
|
||||
serves opaque blobs.
|
||||
|
||||
**Architecture impact:** Add a media upload/download RPC to `NodeService`. The
|
||||
server proxies to the object store or returns pre-signed URLs.
|
||||
|
||||
**Crates:** `aws-sdk-s3`, `opendal`
|
||||
|
||||
---
|
||||
|
||||
## Cryptography and Privacy
|
||||
|
||||
### ML-KEM + ML-DSA Hybrid (Post-Quantum MLS)
|
||||
|
||||
**Problem:** Quantum computers threaten X25519 and Ed25519. While MLS content is
|
||||
protected by ephemeral key exchange, the init keys and credential signatures are
|
||||
vulnerable to harvest-now-decrypt-later attacks.
|
||||
|
||||
**Solution:** Hybrid X25519 + ML-KEM-768 KEM for MLS init keys, and optionally
|
||||
hybrid Ed25519 + ML-DSA-65 for credential signatures. The `ml-kem` crate is
|
||||
already vendored in the workspace.
|
||||
|
||||
**Architecture impact:** Custom `OpenMlsCryptoProvider` in `quicnprotochat-core`
|
||||
implementing the hybrid combiner. This is the M7 milestone -- see
|
||||
[Milestones](milestones.md#m7----post-quantum-planned) and
|
||||
[Hybrid KEM](../protocol-layers/hybrid-kem.md).
|
||||
|
||||
**Crates:** `ml-kem`, `ml-dsa`
|
||||
|
||||
**References:** NIST FIPS 203 (ML-KEM), `draft-ietf-tls-hybrid-design`
|
||||
|
||||
### Private Information Retrieval (PIR)
|
||||
|
||||
**Problem:** When a client fetches messages or KeyPackages, the server learns
|
||||
*which* recipient is requesting -- even though it cannot read the content.
|
||||
|
||||
**Solution:** Private Information Retrieval (PIR) allows a client to fetch a
|
||||
record from the server without revealing which record was requested.
|
||||
[SealPIR](https://github.com/microsoft/SealPIR) and SimplePIR provide practical
|
||||
constructions.
|
||||
|
||||
**Architecture impact:** Replace the `fetch` / `fetchKeyPackage` RPCs with PIR
|
||||
queries. This is a significant performance trade-off: PIR has high computational
|
||||
cost. Suitable for KeyPackage fetch (small database) before message fetch (large
|
||||
database).
|
||||
|
||||
### Sealed Sender (Signal-style)
|
||||
|
||||
**Problem:** The server sees `(sender, recipient, timestamp)` metadata on every
|
||||
enqueued message. Even without reading content, this metadata reveals social
|
||||
graphs.
|
||||
|
||||
**Solution:** [Sealed Sender](https://signal.org/blog/sealed-sender/) encrypts
|
||||
the sender's identity inside the MLS ciphertext. The server routes by
|
||||
`recipientKey` only and cannot determine who sent the message.
|
||||
|
||||
**Architecture impact:** Modify the `enqueue` RPC to omit sender identity from
|
||||
the server-visible metadata. The sender identity is included only inside the
|
||||
MLS application message (encrypted).
|
||||
|
||||
### Key Transparency (RFC draft)
|
||||
|
||||
**Problem:** A compromised server could substitute public keys, performing a
|
||||
man-in-the-middle attack on MLS group formation.
|
||||
|
||||
**Solution:** A verifiable, append-only log of public key bindings (similar to
|
||||
Certificate Transparency for TLS). Clients verify that the server's response
|
||||
matches the log before trusting a fetched KeyPackage.
|
||||
|
||||
**Architecture impact:** Add a key transparency log (Merkle tree) alongside the
|
||||
Authentication Service. Clients verify inclusion proofs on every `fetchKeyPackage`
|
||||
response.
|
||||
|
||||
**References:** `draft-ietf-keytrans-protocol`
|
||||
|
||||
---
|
||||
|
||||
## Identity and Authentication
|
||||
|
||||
### DIDs (Decentralized Identifiers)
|
||||
|
||||
**Problem:** User identities are currently bound to the server. If the server
|
||||
goes away, identities are lost.
|
||||
|
||||
**Solution:** [Decentralized Identifiers](https://www.w3.org/TR/did-core/)
|
||||
(`did:key`, `did:web`) provide self-sovereign identity. A user's DID is derived
|
||||
from their Ed25519 public key and is portable across servers.
|
||||
|
||||
**Architecture impact:** Replace raw Ed25519 public keys in MLS credentials with
|
||||
DID URIs. The server resolves DIDs to public keys for routing.
|
||||
|
||||
**Crates:** `did-key`, `ssi`
|
||||
|
||||
### OPAQUE (aPAKE)
|
||||
|
||||
**Problem:** If quicnprotochat adds password-based account registration, the
|
||||
server must never see the password -- not even a hash.
|
||||
|
||||
**Solution:** [OPAQUE](https://datatracker.ietf.org/doc/rfc9497/) is an
|
||||
asymmetric password-authenticated key exchange where the server stores only a
|
||||
one-way transformation of the password. The server cannot perform offline
|
||||
dictionary attacks.
|
||||
|
||||
**Architecture impact:** Replace the registration/login flow with OPAQUE. The
|
||||
server stores an OPAQUE registration record; the client runs the OPAQUE protocol
|
||||
to authenticate and derive a session key.
|
||||
|
||||
**Crates:** `opaque-ke`
|
||||
|
||||
**References:** RFC 9497
|
||||
|
||||
### WebAuthn / Passkeys
|
||||
|
||||
**Problem:** Password-based auth (even with OPAQUE) is vulnerable to phishing.
|
||||
Hardware-backed authentication provides stronger device binding.
|
||||
|
||||
**Solution:** [WebAuthn](https://www.w3.org/TR/webauthn-3/) / Passkeys allow
|
||||
authentication via hardware tokens (YubiKey), platform authenticators (Touch ID,
|
||||
Windows Hello), or synced passkeys.
|
||||
|
||||
**Architecture impact:** Add a WebAuthn registration/authentication flow to the
|
||||
account system. Requires a server-side WebAuthn relying party implementation.
|
||||
|
||||
**Crates:** `webauthn-rs`
|
||||
|
||||
### Verifiable Credentials (W3C VC)
|
||||
|
||||
**Problem:** Proving attributes (organization membership, role, age) without
|
||||
revealing full identity.
|
||||
|
||||
**Solution:** [Verifiable Credentials](https://www.w3.org/TR/vc-data-model/)
|
||||
allow a user to present cryptographic proofs of attributes issued by a trusted
|
||||
authority.
|
||||
|
||||
**Architecture impact:** Extend MLS credentials with VC presentation. A group
|
||||
admin could require proof of organization membership before allowing join.
|
||||
|
||||
---
|
||||
|
||||
## Application Layer
|
||||
|
||||
### Matrix-style Federation
|
||||
|
||||
**Problem:** A single server is a single point of failure and a single point of
|
||||
trust. Users on different servers cannot communicate.
|
||||
|
||||
**Solution:** Federation allows multiple quicnprotochat servers to exchange
|
||||
messages, similar to [Matrix](https://matrix.org/) homeserver federation. Each
|
||||
server manages its own users and relays messages to peer servers.
|
||||
|
||||
**Architecture impact:** Major. Requires server-to-server protocol, distributed
|
||||
identity resolution, and cross-server MLS group management.
|
||||
|
||||
### WASM Plugin System
|
||||
|
||||
**Problem:** Extensibility (bots, bridges, custom message types) currently
|
||||
requires forking the codebase.
|
||||
|
||||
**Solution:** A sandboxed WASM plugin system allows third-party extensions to run
|
||||
inside the client or server without access to private key material.
|
||||
|
||||
**Architecture impact:** Define a plugin API (message hooks, command handlers).
|
||||
Plugins compiled to WASM and loaded at runtime via `wasmtime` or `wasmer`.
|
||||
|
||||
**Crates:** `wasmtime`, `wasmer`, `extism`
|
||||
|
||||
### Double-Ratchet DM Layer
|
||||
|
||||
**Problem:** MLS is optimised for groups. For efficient 1:1 conversations, the
|
||||
Signal double ratchet (X3DH + Axolotl) provides better performance
|
||||
characteristics (no tree overhead for two parties).
|
||||
|
||||
**Solution:** Implement a double-ratchet layer for 1:1 DMs, using MLS only for
|
||||
groups with N > 2. The [1:1 Channel Design](dm-channels.md) currently uses MLS
|
||||
for DMs; this would be an optimisation.
|
||||
|
||||
**References:** [The Double Ratchet Algorithm](https://signal.org/docs/specifications/doubleratchet/),
|
||||
[X3DH Key Agreement Protocol](https://signal.org/docs/specifications/x3dh/)
|
||||
|
||||
---
|
||||
|
||||
## Observability and Operations
|
||||
|
||||
### OpenTelemetry (Tracing + Metrics)
|
||||
|
||||
**Problem:** The current logging is `tracing`-based but lacks distributed
|
||||
tracing context and structured metrics export.
|
||||
|
||||
**Solution:** [OpenTelemetry](https://opentelemetry.io/) provides a unified
|
||||
framework for distributed tracing, metrics, and log correlation. OTLP export
|
||||
enables integration with any observability backend.
|
||||
|
||||
**Architecture impact:** Add `tracing-opentelemetry` and `opentelemetry-otlp`
|
||||
to the server. Instrument RPC handlers with spans. Export to Jaeger, Grafana
|
||||
Tempo, or any OTLP-compatible backend.
|
||||
|
||||
**Crates:** `opentelemetry`, `opentelemetry-otlp`, `tracing-opentelemetry`
|
||||
|
||||
### Prometheus + Grafana
|
||||
|
||||
**Problem:** No quantitative visibility into server performance (throughput,
|
||||
latency, queue depth, epoch advancement rate).
|
||||
|
||||
**Solution:** Export Prometheus metrics from the server. Visualise with Grafana
|
||||
dashboards.
|
||||
|
||||
**Metrics to export:** message throughput (enqueue/fetch per second), RPC
|
||||
latency histograms, MLS epoch advancement rate, delivery queue depth, KeyPackage
|
||||
store size, active connections.
|
||||
|
||||
**Crates:** `prometheus`, `metrics`, `metrics-exporter-prometheus`
|
||||
|
||||
### Testcontainers-rs
|
||||
|
||||
**Problem:** Integration tests currently run server and client in the same
|
||||
process (`tokio::spawn`). This does not test real network conditions, container
|
||||
startup, or multi-process interactions.
|
||||
|
||||
**Solution:** [Testcontainers-rs](https://docs.rs/testcontainers/) runs Docker
|
||||
containers from Rust tests, enabling true end-to-end CI with real network
|
||||
boundaries.
|
||||
|
||||
**Architecture impact:** Add testcontainers-based integration tests alongside
|
||||
the existing in-process tests. The Docker image is already maintained.
|
||||
|
||||
**Crates:** `testcontainers`, `testcontainers-modules`
|
||||
|
||||
---
|
||||
|
||||
## Developer Experience
|
||||
|
||||
### Tauri / Dioxus (Native GUI)
|
||||
|
||||
**Problem:** The current interface is CLI-only. A graphical client would broaden
|
||||
the user base for testing and demonstration.
|
||||
|
||||
**Solution:** [Tauri](https://tauri.app/) or [Dioxus](https://dioxuslabs.com/)
|
||||
provide native cross-platform GUI frameworks in Rust. The
|
||||
`quicnprotochat-core` crate can be shared directly with the GUI client.
|
||||
|
||||
**Architecture impact:** Add a `quicnprotochat-gui` crate that depends on
|
||||
`quicnprotochat-core` and `quicnprotochat-proto`. The GUI drives the same
|
||||
`GroupMember` and RPC logic as the CLI client.
|
||||
|
||||
**Crates:** `tauri`, `dioxus`
|
||||
|
||||
### uniffi / diplomat (Mobile FFI)
|
||||
|
||||
**Problem:** Mobile clients (iOS, Android) cannot use the Rust binary directly.
|
||||
|
||||
**Solution:** [uniffi](https://github.com/aspect-build/aspect-cli) (Mozilla) and
|
||||
[diplomat](https://github.com/nickelc/diplomat) generate idiomatic Swift and
|
||||
Kotlin bindings from Rust definitions.
|
||||
|
||||
**Architecture impact:** Expose `quicnprotochat-core` through a C-compatible FFI
|
||||
layer. Mobile apps call into the Rust crypto and protocol logic.
|
||||
|
||||
**Crates:** `uniffi`, `diplomat`
|
||||
|
||||
### Nix Flakes
|
||||
|
||||
**Problem:** The development environment requires `capnp` (Cap'n Proto compiler),
|
||||
a specific Rust toolchain version, and test infrastructure. Setup varies across
|
||||
developer machines.
|
||||
|
||||
**Solution:** [Nix flakes](https://nixos.wiki/wiki/Flakes) provide a
|
||||
reproducible, declarative development environment. A single `nix develop`
|
||||
command sets up the toolchain, `capnp`, and all dependencies.
|
||||
|
||||
**Architecture impact:** Add `flake.nix` and `flake.lock` to the repository root.
|
||||
|
||||
---
|
||||
|
||||
## Top 5 Priority Implementations
|
||||
|
||||
The following table ranks the most impactful technologies for near-term adoption,
|
||||
considering the current state of the codebase and the [milestone plan](milestones.md).
|
||||
|
||||
| Priority | Technology | Why | Unlocks |
|
||||
|----------|-----------|-----|---------|
|
||||
| 1 | **Post-quantum hybrid KEM** | `ml-kem` is already vendored in the workspace. Completing the hybrid `OpenMlsCryptoProvider` makes quicnprotochat one of the first PQ MLS implementations. | M7 |
|
||||
| 2 | **SQLCipher persistence** | Encrypted-at-rest storage is the prerequisite for multi-device support, offline usage, and server restart survival. | M6 |
|
||||
| 3 | **OPAQUE auth** | Zero-knowledge password authentication is a massive security uplift for the account system. The server never sees or stores passwords. | Phase 3 (authz) |
|
||||
| 4 | **iroh / LibP2P** | NAT traversal and optional P2P mesh makes quicnprotochat deployable without centralised infrastructure. Aligns with the existing QUIC transport. | Beyond M7 |
|
||||
| 5 | **Sealed Sender + PIR** | Content encryption is table stakes. Metadata resistance (hiding who talks to whom) is the frontier of private messaging research. | Beyond M7 |
|
||||
|
||||
---
|
||||
|
||||
## Cross-references
|
||||
|
||||
- [Milestones](milestones.md) -- current milestone tracker
|
||||
- [Production Readiness WBS](production-readiness.md) -- phased work breakdown
|
||||
- [Auth, Devices, and Tokens](authz-plan.md) -- OPAQUE integration point
|
||||
- [1:1 Channel Design](dm-channels.md) -- double-ratchet optimisation context
|
||||
- [Hybrid KEM](../protocol-layers/hybrid-kem.md) -- existing PQ design
|
||||
- [ADR-006: PQ Gap in Noise Transport](../design-rationale/adr-006-pq-gap.md) -- accepted PQ risk
|
||||
- [References](../appendix/references.md) -- standards and crate documentation
|
||||
194
docs/src/roadmap/milestones.md
Normal file
194
docs/src/roadmap/milestones.md
Normal file
@@ -0,0 +1,194 @@
|
||||
# Milestone Tracker
|
||||
|
||||
This page tracks the project milestones for quicnprotochat, from initial transport
|
||||
layer through post-quantum cryptography. Each milestone produces production-ready,
|
||||
tested, deployable code -- see [Coding Standards](../contributing/coding-standards.md)
|
||||
for what that means in practice.
|
||||
|
||||
---
|
||||
|
||||
## Milestone Summary
|
||||
|
||||
| # | Name | Status | What it adds |
|
||||
|---|------|--------|-------------|
|
||||
| M1 | QUIC/TLS Transport | **Complete** | QUIC + TLS 1.3 endpoint, length-prefixed framing, Ping/Pong |
|
||||
| M2 | Authentication Service | **Complete** | Ed25519 identity, KeyPackage generation, AS upload/fetch |
|
||||
| M3 | Delivery Service + MLS Groups | **Complete** | DS relay, GroupMember create/join/add/send/recv |
|
||||
| M4 | Group CLI Subcommands | **Next** | Persistent CLI (create-group, invite, join, send, recv); `demo-group` already available |
|
||||
| M5 | Multi-party Groups | Planned | N > 2 members, Commit fan-out, Proposal handling |
|
||||
| M6 | Persistence | Planned | SQLite key store, durable group state |
|
||||
| M7 | Post-quantum | Planned | PQ hybrid for MLS/HPKE (X25519 + ML-KEM-768) |
|
||||
|
||||
---
|
||||
|
||||
## M1 -- QUIC/TLS Transport (Complete)
|
||||
|
||||
**Goal:** Two processes establish a QUIC connection over TLS 1.3 and exchange
|
||||
typed Cap'n Proto frames.
|
||||
|
||||
**Deliverables:**
|
||||
|
||||
- `schemas/envelope.capnp`: `Envelope` struct with `MsgType` enum (Ping/Pong at this stage)
|
||||
- `quicnprotochat-proto`: `build.rs` invoking `capnpc`, generated type re-exports,
|
||||
canonical serialisation helpers
|
||||
- `quicnprotochat-core`: static X25519 keypair generation, Noise\_XX initiator and
|
||||
responder, length-prefixed Cap'n Proto frame codec (Tokio `Encoder`/`Decoder`)
|
||||
- `quicnprotochat-server`: QUIC listener with TLS 1.3 (quinn/rustls), Ping to Pong
|
||||
handler, one tokio task per connection
|
||||
- `quicnprotochat-client`: connects over QUIC, sends Ping, receives Pong, exits 0
|
||||
- Integration test: server and client in same test binary using `tokio::spawn`
|
||||
- `docker-compose.yml` running the server
|
||||
|
||||
**Tests:** codec (7 unit tests), keypair (3 unit tests), Noise transport integration.
|
||||
|
||||
**Branch:** `feat/m1-noise-transport`
|
||||
|
||||
---
|
||||
|
||||
## M2 -- Authentication Service (Complete)
|
||||
|
||||
**Goal:** Clients register an Ed25519 identity and publish/fetch MLS KeyPackages
|
||||
via Cap'n Proto RPC.
|
||||
|
||||
**Deliverables:**
|
||||
|
||||
- `schemas/auth.capnp`: `AuthenticationService` interface (`uploadKeyPackage`,
|
||||
`fetchKeyPackage`)
|
||||
- `quicnprotochat-core`: Ed25519 identity keypair generation, MLS KeyPackage
|
||||
generation via `openmls`
|
||||
- `quicnprotochat-server`: AS RPC server with `DashMap` store, atomic consume-on-fetch
|
||||
- `quicnprotochat-client`: `register-state` and `fetch-key` CLI subcommands
|
||||
- Integration test: Alice uploads KeyPackage, Bob fetches it, fingerprints match
|
||||
|
||||
**Tests:** auth\_service.rs integration tests (upload, fetch, consume semantics).
|
||||
|
||||
---
|
||||
|
||||
## M3 -- Delivery Service + MLS Groups (Complete)
|
||||
|
||||
**Goal:** Alice creates a group and adds Bob via MLS Welcome. Both exchange
|
||||
encrypted application messages through the Delivery Service.
|
||||
|
||||
**Deliverables:**
|
||||
|
||||
- Unified `NodeService` on port 7000 combining Authentication Service and Delivery
|
||||
Service into a single Cap'n Proto RPC interface
|
||||
- `GroupMember` struct with full MLS lifecycle: `create_group`, `add_member`,
|
||||
`join_from_welcome`, `send_message`, `receive_message`
|
||||
- DS relay with `enqueue`, `fetch`, and `fetchWait` (long-polling) operations
|
||||
- `demo-group` subcommand exercising the complete Alice/Bob flow in one process
|
||||
- Channel-aware delivery: messages routed by `(channelId, recipientKey)`
|
||||
|
||||
**Tests:** All passing -- codec (5+ tests), keypair (3 tests), group round-trip,
|
||||
group\_id lifecycle, MLS integration.
|
||||
|
||||
**Key design decisions from M3:**
|
||||
|
||||
1. **OpenMlsRustCrypto backend holds the HPKE init key in memory.** The same
|
||||
`GroupMember` instance that generated the KeyPackage must process the
|
||||
corresponding Welcome. If the process exits in between, the init private key
|
||||
is lost. This is by design for M3; persistence comes at M6.
|
||||
|
||||
2. **KeyPackage wire format: raw TLS-encoded bytes.** KeyPackages are serialised
|
||||
using `tls_serialize_detached()` rather than wrapped in `MlsMessageOut`. This
|
||||
avoids an extra layer of indirection and matches what `openmls` expects on the
|
||||
receive side via `KeyPackageIn::tls_deserialize_exact()`.
|
||||
|
||||
3. **openmls 0.5 API gotchas.** Several `openmls` methods changed signatures
|
||||
between 0.4 and 0.5 (e.g., `MlsGroup::new` vs `MlsGroup::new_with_group_id`,
|
||||
`BasicCredential::new` taking `Vec<u8>` directly). These differences are
|
||||
documented inline in `quicnprotochat-core/src/group.rs`.
|
||||
|
||||
**Branch:** `feat/m1-noise-transport`
|
||||
|
||||
---
|
||||
|
||||
## M4 -- Group CLI Subcommands (Next)
|
||||
|
||||
**Goal:** Persistent, composable CLI subcommands for group operations, replacing
|
||||
the monolithic `demo-group` proof-of-concept.
|
||||
|
||||
**Planned deliverables:**
|
||||
|
||||
- `create-group` -- creates a new MLS group, stores state locally
|
||||
- `invite <identity>` -- adds a member by fetching their KeyPackage from the AS
|
||||
- `join` -- processes a Welcome message and joins an existing group
|
||||
- `send <message>` -- encrypts and enqueues an application message
|
||||
- `recv` -- fetches and decrypts pending messages (or long-polls with `fetchWait`)
|
||||
|
||||
The `demo-group` subcommand remains available as a single-command demonstration
|
||||
of the full flow.
|
||||
|
||||
---
|
||||
|
||||
## M5 -- Multi-party Groups (Planned)
|
||||
|
||||
**Goal:** Support groups with N > 2 members, including Commit fan-out and
|
||||
Proposal handling.
|
||||
|
||||
**Planned deliverables:**
|
||||
|
||||
- Commit fan-out through the DS to all group members
|
||||
- Proposal handling (Add, Remove, Update)
|
||||
- Epoch synchronisation across N members
|
||||
- Criterion benchmarks: key generation, encap/decap, group-add latency
|
||||
(10/100/1000 members)
|
||||
|
||||
---
|
||||
|
||||
## M6 -- Persistence (Planned)
|
||||
|
||||
**Goal:** Server survives restart. Client state persists across sessions.
|
||||
|
||||
**Planned deliverables:**
|
||||
|
||||
- `quicnprotochat-server`: SQLite via `sqlx` for AS key store and DS message log,
|
||||
`migrations/` directory
|
||||
- `docker/Dockerfile`: multi-stage build (`rust:bookworm` builder, `debian:bookworm-slim` runtime)
|
||||
- `docker-compose.yml`: server + SQLite volume, healthcheck
|
||||
- Client reconnect with session resume (re-handshake + rejoin group epoch from
|
||||
DS log)
|
||||
|
||||
See [Future Research: SQLCipher](future-research.md#storage--persistence) for
|
||||
encrypted-at-rest options.
|
||||
|
||||
---
|
||||
|
||||
## M7 -- Post-quantum (Planned)
|
||||
|
||||
**Goal:** Replace the MLS crypto backend with a hybrid X25519 + ML-KEM-768 KEM,
|
||||
providing post-quantum confidentiality for all group key material.
|
||||
|
||||
**Planned deliverables:**
|
||||
|
||||
- Custom `OpenMlsCryptoProvider` with hybrid KEM in `quicnprotochat-core`
|
||||
- Hybrid shared secret derivation:
|
||||
|
||||
```
|
||||
SharedSecret = HKDF-SHA256(
|
||||
ikm = X25519_ss || ML-KEM-768_ss,
|
||||
info = "quicnprotochat-hybrid-v1",
|
||||
len = 32
|
||||
)
|
||||
```
|
||||
|
||||
- All M3/M4/M5 tests pass unchanged with the new ciphersuite
|
||||
- Follows the combiner approach from `draft-ietf-tls-hybrid-design`
|
||||
|
||||
The `ml-kem` crate is already vendored in the workspace. See
|
||||
[Hybrid KEM](../protocol-layers/hybrid-kem.md) for the detailed design and
|
||||
[ADR-006: PQ Gap in Noise Transport](../design-rationale/adr-006-pq-gap.md) for
|
||||
the accepted residual risk in the transport layer.
|
||||
|
||||
---
|
||||
|
||||
## Cross-references
|
||||
|
||||
- [Production Readiness WBS](production-readiness.md) -- phased work breakdown
|
||||
for hardening beyond the milestone track
|
||||
- [Auth, Devices, and Tokens](authz-plan.md) -- authentication and authorisation
|
||||
design that cuts across M4--M6
|
||||
- [1:1 Channel Design](dm-channels.md) -- DM channel schema and authz model
|
||||
- [Future Research](future-research.md) -- technology options for M6+ and beyond
|
||||
- [Testing Strategy](../contributing/testing.md) -- how tests are structured
|
||||
across milestones
|
||||
226
docs/src/roadmap/production-readiness.md
Normal file
226
docs/src/roadmap/production-readiness.md
Normal file
@@ -0,0 +1,226 @@
|
||||
# Production Readiness WBS
|
||||
|
||||
This page defines the work breakdown structure (WBS) for taking quicnprotochat
|
||||
from a proof-of-concept to a production-hardened system. It covers feature scope,
|
||||
security policy, phased delivery, and a planning checklist.
|
||||
|
||||
For the milestone-by-milestone tracker, see [Milestones](milestones.md). This
|
||||
document focuses on the cross-cutting concerns that span multiple milestones.
|
||||
|
||||
---
|
||||
|
||||
## Feature Scope (Must-Have)
|
||||
|
||||
These are the feature areas that must be addressed before quicnprotochat can be
|
||||
considered production-ready. Each area maps to one or more milestones or phases
|
||||
in the WBS below.
|
||||
|
||||
| Area | Description | Primary Milestone |
|
||||
|------|-------------|-------------------|
|
||||
| **Identity / Auth** | Account creation, device registration, token-based RPC authentication, MLS identity binding | M4 + Phase 3 |
|
||||
| **Key / MLS Lifecycle** | KeyPackage rotation, epoch advancement, member removal, credential updates | M5 + Phase 2 |
|
||||
| **Transport / Delivery** | QUIC + TLS 1.3 hardening, ALPN enforcement, connection draining, reconnect | M1 (done) + Phase 2 |
|
||||
| **Private 1:1 Channels** | Channel creation, per-channel authz, TTL eviction, DM-specific flows | Phase 4 |
|
||||
| **Storage / Persistence** | SQLite (or SQLCipher) for AS, DS, client state; migrations; backup/restore | M6 + Phase 6 |
|
||||
| **Observability / Ops** | Structured logging, metrics, distributed tracing, healthcheck endpoints | Phase 6 |
|
||||
| **Client Resilience** | Offline queue, retry with backoff, idempotent message IDs, gap detection | Phase 4 |
|
||||
| **Compatibility / Protocols** | Wire versioning, N-1 client interoperability, ciphersuite negotiation | Phase 2 + Phase 5 |
|
||||
|
||||
---
|
||||
|
||||
## Security Plan (By Design)
|
||||
|
||||
quicnprotochat follows a security-by-design philosophy. The standards below are
|
||||
non-negotiable -- see [Coding Standards](../contributing/coding-standards.md) for
|
||||
how they are enforced in code.
|
||||
|
||||
### Governance
|
||||
|
||||
- `CODEOWNERS` file mapping each crate to a responsible reviewer.
|
||||
- All PRs require at least one review from a crate owner.
|
||||
- Security-sensitive changes (crypto, auth, wire format) require two reviewers.
|
||||
- GPG-signed commits only.
|
||||
|
||||
### Transport Policy
|
||||
|
||||
- TLS 1.3 only (`rustls` configured with `TLS13` cipher suites exclusively).
|
||||
- ALPN token `b"capnp"` required; reject connections with mismatched ALPN.
|
||||
- Self-signed certificates acceptable for development; production deployments
|
||||
must use a CA-signed certificate or certificate pinning.
|
||||
- Connection draining on shutdown (QUIC `CONNECTION_CLOSE`).
|
||||
|
||||
### MLS Policy
|
||||
|
||||
- Ciphersuite: `MLS_128_DHKEMX25519_AES128GCM_SHA256_Ed25519` (baseline).
|
||||
- Single-use KeyPackages (consumed on fetch, per RFC 9420).
|
||||
- KeyPackage TTL: 24 hours; clients must rotate before expiry.
|
||||
- Ciphersuite allowlist: server rejects KeyPackages with unknown ciphersuites.
|
||||
- No downgrade: once a group has used a ciphersuite, members cannot rejoin with
|
||||
a weaker one.
|
||||
|
||||
### Input Validation
|
||||
|
||||
- All incoming Cap'n Proto messages validated against schema before processing.
|
||||
- Maximum payload size: 5 MB per RPC call.
|
||||
- Group ID, identity key, and channel ID fields validated for correct length
|
||||
(32 bytes, 32 bytes, 16 bytes respectively).
|
||||
- UTF-8 validation on all string fields.
|
||||
|
||||
### Secrets Management
|
||||
|
||||
- All private key material wrapped in `Zeroizing<T>` (via the `zeroize` crate).
|
||||
- No secret material in log output at any level.
|
||||
- No `unwrap()` on cryptographic operations -- all errors are typed and propagated.
|
||||
- Constant-time comparison for authentication tokens and key fingerprints.
|
||||
|
||||
### Abuse / DoS Controls
|
||||
|
||||
- Rate limiting: 50 requests/second per IP, per account, and per device.
|
||||
- Payload cap: 5 MB per message.
|
||||
- Connection limit: configurable max concurrent QUIC connections.
|
||||
- KeyPackage upload limit: configurable per account (prevents store exhaustion).
|
||||
- Long-poll timeout cap: server-enforced maximum for `fetchWait`.
|
||||
|
||||
### Data Protection
|
||||
|
||||
- MLS ciphertext is opaque to the server (DS never holds group keys).
|
||||
- Message retention: 7 days default, configurable.
|
||||
- KeyPackage retention: 24 hours (TTL eviction).
|
||||
- At-rest encryption for persistent storage (SQLCipher at M6).
|
||||
|
||||
### Logging Safety
|
||||
|
||||
- Structured logging via `tracing` with `env-filter`.
|
||||
- Sensitive fields (keys, tokens, ciphertext) are never logged, even at `TRACE`.
|
||||
- Audit-level events: auth success/failure, token issuance, keypackage upload,
|
||||
enqueue/fetch, rate limit hits.
|
||||
|
||||
### Testing
|
||||
|
||||
- Unit tests for all crypto operations (see [Testing Strategy](../contributing/testing.md)).
|
||||
- Integration tests for every RPC method.
|
||||
- Negative tests: malformed input, expired tokens, wrong identity, replay attempts.
|
||||
- N-1 compatibility tests (old client against new server).
|
||||
- Fuzzing targets for Cap'n Proto parsers and MLS message handling (Phase 5).
|
||||
|
||||
---
|
||||
|
||||
## Work Breakdown (6 Phases)
|
||||
|
||||
### Phase 1 -- Baselines and Governance
|
||||
|
||||
**Goal:** Establish project hygiene before adding features.
|
||||
|
||||
| Task | Description |
|
||||
|------|-------------|
|
||||
| CODEOWNERS | Map crates to responsible reviewers |
|
||||
| CI pipeline | GitHub Actions: `cargo test --workspace`, `cargo clippy`, `cargo fmt --check`, `cargo deny check` |
|
||||
| SBOM generation | `cargo-cyclonedx` or `cargo-about` in CI; publish with each release |
|
||||
| Threat model | Document assets, adversaries, attack surface, trust boundaries; reference in [Threat Model](../cryptography/threat-model.md) |
|
||||
| Dependency audit | `cargo audit` in CI; pin all major versions per [Coding Standards](../contributing/coding-standards.md) |
|
||||
|
||||
### Phase 2 -- Protocols and Core Hardening
|
||||
|
||||
**Goal:** Lock down the wire format and cryptographic policy.
|
||||
|
||||
| Task | Description |
|
||||
|------|-------------|
|
||||
| Wire versioning | Add `version` field to all Cap'n Proto structs; reject unknown versions |
|
||||
| Ciphersuite allowlist | Server rejects KeyPackages outside the allowed set |
|
||||
| Downgrade guards | Prevent epoch rollback; reject Commits with weaker ciphersuites |
|
||||
| ALPN enforcement | Reject connections without `b"capnp"` ALPN token |
|
||||
| Connection draining | Graceful QUIC `CONNECTION_CLOSE` on server shutdown |
|
||||
| KeyPackage rotation | Client-side timer to upload fresh KeyPackages before TTL expiry |
|
||||
|
||||
### Phase 3 -- Auth, Device, and Server Hardening
|
||||
|
||||
**Goal:** Add account/device identity and token-based authentication.
|
||||
|
||||
See [Auth, Devices, and Tokens](authz-plan.md) for the full design.
|
||||
|
||||
| Task | Description |
|
||||
|------|-------------|
|
||||
| Account + device model | `{account_id, device_id, device_pubkey}` with status lifecycle |
|
||||
| Token issuance | Access + refresh tokens; configurable expiry |
|
||||
| RPC auth middleware | Validate token on every RPC; map to account/device |
|
||||
| Identity binding | Bind MLS identity key to account; reject mismatched uploads |
|
||||
| Rate limiting | Per-IP, per-account, per-device counters |
|
||||
| Audit logging | Auth events, token lifecycle, rate limit hits |
|
||||
|
||||
### Phase 4 -- Delivery Semantics and Client Resilience
|
||||
|
||||
**Goal:** Reliable message delivery and 1:1 channels.
|
||||
|
||||
See [1:1 Channel Design](dm-channels.md) for the DM-specific design.
|
||||
|
||||
| Task | Description |
|
||||
|------|-------------|
|
||||
| Idempotent message IDs | Client-generated UUIDs; server deduplicates |
|
||||
| Ordering guarantees | Per-channel sequence numbers; client detects gaps |
|
||||
| Offline queue | Server retains messages for offline recipients (up to TTL) |
|
||||
| 1:1 channels | Channel creation, membership, per-channel authz |
|
||||
| TTL eviction | Background sweep + fetch-time check for expired messages |
|
||||
| Client retry | Exponential backoff with jitter on transient failures |
|
||||
|
||||
### Phase 5 -- E2E Harness and Security Tests
|
||||
|
||||
**Goal:** Automated end-to-end testing and security validation.
|
||||
|
||||
| Task | Description |
|
||||
|------|-------------|
|
||||
| docker-compose testnet | Multi-node test environment with configurable topology |
|
||||
| Positive E2E tests | Full group lifecycle: register, create, invite, join, send, recv, leave |
|
||||
| Negative E2E tests | Expired tokens, wrong identity, replay, malformed messages |
|
||||
| Compat matrix | N-1 client/server version testing |
|
||||
| Fuzz targets | `cargo-fuzz` targets for Cap'n Proto parsers, MLS message handlers |
|
||||
| Golden-wire fixtures | Serialised test vectors for regression testing across versions |
|
||||
|
||||
### Phase 6 -- Reliability, Performance, and Operations
|
||||
|
||||
**Goal:** Production-grade operations and performance validation.
|
||||
|
||||
| Task | Description |
|
||||
|------|-------------|
|
||||
| SQLite/SQLCipher persistence | AS key store, DS message log, client state (M6) |
|
||||
| Soak testing | 72-hour continuous operation under synthetic load |
|
||||
| Load testing | Throughput and latency benchmarks (Criterion + custom harness) |
|
||||
| Chaos testing | Network partitions, process crashes, disk full scenarios |
|
||||
| Backup / restore | SQLite backup with integrity verification |
|
||||
| Canary / rollback | Rolling deployment strategy with automatic rollback on failure |
|
||||
| Metrics + dashboards | Prometheus metrics, Grafana dashboards (see [Future Research](future-research.md)) |
|
||||
|
||||
---
|
||||
|
||||
## Planning Checklist
|
||||
|
||||
Use this checklist when planning a new milestone or phase. Each item should have
|
||||
a documented decision before implementation begins.
|
||||
|
||||
- [ ] **Release criteria / SLOs** -- Define what "done" means. Latency targets,
|
||||
error rate thresholds, test coverage minimums.
|
||||
- [ ] **Threat model review** -- Update the [Threat Model](../cryptography/threat-model.md)
|
||||
for any new attack surface introduced by this phase.
|
||||
- [ ] **Protocol policy** -- Ciphersuite allowlist, wire version, downgrade rules.
|
||||
- [ ] **Identity / auth model** -- Who authenticates, how, and what operations
|
||||
are gated.
|
||||
- [ ] **Data model** -- Schema changes, migrations, backward compatibility.
|
||||
- [ ] **Abuse controls** -- Rate limits, size caps, connection limits for this phase.
|
||||
- [ ] **Observability contracts** -- What new metrics, logs, and traces are needed.
|
||||
- [ ] **Environments / secrets** -- Dev, staging, production configuration;
|
||||
secret rotation plan.
|
||||
- [ ] **Testing matrix** -- Unit, integration, E2E, negative, fuzz, compat tests
|
||||
for this phase.
|
||||
- [ ] **Rollout / ops** -- Deployment strategy, rollback plan, monitoring during
|
||||
rollout.
|
||||
|
||||
---
|
||||
|
||||
## Cross-references
|
||||
|
||||
- [Milestones](milestones.md) -- feature milestone tracker
|
||||
- [Auth, Devices, and Tokens](authz-plan.md) -- Phase 3 design
|
||||
- [1:1 Channel Design](dm-channels.md) -- Phase 4 design
|
||||
- [Future Research](future-research.md) -- technology options for Phase 6+
|
||||
- [Coding Standards](../contributing/coding-standards.md) -- engineering standards
|
||||
- [Testing Strategy](../contributing/testing.md) -- test structure and conventions
|
||||
- [Threat Model](../cryptography/threat-model.md) -- security analysis
|
||||
Reference in New Issue
Block a user