feat: add delivery sequence numbers + major server/client refactor

Delivery sequence numbers (MLS epoch ordering fix):
- schemas/node.capnp: add Envelope{seq,data} struct; enqueue returns seq:UInt64;
  fetch/fetchWait return List(Envelope) instead of List(Data)
- storage.rs: Store trait enqueue returns u64; fetch/fetch_limited return
  Vec<(u64, Vec<u8>)>; FileBackedStore gains QueueMapV3 with per-inbox seq
  counters and V2→V3 on-disk migration
- migrations/002_add_seq.sql: seq column, delivery_seq_counters table, index
- sql_store.rs: atomic UPSERT counter via RETURNING, ORDER BY seq, SCHEMA_VERSION→3
- node_service/delivery.rs: builds Envelope list; returns seq from enqueue
- client/rpc.rs: enqueue→u64, fetch_all/fetch_wait→Vec<(u64,Vec<u8>)>
- client/commands.rs: sort-by-seq before MLS processing; retry loop in cmd_recv
  and receive_pending_plaintexts for correct epoch ordering

Server refactor:
- Split monolithic main.rs into node_service/{mod,delivery,auth_ops,key_ops,p2p_ops}
- Add auth.rs (token validation, rate limiting), config.rs, metrics.rs, tls.rs
- Add SQL migrations runner (001_initial.sql, 002_add_seq.sql)
- OPAQUE PAKE login/registration, sealed-sender mode, queue depth limit (1000)

Client refactor:
- Split lib.rs into client/{commands,rpc,state,retry,hex,mod}
- Add cmd_whoami, cmd_health, cmd_check_key, cmd_ping subcommands
- Add cmd_register_user, cmd_login (OPAQUE), cmd_refresh_keypackage
- Hybrid PQ envelope (X25519 + ML-KEM-768) on all send/recv paths
- E2E test suite expanded

Other:
- quicnprotochat-gui: Tauri 2 desktop GUI skeleton (backend + HTML UI)
- quicnprotochat-p2p: iroh-based P2P transport stub
- quicnprotochat-core: app_message, hybrid_crypto modules; GroupMember API updates
- .github/workflows/size-lint.yml: binary size regression check
- docs: protocol comparison, roadmap updates, fully-operational checklist
This commit is contained in:
2026-02-22 20:40:12 +01:00
parent b5b361e2ff
commit 6b8b61c6ae
56 changed files with 10693 additions and 3024 deletions

View File

@@ -0,0 +1,135 @@
# Features Needed to Be Fully Operational
This checklist reflects the current state after M1M3, M4-style CLI, M6 migrations, rich messaging, Sealed Sender, and GUI scaffold. It lists what is **done**, what is **partially done**, and what still **must be implemented** for a fully operational chat system.
---
## Summary Table
| Area | Status | Notes |
|------|--------|--------|
| Transport (QUIC/TLS) | Done | M1 |
| Auth service (KeyPackage, OPAQUE) | Done | M2 + register-user, login |
| Delivery + MLS groups (2-party) | Done | M3 |
| Group CLI (create, invite, join, send, recv, chat) | Done | M4-style |
| Server persistence (SQL + migrations) | Done | M6 migrations + runner |
| Client state persistence | Done | State file, DiskKeyStore, encrypted (QPCE) |
| Rich messaging (app payload schema) | Done | Chat, Reply, Reaction, ReadReceipt, Typing + sender |
| Sealed Sender | Done | Server config; enqueue without identity |
| Native GUI scaffold | Done | Tauri, whoami, health |
| **Multi-party groups (N > 2)** | Done | M5: Commit fan-out, send --all, epoch sync, three-party E2E |
| **KeyPackage rotation** | **To do** | Client upload before TTL (24h) |
| **Observability** | **To do** | Metrics (Prometheus), tracing (OpenTelemetry), health |
| **Client resilience** | **To do** | Retry/backoff, idempotent message IDs, gap detection |
| **1:1 channel semantics** | Partial | channelId in DS; per-channel authz/TTL not formalized |
| **Production hardening** | **To do** | CI, CODEOWNERS, SBOM, backup/restore, rate-limit tuning |
| **Post-quantum (M7)** | Next | Custom OpenMlsCryptoProvider with hybrid KEM |
---
## 1. Must-Have for “Fully Operational”
These are the features that, if missing, prevent the system from being considered fully operational for real use (multi-user groups, reliability, and operations).
### 1.1 Multi-party groups (M5)
**Current:** Core supports `add_member` and `merge_staged_commit`; client/server only exercise 2-party (creator + one joiner).
**To implement:**
- **Commit fan-out:** When creator invites a new member, the Commit must be delivered to **all existing members** (not just the creator). Client flow: after `add_member`, enqueue the Commit to each existing members queue (by identity / recipient_key) in addition to sending the Welcome to the new member.
- **Proposal handling:** Ensure all members process Commits and Proposals (Add/Remove/Update) so epoch advancement is consistent; already partially in core (`merge_staged_commit`, `store_pending_proposal`).
- **CLI/API:** Extend `invite` so that after adding a member, the client fetches the list of existing members (e.g. from local group state) and enqueues the Commit to each. Optional: `recv` processes incoming Commits and updates local group state before returning application messages.
- **Tests:** E2E with 3+ members: create group, invite B, invite C, send from A, B, C; all receive and decrypt.
### 1.2 KeyPackage rotation
**Current:** KeyPackages are single-use (consume-on-fetch). Server TTL (e.g. 24h) and client upload are in place, but there is no **scheduled client-side rotation**.
**To implement:**
- **Timer or on-demand:** Before KeyPackage TTL expires (e.g. 24h), client uploads a fresh KeyPackage (and optionally removes or replaces the old one). Can be a background task in the client (CLI daemon or GUI backend) or triggered when a “fetch key” fails with “no key”.
- **Documentation:** Document TTL and rotation in user/ops docs.
### 1.3 Observability
**Current:** Health RPC and basic tracing exist; no structured metrics or distributed tracing.
**To implement:**
- **Metrics:** Prometheus (or equivalent) export for: enqueue/fetch rate, RPC latency histograms, queue depth per recipient, KeyPackage store size, active connections. See [Future Research](future-research.md).
- **Health:** Existing `health` RPC is sufficient; optionally add a simple HTTP health endpoint for load balancers (e.g. on a separate port).
- **Structured logging:** Ensure sensitive data is never logged; audit events (auth, enqueue, rate limit) as in [Production Readiness](production-readiness.md).
### 1.4 Client resilience
**Current:** Single attempt for send/recv; no retry, no idempotent message IDs, no gap detection.
**To implement:**
- **Retry with backoff:** On transient failures (network, server busy), retry with exponential backoff + jitter for enqueue, fetch, fetchWait.
- **Idempotent message IDs:** Client-generated message IDs (already in rich messaging); server-side deduplication by (recipient_key, channel_id, message_id) if desired, to avoid duplicate delivery on retry.
- **Gap detection (optional):** Per-channel sequence numbers or epoch checks so the client can detect missing Commits or messages and re-sync (e.g. re-fetch or rejoin).
---
## 2. Important for Production Readiness
Not strictly required for “operational” but expected for production deployments.
### 2.1 1:1 channel semantics (Phase 4)
**Current:** Delivery is per `(recipient_key, channel_id)`; channelId is used in enqueue/fetch. No formal per-channel authz or TTL.
**To implement:**
- **Per-channel authz:** Ensure fetch/fetchWait only return messages for channels the authenticated identity is allowed to read (e.g. identity bound to recipient_key or to a channel membership list).
- **TTL eviction:** Server already has message TTL (e.g. 7 days) and GC; document and optionally make TTL configurable per channel type.
### 2.2 Wire versioning and protocol hardening (Phase 2)
**Current:** Wire version is checked on enqueue/fetch (e.g. `CURRENT_WIRE_VERSION`). Ciphersuite allowlist and ALPN are partially in place.
**To implement:**
- **Ciphersuite allowlist:** Server rejects KeyPackages with unknown ciphersuites.
- **Downgrade guards:** Reject Commits with weaker ciphersuites once a group has advanced.
- **Connection draining:** Graceful QUIC `CONNECTION_CLOSE` on server shutdown.
### 2.3 Production hardening (Phase 1 + 6)
- **CODEOWNERS:** Map crates to reviewers.
- **CI:** `cargo test --workspace`, `cargo clippy`, `cargo fmt --check`, `cargo audit`, optional `cargo deny`.
- **SBOM:** e.g. `cargo-cyclonedx` or `cargo-about` in CI.
- **Backup/restore:** SQLite/SQLCipher backup and integrity verification for server DB.
- **Rate limiting:** Already per-token; optionally add per-IP and per-account limits and document.
---
## 3. Roadmap and Documentation Updates
- **Milestones doc:** Mark M4 as **Complete** (CLI subcommands exist). Mark M6 as **Complete** (migrations + runner; server and client persistence in place). Leave M5 as **Next** and M7 as **Planned**.
- **README:** Update milestone table to reflect M4 and M6 complete; add one line on migrations (e.g. “Server supports SQL migrations under `quicnprotochat-server/migrations/`”).
- **Migration convention:** Document in README or a dev doc: add new migrations as `NNN_name.sql`, add to `MIGRATIONS` in `sql_store.rs`, bump `SCHEMA_VERSION`.
---
## 4. Optional / Later
- **Post-quantum (M7):** Custom `OpenMlsCryptoProvider` with hybrid X25519 + ML-KEM-768 for MLS HPKE; all M3M5 tests pass with PQ backend.
- **GUI completion:** Full flows (login, conversation list, chat view with send/recv, settings); long-lived connection and streaming recv.
- **WebTransport + WASM:** Browser client.
- **iroh / P2P:** NAT traversal and optional direct peer-to-peer delivery.
---
## Priority Order for “Fully Operational”
1. **M5 Multi-party groups** — Commit fan-out and client flow for N > 2.
2. **KeyPackage rotation** — Client upload before TTL.
3. **Observability** — Metrics + health + safe logging.
4. **Client resilience** — Retry, backoff, idempotent message IDs.
5. **Docs** — Update milestones and README (M4, M6, migrations).
6. **Production hardening** — CI, CODEOWNERS, SBOM, backup, rate-limit docs.
Once 15 are in place, the system can be considered **fully operational** for multi-user group chat with durable state and observable, resilient clients. Item 6 and the optional items bring it to **production-ready** and beyond.

View File

@@ -14,10 +14,10 @@ for what that means in practice.
| M1 | QUIC/TLS Transport | **Complete** | QUIC + TLS 1.3 endpoint, length-prefixed framing, Ping/Pong |
| M2 | Authentication Service | **Complete** | Ed25519 identity, KeyPackage generation, AS upload/fetch |
| M3 | Delivery Service + MLS Groups | **Complete** | DS relay, GroupMember create/join/add/send/recv |
| M4 | Group CLI Subcommands | **Next** | Persistent CLI (create-group, invite, join, send, recv); `demo-group` already available |
| M5 | Multi-party Groups | Planned | N > 2 members, Commit fan-out, Proposal handling |
| M6 | Persistence | Planned | SQLite key store, durable group state |
| M7 | Post-quantum | Planned | PQ hybrid for MLS/HPKE (X25519 + ML-KEM-768) |
| M4 | Group CLI Subcommands | **Complete** | Persistent CLI (create-group, invite, join, send, recv), OPAQUE login |
| M5 | Multi-party Groups | **Complete** | N > 2 members, Commit fan-out, send --all, epoch sync |
| M6 | Persistence | **Complete** | SQLite/SQLCipher, migrations, durable server + client state |
| M7 | Post-quantum | **Next** | PQ hybrid for MLS/HPKE (X25519 + ML-KEM-768) |
---
@@ -103,63 +103,45 @@ group\_id lifecycle, MLS integration.
---
## M4 -- Group CLI Subcommands (Next)
## M4 -- Group CLI Subcommands (Complete)
**Goal:** Persistent, composable CLI subcommands for group operations, replacing
the monolithic `demo-group` proof-of-concept.
**Planned deliverables:**
- `create-group` -- creates a new MLS group, stores state locally
- `invite <identity>` -- adds a member by fetching their KeyPackage from the AS
- `join` -- processes a Welcome message and joins an existing group
- `send <message>` -- encrypts and enqueues an application message
- `recv` -- fetches and decrypts pending messages (or long-polls with `fetchWait`)
The `demo-group` subcommand remains available as a single-command demonstration
of the full flow.
**Deliverables:** `create-group`, `invite`, `join`, `send`, `recv`, `chat`;
OPAQUE `register-user` and `login`; `demo-group` remains for single-command demo.
---
## M5 -- Multi-party Groups (Planned)
## M5 -- Multi-party Groups (Complete)
**Goal:** Support groups with N > 2 members, including Commit fan-out and
Proposal handling.
epoch synchronisation.
**Planned deliverables:**
- Commit fan-out through the DS to all group members
- Proposal handling (Add, Remove, Update)
- Epoch synchronisation across N members
- Criterion benchmarks: key generation, encap/decap, group-add latency
(10/100/1000 members)
**Deliverables:** Commit fan-out to existing members on invite; `send --all`;
`cmd_join` processes all queued payloads (Welcome + Commits); three-party E2E
passing. Proposal handling (Remove, Update) and Criterion benchmarks are
optional follow-ups.
---
## M6 -- Persistence (Planned)
## M6 -- Persistence (Complete)
**Goal:** Server survives restart. Client state persists across sessions.
**Planned deliverables:**
- `quicnprotochat-server`: SQLite via `sqlx` for AS key store and DS message log,
`migrations/` directory
- `docker/Dockerfile`: multi-stage build (`rust:bookworm` builder, `debian:bookworm-slim` runtime)
- `docker-compose.yml`: server + SQLite volume, healthcheck
- Client reconnect with session resume (re-handshake + rejoin group epoch from
DS log)
**Deliverables:** SQLite/SQLCipher via rusqlite, `migrations/` directory and
migration runner; client state file and DiskKeyStore (encrypted QPCE optional).
See [Future Research: SQLCipher](future-research.md#storage--persistence) for
encrypted-at-rest options.
---
## M7 -- Post-quantum (Planned)
## M7 -- Post-quantum (Next)
**Goal:** Replace the MLS crypto backend with a hybrid X25519 + ML-KEM-768 KEM,
providing post-quantum confidentiality for all group key material.
**Planned deliverables:**
**Deliverables:**
- Custom `OpenMlsCryptoProvider` with hybrid KEM in `quicnprotochat-core`
- Hybrid shared secret derivation: