# Roadmap — quicproquo > From proof-of-concept to production-grade E2E encrypted messaging. > > Each phase is designed to be tackled sequentially. Items within a phase > can be parallelised. Check the box when done. --- ## Phase 1 — Production Hardening (Critical) Eliminate all crash paths, enforce secure defaults, fix deployment blockers. - [ ] **1.1 Remove `.unwrap()` / `.expect()` from production paths** - Replace `AUTH_CONTEXT.read().expect()` in client RPC with proper `Result` - Replace `"0.0.0.0:0".parse().unwrap()` in client with fallible parse - Replace `Mutex::lock().unwrap()` in server storage with `.map_err()` - Audit: `grep -rn 'unwrap()\|expect(' crates/` outside `#[cfg(test)]` - [ ] **1.2 Enforce secure defaults in production mode** - Reject startup if `QPQ_PRODUCTION=true` and `auth_token` is empty or `"devtoken"` - Require non-empty `db_key` when using SQL backend in production - Refuse to auto-generate TLS certs in production mode (require existing cert+key) - Already partially implemented — verify and harden the validation in `config.rs` - [ ] **1.3 Fix `.gitignore`** - Add `data/`, `*.der`, `*.pem`, `*.db`, `*.bin` (state files), `*.ks` (keystores) - Verify no secrets are already tracked: `git ls-files data/ *.der *.db` - [ ] **1.4 Fix Dockerfile** - Sync workspace members (handle excluded `p2p` crate) - Create dedicated user/group instead of `nobody` - Set writable `QPQ_DATA_DIR` with correct permissions - Test: `docker build . && docker run --rm -it qpq-server --help` - [ ] **1.5 TLS certificate lifecycle** - Document CA-signed cert setup (Let's Encrypt / custom CA) - Add `--tls-required` flag that refuses to start without valid cert - Log clear warning when using self-signed certs - Document certificate rotation procedure --- ## Phase 2 — Test & CI Maturity Build confidence before adding features. - [ ] **2.1 Expand E2E test coverage** - Auth failure scenarios (wrong password, expired token, invalid token) - Message ordering verification (send N messages, verify seq numbers) - Concurrent clients (3+ members in group, simultaneous send/recv) - OPAQUE registration + login full flow - Queue full behavior (>1000 messages) - Rate limiting behavior (>100 enqueues/minute) - Reconnection after server restart - KeyPackage exhaustion (fetch when none available) - [ ] **2.2 Add unit tests for untested paths** - Client retry logic (exponential backoff, jitter, retriable classification) - REPL input parsing edge cases (empty input, special characters, `/` commands) - State file encryption/decryption round-trip with bad password - Token cache expiry - Conversation store migrations - [ ] **2.3 CI hardening** - Add `.github/CODEOWNERS` (crypto, auth, wire-format require 2 reviewers) - Ensure `cargo deny check` runs on every PR (already in CI — verify) - Add `cargo audit` as blocking check (already in CI — verify) - Add coverage reporting (tarpaulin or llvm-cov) - Add CI job for Docker build validation - [ ] **2.4 Clean up build warnings** - Fix Cap'n Proto generated `unused_parens` warnings - Remove dead code / unused imports - Address `openmls` future-incompat warnings - Target: `cargo clippy --workspace -- -D warnings` passes clean --- ## Phase 3 — Client SDKs: Native QUIC + Cap'n Proto Everywhere **No REST gateway. No protocol dilution.** The `.capnp` schemas are the interface definition. Every SDK speaks native QUIC + Cap'n Proto. The project name stays honest. ### Why this matters The name is **quic**n**proto**chat — the protocol IS the product. Instead of adding an HTTP translation layer that loses zero-copy performance and adds base64 overhead, we invest in making the native protocol accessible from every language that has QUIC + Cap'n Proto support, and provide WASM/FFI for the crypto layer. ### Architecture ``` Server: QUIC + Cap'n Proto (single protocol, no gateway) Client SDKs: ┌─── Rust quinn + capnp-rpc (existing, reference impl) ├─── Go quic-go + go-capnp (native, high confidence) ├─── Python aioquic + pycapnp (native QUIC, manual framing) ├─── C/C++ msquic/ngtcp2 + capnproto (reference impl, full RPC) └─── Browser WebTransport + capnp (WASM) (QUIC transport, no HTTP needed) Crypto layer (client-side MLS, shared across all SDKs): ┌─── Rust crate (native, existing) ├─── WASM module (browsers, Node.js, Deno) └─── C FFI (Swift, Kotlin, Python, Go via cgo) ``` ### Language support reality check | Language | QUIC | Cap'n Proto | RPC | Confidence | |----------|------|-------------|-----|------------| | **Rust** | quinn ✅ | capnp-rpc ✅ | Full ✅ | Existing | | **Go** | quic-go ✅ | go-capnp ✅ | Level 1 ✅ | High | | **Python** | aioquic ✅ | pycapnp ⚠️ | Manual framing | Medium | | **C/C++** | msquic/ngtcp2 ✅ | capnproto ✅ | Full ✅ | High | | **Browser** | WebTransport ✅ | WASM ✅ | Via WASM bridge | Medium | ### Implementation - [ ] **3.1 Go SDK (`quicproquo-go`)** - Generate Go types: `capnp compile -ogo schemas/node.capnp` - QUIC transport: `quic-go` with TLS 1.3 + ALPN `"capnp"` - Cap'n Proto RPC framing over QUIC bidirectional stream - Auth context: bearer token + session management - Retry with exponential backoff (mirror Rust client pattern) - Publish: `go get git.xorwell.de/c/quicproquo-go` - Example: CLI client matching Rust feature set - [ ] **3.2 Python SDK (`quicproquo-py`)** - QUIC transport: `aioquic` with custom Cap'n Proto stream handler - Cap'n Proto serialization: `pycapnp` for message types - Manual RPC framing: length-prefixed request/response over QUIC stream - Async/await API matching the Rust client patterns - Crypto: PyO3 bindings to `quicproquo-core` for MLS operations - Publish: PyPI `quicproquo` - Example: async bot client - [ ] **3.3 C FFI layer (`quicproquo-ffi`)** - New crate in workspace: `crates/quicproquo-ffi` - `cbindgen` to generate `quicproquo.h` C header - Crypto functions: `qpc_identity_new()`, `qpc_group_create()`, `qpc_encrypt()`, `qpc_decrypt()`, `qpc_key_package_generate()` - Transport functions: `qpc_connect()`, `qpc_enqueue()`, `qpc_fetch()`, `qpc_fetch_wait()` (bundles QUIC + Cap'n Proto internally) - Memory: caller-allocated buffers with length, no ownership transfer - Builds as `libquicproquo.so` / `.dylib` / `.dll` - Swift and Kotlin wrapper examples using the C header - [ ] **3.4 WASM compilation of `quicproquo-core`** - `wasm-pack build` target for browser + Node.js - Crypto-only: `GroupMember`, `IdentityKeypair`, `AppMessage`, `hybrid_encrypt/decrypt`, `generate_key_package` - Transport NOT included (browsers use WebTransport, see Phase 3.5) - Publish to npm: `@quicproquo/core` - TypeScript type definitions auto-generated via `wasm-bindgen` - [ ] **3.5 WebTransport server endpoint** - Add HTTP/3 + WebTransport listener to server (same QUIC stack via quinn) - Cap'n Proto RPC framed over WebTransport bidirectional streams - Same auth, same storage, same RPC handlers — just a different stream source - Browsers connect via `new WebTransport("https://server:7443")` - ALPN negotiation: `"h3"` for WebTransport, `"capnp"` for native QUIC - Configurable port: `--webtransport-listen 0.0.0.0:7443` - Feature-flagged: `--features webtransport` - [ ] **3.6 TypeScript/JavaScript SDK (`@quicproquo/client`)** - WebTransport for QUIC connectivity (no HTTP fallback) - WASM module (Phase 3.4) for MLS crypto - Cap'n Proto serialization via WASM bridge - Handles: auth flow, key upload, message send/receive, group management - Publish to npm: `@quicproquo/client` - Example: browser chat UI - [ ] **3.7 SDK documentation and schema publishing** - Publish `.capnp` schemas as the canonical API contract - Document the QUIC + Cap'n Proto connection pattern for each language - Provide a "build your own SDK" guide (QUIC stream → Cap'n Proto RPC bootstrap) - Reference implementation checklist: connect, auth, upload key, enqueue, fetch --- ## Phase 4 — Trust & Security Infrastructure Address the security gaps required for real-world deployment. - [ ] **4.1 Third-party cryptographic audit** - Scope: MLS integration, OPAQUE flow, hybrid KEM, key lifecycle, zeroization - Firms: NCC Group, Trail of Bits, Cure53 - Budget and timeline: typically 4-6 weeks, $50K–$150K - Publish report publicly (builds trust) - [ ] **4.2 Key Transparency / revocation** - Replace `BasicCredential` with X.509-based MLS credentials - Or: verifiable key directory (Merkle tree, auditable log) - Users can verify peer keys haven't been substituted (MITM detection) - Revocation mechanism for compromised keys - [ ] **4.3 Client authentication on Delivery Service** - Currently server trusts claimed identity key on enqueue - Bind enqueue operations to the authenticated session's identity key - Prevent: client A fetching/sending as client B's identity - Backward compat: sealed_sender mode for anonymous enqueue - [ ] **4.4 M7 — Post-quantum MLS integration** - Integrate hybrid KEM (X25519 + ML-KEM-768) into the OpenMLS crypto provider - Group key material gets post-quantum confidentiality - Full test suite with PQ ciphersuite - Ref: existing `hybrid_kem.rs` and `hybrid_crypto.rs` - [ ] **4.5 Username enumeration mitigation** - Constant-time or uniform response for unknown users during OPAQUE login - Prevent timing side-channels that reveal user existence --- ## Phase 5 — Features & UX Make it a product people want to use. - [ ] **5.1 Multi-device support** - Account → multiple devices, each with own Ed25519 key + MLS KeyPackages - Device graph management (add device, remove device, list devices) - Messages delivered to all devices of a user - `device_id` field already in Auth struct — wire it through - [ ] **5.2 Account recovery** - Recovery codes or backup key (encrypted, stored by user) - Option: server-assisted recovery with security questions (lower security) - MLS state re-establishment after device loss - [ ] **5.3 Full MLS lifecycle** - Member removal (Remove proposal → Commit → fan-out) - Credential update (Update proposal for key rotation) - Explicit proposal handling (queue proposals, batch commit) - Group metadata (name, description, avatar hash) - [ ] **5.4 Message editing and deletion** - New `AppMessage` variants: `Edit { target_seq, new_content }`, `Delete { target_seq }` - Client-side tombstones, server doesn't know about edits - [ ] **5.5 File and media transfer** - Upload encrypted blob → get content hash - Share hash + symmetric key inside MLS message - Download by hash, decrypt client-side - Size limits, content-type validation - [ ] **5.6 Abuse prevention and moderation** - Block user (client-side, suppress display) - Report message (encrypted report to admin key) - Admin tools: ban user, delete account, audit log - [ ] **5.7 Offline message queue (client-side)** - Queue messages when disconnected, send on reconnect - Idempotent message IDs to prevent duplicates - Gap detection: compare local seq with server seq --- ## Phase 6 — Scale & Operations Prepare for real traffic. - [ ] **6.1 Distributed rate limiting** - Current: in-memory per-process, lost on restart - Move to Redis or shared state for multi-node deployments - Sliding window with configurable thresholds - [ ] **6.2 Multi-node / horizontal scaling** - Stateless server design (already mostly there — state is in storage backend) - Shared PostgreSQL or CockroachDB backend (replace SQLite) - Message queue fan-out (Redis pub/sub or NATS for cross-node notification) - Load balancer health check via QUIC RPC `health()` or Prometheus `/metrics` - [ ] **6.3 Operational runbook** - Backup / restore procedures (SQLCipher, file backend) - Key rotation (auth token, TLS cert, DB encryption key) - Incident response playbook - Scaling guide (when to add nodes, resource sizing) - Monitoring dashboard templates (Grafana + Prometheus) - [ ] **6.4 Connection draining and graceful shutdown** - Stop accepting new connections on SIGTERM - Wait for in-flight RPCs (configurable timeout, default 30s) - Drain WebTransport sessions with close frame - Document expected behavior for load balancers (health → unhealthy first) - [ ] **6.5 Request-level timeouts** - Per-RPC timeout (prevent slow clients from holding resources) - Database query timeout - Overall request deadline propagation - [ ] **6.6 Observability enhancements** - Request correlation IDs (trace across RPC → storage) - Storage operation latency metrics - Per-endpoint latency histograms - Structured audit log to persistent storage (not just stdout) - OpenTelemetry integration --- ## Phase 7 — Platform Expansion & Research Long-term vision for wide adoption. - [ ] **7.1 Mobile clients (iOS + Android)** - Use C FFI (Phase 3.3) for crypto + transport (single library) - Push notifications via APNs / FCM (server sends notification on enqueue) - Background QUIC connection for message polling - Biometric auth for local key storage (Keychain / Android Keystore) - [ ] **7.2 Web client (browser)** - Use WASM (Phase 3.4) for crypto - Use WebTransport (Phase 3.5) for native QUIC transport - Cap'n Proto via WASM bridge (Phase 3.6) - IndexedDB for local state persistence - Service Worker for background notifications - Progressive Web App (PWA) support - [ ] **7.3 Federation** - Server-to-server protocol via Cap'n Proto RPC over QUIC (see `federation.capnp`) - `relayEnqueue`, `proxyFetchKeyPackage`, `federationHealth` methods - Identity resolution across federated servers - MLS group spanning multiple servers - Trust model for federated deployments - [ ] **7.4 Sealed Sender** - Sender identity inside MLS ciphertext only (server can't see who sent) - Requires: sender certificate + encrypted sender proof - Ref: Signal's Sealed Sender design - [ ] **7.5 Additional language SDKs** - Java/Kotlin: JNI bindings to C FFI (Phase 3.3) + native QUIC (netty-quic) - Swift: Swift wrapper over C FFI + Network.framework QUIC - Ruby: FFI bindings via `quicproquo-ffi` - Evaluate demand-driven — only build SDKs people request - [ ] **7.6 P2P / NAT traversal** - Direct peer-to-peer via iroh (foundation exists in `quicproquo-p2p`) - Server as fallback relay only - Reduces latency and single-point-of-failure - Ref: `FUTURE-IMPROVEMENTS.md § 6.1` - [ ] **7.7 Traffic analysis resistance** - Padding messages to uniform size - Decoy traffic to mask timing patterns - Optional Tor/I2P routing for IP privacy - Ref: `FUTURE-IMPROVEMENTS.md § 5.4, 6.3` --- ## Summary Timeline | Phase | Focus | Estimated Effort | |-------|-------|-----------------| | **1** | Production Hardening | 1–2 days | | **2** | Test & CI Maturity | 2–3 days | | **3** | Client SDKs (Go, Python, WASM, FFI, WebTransport) | 5–8 days | | **4** | Trust & Security Infrastructure | 2–4 days (excl. audit) | | **5** | Features & UX | 5–7 days | | **6** | Scale & Operations | 3–5 days | | **7** | Platform Expansion & Research | ongoing | --- ## Related Documents - [Future Improvements](docs/FUTURE-IMPROVEMENTS.md) — consolidated improvement list - [Production Readiness Audit](docs/PRODUCTION-READINESS-AUDIT.md) — specific blockers - [Security Audit](docs/SECURITY-AUDIT.md) — findings and recommendations - [Milestone Tracker](docs/src/roadmap/milestones.md) — M1–M7 status - [Auth, Devices, and Tokens](docs/src/roadmap/authz-plan.md) — authorization design - [DM Channel Design](docs/src/roadmap/dm-channels.md) — 1:1 channel spec