# Roadmap — quicproquo

> From proof-of-concept to production-grade E2E encrypted messaging.
>
> Each phase is designed to be tackled sequentially. Items within a phase
> can be parallelised. Check the box when done.

---

## Phase 1 — Production Hardening (Critical)

Eliminate all crash paths, enforce secure defaults, fix deployment blockers.

- [ ] **1.1 Remove `.unwrap()` / `.expect()` from production paths**
  - Replace `AUTH_CONTEXT.read().expect()` in client RPC with proper `Result`
  - Replace `"0.0.0.0:0".parse().unwrap()` in client with fallible parse
  - Replace `Mutex::lock().unwrap()` in server storage with `.map_err()`
  - Audit: `grep -rn 'unwrap()\|expect(' crates/` outside `#[cfg(test)]`

- [ ] **1.2 Enforce secure defaults in production mode**
  - Reject startup if `QPQ_PRODUCTION=true` and `auth_token` is empty or `"devtoken"`
  - Require non-empty `db_key` when using SQL backend in production
  - Refuse to auto-generate TLS certs in production mode (require existing cert+key)
  - Already partially implemented — verify and harden the validation in `config.rs`

- [ ] **1.3 Fix `.gitignore`**
  - Add `data/`, `*.der`, `*.pem`, `*.db`, `*.bin` (state files), `*.ks` (keystores)
  - Verify no secrets are already tracked: `git ls-files data/ *.der *.db`

- [ ] **1.4 Fix Dockerfile**
  - Sync workspace members (handle excluded `p2p` crate)
  - Create dedicated user/group instead of `nobody`
  - Set writable `QPQ_DATA_DIR` with correct permissions
  - Test: `docker build . && docker run --rm -it qpq-server --help`

- [ ] **1.5 TLS certificate lifecycle**
  - Document CA-signed cert setup (Let's Encrypt / custom CA)
  - Add `--tls-required` flag that refuses to start without valid cert
  - Log clear warning when using self-signed certs
  - Document certificate rotation procedure

---

## Phase 2 — Test & CI Maturity

Build confidence before adding features.

- [ ] **2.1 Expand E2E test coverage**
  - Auth failure scenarios (wrong password, expired token, invalid token)
  - Message ordering verification (send N messages, verify seq numbers)
  - Concurrent clients (3+ members in group, simultaneous send/recv)
  - OPAQUE registration + login full flow
  - Queue full behavior (>1000 messages)
  - Rate limiting behavior (>100 enqueues/minute)
  - Reconnection after server restart
  - KeyPackage exhaustion (fetch when none available)

- [ ] **2.2 Add unit tests for untested paths**
  - Client retry logic (exponential backoff, jitter, retriable classification)
  - REPL input parsing edge cases (empty input, special characters, `/` commands)
  - State file encryption/decryption round-trip with bad password
  - Token cache expiry
  - Conversation store migrations

- [ ] **2.3 CI hardening**
  - Add `.github/CODEOWNERS` (crypto, auth, wire-format require 2 reviewers)
  - Ensure `cargo deny check` runs on every PR (already in CI — verify)
  - Add `cargo audit` as blocking check (already in CI — verify)
  - Add coverage reporting (tarpaulin or llvm-cov)
  - Add CI job for Docker build validation

- [ ] **2.4 Clean up build warnings**
  - Fix Cap'n Proto generated `unused_parens` warnings
  - Remove dead code / unused imports
  - Address `openmls` future-incompat warnings
  - Target: `cargo clippy --workspace -- -D warnings` passes clean

---

## Phase 3 — Client SDKs: Native QUIC + Cap'n Proto Everywhere

**No REST gateway. No protocol dilution.** The `.capnp` schemas are the
interface definition. Every SDK speaks native QUIC + Cap'n Proto. The
project name stays honest.

### Why this matters

The name is **quic**n**proto**chat — the protocol IS the product. Instead
of adding an HTTP translation layer that loses zero-copy performance and
adds base64 overhead, we invest in making the native protocol accessible
from every language that has QUIC + Cap'n Proto support, and provide
WASM/FFI for the crypto layer.

### Architecture

```
  Server: QUIC + Cap'n Proto (single protocol, no gateway)

  Client SDKs:
    ┌─── Rust         quinn + capnp-rpc          (existing, reference impl)
    ├─── Go           quic-go + go-capnp          (native, high confidence)
    ├─── Python       aioquic + pycapnp            (native QUIC, manual framing)
    ├─── C/C++        msquic/ngtcp2 + capnproto    (reference impl, full RPC)
    └─── Browser      WebTransport + capnp (WASM)  (QUIC transport, no HTTP needed)

  Crypto layer (client-side MLS, shared across all SDKs):
    ┌─── Rust crate   (native, existing)
    ├─── WASM module  (browsers, Node.js, Deno)
    └─── C FFI        (Swift, Kotlin, Python, Go via cgo)
```

### Language support reality check

| Language | QUIC | Cap'n Proto | RPC | Confidence |
|----------|------|-------------|-----|------------|
| **Rust** | quinn ✅ | capnp-rpc ✅ | Full ✅ | Existing |
| **Go** | quic-go ✅ | go-capnp ✅ | Level 1 ✅ | High |
| **Python** | aioquic ✅ | pycapnp ⚠️ | Manual framing | Medium |
| **C/C++** | msquic/ngtcp2 ✅ | capnproto ✅ | Full ✅ | High |
| **Browser** | WebTransport ✅ | WASM ✅ | Via WASM bridge | Medium |

### Implementation

- [ ] **3.1 Go SDK (`quicproquo-go`)**
  - Generate Go types: `capnp compile -ogo schemas/node.capnp`
  - QUIC transport: `quic-go` with TLS 1.3 + ALPN `"capnp"`
  - Cap'n Proto RPC framing over QUIC bidirectional stream
  - Auth context: bearer token + session management
  - Retry with exponential backoff (mirror Rust client pattern)
  - Publish: `go get git.xorwell.de/c/quicproquo-go`
  - Example: CLI client matching Rust feature set

- [ ] **3.2 Python SDK (`quicproquo-py`)**
  - QUIC transport: `aioquic` with custom Cap'n Proto stream handler
  - Cap'n Proto serialization: `pycapnp` for message types
  - Manual RPC framing: length-prefixed request/response over QUIC stream
  - Async/await API matching the Rust client patterns
  - Crypto: PyO3 bindings to `quicproquo-core` for MLS operations
  - Publish: PyPI `quicproquo`
  - Example: async bot client

- [ ] **3.3 C FFI layer (`quicproquo-ffi`)**
  - New crate in workspace: `crates/quicproquo-ffi`
  - `cbindgen` to generate `quicproquo.h` C header
  - Crypto functions: `qpc_identity_new()`, `qpc_group_create()`,
    `qpc_encrypt()`, `qpc_decrypt()`, `qpc_key_package_generate()`
  - Transport functions: `qpc_connect()`, `qpc_enqueue()`, `qpc_fetch()`,
    `qpc_fetch_wait()` (bundles QUIC + Cap'n Proto internally)
  - Memory: caller-allocated buffers with length, no ownership transfer
  - Builds as `libquicproquo.so` / `.dylib` / `.dll`
  - Swift and Kotlin wrapper examples using the C header

- [ ] **3.4 WASM compilation of `quicproquo-core`**
  - `wasm-pack build` target for browser + Node.js
  - Crypto-only: `GroupMember`, `IdentityKeypair`, `AppMessage`,
    `hybrid_encrypt/decrypt`, `generate_key_package`
  - Transport NOT included (browsers use WebTransport, see Phase 3.5)
  - Publish to npm: `@quicproquo/core`
  - TypeScript type definitions auto-generated via `wasm-bindgen`

- [ ] **3.5 WebTransport server endpoint**
  - Add HTTP/3 + WebTransport listener to server (same QUIC stack via quinn)
  - Cap'n Proto RPC framed over WebTransport bidirectional streams
  - Same auth, same storage, same RPC handlers — just a different stream source
  - Browsers connect via `new WebTransport("https://server:7443")`
  - ALPN negotiation: `"h3"` for WebTransport, `"capnp"` for native QUIC
  - Configurable port: `--webtransport-listen 0.0.0.0:7443`
  - Feature-flagged: `--features webtransport`

- [ ] **3.6 TypeScript/JavaScript SDK (`@quicproquo/client`)**
  - WebTransport for QUIC connectivity (no HTTP fallback)
  - WASM module (Phase 3.4) for MLS crypto
  - Cap'n Proto serialization via WASM bridge
  - Handles: auth flow, key upload, message send/receive, group management
  - Publish to npm: `@quicproquo/client`
  - Example: browser chat UI

- [ ] **3.7 SDK documentation and schema publishing**
  - Publish `.capnp` schemas as the canonical API contract
  - Document the QUIC + Cap'n Proto connection pattern for each language
  - Provide a "build your own SDK" guide (QUIC stream → Cap'n Proto RPC bootstrap)
  - Reference implementation checklist: connect, auth, upload key, enqueue, fetch

---

## Phase 4 — Trust & Security Infrastructure

Address the security gaps required for real-world deployment.

- [ ] **4.1 Third-party cryptographic audit**
  - Scope: MLS integration, OPAQUE flow, hybrid KEM, key lifecycle, zeroization
  - Firms: NCC Group, Trail of Bits, Cure53
  - Budget and timeline: typically 4-6 weeks, $50K–$150K
  - Publish report publicly (builds trust)

- [ ] **4.2 Key Transparency / revocation**
  - Replace `BasicCredential` with X.509-based MLS credentials
  - Or: verifiable key directory (Merkle tree, auditable log)
  - Users can verify peer keys haven't been substituted (MITM detection)
  - Revocation mechanism for compromised keys

- [ ] **4.3 Client authentication on Delivery Service**
  - Currently server trusts claimed identity key on enqueue
  - Bind enqueue operations to the authenticated session's identity key
  - Prevent: client A fetching/sending as client B's identity
  - Backward compat: sealed_sender mode for anonymous enqueue

- [ ] **4.4 M7 — Post-quantum MLS integration**
  - Integrate hybrid KEM (X25519 + ML-KEM-768) into the OpenMLS crypto provider
  - Group key material gets post-quantum confidentiality
  - Full test suite with PQ ciphersuite
  - Ref: existing `hybrid_kem.rs` and `hybrid_crypto.rs`

- [ ] **4.5 Username enumeration mitigation**
  - Constant-time or uniform response for unknown users during OPAQUE login
  - Prevent timing side-channels that reveal user existence

---

## Phase 5 — Features & UX

Make it a product people want to use.

- [ ] **5.1 Multi-device support**
  - Account → multiple devices, each with own Ed25519 key + MLS KeyPackages
  - Device graph management (add device, remove device, list devices)
  - Messages delivered to all devices of a user
  - `device_id` field already in Auth struct — wire it through

- [ ] **5.2 Account recovery**
  - Recovery codes or backup key (encrypted, stored by user)
  - Option: server-assisted recovery with security questions (lower security)
  - MLS state re-establishment after device loss

- [ ] **5.3 Full MLS lifecycle**
  - Member removal (Remove proposal → Commit → fan-out)
  - Credential update (Update proposal for key rotation)
  - Explicit proposal handling (queue proposals, batch commit)
  - Group metadata (name, description, avatar hash)

- [ ] **5.4 Message editing and deletion**
  - New `AppMessage` variants: `Edit { target_seq, new_content }`, `Delete { target_seq }`
  - Client-side tombstones, server doesn't know about edits

- [ ] **5.5 File and media transfer**
  - Upload encrypted blob → get content hash
  - Share hash + symmetric key inside MLS message
  - Download by hash, decrypt client-side
  - Size limits, content-type validation

- [ ] **5.6 Abuse prevention and moderation**
  - Block user (client-side, suppress display)
  - Report message (encrypted report to admin key)
  - Admin tools: ban user, delete account, audit log

- [ ] **5.7 Offline message queue (client-side)**
  - Queue messages when disconnected, send on reconnect
  - Idempotent message IDs to prevent duplicates
  - Gap detection: compare local seq with server seq

---

## Phase 6 — Scale & Operations

Prepare for real traffic.

- [ ] **6.1 Distributed rate limiting**
  - Current: in-memory per-process, lost on restart
  - Move to Redis or shared state for multi-node deployments
  - Sliding window with configurable thresholds

- [ ] **6.2 Multi-node / horizontal scaling**
  - Stateless server design (already mostly there — state is in storage backend)
  - Shared PostgreSQL or CockroachDB backend (replace SQLite)
  - Message queue fan-out (Redis pub/sub or NATS for cross-node notification)
  - Load balancer health check via QUIC RPC `health()` or Prometheus `/metrics`

- [ ] **6.3 Operational runbook**
  - Backup / restore procedures (SQLCipher, file backend)
  - Key rotation (auth token, TLS cert, DB encryption key)
  - Incident response playbook
  - Scaling guide (when to add nodes, resource sizing)
  - Monitoring dashboard templates (Grafana + Prometheus)

- [ ] **6.4 Connection draining and graceful shutdown**
  - Stop accepting new connections on SIGTERM
  - Wait for in-flight RPCs (configurable timeout, default 30s)
  - Drain WebTransport sessions with close frame
  - Document expected behavior for load balancers (health → unhealthy first)

- [ ] **6.5 Request-level timeouts**
  - Per-RPC timeout (prevent slow clients from holding resources)
  - Database query timeout
  - Overall request deadline propagation

- [ ] **6.6 Observability enhancements**
  - Request correlation IDs (trace across RPC → storage)
  - Storage operation latency metrics
  - Per-endpoint latency histograms
  - Structured audit log to persistent storage (not just stdout)
  - OpenTelemetry integration

---

## Phase 7 — Platform Expansion & Research

Long-term vision for wide adoption.

- [ ] **7.1 Mobile clients (iOS + Android)**
  - Use C FFI (Phase 3.3) for crypto + transport (single library)
  - Push notifications via APNs / FCM (server sends notification on enqueue)
  - Background QUIC connection for message polling
  - Biometric auth for local key storage (Keychain / Android Keystore)

- [ ] **7.2 Web client (browser)**
  - Use WASM (Phase 3.4) for crypto
  - Use WebTransport (Phase 3.5) for native QUIC transport
  - Cap'n Proto via WASM bridge (Phase 3.6)
  - IndexedDB for local state persistence
  - Service Worker for background notifications
  - Progressive Web App (PWA) support

- [ ] **7.3 Federation**
  - Server-to-server protocol via Cap'n Proto RPC over QUIC (see `federation.capnp`)
  - `relayEnqueue`, `proxyFetchKeyPackage`, `federationHealth` methods
  - Identity resolution across federated servers
  - MLS group spanning multiple servers
  - Trust model for federated deployments

- [ ] **7.4 Sealed Sender**
  - Sender identity inside MLS ciphertext only (server can't see who sent)
  - Requires: sender certificate + encrypted sender proof
  - Ref: Signal's Sealed Sender design

- [ ] **7.5 Additional language SDKs**
  - Java/Kotlin: JNI bindings to C FFI (Phase 3.3) + native QUIC (netty-quic)
  - Swift: Swift wrapper over C FFI + Network.framework QUIC
  - Ruby: FFI bindings via `quicproquo-ffi`
  - Evaluate demand-driven — only build SDKs people request

- [ ] **7.6 P2P / NAT traversal**
  - Direct peer-to-peer via iroh (foundation exists in `quicproquo-p2p`)
  - Server as fallback relay only
  - Reduces latency and single-point-of-failure
  - Ref: `FUTURE-IMPROVEMENTS.md § 6.1`

- [ ] **7.7 Traffic analysis resistance**
  - Padding messages to uniform size
  - Decoy traffic to mask timing patterns
  - Optional Tor/I2P routing for IP privacy
  - Ref: `FUTURE-IMPROVEMENTS.md § 5.4, 6.3`

---

## Phase 8 — Freifunk / Community Mesh Networking

Make qpq a first-class citizen on decentralised, community-operated wireless
networks (Freifunk, BATMAN-adv/Babel routing, OpenWrt). Multiple qpq nodes form
a federated mesh; clients auto-discover nearby nodes via mDNS; the network
functions without any central infrastructure or internet uplink.

### Architecture

```
  Client A ─── mDNS discovery ──► nearby qpq node (LAN / mesh)
                                        │
                               Cap'n Proto federation
                                        │
                               remote qpq node (across mesh)
```

- [x] **F0 — Re-include `quicproquo-p2p` in workspace; fix ALPN strings**
  - Moved `crates/quicproquo-p2p` from `exclude` back into `[workspace] members`
  - Fixed ALPN `b"quicnprotochat/p2p/1"` → `b"quicproquo/p2p/1"` (breaking wire change)
  - Fixed federation ALPN `b"qnpc-fed"` → `b"quicproquo/federation/1"`
  - Feature-gated behind `--features mesh` on client (keeps iroh out of default builds)

- [x] **F1 — Federation routing in message delivery**
  - `handle_enqueue` and `handle_batch_enqueue` call `federation::routing::resolve_destination()`
  - Recipients with a remote home server are relayed via `FederationClient::relay_enqueue()`
  - mTLS mutual authentication between nodes (both present client certs, validated against shared CA)
  - Config: `QPQ_FEDERATION_LISTEN`, `QPQ_LOCAL_DOMAIN`, `QPQ_FEDERATION_CERT/KEY/CA`

- [x] **F2 — mDNS local peer discovery**
  - Server announces `_quicproquo._udp.local.` on startup via `mdns-sd`
  - Client: `MeshDiscovery::start()` browses for nearby nodes (feature-gated)
  - REPL commands: `/mesh peers` (scan + list), `/mesh server <host:port>` (note address)
  - Nodes announce: `ver=1`, `server=<host:port>`, `domain=<local_domain>` TXT records

- [ ] **F3 — Self-sovereign mesh identity**
  - Keypair = identity; OPAQUE password auth becomes optional (opt-in for managed deployments)
  - `--mesh` startup mode: no AS required, nodes accept any verifiable keypair
  - Bootstrap trust via out-of-band key fingerprint exchange (QR code or short code)

- [ ] **F4 — Store-and-forward with TTL**
  - Add `ttl_secs: u32` to `Envelope` in `node.capnp`
  - Relay nodes hold messages for offline peers up to TTL, then discard
  - Gossip-style propagation: each hop decrements a hop counter
  - Enables asynchronous messaging across intermittently connected mesh segments

- [ ] **F5 — Lightweight broadcast channels**
  - No MLS overhead; symmetric group key distributed out-of-band
  - Gossip delivery: node broadcasts to all peers, peers re-broadcast once
  - Loop prevention via bloom filter on seen message IDs
  - Suitable for community bulletin boards, emergency broadcasts on mesh

- [ ] **F6 — Extended `/mesh` REPL commands**
  - `/mesh dm <fingerprint>` — direct message to peer by key fingerprint (P2P path)
  - `/mesh broadcast <channel>` — publish to a symmetric broadcast channel
  - `/mesh auto` — auto-select server with lowest RTT from discovered peers
  - Auto-reconnect: if current server unreachable, fall back to next discovered peer

- [ ] **F7 — OpenWrt cross-compilation guide**
  - Musl static builds: `x86_64-unknown-linux-musl`, `armv7-unknown-linux-musleabihf`, `mips-unknown-linux-musl`
  - Strip binary: `--release` + `strip` → target size < 5 MB for flash storage
  - `opkg` package manifest for OpenWrt feed
  - `procd` init script + `uci` config file for OpenWrt integration
  - CI job: cross-compile and size-check on every release tag

- [ ] **F8 — Traffic analysis resistance for mesh**
  - Uniform message padding to nearest 256-byte boundary (hides message size)
  - Configurable decoy traffic rate (fake messages to mask send timing)
  - Optional onion routing: 3-hop relay through other mesh nodes (no Tor dependency)
  - Ref: Phase 7.7 for server-side traffic analysis resistance

---

## Summary Timeline

| Phase | Focus | Estimated Effort |
|-------|-------|-----------------|
| **1** | Production Hardening | 1–2 days |
| **2** | Test & CI Maturity | 2–3 days |
| **3** | Client SDKs (Go, Python, WASM, FFI, WebTransport) | 5–8 days |
| **4** | Trust & Security Infrastructure | 2–4 days (excl. audit) |
| **5** | Features & UX | 5–7 days |
| **6** | Scale & Operations | 3–5 days |
| **7** | Platform Expansion & Research | ongoing |

---

## Related Documents

- [Future Improvements](docs/FUTURE-IMPROVEMENTS.md) — consolidated improvement list
- [Production Readiness Audit](docs/PRODUCTION-READINESS-AUDIT.md) — specific blockers
- [Security Audit](docs/SECURITY-AUDIT.md) — findings and recommendations
- [Milestone Tracker](docs/src/roadmap/milestones.md) — M1–M7 status
- [Auth, Devices, and Tokens](docs/src/roadmap/authz-plan.md) — authorization design
- [DM Channel Design](docs/src/roadmap/dm-channels.md) — 1:1 channel spec