Files
quicproquo/ROADMAP.md
Christian Nennemann 4694a3098b docs: comprehensive update for sprints 1-9
Update README, ROADMAP, and mdBook to reflect all sprint deliverables:
rich messaging, file transfer, disappearing messages, Go/TypeScript SDKs,
C FFI, mesh networking (identity, store-and-forward, broadcast), and
security hardening. Add 6 new mdBook guides (REPL reference, Go SDK,
TypeScript SDK + browser demo, rich messaging, file transfer, mesh
networking). Check off 16 completed ROADMAP items across phases 3-9.
2026-03-04 02:10:20 +01:00

494 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Roadmap — quicproquo
> From proof-of-concept to production-grade E2E encrypted messaging.
>
> Each phase is designed to be tackled sequentially. Items within a phase
> can be parallelised. Check the box when done.
---
## Phase 1 — Production Hardening (Critical)
Eliminate all crash paths, enforce secure defaults, fix deployment blockers.
- [ ] **1.1 Remove `.unwrap()` / `.expect()` from production paths**
- Replace `AUTH_CONTEXT.read().expect()` in client RPC with proper `Result`
- Replace `"0.0.0.0:0".parse().unwrap()` in client with fallible parse
- Replace `Mutex::lock().unwrap()` in server storage with `.map_err()`
- Audit: `grep -rn 'unwrap()\|expect(' crates/` outside `#[cfg(test)]`
- [ ] **1.2 Enforce secure defaults in production mode**
- Reject startup if `QPQ_PRODUCTION=true` and `auth_token` is empty or `"devtoken"`
- Require non-empty `db_key` when using SQL backend in production
- Refuse to auto-generate TLS certs in production mode (require existing cert+key)
- Already partially implemented — verify and harden the validation in `config.rs`
- [ ] **1.3 Fix `.gitignore`**
- Add `data/`, `*.der`, `*.pem`, `*.db`, `*.bin` (state files), `*.ks` (keystores)
- Verify no secrets are already tracked: `git ls-files data/ *.der *.db`
- [ ] **1.4 Fix Dockerfile**
- Sync workspace members (handle excluded `p2p` crate)
- Create dedicated user/group instead of `nobody`
- Set writable `QPQ_DATA_DIR` with correct permissions
- Test: `docker build . && docker run --rm -it qpq-server --help`
- [ ] **1.5 TLS certificate lifecycle**
- Document CA-signed cert setup (Let's Encrypt / custom CA)
- Add `--tls-required` flag that refuses to start without valid cert
- Log clear warning when using self-signed certs
- Document certificate rotation procedure
---
## Phase 2 — Test & CI Maturity
Build confidence before adding features.
- [ ] **2.1 Expand E2E test coverage**
- Auth failure scenarios (wrong password, expired token, invalid token)
- Message ordering verification (send N messages, verify seq numbers)
- Concurrent clients (3+ members in group, simultaneous send/recv)
- OPAQUE registration + login full flow
- Queue full behavior (>1000 messages)
- Rate limiting behavior (>100 enqueues/minute)
- Reconnection after server restart
- KeyPackage exhaustion (fetch when none available)
- [ ] **2.2 Add unit tests for untested paths**
- Client retry logic (exponential backoff, jitter, retriable classification)
- REPL input parsing edge cases (empty input, special characters, `/` commands)
- State file encryption/decryption round-trip with bad password
- Token cache expiry
- Conversation store migrations
- [ ] **2.3 CI hardening**
- Add `.github/CODEOWNERS` (crypto, auth, wire-format require 2 reviewers)
- Ensure `cargo deny check` runs on every PR (already in CI — verify)
- Add `cargo audit` as blocking check (already in CI — verify)
- Add coverage reporting (tarpaulin or llvm-cov)
- Add CI job for Docker build validation
- [ ] **2.4 Clean up build warnings**
- Fix Cap'n Proto generated `unused_parens` warnings
- Remove dead code / unused imports
- Address `openmls` future-incompat warnings
- Target: `cargo clippy --workspace -- -D warnings` passes clean
---
## Phase 3 — Client SDKs: Native QUIC + Cap'n Proto Everywhere
**No REST gateway. No protocol dilution.** The `.capnp` schemas are the
interface definition. Every SDK speaks native QUIC + Cap'n Proto. The
project name stays honest.
### Why this matters
The name is **quic**n**proto**chat — the protocol IS the product. Instead
of adding an HTTP translation layer that loses zero-copy performance and
adds base64 overhead, we invest in making the native protocol accessible
from every language that has QUIC + Cap'n Proto support, and provide
WASM/FFI for the crypto layer.
### Architecture
```
Server: QUIC + Cap'n Proto (single protocol, no gateway)
Client SDKs:
┌─── Rust quinn + capnp-rpc (existing, reference impl)
├─── Go quic-go + go-capnp (native, high confidence)
├─── Python aioquic + pycapnp (native QUIC, manual framing)
├─── C/C++ msquic/ngtcp2 + capnproto (reference impl, full RPC)
└─── Browser WebTransport + capnp (WASM) (QUIC transport, no HTTP needed)
Crypto layer (client-side MLS, shared across all SDKs):
┌─── Rust crate (native, existing)
├─── WASM module (browsers, Node.js, Deno)
└─── C FFI (Swift, Kotlin, Python, Go via cgo)
```
### Language support reality check
| Language | QUIC | Cap'n Proto | RPC | Confidence |
|----------|------|-------------|-----|------------|
| **Rust** | quinn ✅ | capnp-rpc ✅ | Full ✅ | Existing |
| **Go** | quic-go ✅ | go-capnp ✅ | Level 1 ✅ | High |
| **Python** | aioquic ✅ | pycapnp ⚠️ | Manual framing | Medium |
| **C/C++** | msquic/ngtcp2 ✅ | capnproto ✅ | Full ✅ | High |
| **Browser** | WebTransport ✅ | WASM ✅ | Via WASM bridge | Medium |
### Implementation
- [x] **3.1 Go SDK (`quicproquo-go`)**
- Generated Go types from `node.capnp` (6487-line codegen, all 24 RPC methods)
- QUIC transport via `quic-go` with TLS 1.3 + ALPN `"capnp"`
- High-level `qpq` package: Connect, Health, ResolveUser, CreateChannel, Send/SendWithTTL, Receive/ReceiveWait, DeleteAccount, OPAQUE auth
- Example CLI in `sdks/go/cmd/example/`
- [ ] **3.2 Python SDK (`quicproquo-py`)**
- QUIC transport: `aioquic` with custom Cap'n Proto stream handler
- Cap'n Proto serialization: `pycapnp` for message types
- Manual RPC framing: length-prefixed request/response over QUIC stream
- Async/await API matching the Rust client patterns
- Crypto: PyO3 bindings to `quicproquo-core` for MLS operations
- Publish: PyPI `quicproquo`
- Example: async bot client
- [x] **3.3 C FFI layer (`quicproquo-ffi`)**
- `crates/quicproquo-ffi` with 7 extern "C" functions: connect, login, send, receive, disconnect, last_error, free_string
- Builds as `libquicproquo_ffi.so` / `.dylib` / `.dll`
- Python ctypes wrapper in `examples/python/qpq_client.py`
- [x] **3.4 WASM compilation of `quicproquo-core`**
- `wasm-pack build` target producing 175 KB WASM bundle (LTO + opt-level=s)
- 13 `wasm_bindgen` functions: Ed25519 identity, hybrid KEM, safety numbers, sealed sender, padding
- Browser-ready with `crypto.getRandomValues()` RNG
- Published as `sdks/typescript/wasm-crypto/`
- [ ] **3.5 WebTransport server endpoint**
- Add HTTP/3 + WebTransport listener to server (same QUIC stack via quinn)
- Cap'n Proto RPC framed over WebTransport bidirectional streams
- Same auth, same storage, same RPC handlers — just a different stream source
- Browsers connect via `new WebTransport("https://server:7443")`
- ALPN negotiation: `"h3"` for WebTransport, `"capnp"` for native QUIC
- Configurable port: `--webtransport-listen 0.0.0.0:7443`
- Feature-flagged: `--features webtransport`
- [x] **3.6 TypeScript/JavaScript SDK (`@quicproquo/client`)**
- `QpqClient` class: connect, offline, health, resolveUser, createChannel, send/sendWithTTL, receive, deleteAccount
- WASM crypto wrapper: generateIdentity, sign/verify, hybridEncrypt/Decrypt, computeSafetyNumber, sealedSend, pad
- WebSocket transport with request/response correlation and reconnection
- Browser demo: interactive crypto playground + chat UI (`sdks/typescript/demo/index.html`)
- [ ] **3.7 SDK documentation and schema publishing**
- Publish `.capnp` schemas as the canonical API contract
- Document the QUIC + Cap'n Proto connection pattern for each language
- Provide a "build your own SDK" guide (QUIC stream → Cap'n Proto RPC bootstrap)
- Reference implementation checklist: connect, auth, upload key, enqueue, fetch
---
## Phase 4 — Trust & Security Infrastructure
Address the security gaps required for real-world deployment.
- [ ] **4.1 Third-party cryptographic audit**
- Scope: MLS integration, OPAQUE flow, hybrid KEM, key lifecycle, zeroization
- Firms: NCC Group, Trail of Bits, Cure53
- Budget and timeline: typically 4-6 weeks, $50K$150K
- Publish report publicly (builds trust)
- [ ] **4.2 Key Transparency / revocation**
- Replace `BasicCredential` with X.509-based MLS credentials
- Or: verifiable key directory (Merkle tree, auditable log)
- Users can verify peer keys haven't been substituted (MITM detection)
- Revocation mechanism for compromised keys
- [x] **4.3 Client authentication on Delivery Service**
- DS sender identity binding with explicit audit logging
- `sender_prefix` tracking in enqueue/batch_enqueue RPCs
- Sender identity derived from authenticated session
- [ ] **4.4 M7 — Post-quantum MLS integration**
- Integrate hybrid KEM (X25519 + ML-KEM-768) into the OpenMLS crypto provider
- Group key material gets post-quantum confidentiality
- Full test suite with PQ ciphersuite
- Ref: existing `hybrid_kem.rs` and `hybrid_crypto.rs`
- [x] **4.5 Username enumeration mitigation**
- 5 ms timing floor on `resolveUser` responses
- Rate limiting to prevent bulk enumeration attacks
---
## Phase 5 — Features & UX
Make it a product people want to use.
- [ ] **5.1 Multi-device support**
- Account → multiple devices, each with own Ed25519 key + MLS KeyPackages
- Device graph management (add device, remove device, list devices)
- Messages delivered to all devices of a user
- `device_id` field already in Auth struct — wire it through
- [ ] **5.2 Account recovery**
- Recovery codes or backup key (encrypted, stored by user)
- Option: server-assisted recovery with security questions (lower security)
- MLS state re-establishment after device loss
- [ ] **5.3 Full MLS lifecycle**
- Member removal (Remove proposal → Commit → fan-out)
- Credential update (Update proposal for key rotation)
- Explicit proposal handling (queue proposals, batch commit)
- Group metadata (name, description, avatar hash)
- [x] **5.4 Message editing and deletion**
- `Edit` (0x06) and `Delete` (0x07) message types in `AppMessage`
- `/edit <index> <text>` and `/delete <index>` REPL commands (own messages only)
- Database update/removal on incoming edit/delete
- [x] **5.5 File and media transfer**
- `uploadBlob` / `downloadBlob` RPCs with 256 KB chunked streaming
- SHA-256 content-addressable storage with hash verification
- `FileRef` (0x08) message type with blob_id, filename, file_size, mime_type
- `/send-file <path>` and `/download <index>` REPL commands with progress bars
- 50 MB max file size, automatic MIME detection via `mime_guess`
- [ ] **5.6 Abuse prevention and moderation**
- Block user (client-side, suppress display)
- Report message (encrypted report to admin key)
- Admin tools: ban user, delete account, audit log
- [ ] **5.7 Offline message queue (client-side)**
- Queue messages when disconnected, send on reconnect
- Idempotent message IDs to prevent duplicates
- Gap detection: compare local seq with server seq
---
## Phase 6 — Scale & Operations
Prepare for real traffic.
- [ ] **6.1 Distributed rate limiting**
- Current: in-memory per-process, lost on restart
- Move to Redis or shared state for multi-node deployments
- Sliding window with configurable thresholds
- [ ] **6.2 Multi-node / horizontal scaling**
- Stateless server design (already mostly there — state is in storage backend)
- Shared PostgreSQL or CockroachDB backend (replace SQLite)
- Message queue fan-out (Redis pub/sub or NATS for cross-node notification)
- Load balancer health check via QUIC RPC `health()` or Prometheus `/metrics`
- [ ] **6.3 Operational runbook**
- Backup / restore procedures (SQLCipher, file backend)
- Key rotation (auth token, TLS cert, DB encryption key)
- Incident response playbook
- Scaling guide (when to add nodes, resource sizing)
- Monitoring dashboard templates (Grafana + Prometheus)
- [ ] **6.4 Connection draining and graceful shutdown**
- Stop accepting new connections on SIGTERM
- Wait for in-flight RPCs (configurable timeout, default 30s)
- Drain WebTransport sessions with close frame
- Document expected behavior for load balancers (health → unhealthy first)
- [ ] **6.5 Request-level timeouts**
- Per-RPC timeout (prevent slow clients from holding resources)
- Database query timeout
- Overall request deadline propagation
- [ ] **6.6 Observability enhancements**
- Request correlation IDs (trace across RPC → storage)
- Storage operation latency metrics
- Per-endpoint latency histograms
- Structured audit log to persistent storage (not just stdout)
- OpenTelemetry integration
---
## Phase 7 — Platform Expansion & Research
Long-term vision for wide adoption.
- [ ] **7.1 Mobile clients (iOS + Android)**
- Use C FFI (Phase 3.3) for crypto + transport (single library)
- Push notifications via APNs / FCM (server sends notification on enqueue)
- Background QUIC connection for message polling
- Biometric auth for local key storage (Keychain / Android Keystore)
- [ ] **7.2 Web client (browser)**
- Use WASM (Phase 3.4) for crypto
- Use WebTransport (Phase 3.5) for native QUIC transport
- Cap'n Proto via WASM bridge (Phase 3.6)
- IndexedDB for local state persistence
- Service Worker for background notifications
- Progressive Web App (PWA) support
- [ ] **7.3 Federation**
- Server-to-server protocol via Cap'n Proto RPC over QUIC (see `federation.capnp`)
- `relayEnqueue`, `proxyFetchKeyPackage`, `federationHealth` methods
- Identity resolution across federated servers
- MLS group spanning multiple servers
- Trust model for federated deployments
- [x] **7.4 Sealed Sender**
- Sender identity inside MLS ciphertext only (server can't see who sent)
- `sealed_sender` module in quicproquo-core with seal/unseal API
- WASM-accessible via `wasm_bindgen` for browser use
- [ ] **7.5 Additional language SDKs**
- Java/Kotlin: JNI bindings to C FFI (Phase 3.3) + native QUIC (netty-quic)
- Swift: Swift wrapper over C FFI + Network.framework QUIC
- Ruby: FFI bindings via `quicproquo-ffi`
- Evaluate demand-driven — only build SDKs people request
- [ ] **7.6 P2P / NAT traversal**
- Direct peer-to-peer via iroh (foundation exists in `quicproquo-p2p`)
- Server as fallback relay only
- Reduces latency and single-point-of-failure
- Ref: `FUTURE-IMPROVEMENTS.md § 6.1`
- [ ] **7.7 Traffic analysis resistance**
- Padding messages to uniform size
- Decoy traffic to mask timing patterns
- Optional Tor/I2P routing for IP privacy
- Ref: `FUTURE-IMPROVEMENTS.md § 5.4, 6.3`
---
## Phase 8 — Freifunk / Community Mesh Networking
Make qpq a first-class citizen on decentralised, community-operated wireless
networks (Freifunk, BATMAN-adv/Babel routing, OpenWrt). Multiple qpq nodes form
a federated mesh; clients auto-discover nearby nodes via mDNS; the network
functions without any central infrastructure or internet uplink.
### Architecture
```
Client A ─── mDNS discovery ──► nearby qpq node (LAN / mesh)
Cap'n Proto federation
remote qpq node (across mesh)
```
- [x] **F0 — Re-include `quicproquo-p2p` in workspace; fix ALPN strings**
- Moved `crates/quicproquo-p2p` from `exclude` back into `[workspace] members`
- Fixed ALPN `b"quicnprotochat/p2p/1"``b"quicproquo/p2p/1"` (breaking wire change)
- Fixed federation ALPN `b"qnpc-fed"``b"quicproquo/federation/1"`
- Feature-gated behind `--features mesh` on client (keeps iroh out of default builds)
- [x] **F1 — Federation routing in message delivery**
- `handle_enqueue` and `handle_batch_enqueue` call `federation::routing::resolve_destination()`
- Recipients with a remote home server are relayed via `FederationClient::relay_enqueue()`
- mTLS mutual authentication between nodes (both present client certs, validated against shared CA)
- Config: `QPQ_FEDERATION_LISTEN`, `QPQ_LOCAL_DOMAIN`, `QPQ_FEDERATION_CERT/KEY/CA`
- [x] **F2 — mDNS local peer discovery**
- Server announces `_quicproquo._udp.local.` on startup via `mdns-sd`
- Client: `MeshDiscovery::start()` browses for nearby nodes (feature-gated)
- REPL commands: `/mesh peers` (scan + list), `/mesh server <host:port>` (note address)
- Nodes announce: `ver=1`, `server=<host:port>`, `domain=<local_domain>` TXT records
- [x] **F3 — Self-sovereign mesh identity**
- Ed25519 keypair-based identity independent of AS registration
- JSON-persisted seed + known peers directory
- Sign/verify operations for mesh authenticity (`crates/quicproquo-p2p/src/identity.rs`)
- [x] **F4 — Store-and-forward with TTL**
- `MeshEnvelope` with TTL-based expiry, hop_count tracking, max_hops routing limit
- SHA-256 deduplication ID prevents relay loops
- Ed25519 signature verification on envelopes
- `MeshStore` in-memory queue with per-recipient capacity limits and TTL-based GC
- [x] **F5 — Lightweight broadcast channels**
- Symmetric ChaCha20-Poly1305 encrypted channels (no MLS overhead)
- Topic-based pub/sub via `BroadcastChannel` and `BroadcastManager`
- Subscribe/unsubscribe, create, publish API on `P2pNode`
- [x] **F6 — Extended `/mesh` REPL commands**
- `/mesh send <peer_id> <msg>` — direct P2P message via iroh
- `/mesh broadcast <topic> <msg>` — publish to broadcast channel
- `/mesh subscribe <topic>` — join broadcast channel
- `/mesh route` — show routing table
- `/mesh identity` — show mesh identity info
- `/mesh store` — show store-and-forward statistics
- [ ] **F7 — OpenWrt cross-compilation guide**
- Musl static builds: `x86_64-unknown-linux-musl`, `armv7-unknown-linux-musleabihf`, `mips-unknown-linux-musl`
- Strip binary: `--release` + `strip` → target size < 5 MB for flash storage
- `opkg` package manifest for OpenWrt feed
- `procd` init script + `uci` config file for OpenWrt integration
- CI job: cross-compile and size-check on every release tag
- [ ] **F8 — Traffic analysis resistance for mesh**
- Uniform message padding to nearest 256-byte boundary (hides message size)
- Configurable decoy traffic rate (fake messages to mask send timing)
- Optional onion routing: 3-hop relay through other mesh nodes (no Tor dependency)
- Ref: Phase 7.7 for server-side traffic analysis resistance
---
## Phase 9 — Developer Experience & Community Growth
Features designed to attract contributors, create demo/showcase potential,
and lower the barrier to entry for non-crypto developers.
- [ ] **9.1 Criterion Benchmark Suite (`qpq-bench`)**
- Criterion benchmarks for all crypto primitives: hybrid KEM encap/decap,
MLS group-add at 10/100/1000 members, epoch rotation, Noise_XX handshake
- CI publishes HTML benchmark reports as GitHub Actions artifacts
- Citable numbers — no other project benchmarks MLS + PQ-KEM in Rust
- [x] **9.2 Safety Numbers (key verification)**
- 60-digit numeric code derived from two identity keys (Signal-style)
- `/verify <username>` REPL command for out-of-band verification
- Available in WASM via `compute_safety_number` binding
- [ ] **9.3 Full-Screen TUI (Ratatui + Crossterm)**
- `qpq tui` launches a full-screen terminal UI: message pane, input bar,
channel sidebar with unread counts, MLS epoch indicator
- Feature-gated `--features tui` to keep ratatui/crossterm out of default builds
- Existing REPL and CLI subcommands are unaffected
- [ ] **9.4 Delivery Proof Canary Tokens**
- Server signs `Ed25519(SHA-256(message_id || recipient || timestamp))` on enqueue
- Sender stores proof locally — cryptographic evidence the server queued the message
- Cap'n Proto schema gains optional `deliveryProof: Data` on enqueue response
- [ ] **9.5 Verifiable Transcript Archive**
- `GroupMember::export_transcript(path, password)` writes encrypted, tamper-evident
message archive (CBOR records, Argon2id + ChaCha20-Poly1305, Merkle chain)
- `qpq export verify` CLI command independently verifies chain integrity
- Useful for legal discovery, audit, or personal backup
- [ ] **9.6 Key Transparency (Merkle-Log Identity Binding)**
- Append-only Merkle log of (username, identity_key) bindings in the AS
- Clients receive inclusion proofs alongside key fetches
- Any client can independently audit the full identity history
- Lightweight subset of RFC 9162 adapted for identity keys
- [x] **9.7 Dynamic Server Plugin System**
- Server loads `.so`/`.dylib` plugins at runtime via `--plugin-dir`
- C-compatible `HookVTable` via `extern "C"` — plugins in any language
- 6 hook points: on_message_enqueue, on_batch_enqueue, on_auth, on_channel_created, on_fetch, on_user_registered
- Example plugins: logging plugin, rate limit plugin (512 KiB payload enforcement)
- [ ] **9.8 PQ Noise Transport Layer**
- Hybrid `Noise_XX + ML-KEM-768` handshake for post-quantum transport security
- Closes the harvest-now-decrypt-later gap on handshake metadata (ADR-006)
- Feature-gated `--features pq-noise`; classical Noise_XX default preserved
- May require extending or forking `snow` crate's `CryptoResolver`
---
## Summary Timeline
| Phase | Focus | Estimated Effort |
|-------|-------|-----------------|
| **1** | Production Hardening | 12 days |
| **2** | Test & CI Maturity | 23 days |
| **3** | Client SDKs (Go, Python, WASM, FFI, WebTransport) | 58 days |
| **4** | Trust & Security Infrastructure | 24 days (excl. audit) |
| **5** | Features & UX | 57 days |
| **6** | Scale & Operations | 35 days |
| **7** | Platform Expansion & Research | ongoing |
| **8** | Freifunk / Community Mesh | ongoing |
| **9** | Developer Experience & Community Growth | 35 days |
---
## Related Documents
- [Future Improvements](docs/FUTURE-IMPROVEMENTS.md) — consolidated improvement list
- [Production Readiness Audit](docs/PRODUCTION-READINESS-AUDIT.md) — specific blockers
- [Security Audit](docs/SECURITY-AUDIT.md) — findings and recommendations
- [Milestone Tracker](docs/src/roadmap/milestones.md) — M1M7 status
- [Auth, Devices, and Tokens](docs/src/roadmap/authz-plan.md) — authorization design
- [DM Channel Design](docs/src/roadmap/dm-channels.md) — 1:1 channel spec