chore: fix all clippy warnings across workspace
This commit is contained in:
380
docs/V2-DESIGN-ANALYSIS.md
Normal file
380
docs/V2-DESIGN-ANALYSIS.md
Normal file
@@ -0,0 +1,380 @@
|
||||
# quicproquo v2 — Design Analysis & Recommendations
|
||||
|
||||
> Multi-perspective retrospective of the v1 architecture.
|
||||
> Produced 2026-03-04 by four parallel analysis agents examining server,
|
||||
> client/UX, crypto/security, and project structure/DX.
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
quicproquo v1 demonstrates strong fundamentals: QUIC-native transport, RFC 9420
|
||||
MLS group encryption, post-quantum hybrid KEM, OPAQUE zero-knowledge auth, and a
|
||||
working multi-language SDK surface. These are the right bets and put the project
|
||||
ahead of most open-source messengers on the crypto front.
|
||||
|
||||
However, three architectural choices limit the path to production:
|
||||
|
||||
1. **capnp-rpc is `!Send`** — forces single-threaded RPC handling, blocking
|
||||
scalability.
|
||||
2. **Monolithic client with global state** — business logic is tangled into the
|
||||
REPL, duplicated across TUI/GUI/Web, and cannot be used as a library.
|
||||
3. **Poll-based delivery** — 1-second polling wastes bandwidth and adds latency;
|
||||
no server-push channel exists.
|
||||
|
||||
A v2 should keep the crypto stack (MLS + hybrid PQ KEM + OPAQUE), keep QUIC, but
|
||||
rearchitect the RPC layer, extract an SDK crate, and add push-based delivery.
|
||||
|
||||
---
|
||||
|
||||
## Part 1 — What Works Well
|
||||
|
||||
### Transport & Protocol
|
||||
- **QUIC (quinn) + TLS 1.3** — correct choice. Built-in encryption, connection
|
||||
migration, 0-RTT potential. No reason to change.
|
||||
- **Cap'n Proto schemas as API contract** — zero-copy wire format, compact
|
||||
binary, schema evolution via ordinals. The *schemas* are good; the *RPC
|
||||
runtime* is the problem.
|
||||
|
||||
### Cryptography
|
||||
- **MLS (RFC 9420, openmls)** — only IETF-standard group E2E protocol. No
|
||||
realistic alternative for groups > 2 members. Test suite is thorough (1005
|
||||
lines covering 2-party, 3-party, hybrid, removal, leave, stale epoch).
|
||||
- **Hybrid PQ KEM (X25519 + ML-KEM-768)** — forward-thinking dual-algorithm
|
||||
protection. Well-implemented with versioned wire format, proper zeroization,
|
||||
and 12 targeted tests. Ahead of Signal (PQXDH, late 2023) and Matrix (no PQ).
|
||||
- **OPAQUE (RFC 9497)** — server never sees passwords. Ristretto255 + Argon2id
|
||||
is best-in-class.
|
||||
- **Sealed sender, safety numbers, message padding** — all clean, simple,
|
||||
correct. Safety numbers match Signal's 5200-iteration HMAC-SHA256 cost.
|
||||
- **Zeroization discipline** — secrets wrapped in `Zeroizing`, Debug impls
|
||||
redact keys, no `.unwrap()` in crypto paths.
|
||||
- **WASM feature gating** — `core/native` cleanly separates WASM-safe crypto
|
||||
from native-only modules (MLS, OPAQUE, filesystem).
|
||||
|
||||
### Server Design
|
||||
- **Store trait abstraction** — 30+ methods, clean backend swap (SqlStore vs
|
||||
FileBackedStore). Well-factored.
|
||||
- **OPAQUE auth with timing floors** — `resolveUser`/`resolveIdentity` mask
|
||||
lookup timing to prevent username enumeration.
|
||||
- **Delivery proofs** — Ed25519-signed receipt of server acceptance. Clients get
|
||||
cryptographic evidence.
|
||||
- **`wasNew` flag on createChannel** — elegantly solves the dual-MLS-group race
|
||||
condition where both DM parties try to initialize.
|
||||
- **Plugin hooks (C-ABI)** — `#![no_std]` vtable, zero dependencies, chained
|
||||
hooks with continue/reject protocol. Clean extensibility.
|
||||
- **Production config validation** — enforces encrypted storage, strong auth
|
||||
tokens, pre-existing TLS certs.
|
||||
|
||||
### Client & DX
|
||||
- **Zero-config local dev** — `qpq --username alice --password pass` auto-starts
|
||||
server, generates TLS certs, registers, and logs in. Genuinely excellent.
|
||||
- **Encrypted-at-rest everything** — state file (QPCE), conversation DB
|
||||
(SQLCipher), session cache. Argon2id + ChaCha20-Poly1305 throughout.
|
||||
- **Playbook system** — YAML-scripted command execution with assertions. Great
|
||||
for CI/integration testing.
|
||||
- **Conversation store** — SQLite with deduplication, outbox for offline
|
||||
queuing, activity tracking.
|
||||
- **Conventional commits, GPG-signed** — consistent `feat:`/`fix:`/`docs:`
|
||||
discipline.
|
||||
- **Security lints enforced by build** — `clippy::unwrap_used = "deny"`,
|
||||
`unsafe_code = "warn"`.
|
||||
|
||||
---
|
||||
|
||||
## Part 2 — What Needs Rethinking
|
||||
|
||||
### 2.1 RPC Layer: capnp-rpc is the #1 Scalability Bottleneck
|
||||
|
||||
**Problem:** `capnp-rpc` uses `Rc` internally and is `!Send`. Everything runs on
|
||||
a `LocalSet` with `spawn_local`. All 27 RPC methods serialize through a single
|
||||
thread. No work-stealing, no multi-core utilization.
|
||||
|
||||
**Impact:** With 1000+ concurrent clients, the single-threaded executor cannot
|
||||
keep up. A slow `fetchWait` (30s timeout) blocks the entire connection.
|
||||
|
||||
**Also:** The WebSocket bridge (`ws_bridge.rs`, 645 lines) exists solely because
|
||||
Cap'n Proto cannot run in browsers. This duplicates handler logic and creates
|
||||
maintenance burden.
|
||||
|
||||
### 2.2 Client Architecture: Monolith with Global State
|
||||
|
||||
**Problem:** `AUTH_CONTEXT` is a process-wide `RwLock<Option<ClientAuth>>`.
|
||||
Business logic (MLS processing, sealed sender, hybrid decryption, message
|
||||
routing) lives inside `repl.rs`'s `poll_messages()` — a 100-line function that
|
||||
mixes transport, crypto, routing, and storage.
|
||||
|
||||
**Impact:** Every frontend (REPL, TUI, GUI, Web) must reimplement message
|
||||
processing. The TUI already duplicates it. The GUI stub and mobile PoC would need
|
||||
yet another copy. Client cannot be used as a library.
|
||||
|
||||
### 2.3 Delivery Model: Poll-Based, No Push Channel
|
||||
|
||||
**Problem:** Client polls every 1 second with `fetch_wait(timeout_ms=0)` — never
|
||||
actually long-polls. Constant network traffic even when idle. ~1 second latency
|
||||
for message delivery.
|
||||
|
||||
**Also:** `fetch` is destructive (drains queue). If the client crashes between
|
||||
receive and processing, messages are lost.
|
||||
|
||||
### 2.4 Connection Model: Single Stream
|
||||
|
||||
**Problem:** `max_concurrent_bidi_streams(1)` means the entire QUIC connection is
|
||||
effectively single-stream. A blocking `fetchWait` prevents all other RPCs.
|
||||
|
||||
### 2.5 Storage: Single Mutex-Guarded SQLite Connection
|
||||
|
||||
**Problem:** `SqlStore` uses `Mutex<Connection>`. Every database operation
|
||||
acquires a global lock. Under concurrent load, all storage access serializes.
|
||||
|
||||
**Also:** `FileBackedStore` flushes the entire map on every write (O(n) I/O).
|
||||
Sessions are in-memory only — server restart forces all clients to re-login.
|
||||
|
||||
### 2.6 Key Management Gaps
|
||||
|
||||
- **DiskKeyStore** — HPKE private keys stored as plaintext bincode on disk. No
|
||||
encryption at rest.
|
||||
- **MLS group state** — `GroupMember` holds `MlsGroup` in memory only. Process
|
||||
crash loses all group state.
|
||||
- **Token zeroization** — `AuthContext.token`, `ClientAuth.access_token` are not
|
||||
wrapped in `Zeroizing`.
|
||||
|
||||
### 2.7 Workspace Bloat
|
||||
|
||||
12 crates for a project at this maturity is excessive. Several are thin stubs
|
||||
(`quicproquo-gen`, `quicproquo-bot` at 354 lines) or broken (`quicproquo-gui`
|
||||
fails `cargo build --workspace`).
|
||||
|
||||
---
|
||||
|
||||
## Part 3 — v2 Architecture Recommendations
|
||||
|
||||
### 3.1 Replace capnp-rpc with a Send-Compatible RPC Framework
|
||||
|
||||
**Recommendation:** Switch to **tonic (gRPC)** or a custom framing layer.
|
||||
|
||||
| Dimension | capnp-rpc (v1) | tonic/gRPC (v2) |
|
||||
|-----------|---------------|-----------------|
|
||||
| Threading | `!Send`, single-threaded | `Send + Sync`, multi-threaded |
|
||||
| Browser | Requires WS bridge | grpc-web native |
|
||||
| Streaming | Not supported | Built-in |
|
||||
| Middleware | None (copy-paste auth) | Interceptors/layers |
|
||||
| Ecosystem | Niche | Massive (every language) |
|
||||
|
||||
**Alternative:** Keep Cap'n Proto *schemas* for serialization (zero-copy
|
||||
advantage) but replace capnp-rpc with custom framing over QUIC streams. This
|
||||
preserves the wire format while gaining `Send` compatibility.
|
||||
|
||||
The WS bridge would be eliminated entirely — grpc-web or WebTransport gives
|
||||
browsers direct access.
|
||||
|
||||
### 3.2 Extract an SDK Crate (Most Important Client Change)
|
||||
|
||||
Create `quicproquo-sdk` that owns all business logic:
|
||||
|
||||
```
|
||||
quicproquo-sdk/
|
||||
src/
|
||||
client.rs -- QpqClient: connect, login, send, receive
|
||||
events.rs -- ClientEvent enum (push-based)
|
||||
conversation.rs -- ConversationHandle, group management
|
||||
crypto.rs -- MLS pipeline, sealed sender, hybrid decryption
|
||||
sync.rs -- message sync, offline queue, retry
|
||||
```
|
||||
|
||||
All frontends become thin shells:
|
||||
|
||||
```
|
||||
CLI/REPL -> calls sdk
|
||||
TUI -> calls sdk
|
||||
Tauri GUI -> calls sdk (via Tauri commands)
|
||||
Mobile -> calls sdk (via C FFI)
|
||||
Web/WASM -> calls sdk (compiled to wasm32)
|
||||
```
|
||||
|
||||
**Key API shape:**
|
||||
```rust
|
||||
pub struct QpqClient { /* session, rpc, crypto pipeline */ }
|
||||
|
||||
impl QpqClient {
|
||||
pub async fn connect(config: ClientConfig) -> Result<Self>;
|
||||
pub async fn login(username: &str, password: &str) -> Result<Self>;
|
||||
pub async fn dm(&mut self, username: &str) -> Result<ConversationHandle>;
|
||||
pub async fn create_group(&mut self, name: &str) -> Result<ConversationHandle>;
|
||||
pub async fn send(&mut self, text: &str) -> Result<MessageId>;
|
||||
pub fn subscribe(&self) -> Receiver<ClientEvent>;
|
||||
}
|
||||
```
|
||||
|
||||
No global state. No `AUTH_CONTEXT`. Auth context is per-`QpqClient` instance.
|
||||
|
||||
### 3.3 Add Push-Based Delivery
|
||||
|
||||
**Recommendation:** Dedicated QUIC unidirectional stream for server-push
|
||||
notifications.
|
||||
|
||||
```
|
||||
Client opens bidi stream 0 -> RPC channel (request/response)
|
||||
Server opens uni stream 1 -> push notifications (new message, typing, etc.)
|
||||
```
|
||||
|
||||
Benefits:
|
||||
- Zero-latency message delivery (no polling)
|
||||
- No idle network traffic
|
||||
- Typing indicators delivered in real-time
|
||||
- Graceful degradation: fall back to long-poll if push stream fails
|
||||
|
||||
**Also:** Make `peek` + `ack` the default delivery pattern (not destructive
|
||||
`fetch`). Add idempotency keys to prevent duplicate messages on retry.
|
||||
|
||||
### 3.4 Multi-Stream Connections
|
||||
|
||||
Allow 4-8 concurrent bidirectional QUIC streams per connection. This enables:
|
||||
- Pipelined RPCs (send while fetching)
|
||||
- Concurrent blob upload + chat
|
||||
- `fetchWait` on one stream without blocking others
|
||||
|
||||
### 3.5 Storage Improvements
|
||||
|
||||
| Change | Rationale |
|
||||
|--------|-----------|
|
||||
| Drop `FileBackedStore` | O(n) flush per write, no federation support |
|
||||
| Connection pool for SQLite | Replace `Mutex<Connection>` with r2d2/deadpool |
|
||||
| Persist sessions to DB | Server restart shouldn't force re-login |
|
||||
| Encrypt DiskKeyStore at rest | HPKE private keys in plaintext is a real vuln |
|
||||
| Persist MLS group state | Process crash shouldn't lose group state |
|
||||
| Atomic keystore writes | tempfile-then-rename pattern |
|
||||
|
||||
### 3.6 Crypto Stack Refinements
|
||||
|
||||
The algorithms are correct. The refinements are operational:
|
||||
|
||||
| Change | Rationale |
|
||||
|--------|-----------|
|
||||
| Typed MLS error variants | Stop losing error info via `format!("{e:?}")` |
|
||||
| Formalize hybrid PQ ciphersuite ID | Replace length-based key detection |
|
||||
| Remove all InsecureServerCertVerifier | No TLS bypass on any platform |
|
||||
| Add passkey/WebAuthn alt-auth | Better UX for GUI/mobile, no password to forget |
|
||||
| Consider Double Ratchet for 1:1 DMs | MLS is over-engineered for 2-party; DR gives better per-message forward secrecy |
|
||||
| Token/session secret zeroization | `AuthContext.token` et al. need `Zeroizing` wrappers |
|
||||
| Fix serde deserialization of secrets | Intermediate non-zeroized `Vec<u8>` in `IdentityKeypair::deserialize` |
|
||||
|
||||
### 3.7 Workspace Restructuring
|
||||
|
||||
**Reduce from 12 to 8 crates:**
|
||||
|
||||
```
|
||||
quicproquo-core -- crypto primitives (keep)
|
||||
quicproquo-proto -- schema codegen (keep)
|
||||
quicproquo-plugin-api -- #![no_std] C-ABI (keep)
|
||||
quicproquo-kt -- key transparency (keep)
|
||||
quicproquo-sdk -- NEW: business logic library
|
||||
quicproquo-server -- server binary (keep)
|
||||
quicproquo-client -- CLI/TUI binary, depends on sdk (keep, slimmed)
|
||||
quicproquo-p2p -- mesh networking (keep, feature-flagged)
|
||||
```
|
||||
|
||||
**Merge/remove:**
|
||||
- `bot` -> `sdk::bot` module
|
||||
- `ffi` -> `sdk` with `--features c-ffi`
|
||||
- `gen` -> `scripts/` or `xtask`
|
||||
- `gui` -> `apps/gui/` outside workspace (Tauri project)
|
||||
- `mobile` -> `examples/` (research spike)
|
||||
|
||||
**Add `[workspace.default-members]`** so `cargo build` doesn't attempt GUI.
|
||||
**Add `justfile`** with `build`, `test`, `test-e2e`, `build-wasm`, `docker`.
|
||||
|
||||
### 3.8 Plugin System Evolution
|
||||
|
||||
| Change | Rationale |
|
||||
|--------|-----------|
|
||||
| Add `version: u32` to `HookVTable` | ABI stability — check version on load |
|
||||
| Config passthrough | `qpq_plugin_init(vtable, config_json)` |
|
||||
| Async hooks | Plugins that call external services shouldn't block Tokio |
|
||||
| Evaluate WASM plugins | Sandboxed community plugins (keep C-ABI for first-party) |
|
||||
|
||||
### 3.9 Federation Improvements
|
||||
|
||||
| Change | Rationale |
|
||||
|--------|-----------|
|
||||
| DNS SRV / .well-known discovery | Static peer config doesn't scale |
|
||||
| Persistent relay queue with retry | Messages to offline peers are currently lost |
|
||||
| Deterministic channel ID derivation | Avoid cross-server channel conflicts |
|
||||
| Keep mDNS as optional mesh feature | Not for internet-scale, but good for LAN |
|
||||
|
||||
### 3.10 Test & CI Improvements
|
||||
|
||||
| Change | Rationale |
|
||||
|--------|-----------|
|
||||
| Per-client auth context | Removes `--test-threads 1` constraint |
|
||||
| Mock server for client unit tests | Fast tests without spawning real server |
|
||||
| Fuzz testing (cargo-fuzz) | Hybrid KEM, sealed sender, padding, Cap'n Proto deser |
|
||||
| WS bridge unit tests | 645 lines, zero tests, security-critical |
|
||||
| WASM + Go SDK in CI | Currently untested in CI |
|
||||
| Separate E2E from unit test CI job | Different speed, different failure modes |
|
||||
| macOS CI | FFI/mobile cross-compilation validation |
|
||||
| Release automation | Binary artifacts, Docker tags, WASM npm publish |
|
||||
|
||||
---
|
||||
|
||||
## Part 4 — Ecosystem Positioning
|
||||
|
||||
### Don't compete with Signal or Matrix directly.
|
||||
|
||||
**Target: Privacy-first messaging infrastructure for developers and
|
||||
organizations.**
|
||||
|
||||
quicproquo's differentiators — QUIC-native transport, post-quantum crypto, MLS,
|
||||
plugin system, multi-language SDKs, embeddable architecture — point toward an
|
||||
infrastructure play, not a consumer app.
|
||||
|
||||
Think: *"the Postgres of E2E encrypted messaging"* — a high-quality open-source
|
||||
server and protocol that other projects build on.
|
||||
|
||||
| Segment | Value Proposition |
|
||||
|---------|-------------------|
|
||||
| **Developer tool** | API-first messenger for encrypted bots and integrations |
|
||||
| **Embeddable** | C FFI + WASM + Go SDK for embedding in other apps |
|
||||
| **Enterprise** | On-prem, plugins for compliance/audit, OPAQUE zero-knowledge auth |
|
||||
| **Research** | Post-quantum crypto, MLS reference implementation, mesh networking |
|
||||
|
||||
---
|
||||
|
||||
## Part 5 — Priority Ordering
|
||||
|
||||
### Phase 1: Foundation (unblocks everything else)
|
||||
1. Replace capnp-rpc with Send-compatible framework
|
||||
2. Extract SDK crate from client
|
||||
3. Per-client auth context (no global state)
|
||||
|
||||
### Phase 2: Reliability
|
||||
4. Push-based delivery (QUIC uni-stream)
|
||||
5. Multi-stream connections
|
||||
6. Persist sessions + MLS group state
|
||||
7. Encrypt DiskKeyStore at rest
|
||||
8. peek+ack as default delivery
|
||||
|
||||
### Phase 3: Polish
|
||||
9. Workspace restructuring (12 -> 8 crates)
|
||||
10. TUI as primary interactive mode (built on SDK)
|
||||
11. Plugin system v2 (versioning, config, async)
|
||||
12. Federation retry queue + discovery
|
||||
|
||||
### Phase 4: Ecosystem
|
||||
13. Full MLS in WASM (browser E2E)
|
||||
14. WebTransport (eliminate WS bridge)
|
||||
15. Tauri GUI (built on SDK)
|
||||
16. Release automation + expanded CI
|
||||
|
||||
---
|
||||
|
||||
## Appendix — Analysis Sources
|
||||
|
||||
This document was produced by four parallel analysis agents:
|
||||
|
||||
| Agent | Scope | Files Read |
|
||||
|-------|-------|-----------|
|
||||
| server-analyst | Transport, RPC, delivery, storage, federation | 27 server .rs files, 4 schemas, core transport |
|
||||
| client-analyst | REPL, UX, state, multi-platform, SDK design | All client .rs, GUI, mobile, TS demo |
|
||||
| security-analyst | MLS, OPAQUE, hybrid KEM, keystore, identity | All core .rs, review doc |
|
||||
| dx-analyst | Workspace, build, tests, plugins, CI, ecosystem | All Cargo.toml, tests, CI, plugins, SDKs |
|
||||
Reference in New Issue
Block a user