Files
quicproquo/docs/V2-DESIGN-ANALYSIS.md

15 KiB

quicproquo v2 — Design Analysis & Recommendations

Multi-perspective retrospective of the v1 architecture. Produced 2026-03-04 by four parallel analysis agents examining server, client/UX, crypto/security, and project structure/DX.


Executive Summary

quicproquo v1 demonstrates strong fundamentals: QUIC-native transport, RFC 9420 MLS group encryption, post-quantum hybrid KEM, OPAQUE zero-knowledge auth, and a working multi-language SDK surface. These are the right bets and put the project ahead of most open-source messengers on the crypto front.

However, three architectural choices limit the path to production:

  1. capnp-rpc is !Send — forces single-threaded RPC handling, blocking scalability.
  2. Monolithic client with global state — business logic is tangled into the REPL, duplicated across TUI/GUI/Web, and cannot be used as a library.
  3. Poll-based delivery — 1-second polling wastes bandwidth and adds latency; no server-push channel exists.

A v2 should keep the crypto stack (MLS + hybrid PQ KEM + OPAQUE), keep QUIC, but rearchitect the RPC layer, extract an SDK crate, and add push-based delivery.


Part 1 — What Works Well

Transport & Protocol

  • QUIC (quinn) + TLS 1.3 — correct choice. Built-in encryption, connection migration, 0-RTT potential. No reason to change.
  • Cap'n Proto schemas as API contract — zero-copy wire format, compact binary, schema evolution via ordinals. The schemas are good; the RPC runtime is the problem.

Cryptography

  • MLS (RFC 9420, openmls) — only IETF-standard group E2E protocol. No realistic alternative for groups > 2 members. Test suite is thorough (1005 lines covering 2-party, 3-party, hybrid, removal, leave, stale epoch).
  • Hybrid PQ KEM (X25519 + ML-KEM-768) — forward-thinking dual-algorithm protection. Well-implemented with versioned wire format, proper zeroization, and 12 targeted tests. Ahead of Signal (PQXDH, late 2023) and Matrix (no PQ).
  • OPAQUE (RFC 9497) — server never sees passwords. Ristretto255 + Argon2id is best-in-class.
  • Sealed sender, safety numbers, message padding — all clean, simple, correct. Safety numbers match Signal's 5200-iteration HMAC-SHA256 cost.
  • Zeroization discipline — secrets wrapped in Zeroizing, Debug impls redact keys, no .unwrap() in crypto paths.
  • WASM feature gatingcore/native cleanly separates WASM-safe crypto from native-only modules (MLS, OPAQUE, filesystem).

Server Design

  • Store trait abstraction — 30+ methods, clean backend swap (SqlStore vs FileBackedStore). Well-factored.
  • OPAQUE auth with timing floorsresolveUser/resolveIdentity mask lookup timing to prevent username enumeration.
  • Delivery proofs — Ed25519-signed receipt of server acceptance. Clients get cryptographic evidence.
  • wasNew flag on createChannel — elegantly solves the dual-MLS-group race condition where both DM parties try to initialize.
  • Plugin hooks (C-ABI)#![no_std] vtable, zero dependencies, chained hooks with continue/reject protocol. Clean extensibility.
  • Production config validation — enforces encrypted storage, strong auth tokens, pre-existing TLS certs.

Client & DX

  • Zero-config local devqpq --username alice --password pass auto-starts server, generates TLS certs, registers, and logs in. Genuinely excellent.
  • Encrypted-at-rest everything — state file (QPCE), conversation DB (SQLCipher), session cache. Argon2id + ChaCha20-Poly1305 throughout.
  • Playbook system — YAML-scripted command execution with assertions. Great for CI/integration testing.
  • Conversation store — SQLite with deduplication, outbox for offline queuing, activity tracking.
  • Conventional commits, GPG-signed — consistent feat:/fix:/docs: discipline.
  • Security lints enforced by buildclippy::unwrap_used = "deny", unsafe_code = "warn".

Part 2 — What Needs Rethinking

2.1 RPC Layer: capnp-rpc is the #1 Scalability Bottleneck

Problem: capnp-rpc uses Rc internally and is !Send. Everything runs on a LocalSet with spawn_local. All 27 RPC methods serialize through a single thread. No work-stealing, no multi-core utilization.

Impact: With 1000+ concurrent clients, the single-threaded executor cannot keep up. A slow fetchWait (30s timeout) blocks the entire connection.

Also: The WebSocket bridge (ws_bridge.rs, 645 lines) exists solely because Cap'n Proto cannot run in browsers. This duplicates handler logic and creates maintenance burden.

2.2 Client Architecture: Monolith with Global State

Problem: AUTH_CONTEXT is a process-wide RwLock<Option<ClientAuth>>. Business logic (MLS processing, sealed sender, hybrid decryption, message routing) lives inside repl.rs's poll_messages() — a 100-line function that mixes transport, crypto, routing, and storage.

Impact: Every frontend (REPL, TUI, GUI, Web) must reimplement message processing. The TUI already duplicates it. The GUI stub and mobile PoC would need yet another copy. Client cannot be used as a library.

2.3 Delivery Model: Poll-Based, No Push Channel

Problem: Client polls every 1 second with fetch_wait(timeout_ms=0) — never actually long-polls. Constant network traffic even when idle. ~1 second latency for message delivery.

Also: fetch is destructive (drains queue). If the client crashes between receive and processing, messages are lost.

2.4 Connection Model: Single Stream

Problem: max_concurrent_bidi_streams(1) means the entire QUIC connection is effectively single-stream. A blocking fetchWait prevents all other RPCs.

2.5 Storage: Single Mutex-Guarded SQLite Connection

Problem: SqlStore uses Mutex<Connection>. Every database operation acquires a global lock. Under concurrent load, all storage access serializes.

Also: FileBackedStore flushes the entire map on every write (O(n) I/O). Sessions are in-memory only — server restart forces all clients to re-login.

2.6 Key Management Gaps

  • DiskKeyStore — HPKE private keys stored as plaintext bincode on disk. No encryption at rest.
  • MLS group stateGroupMember holds MlsGroup in memory only. Process crash loses all group state.
  • Token zeroizationAuthContext.token, ClientAuth.access_token are not wrapped in Zeroizing.

2.7 Workspace Bloat

12 crates for a project at this maturity is excessive. Several are thin stubs (quicproquo-gen, quicproquo-bot at 354 lines) or broken (quicproquo-gui fails cargo build --workspace).


Part 3 — v2 Architecture Recommendations

3.1 Replace capnp-rpc with a Send-Compatible RPC Framework

Recommendation: Switch to tonic (gRPC) or a custom framing layer.

Dimension capnp-rpc (v1) tonic/gRPC (v2)
Threading !Send, single-threaded Send + Sync, multi-threaded
Browser Requires WS bridge grpc-web native
Streaming Not supported Built-in
Middleware None (copy-paste auth) Interceptors/layers
Ecosystem Niche Massive (every language)

Alternative: Keep Cap'n Proto schemas for serialization (zero-copy advantage) but replace capnp-rpc with custom framing over QUIC streams. This preserves the wire format while gaining Send compatibility.

The WS bridge would be eliminated entirely — grpc-web or WebTransport gives browsers direct access.

3.2 Extract an SDK Crate (Most Important Client Change)

Create quicproquo-sdk that owns all business logic:

quicproquo-sdk/
  src/
    client.rs       -- QpqClient: connect, login, send, receive
    events.rs       -- ClientEvent enum (push-based)
    conversation.rs -- ConversationHandle, group management
    crypto.rs       -- MLS pipeline, sealed sender, hybrid decryption
    sync.rs         -- message sync, offline queue, retry

All frontends become thin shells:

CLI/REPL  -> calls sdk
TUI       -> calls sdk
Tauri GUI -> calls sdk (via Tauri commands)
Mobile    -> calls sdk (via C FFI)
Web/WASM  -> calls sdk (compiled to wasm32)

Key API shape:

pub struct QpqClient { /* session, rpc, crypto pipeline */ }

impl QpqClient {
    pub async fn connect(config: ClientConfig) -> Result<Self>;
    pub async fn login(username: &str, password: &str) -> Result<Self>;
    pub async fn dm(&mut self, username: &str) -> Result<ConversationHandle>;
    pub async fn create_group(&mut self, name: &str) -> Result<ConversationHandle>;
    pub async fn send(&mut self, text: &str) -> Result<MessageId>;
    pub fn subscribe(&self) -> Receiver<ClientEvent>;
}

No global state. No AUTH_CONTEXT. Auth context is per-QpqClient instance.

3.3 Add Push-Based Delivery

Recommendation: Dedicated QUIC unidirectional stream for server-push notifications.

Client opens bidi stream 0 -> RPC channel (request/response)
Server opens uni stream 1  -> push notifications (new message, typing, etc.)

Benefits:

  • Zero-latency message delivery (no polling)
  • No idle network traffic
  • Typing indicators delivered in real-time
  • Graceful degradation: fall back to long-poll if push stream fails

Also: Make peek + ack the default delivery pattern (not destructive fetch). Add idempotency keys to prevent duplicate messages on retry.

3.4 Multi-Stream Connections

Allow 4-8 concurrent bidirectional QUIC streams per connection. This enables:

  • Pipelined RPCs (send while fetching)
  • Concurrent blob upload + chat
  • fetchWait on one stream without blocking others

3.5 Storage Improvements

Change Rationale
Drop FileBackedStore O(n) flush per write, no federation support
Connection pool for SQLite Replace Mutex<Connection> with r2d2/deadpool
Persist sessions to DB Server restart shouldn't force re-login
Encrypt DiskKeyStore at rest HPKE private keys in plaintext is a real vuln
Persist MLS group state Process crash shouldn't lose group state
Atomic keystore writes tempfile-then-rename pattern

3.6 Crypto Stack Refinements

The algorithms are correct. The refinements are operational:

Change Rationale
Typed MLS error variants Stop losing error info via format!("{e:?}")
Formalize hybrid PQ ciphersuite ID Replace length-based key detection
Remove all InsecureServerCertVerifier No TLS bypass on any platform
Add passkey/WebAuthn alt-auth Better UX for GUI/mobile, no password to forget
Consider Double Ratchet for 1:1 DMs MLS is over-engineered for 2-party; DR gives better per-message forward secrecy
Token/session secret zeroization AuthContext.token et al. need Zeroizing wrappers
Fix serde deserialization of secrets Intermediate non-zeroized Vec<u8> in IdentityKeypair::deserialize

3.7 Workspace Restructuring

Reduce from 12 to 8 crates:

quicproquo-core        -- crypto primitives (keep)
quicproquo-proto       -- schema codegen (keep)
quicproquo-plugin-api  -- #![no_std] C-ABI (keep)
quicproquo-kt          -- key transparency (keep)
quicproquo-sdk         -- NEW: business logic library
quicproquo-server      -- server binary (keep)
quicproquo-client      -- CLI/TUI binary, depends on sdk (keep, slimmed)
quicproquo-p2p         -- mesh networking (keep, feature-flagged)

Merge/remove:

  • bot -> sdk::bot module
  • ffi -> sdk with --features c-ffi
  • gen -> scripts/ or xtask
  • gui -> apps/gui/ outside workspace (Tauri project)
  • mobile -> examples/ (research spike)

Add [workspace.default-members] so cargo build doesn't attempt GUI. Add justfile with build, test, test-e2e, build-wasm, docker.

3.8 Plugin System Evolution

Change Rationale
Add version: u32 to HookVTable ABI stability — check version on load
Config passthrough qpq_plugin_init(vtable, config_json)
Async hooks Plugins that call external services shouldn't block Tokio
Evaluate WASM plugins Sandboxed community plugins (keep C-ABI for first-party)

3.9 Federation Improvements

Change Rationale
DNS SRV / .well-known discovery Static peer config doesn't scale
Persistent relay queue with retry Messages to offline peers are currently lost
Deterministic channel ID derivation Avoid cross-server channel conflicts
Keep mDNS as optional mesh feature Not for internet-scale, but good for LAN

3.10 Test & CI Improvements

Change Rationale
Per-client auth context Removes --test-threads 1 constraint
Mock server for client unit tests Fast tests without spawning real server
Fuzz testing (cargo-fuzz) Hybrid KEM, sealed sender, padding, Cap'n Proto deser
WS bridge unit tests 645 lines, zero tests, security-critical
WASM + Go SDK in CI Currently untested in CI
Separate E2E from unit test CI job Different speed, different failure modes
macOS CI FFI/mobile cross-compilation validation
Release automation Binary artifacts, Docker tags, WASM npm publish

Part 4 — Ecosystem Positioning

Don't compete with Signal or Matrix directly.

Target: Privacy-first messaging infrastructure for developers and organizations.

quicproquo's differentiators — QUIC-native transport, post-quantum crypto, MLS, plugin system, multi-language SDKs, embeddable architecture — point toward an infrastructure play, not a consumer app.

Think: "the Postgres of E2E encrypted messaging" — a high-quality open-source server and protocol that other projects build on.

Segment Value Proposition
Developer tool API-first messenger for encrypted bots and integrations
Embeddable C FFI + WASM + Go SDK for embedding in other apps
Enterprise On-prem, plugins for compliance/audit, OPAQUE zero-knowledge auth
Research Post-quantum crypto, MLS reference implementation, mesh networking

Part 5 — Priority Ordering

Phase 1: Foundation (unblocks everything else)

  1. Replace capnp-rpc with Send-compatible framework
  2. Extract SDK crate from client
  3. Per-client auth context (no global state)

Phase 2: Reliability

  1. Push-based delivery (QUIC uni-stream)
  2. Multi-stream connections
  3. Persist sessions + MLS group state
  4. Encrypt DiskKeyStore at rest
  5. peek+ack as default delivery

Phase 3: Polish

  1. Workspace restructuring (12 -> 8 crates)
  2. TUI as primary interactive mode (built on SDK)
  3. Plugin system v2 (versioning, config, async)
  4. Federation retry queue + discovery

Phase 4: Ecosystem

  1. Full MLS in WASM (browser E2E)
  2. WebTransport (eliminate WS bridge)
  3. Tauri GUI (built on SDK)
  4. Release automation + expanded CI

Appendix — Analysis Sources

This document was produced by four parallel analysis agents:

Agent Scope Files Read
server-analyst Transport, RPC, delivery, storage, federation 27 server .rs files, 4 schemas, core transport
client-analyst REPL, UX, state, multi-platform, SDK design All client .rs, GUI, mobile, TS demo
security-analyst MLS, OPAQUE, hybrid KEM, keystore, identity All core .rs, review doc
dx-analyst Workspace, build, tests, plugins, CI, ecosystem All Cargo.toml, tests, CI, plugins, SDKs