feat: add post-quantum hybrid KEM + SQLCipher persistence

Feature 1 — Post-Quantum Hybrid KEM (X25519 + ML-KEM-768):
- Create hybrid_kem.rs with keygen, encrypt, decrypt + 11 unit tests
- Wire format: version(1) | x25519_eph_pk(32) | mlkem_ct(1088) | nonce(12) | ct
- Add uploadHybridKey/fetchHybridKey RPCs to node.capnp schema
- Server: hybrid key storage in FileBackedStore + RPC handlers
- Client: hybrid keypair in StoredState, auto-wrap/unwrap in send/recv/invite/join
- demo-group runs full hybrid PQ envelope round-trip

Feature 2 — SQLCipher Persistence:
- Extract Store trait from FileBackedStore API
- Create SqlStore (rusqlite + bundled-sqlcipher) with encrypted-at-rest SQLite
- Schema: key_packages, deliveries, hybrid_keys tables with indexes
- Server CLI: --store-backend=sql, --db-path, --db-key flags
- 5 unit tests for SqlStore (FIFO, round-trip, upsert, channel isolation)

Also includes: client lib.rs refactor, auth config, TOML config file support,
mdBook documentation, and various cleanups by user.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-22 08:07:48 +01:00
parent d1ddef4cea
commit f334ed3d43
81 changed files with 14502 additions and 2289 deletions

View File

@@ -0,0 +1,406 @@
# Future Research Directions
This page catalogues technologies and research directions that could strengthen
quicnprotochat beyond the current [milestone plan](milestones.md). Each entry
includes a brief description, the problem it solves, relevant crates or
specifications, and how it maps to the project architecture.
For the production readiness work breakdown, see
[Production Readiness WBS](production-readiness.md).
---
## Transport and Networking
### LibP2P / iroh (n0)
**Problem:** The current architecture is strictly client-server. Clients behind
NAT cannot communicate directly, and the server is a single point of failure for
delivery.
**Solution:** [LibP2P](https://libp2p.io/) and [iroh](https://iroh.computer/)
(from n0) provide peer discovery, NAT traversal (hole-punching), and relay
fallback. iroh is particularly interesting because it is Rust-native and built on
QUIC, aligning with quicnprotochat's existing transport layer.
**Architecture impact:** Move from pure client-server to a hybrid topology where
peers communicate directly when possible and fall back to server relay when NAT
traversal fails. The server role shifts from mandatory relay to optional
rendezvous/relay node.
**Crates:** `libp2p`, `iroh`, `iroh-net`
### WebTransport (HTTP/3)
**Problem:** Browser clients cannot use raw QUIC. The current stack requires a
native Rust binary.
**Solution:** [WebTransport](https://w3c.github.io/webtransport/) exposes
QUIC-like semantics (multiplexed bidirectional streams, datagrams) to browsers
over HTTP/3. A WebTransport endpoint alongside the existing QUIC listener would
enable a web client without WebSocket degradation.
**Architecture impact:** Add a second listener (HTTP/3 + WebTransport) that
terminates WebTransport and bridges into the existing `NodeService` RPC layer.
Cap'n Proto serialisation works in WASM via `capnp` crate.
**Crates:** `h3`, `h3-webtransport`, `wtransport`
### Tor / I2P Integration
**Problem:** MLS protects message content, but connection metadata (who connects
to the server, when, how often) leaks to the server and network observers.
**Solution:** Route client-server connections through
[Tor](https://www.torproject.org/) onion services or
[I2P](https://geti2p.net/) tunnels. This provides metadata resistance at the
network layer.
**Architecture impact:** The server exposes a `.onion` address (Tor) or an I2P
destination. Clients connect through the anonymity network. Latency increases
significantly, so this should be optional.
**Crates:** `arti` (Tor client in Rust), `arti-client`
---
## Storage and Persistence
### SQLCipher / libsql (Turso)
**Problem:** At M6, quicnprotochat needs persistent storage for group state, key
material, and message queues. Storing private keys in a plaintext SQLite database
is insufficient.
**Solution:** [SQLCipher](https://www.zetetic.net/sqlcipher/) provides
transparent, page-level AES-256 encryption for SQLite. Alternatively,
[libsql](https://turso.tech/libsql) (Turso) offers a SQLite fork with
encryption, replication, and embedded server capabilities.
**Architecture impact:** Replace the `sqlx` SQLite backend with SQLCipher.
Encryption key derived from a user-provided passphrase (via Argon2id) or a
hardware-backed key.
**Crates:** `rusqlite` (with `bundled-sqlcipher` feature), `libsql`
### CRDTs (Automerge / Yrs)
**Problem:** Multi-device support requires synchronising state (group membership,
read receipts, settings) across devices without a central authority resolving
conflicts.
**Solution:** Conflict-free replicated data types (CRDTs) allow concurrent edits
to converge without coordination. [Automerge](https://automerge.org/) and
[Yrs](https://docs.rs/yrs/) (Yjs in Rust) provide production-quality CRDT
implementations.
**Architecture impact:** Client-side state (contact list, group membership
cache, read markers) stored as CRDT documents. Synchronisation happens over the
existing MLS-encrypted channel, ensuring the server never sees the state.
**Crates:** `automerge`, `yrs`
### Object Storage (S3-compatible)
**Problem:** Encrypted file and media attachments need a storage backend that
the server can host without seeing the content.
**Solution:** An S3-compatible object store (MinIO, Garage, or a cloud provider)
for encrypted blobs. Clients encrypt attachments client-side (using a key derived
from the MLS group secret) and upload the ciphertext. The server stores and
serves opaque blobs.
**Architecture impact:** Add a media upload/download RPC to `NodeService`. The
server proxies to the object store or returns pre-signed URLs.
**Crates:** `aws-sdk-s3`, `opendal`
---
## Cryptography and Privacy
### ML-KEM + ML-DSA Hybrid (Post-Quantum MLS)
**Problem:** Quantum computers threaten X25519 and Ed25519. While MLS content is
protected by ephemeral key exchange, the init keys and credential signatures are
vulnerable to harvest-now-decrypt-later attacks.
**Solution:** Hybrid X25519 + ML-KEM-768 KEM for MLS init keys, and optionally
hybrid Ed25519 + ML-DSA-65 for credential signatures. The `ml-kem` crate is
already vendored in the workspace.
**Architecture impact:** Custom `OpenMlsCryptoProvider` in `quicnprotochat-core`
implementing the hybrid combiner. This is the M7 milestone -- see
[Milestones](milestones.md#m7----post-quantum-planned) and
[Hybrid KEM](../protocol-layers/hybrid-kem.md).
**Crates:** `ml-kem`, `ml-dsa`
**References:** NIST FIPS 203 (ML-KEM), `draft-ietf-tls-hybrid-design`
### Private Information Retrieval (PIR)
**Problem:** When a client fetches messages or KeyPackages, the server learns
*which* recipient is requesting -- even though it cannot read the content.
**Solution:** Private Information Retrieval (PIR) allows a client to fetch a
record from the server without revealing which record was requested.
[SealPIR](https://github.com/microsoft/SealPIR) and SimplePIR provide practical
constructions.
**Architecture impact:** Replace the `fetch` / `fetchKeyPackage` RPCs with PIR
queries. This is a significant performance trade-off: PIR has high computational
cost. Suitable for KeyPackage fetch (small database) before message fetch (large
database).
### Sealed Sender (Signal-style)
**Problem:** The server sees `(sender, recipient, timestamp)` metadata on every
enqueued message. Even without reading content, this metadata reveals social
graphs.
**Solution:** [Sealed Sender](https://signal.org/blog/sealed-sender/) encrypts
the sender's identity inside the MLS ciphertext. The server routes by
`recipientKey` only and cannot determine who sent the message.
**Architecture impact:** Modify the `enqueue` RPC to omit sender identity from
the server-visible metadata. The sender identity is included only inside the
MLS application message (encrypted).
### Key Transparency (RFC draft)
**Problem:** A compromised server could substitute public keys, performing a
man-in-the-middle attack on MLS group formation.
**Solution:** A verifiable, append-only log of public key bindings (similar to
Certificate Transparency for TLS). Clients verify that the server's response
matches the log before trusting a fetched KeyPackage.
**Architecture impact:** Add a key transparency log (Merkle tree) alongside the
Authentication Service. Clients verify inclusion proofs on every `fetchKeyPackage`
response.
**References:** `draft-ietf-keytrans-protocol`
---
## Identity and Authentication
### DIDs (Decentralized Identifiers)
**Problem:** User identities are currently bound to the server. If the server
goes away, identities are lost.
**Solution:** [Decentralized Identifiers](https://www.w3.org/TR/did-core/)
(`did:key`, `did:web`) provide self-sovereign identity. A user's DID is derived
from their Ed25519 public key and is portable across servers.
**Architecture impact:** Replace raw Ed25519 public keys in MLS credentials with
DID URIs. The server resolves DIDs to public keys for routing.
**Crates:** `did-key`, `ssi`
### OPAQUE (aPAKE)
**Problem:** If quicnprotochat adds password-based account registration, the
server must never see the password -- not even a hash.
**Solution:** [OPAQUE](https://datatracker.ietf.org/doc/rfc9497/) is an
asymmetric password-authenticated key exchange where the server stores only a
one-way transformation of the password. The server cannot perform offline
dictionary attacks.
**Architecture impact:** Replace the registration/login flow with OPAQUE. The
server stores an OPAQUE registration record; the client runs the OPAQUE protocol
to authenticate and derive a session key.
**Crates:** `opaque-ke`
**References:** RFC 9497
### WebAuthn / Passkeys
**Problem:** Password-based auth (even with OPAQUE) is vulnerable to phishing.
Hardware-backed authentication provides stronger device binding.
**Solution:** [WebAuthn](https://www.w3.org/TR/webauthn-3/) / Passkeys allow
authentication via hardware tokens (YubiKey), platform authenticators (Touch ID,
Windows Hello), or synced passkeys.
**Architecture impact:** Add a WebAuthn registration/authentication flow to the
account system. Requires a server-side WebAuthn relying party implementation.
**Crates:** `webauthn-rs`
### Verifiable Credentials (W3C VC)
**Problem:** Proving attributes (organization membership, role, age) without
revealing full identity.
**Solution:** [Verifiable Credentials](https://www.w3.org/TR/vc-data-model/)
allow a user to present cryptographic proofs of attributes issued by a trusted
authority.
**Architecture impact:** Extend MLS credentials with VC presentation. A group
admin could require proof of organization membership before allowing join.
---
## Application Layer
### Matrix-style Federation
**Problem:** A single server is a single point of failure and a single point of
trust. Users on different servers cannot communicate.
**Solution:** Federation allows multiple quicnprotochat servers to exchange
messages, similar to [Matrix](https://matrix.org/) homeserver federation. Each
server manages its own users and relays messages to peer servers.
**Architecture impact:** Major. Requires server-to-server protocol, distributed
identity resolution, and cross-server MLS group management.
### WASM Plugin System
**Problem:** Extensibility (bots, bridges, custom message types) currently
requires forking the codebase.
**Solution:** A sandboxed WASM plugin system allows third-party extensions to run
inside the client or server without access to private key material.
**Architecture impact:** Define a plugin API (message hooks, command handlers).
Plugins compiled to WASM and loaded at runtime via `wasmtime` or `wasmer`.
**Crates:** `wasmtime`, `wasmer`, `extism`
### Double-Ratchet DM Layer
**Problem:** MLS is optimised for groups. For efficient 1:1 conversations, the
Signal double ratchet (X3DH + Axolotl) provides better performance
characteristics (no tree overhead for two parties).
**Solution:** Implement a double-ratchet layer for 1:1 DMs, using MLS only for
groups with N > 2. The [1:1 Channel Design](dm-channels.md) currently uses MLS
for DMs; this would be an optimisation.
**References:** [The Double Ratchet Algorithm](https://signal.org/docs/specifications/doubleratchet/),
[X3DH Key Agreement Protocol](https://signal.org/docs/specifications/x3dh/)
---
## Observability and Operations
### OpenTelemetry (Tracing + Metrics)
**Problem:** The current logging is `tracing`-based but lacks distributed
tracing context and structured metrics export.
**Solution:** [OpenTelemetry](https://opentelemetry.io/) provides a unified
framework for distributed tracing, metrics, and log correlation. OTLP export
enables integration with any observability backend.
**Architecture impact:** Add `tracing-opentelemetry` and `opentelemetry-otlp`
to the server. Instrument RPC handlers with spans. Export to Jaeger, Grafana
Tempo, or any OTLP-compatible backend.
**Crates:** `opentelemetry`, `opentelemetry-otlp`, `tracing-opentelemetry`
### Prometheus + Grafana
**Problem:** No quantitative visibility into server performance (throughput,
latency, queue depth, epoch advancement rate).
**Solution:** Export Prometheus metrics from the server. Visualise with Grafana
dashboards.
**Metrics to export:** message throughput (enqueue/fetch per second), RPC
latency histograms, MLS epoch advancement rate, delivery queue depth, KeyPackage
store size, active connections.
**Crates:** `prometheus`, `metrics`, `metrics-exporter-prometheus`
### Testcontainers-rs
**Problem:** Integration tests currently run server and client in the same
process (`tokio::spawn`). This does not test real network conditions, container
startup, or multi-process interactions.
**Solution:** [Testcontainers-rs](https://docs.rs/testcontainers/) runs Docker
containers from Rust tests, enabling true end-to-end CI with real network
boundaries.
**Architecture impact:** Add testcontainers-based integration tests alongside
the existing in-process tests. The Docker image is already maintained.
**Crates:** `testcontainers`, `testcontainers-modules`
---
## Developer Experience
### Tauri / Dioxus (Native GUI)
**Problem:** The current interface is CLI-only. A graphical client would broaden
the user base for testing and demonstration.
**Solution:** [Tauri](https://tauri.app/) or [Dioxus](https://dioxuslabs.com/)
provide native cross-platform GUI frameworks in Rust. The
`quicnprotochat-core` crate can be shared directly with the GUI client.
**Architecture impact:** Add a `quicnprotochat-gui` crate that depends on
`quicnprotochat-core` and `quicnprotochat-proto`. The GUI drives the same
`GroupMember` and RPC logic as the CLI client.
**Crates:** `tauri`, `dioxus`
### uniffi / diplomat (Mobile FFI)
**Problem:** Mobile clients (iOS, Android) cannot use the Rust binary directly.
**Solution:** [uniffi](https://github.com/aspect-build/aspect-cli) (Mozilla) and
[diplomat](https://github.com/nickelc/diplomat) generate idiomatic Swift and
Kotlin bindings from Rust definitions.
**Architecture impact:** Expose `quicnprotochat-core` through a C-compatible FFI
layer. Mobile apps call into the Rust crypto and protocol logic.
**Crates:** `uniffi`, `diplomat`
### Nix Flakes
**Problem:** The development environment requires `capnp` (Cap'n Proto compiler),
a specific Rust toolchain version, and test infrastructure. Setup varies across
developer machines.
**Solution:** [Nix flakes](https://nixos.wiki/wiki/Flakes) provide a
reproducible, declarative development environment. A single `nix develop`
command sets up the toolchain, `capnp`, and all dependencies.
**Architecture impact:** Add `flake.nix` and `flake.lock` to the repository root.
---
## Top 5 Priority Implementations
The following table ranks the most impactful technologies for near-term adoption,
considering the current state of the codebase and the [milestone plan](milestones.md).
| Priority | Technology | Why | Unlocks |
|----------|-----------|-----|---------|
| 1 | **Post-quantum hybrid KEM** | `ml-kem` is already vendored in the workspace. Completing the hybrid `OpenMlsCryptoProvider` makes quicnprotochat one of the first PQ MLS implementations. | M7 |
| 2 | **SQLCipher persistence** | Encrypted-at-rest storage is the prerequisite for multi-device support, offline usage, and server restart survival. | M6 |
| 3 | **OPAQUE auth** | Zero-knowledge password authentication is a massive security uplift for the account system. The server never sees or stores passwords. | Phase 3 (authz) |
| 4 | **iroh / LibP2P** | NAT traversal and optional P2P mesh makes quicnprotochat deployable without centralised infrastructure. Aligns with the existing QUIC transport. | Beyond M7 |
| 5 | **Sealed Sender + PIR** | Content encryption is table stakes. Metadata resistance (hiding who talks to whom) is the frontier of private messaging research. | Beyond M7 |
---
## Cross-references
- [Milestones](milestones.md) -- current milestone tracker
- [Production Readiness WBS](production-readiness.md) -- phased work breakdown
- [Auth, Devices, and Tokens](authz-plan.md) -- OPAQUE integration point
- [1:1 Channel Design](dm-channels.md) -- double-ratchet optimisation context
- [Hybrid KEM](../protocol-layers/hybrid-kem.md) -- existing PQ design
- [ADR-006: PQ Gap in Noise Transport](../design-rationale/adr-006-pq-gap.md) -- accepted PQ risk
- [References](../appendix/references.md) -- standards and crate documentation