feat: add post-quantum hybrid KEM + SQLCipher persistence

Feature 1 — Post-Quantum Hybrid KEM (X25519 + ML-KEM-768): - Create hybrid_kem.rs with keygen, encrypt, decrypt + 11 unit tests - Wire format: version(1) | x25519_eph_pk(32) | mlkem_ct(1088) | nonce(12) | ct - Add uploadHybridKey/fetchHybridKey RPCs to node.capnp schema - Server: hybrid key storage in FileBackedStore + RPC handlers - Client: hybrid keypair in StoredState, auto-wrap/unwrap in send/recv/invite/join - demo-group runs full hybrid PQ envelope round-trip Feature 2 — SQLCipher Persistence: - Extract Store trait from FileBackedStore API - Create SqlStore (rusqlite + bundled-sqlcipher) with encrypted-at-rest SQLite - Schema: key_packages, deliveries, hybrid_keys tables with indexes - Server CLI: --store-backend=sql, --db-path, --db-key flags - 5 unit tests for SqlStore (FIFO, round-trip, upsert, channel isolation) Also includes: client lib.rs refactor, auth config, TOML config file support, mdBook documentation, and various cleanups by user. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 08:07:48 +01:00
parent d1ddef4cea
commit f334ed3d43
81 changed files with 14502 additions and 2289 deletions
--- a/docs/src/roadmap/production-readiness.md
+++ b/docs/src/roadmap/production-readiness.md
@@ -0,0 +1,226 @@
+# Production Readiness WBS
+
+This page defines the work breakdown structure (WBS) for taking quicnprotochat
+from a proof-of-concept to a production-hardened system. It covers feature scope,
+security policy, phased delivery, and a planning checklist.
+
+For the milestone-by-milestone tracker, see [Milestones](milestones.md). This
+document focuses on the cross-cutting concerns that span multiple milestones.
+
+---
+
+## Feature Scope (Must-Have)
+
+These are the feature areas that must be addressed before quicnprotochat can be
+considered production-ready. Each area maps to one or more milestones or phases
+in the WBS below.
+
+| Area | Description | Primary Milestone |
+|------|-------------|-------------------|
+| **Identity / Auth** | Account creation, device registration, token-based RPC authentication, MLS identity binding | M4 + Phase 3 |
+| **Key / MLS Lifecycle** | KeyPackage rotation, epoch advancement, member removal, credential updates | M5 + Phase 2 |
+| **Transport / Delivery** | QUIC + TLS 1.3 hardening, ALPN enforcement, connection draining, reconnect | M1 (done) + Phase 2 |
+| **Private 1:1 Channels** | Channel creation, per-channel authz, TTL eviction, DM-specific flows | Phase 4 |
+| **Storage / Persistence** | SQLite (or SQLCipher) for AS, DS, client state; migrations; backup/restore | M6 + Phase 6 |
+| **Observability / Ops** | Structured logging, metrics, distributed tracing, healthcheck endpoints | Phase 6 |
+| **Client Resilience** | Offline queue, retry with backoff, idempotent message IDs, gap detection | Phase 4 |
+| **Compatibility / Protocols** | Wire versioning, N-1 client interoperability, ciphersuite negotiation | Phase 2 + Phase 5 |
+
+---
+
+## Security Plan (By Design)
+
+quicnprotochat follows a security-by-design philosophy. The standards below are
+non-negotiable -- see [Coding Standards](../contributing/coding-standards.md) for
+how they are enforced in code.
+
+### Governance
+
+- `CODEOWNERS` file mapping each crate to a responsible reviewer.
+- All PRs require at least one review from a crate owner.
+- Security-sensitive changes (crypto, auth, wire format) require two reviewers.
+- GPG-signed commits only.
+
+### Transport Policy
+
+- TLS 1.3 only (`rustls` configured with `TLS13` cipher suites exclusively).
+- ALPN token `b"capnp"` required; reject connections with mismatched ALPN.
+- Self-signed certificates acceptable for development; production deployments
+  must use a CA-signed certificate or certificate pinning.
+- Connection draining on shutdown (QUIC `CONNECTION_CLOSE`).
+
+### MLS Policy
+
+- Ciphersuite: `MLS_128_DHKEMX25519_AES128GCM_SHA256_Ed25519` (baseline).
+- Single-use KeyPackages (consumed on fetch, per RFC 9420).
+- KeyPackage TTL: 24 hours; clients must rotate before expiry.
+- Ciphersuite allowlist: server rejects KeyPackages with unknown ciphersuites.
+- No downgrade: once a group has used a ciphersuite, members cannot rejoin with
+  a weaker one.
+
+### Input Validation
+
+- All incoming Cap'n Proto messages validated against schema before processing.
+- Maximum payload size: 5 MB per RPC call.
+- Group ID, identity key, and channel ID fields validated for correct length
+  (32 bytes, 32 bytes, 16 bytes respectively).
+- UTF-8 validation on all string fields.
+
+### Secrets Management
+
+- All private key material wrapped in `Zeroizing<T>` (via the `zeroize` crate).
+- No secret material in log output at any level.
+- No `unwrap()` on cryptographic operations -- all errors are typed and propagated.
+- Constant-time comparison for authentication tokens and key fingerprints.
+
+### Abuse / DoS Controls
+
+- Rate limiting: 50 requests/second per IP, per account, and per device.
+- Payload cap: 5 MB per message.
+- Connection limit: configurable max concurrent QUIC connections.
+- KeyPackage upload limit: configurable per account (prevents store exhaustion).
+- Long-poll timeout cap: server-enforced maximum for `fetchWait`.
+
+### Data Protection
+
+- MLS ciphertext is opaque to the server (DS never holds group keys).
+- Message retention: 7 days default, configurable.
+- KeyPackage retention: 24 hours (TTL eviction).
+- At-rest encryption for persistent storage (SQLCipher at M6).
+
+### Logging Safety
+
+- Structured logging via `tracing` with `env-filter`.
+- Sensitive fields (keys, tokens, ciphertext) are never logged, even at `TRACE`.
+- Audit-level events: auth success/failure, token issuance, keypackage upload,
+  enqueue/fetch, rate limit hits.
+
+### Testing
+
+- Unit tests for all crypto operations (see [Testing Strategy](../contributing/testing.md)).
+- Integration tests for every RPC method.
+- Negative tests: malformed input, expired tokens, wrong identity, replay attempts.
+- N-1 compatibility tests (old client against new server).
+- Fuzzing targets for Cap'n Proto parsers and MLS message handling (Phase 5).
+
+---
+
+## Work Breakdown (6 Phases)
+
+### Phase 1 -- Baselines and Governance
+
+**Goal:** Establish project hygiene before adding features.
+
+| Task | Description |
+|------|-------------|
+| CODEOWNERS | Map crates to responsible reviewers |
+| CI pipeline | GitHub Actions: `cargo test --workspace`, `cargo clippy`, `cargo fmt --check`, `cargo deny check` |
+| SBOM generation | `cargo-cyclonedx` or `cargo-about` in CI; publish with each release |
+| Threat model | Document assets, adversaries, attack surface, trust boundaries; reference in [Threat Model](../cryptography/threat-model.md) |
+| Dependency audit | `cargo audit` in CI; pin all major versions per [Coding Standards](../contributing/coding-standards.md) |
+
+### Phase 2 -- Protocols and Core Hardening
+
+**Goal:** Lock down the wire format and cryptographic policy.
+
+| Task | Description |
+|------|-------------|
+| Wire versioning | Add `version` field to all Cap'n Proto structs; reject unknown versions |
+| Ciphersuite allowlist | Server rejects KeyPackages outside the allowed set |
+| Downgrade guards | Prevent epoch rollback; reject Commits with weaker ciphersuites |
+| ALPN enforcement | Reject connections without `b"capnp"` ALPN token |
+| Connection draining | Graceful QUIC `CONNECTION_CLOSE` on server shutdown |
+| KeyPackage rotation | Client-side timer to upload fresh KeyPackages before TTL expiry |
+
+### Phase 3 -- Auth, Device, and Server Hardening
+
+**Goal:** Add account/device identity and token-based authentication.
+
+See [Auth, Devices, and Tokens](authz-plan.md) for the full design.
+
+| Task | Description |
+|------|-------------|
+| Account + device model | `{account_id, device_id, device_pubkey}` with status lifecycle |
+| Token issuance | Access + refresh tokens; configurable expiry |
+| RPC auth middleware | Validate token on every RPC; map to account/device |
+| Identity binding | Bind MLS identity key to account; reject mismatched uploads |
+| Rate limiting | Per-IP, per-account, per-device counters |
+| Audit logging | Auth events, token lifecycle, rate limit hits |
+
+### Phase 4 -- Delivery Semantics and Client Resilience
+
+**Goal:** Reliable message delivery and 1:1 channels.
+
+See [1:1 Channel Design](dm-channels.md) for the DM-specific design.
+
+| Task | Description |
+|------|-------------|
+| Idempotent message IDs | Client-generated UUIDs; server deduplicates |
+| Ordering guarantees | Per-channel sequence numbers; client detects gaps |
+| Offline queue | Server retains messages for offline recipients (up to TTL) |
+| 1:1 channels | Channel creation, membership, per-channel authz |
+| TTL eviction | Background sweep + fetch-time check for expired messages |
+| Client retry | Exponential backoff with jitter on transient failures |
+
+### Phase 5 -- E2E Harness and Security Tests
+
+**Goal:** Automated end-to-end testing and security validation.
+
+| Task | Description |
+|------|-------------|
+| docker-compose testnet | Multi-node test environment with configurable topology |
+| Positive E2E tests | Full group lifecycle: register, create, invite, join, send, recv, leave |
+| Negative E2E tests | Expired tokens, wrong identity, replay, malformed messages |
+| Compat matrix | N-1 client/server version testing |
+| Fuzz targets | `cargo-fuzz` targets for Cap'n Proto parsers, MLS message handlers |
+| Golden-wire fixtures | Serialised test vectors for regression testing across versions |
+
+### Phase 6 -- Reliability, Performance, and Operations
+
+**Goal:** Production-grade operations and performance validation.
+
+| Task | Description |
+|------|-------------|
+| SQLite/SQLCipher persistence | AS key store, DS message log, client state (M6) |
+| Soak testing | 72-hour continuous operation under synthetic load |
+| Load testing | Throughput and latency benchmarks (Criterion + custom harness) |
+| Chaos testing | Network partitions, process crashes, disk full scenarios |
+| Backup / restore | SQLite backup with integrity verification |
+| Canary / rollback | Rolling deployment strategy with automatic rollback on failure |
+| Metrics + dashboards | Prometheus metrics, Grafana dashboards (see [Future Research](future-research.md)) |
+
+---
+
+## Planning Checklist
+
+Use this checklist when planning a new milestone or phase. Each item should have
+a documented decision before implementation begins.
+
+- [ ] **Release criteria / SLOs** -- Define what "done" means. Latency targets,
+      error rate thresholds, test coverage minimums.
+- [ ] **Threat model review** -- Update the [Threat Model](../cryptography/threat-model.md)
+      for any new attack surface introduced by this phase.
+- [ ] **Protocol policy** -- Ciphersuite allowlist, wire version, downgrade rules.
+- [ ] **Identity / auth model** -- Who authenticates, how, and what operations
+      are gated.
+- [ ] **Data model** -- Schema changes, migrations, backward compatibility.
+- [ ] **Abuse controls** -- Rate limits, size caps, connection limits for this phase.
+- [ ] **Observability contracts** -- What new metrics, logs, and traces are needed.
+- [ ] **Environments / secrets** -- Dev, staging, production configuration;
+      secret rotation plan.
+- [ ] **Testing matrix** -- Unit, integration, E2E, negative, fuzz, compat tests
+      for this phase.
+- [ ] **Rollout / ops** -- Deployment strategy, rollback plan, monitoring during
+      rollout.
+
+---
+
+## Cross-references
+
+- [Milestones](milestones.md) -- feature milestone tracker
+- [Auth, Devices, and Tokens](authz-plan.md) -- Phase 3 design
+- [1:1 Channel Design](dm-channels.md) -- Phase 4 design
+- [Future Research](future-research.md) -- technology options for Phase 6+
+- [Coding Standards](../contributing/coding-standards.md) -- engineering standards
+- [Testing Strategy](../contributing/testing.md) -- test structure and conventions
+- [Threat Model](../cryptography/threat-model.md) -- security analysis