Files
quicproquo/docs/src/roadmap/production-readiness.md
Christian Nennemann d073f614b3 docs: rewrite mdBook documentation for v2 architecture
Update 25+ files and add 6 new pages to reflect the v2 migration from
Cap'n Proto to Protobuf framing over QUIC. Integrates SDK and Operations
docs into the mdBook, restructures SUMMARY.md, and rewrites the wire
format, architecture, and protocol sections with accurate v2 content.
2026-03-04 22:02:31 +01:00

10 KiB

Production Readiness WBS

This page defines the work breakdown structure (WBS) for taking quicproquo from a proof-of-concept to a production-hardened system. It covers feature scope, security policy, phased delivery, and a planning checklist.

For the milestone-by-milestone tracker, see Milestones. This document focuses on the cross-cutting concerns that span multiple milestones.


Feature Scope (Must-Have)

These are the feature areas that must be addressed before quicproquo can be considered production-ready. Each area maps to one or more milestones or phases in the WBS below.

Area Description Primary Milestone
Identity / Auth Account creation, device registration, token-based RPC authentication, MLS identity binding M4 + Phase 3
Key / MLS Lifecycle KeyPackage rotation, epoch advancement, member removal, credential updates M5 + Phase 2
Transport / Delivery QUIC + TLS 1.3 hardening, ALPN enforcement, connection draining, reconnect M1 (done) + Phase 2
Private 1:1 Channels Channel creation, per-channel authz, TTL eviction, DM-specific flows Phase 4
Storage / Persistence SQLite (or SQLCipher) for AS, DS, client state; migrations; backup/restore M6 + Phase 6
Observability / Ops Structured logging, metrics, distributed tracing, healthcheck endpoints Phase 6
Client Resilience Offline queue, retry with backoff, idempotent message IDs, gap detection Phase 4
Compatibility / Protocols Wire versioning, N-1 client interoperability, ciphersuite negotiation Phase 2 + Phase 5

Security Plan (By Design)

quicproquo follows a security-by-design philosophy. The standards below are non-negotiable -- see Coding Standards for how they are enforced in code.

Governance

  • CODEOWNERS file mapping each crate to a responsible reviewer.
  • All PRs require at least one review from a crate owner.
  • Security-sensitive changes (crypto, auth, wire format) require two reviewers.
  • GPG-signed commits only.

Transport Policy

  • TLS 1.3 only (rustls configured with TLS13 cipher suites exclusively).
  • ALPN token b"qpq" required; reject connections with mismatched ALPN.
  • Self-signed certificates acceptable for development; production deployments must use a CA-signed certificate or certificate pinning.
  • Connection draining on shutdown (QUIC CONNECTION_CLOSE).

MLS Policy

  • Ciphersuite: MLS_128_DHKEMX25519_AES128GCM_SHA256_Ed25519 (baseline).
  • Single-use KeyPackages (consumed on fetch, per RFC 9420).
  • KeyPackage TTL: 24 hours; clients must rotate before expiry.
  • Ciphersuite allowlist: server rejects KeyPackages with unknown ciphersuites.
  • No downgrade: once a group has used a ciphersuite, members cannot rejoin with a weaker one.

Input Validation

  • All incoming Protobuf messages validated against schema before processing.
  • Maximum payload size: 5 MB per RPC call.
  • Group ID, identity key, and channel ID fields validated for correct length (32 bytes, 32 bytes, 16 bytes respectively).
  • UTF-8 validation on all string fields.

Secrets Management

  • All private key material wrapped in Zeroizing<T> (via the zeroize crate).
  • No secret material in log output at any level.
  • No unwrap() on cryptographic operations -- all errors are typed and propagated.
  • Constant-time comparison for authentication tokens and key fingerprints.

Abuse / DoS Controls

  • Rate limiting: 50 requests/second per IP, per account, and per device.
  • Payload cap: 5 MB per message.
  • Connection limit: configurable max concurrent QUIC connections.
  • KeyPackage upload limit: configurable per account (prevents store exhaustion).
  • Long-poll timeout cap: server-enforced maximum for fetchWait.

Data Protection

  • MLS ciphertext is opaque to the server (DS never holds group keys).
  • Message retention: 7 days default, configurable.
  • KeyPackage retention: 24 hours (TTL eviction).
  • At-rest encryption for persistent storage (SQLCipher at M6).

Logging Safety

  • Structured logging via tracing with env-filter.
  • Sensitive fields (keys, tokens, ciphertext) are never logged, even at TRACE.
  • Audit-level events: auth success/failure, token issuance, keypackage upload, enqueue/fetch, rate limit hits.

Testing

  • Unit tests for all crypto operations (see Testing Strategy).
  • Integration tests for every RPC method.
  • Negative tests: malformed input, expired tokens, wrong identity, replay attempts.
  • N-1 compatibility tests (old client against new server).
  • Fuzzing targets for Protobuf parsers and MLS message handling (Phase 5).

Work Breakdown (6 Phases)

Phase 1 -- Baselines and Governance

Goal: Establish project hygiene before adding features.

Task Description
CODEOWNERS Map crates to responsible reviewers
CI pipeline GitHub Actions: cargo test --workspace, cargo clippy, cargo fmt --check, cargo deny check
SBOM generation cargo-cyclonedx or cargo-about in CI; publish with each release
Threat model Document assets, adversaries, attack surface, trust boundaries; reference in Threat Model
Dependency audit cargo audit in CI; pin all major versions per Coding Standards

Phase 2 -- Protocols and Core Hardening

Goal: Lock down the wire format and cryptographic policy.

Task Description
Wire versioning Version field in all Protobuf frames; reject unknown versions
Ciphersuite allowlist Server rejects KeyPackages outside the allowed set
Downgrade guards Prevent epoch rollback; reject Commits with weaker ciphersuites
ALPN enforcement Reject connections without b"qpq" ALPN token
Connection draining Graceful QUIC CONNECTION_CLOSE on server shutdown
KeyPackage rotation Client-side timer to upload fresh KeyPackages before TTL expiry

Phase 3 -- Auth, Device, and Server Hardening

Goal: Add account/device identity and token-based authentication.

See Auth, Devices, and Tokens for the full design.

Task Description
Account + device model {account_id, device_id, device_pubkey} with status lifecycle
Token issuance Access + refresh tokens; configurable expiry
RPC auth middleware Validate token on every RPC; map to account/device
Identity binding Bind MLS identity key to account; reject mismatched uploads
Rate limiting Per-IP, per-account, per-device counters
Audit logging Auth events, token lifecycle, rate limit hits

Phase 4 -- Delivery Semantics and Client Resilience

Goal: Reliable message delivery and 1:1 channels.

See 1:1 Channel Design for the DM-specific design.

Task Description
Idempotent message IDs Client-generated UUIDs; server deduplicates
Ordering guarantees Per-channel sequence numbers; client detects gaps
Offline queue Server retains messages for offline recipients (up to TTL)
1:1 channels Channel creation, membership, per-channel authz
TTL eviction Background sweep + fetch-time check for expired messages
Client retry Exponential backoff with jitter on transient failures

Phase 5 -- E2E Harness and Security Tests

Goal: Automated end-to-end testing and security validation.

Task Description
docker-compose testnet Multi-node test environment with configurable topology
Positive E2E tests Full group lifecycle: register, create, invite, join, send, recv, leave
Negative E2E tests Expired tokens, wrong identity, replay, malformed messages
Compat matrix N-1 client/server version testing
Fuzz targets cargo-fuzz targets for Protobuf parsers, MLS message handlers
Golden-wire fixtures Serialised test vectors for regression testing across versions

Phase 6 -- Reliability, Performance, and Operations

Goal: Production-grade operations and performance validation.

Task Description
SQLite/SQLCipher persistence AS key store, DS message log, client state (M6)
Soak testing 72-hour continuous operation under synthetic load
Load testing Throughput and latency benchmarks (Criterion + custom harness)
Chaos testing Network partitions, process crashes, disk full scenarios
Backup / restore SQLite backup with integrity verification
Canary / rollback Rolling deployment strategy with automatic rollback on failure
Metrics + dashboards Prometheus metrics, Grafana dashboards (see Future Research)

Planning Checklist

Use this checklist when planning a new milestone or phase. Each item should have a documented decision before implementation begins.

  • Release criteria / SLOs -- Define what "done" means. Latency targets, error rate thresholds, test coverage minimums.
  • Threat model review -- Update the Threat Model for any new attack surface introduced by this phase.
  • Protocol policy -- Ciphersuite allowlist, wire version, downgrade rules.
  • Identity / auth model -- Who authenticates, how, and what operations are gated.
  • Data model -- Schema changes, migrations, backward compatibility.
  • Abuse controls -- Rate limits, size caps, connection limits for this phase.
  • Observability contracts -- What new metrics, logs, and traces are needed.
  • Environments / secrets -- Dev, staging, production configuration; secret rotation plan.
  • Testing matrix -- Unit, integration, E2E, negative, fuzz, compat tests for this phase.
  • Rollout / ops -- Deployment strategy, rollback plan, monitoring during rollout.

Cross-references