Update 25+ files and add 6 new pages to reflect the v2 migration from Cap'n Proto to Protobuf framing over QUIC. Integrates SDK and Operations docs into the mdBook, restructures SUMMARY.md, and rewrites the wire format, architecture, and protocol sections with accurate v2 content.
10 KiB
Production Readiness WBS
This page defines the work breakdown structure (WBS) for taking quicproquo from a proof-of-concept to a production-hardened system. It covers feature scope, security policy, phased delivery, and a planning checklist.
For the milestone-by-milestone tracker, see Milestones. This document focuses on the cross-cutting concerns that span multiple milestones.
Feature Scope (Must-Have)
These are the feature areas that must be addressed before quicproquo can be considered production-ready. Each area maps to one or more milestones or phases in the WBS below.
| Area | Description | Primary Milestone |
|---|---|---|
| Identity / Auth | Account creation, device registration, token-based RPC authentication, MLS identity binding | M4 + Phase 3 |
| Key / MLS Lifecycle | KeyPackage rotation, epoch advancement, member removal, credential updates | M5 + Phase 2 |
| Transport / Delivery | QUIC + TLS 1.3 hardening, ALPN enforcement, connection draining, reconnect | M1 (done) + Phase 2 |
| Private 1:1 Channels | Channel creation, per-channel authz, TTL eviction, DM-specific flows | Phase 4 |
| Storage / Persistence | SQLite (or SQLCipher) for AS, DS, client state; migrations; backup/restore | M6 + Phase 6 |
| Observability / Ops | Structured logging, metrics, distributed tracing, healthcheck endpoints | Phase 6 |
| Client Resilience | Offline queue, retry with backoff, idempotent message IDs, gap detection | Phase 4 |
| Compatibility / Protocols | Wire versioning, N-1 client interoperability, ciphersuite negotiation | Phase 2 + Phase 5 |
Security Plan (By Design)
quicproquo follows a security-by-design philosophy. The standards below are non-negotiable -- see Coding Standards for how they are enforced in code.
Governance
CODEOWNERSfile mapping each crate to a responsible reviewer.- All PRs require at least one review from a crate owner.
- Security-sensitive changes (crypto, auth, wire format) require two reviewers.
- GPG-signed commits only.
Transport Policy
- TLS 1.3 only (
rustlsconfigured withTLS13cipher suites exclusively). - ALPN token
b"qpq"required; reject connections with mismatched ALPN. - Self-signed certificates acceptable for development; production deployments must use a CA-signed certificate or certificate pinning.
- Connection draining on shutdown (QUIC
CONNECTION_CLOSE).
MLS Policy
- Ciphersuite:
MLS_128_DHKEMX25519_AES128GCM_SHA256_Ed25519(baseline). - Single-use KeyPackages (consumed on fetch, per RFC 9420).
- KeyPackage TTL: 24 hours; clients must rotate before expiry.
- Ciphersuite allowlist: server rejects KeyPackages with unknown ciphersuites.
- No downgrade: once a group has used a ciphersuite, members cannot rejoin with a weaker one.
Input Validation
- All incoming Protobuf messages validated against schema before processing.
- Maximum payload size: 5 MB per RPC call.
- Group ID, identity key, and channel ID fields validated for correct length (32 bytes, 32 bytes, 16 bytes respectively).
- UTF-8 validation on all string fields.
Secrets Management
- All private key material wrapped in
Zeroizing<T>(via thezeroizecrate). - No secret material in log output at any level.
- No
unwrap()on cryptographic operations -- all errors are typed and propagated. - Constant-time comparison for authentication tokens and key fingerprints.
Abuse / DoS Controls
- Rate limiting: 50 requests/second per IP, per account, and per device.
- Payload cap: 5 MB per message.
- Connection limit: configurable max concurrent QUIC connections.
- KeyPackage upload limit: configurable per account (prevents store exhaustion).
- Long-poll timeout cap: server-enforced maximum for
fetchWait.
Data Protection
- MLS ciphertext is opaque to the server (DS never holds group keys).
- Message retention: 7 days default, configurable.
- KeyPackage retention: 24 hours (TTL eviction).
- At-rest encryption for persistent storage (SQLCipher at M6).
Logging Safety
- Structured logging via
tracingwithenv-filter. - Sensitive fields (keys, tokens, ciphertext) are never logged, even at
TRACE. - Audit-level events: auth success/failure, token issuance, keypackage upload, enqueue/fetch, rate limit hits.
Testing
- Unit tests for all crypto operations (see Testing Strategy).
- Integration tests for every RPC method.
- Negative tests: malformed input, expired tokens, wrong identity, replay attempts.
- N-1 compatibility tests (old client against new server).
- Fuzzing targets for Protobuf parsers and MLS message handling (Phase 5).
Work Breakdown (6 Phases)
Phase 1 -- Baselines and Governance
Goal: Establish project hygiene before adding features.
| Task | Description |
|---|---|
| CODEOWNERS | Map crates to responsible reviewers |
| CI pipeline | GitHub Actions: cargo test --workspace, cargo clippy, cargo fmt --check, cargo deny check |
| SBOM generation | cargo-cyclonedx or cargo-about in CI; publish with each release |
| Threat model | Document assets, adversaries, attack surface, trust boundaries; reference in Threat Model |
| Dependency audit | cargo audit in CI; pin all major versions per Coding Standards |
Phase 2 -- Protocols and Core Hardening
Goal: Lock down the wire format and cryptographic policy.
| Task | Description |
|---|---|
| Wire versioning | Version field in all Protobuf frames; reject unknown versions |
| Ciphersuite allowlist | Server rejects KeyPackages outside the allowed set |
| Downgrade guards | Prevent epoch rollback; reject Commits with weaker ciphersuites |
| ALPN enforcement | Reject connections without b"qpq" ALPN token |
| Connection draining | Graceful QUIC CONNECTION_CLOSE on server shutdown |
| KeyPackage rotation | Client-side timer to upload fresh KeyPackages before TTL expiry |
Phase 3 -- Auth, Device, and Server Hardening
Goal: Add account/device identity and token-based authentication.
See Auth, Devices, and Tokens for the full design.
| Task | Description |
|---|---|
| Account + device model | {account_id, device_id, device_pubkey} with status lifecycle |
| Token issuance | Access + refresh tokens; configurable expiry |
| RPC auth middleware | Validate token on every RPC; map to account/device |
| Identity binding | Bind MLS identity key to account; reject mismatched uploads |
| Rate limiting | Per-IP, per-account, per-device counters |
| Audit logging | Auth events, token lifecycle, rate limit hits |
Phase 4 -- Delivery Semantics and Client Resilience
Goal: Reliable message delivery and 1:1 channels.
See 1:1 Channel Design for the DM-specific design.
| Task | Description |
|---|---|
| Idempotent message IDs | Client-generated UUIDs; server deduplicates |
| Ordering guarantees | Per-channel sequence numbers; client detects gaps |
| Offline queue | Server retains messages for offline recipients (up to TTL) |
| 1:1 channels | Channel creation, membership, per-channel authz |
| TTL eviction | Background sweep + fetch-time check for expired messages |
| Client retry | Exponential backoff with jitter on transient failures |
Phase 5 -- E2E Harness and Security Tests
Goal: Automated end-to-end testing and security validation.
| Task | Description |
|---|---|
| docker-compose testnet | Multi-node test environment with configurable topology |
| Positive E2E tests | Full group lifecycle: register, create, invite, join, send, recv, leave |
| Negative E2E tests | Expired tokens, wrong identity, replay, malformed messages |
| Compat matrix | N-1 client/server version testing |
| Fuzz targets | cargo-fuzz targets for Protobuf parsers, MLS message handlers |
| Golden-wire fixtures | Serialised test vectors for regression testing across versions |
Phase 6 -- Reliability, Performance, and Operations
Goal: Production-grade operations and performance validation.
| Task | Description |
|---|---|
| SQLite/SQLCipher persistence | AS key store, DS message log, client state (M6) |
| Soak testing | 72-hour continuous operation under synthetic load |
| Load testing | Throughput and latency benchmarks (Criterion + custom harness) |
| Chaos testing | Network partitions, process crashes, disk full scenarios |
| Backup / restore | SQLite backup with integrity verification |
| Canary / rollback | Rolling deployment strategy with automatic rollback on failure |
| Metrics + dashboards | Prometheus metrics, Grafana dashboards (see Future Research) |
Planning Checklist
Use this checklist when planning a new milestone or phase. Each item should have a documented decision before implementation begins.
- Release criteria / SLOs -- Define what "done" means. Latency targets, error rate thresholds, test coverage minimums.
- Threat model review -- Update the Threat Model for any new attack surface introduced by this phase.
- Protocol policy -- Ciphersuite allowlist, wire version, downgrade rules.
- Identity / auth model -- Who authenticates, how, and what operations are gated.
- Data model -- Schema changes, migrations, backward compatibility.
- Abuse controls -- Rate limits, size caps, connection limits for this phase.
- Observability contracts -- What new metrics, logs, and traces are needed.
- Environments / secrets -- Dev, staging, production configuration; secret rotation plan.
- Testing matrix -- Unit, integration, E2E, negative, fuzz, compat tests for this phase.
- Rollout / ops -- Deployment strategy, rollback plan, monitoring during rollout.
Cross-references
- Milestones -- feature milestone tracker
- Auth, Devices, and Tokens -- Phase 3 design
- 1:1 Channel Design -- Phase 4 design
- Future Research -- technology options for Phase 6+
- Coding Standards -- engineering standards
- Testing Strategy -- test structure and conventions
- Threat Model -- security analysis