quicproquo/docs/src/roadmap/production-readiness.md

# Production Readiness WBS

This page defines the work breakdown structure (WBS) for taking quicprochat
from a proof-of-concept to a production-hardened system. It covers feature scope,
security policy, phased delivery, and a planning checklist.

For the milestone-by-milestone tracker, see [Milestones](milestones.md). This
document focuses on the cross-cutting concerns that span multiple milestones.

---

## Feature Scope (Must-Have)

These are the feature areas that must be addressed before quicprochat can be
considered production-ready. Each area maps to one or more milestones or phases
in the WBS below.

| Area | Description | Primary Milestone |
|------|-------------|-------------------|
| **Identity / Auth** | Account creation, device registration, token-based RPC authentication, MLS identity binding | M4 + Phase 3 |
| **Key / MLS Lifecycle** | KeyPackage rotation, epoch advancement, member removal, credential updates | M5 + Phase 2 |
| **Transport / Delivery** | QUIC + TLS 1.3 hardening, ALPN enforcement, connection draining, reconnect | M1 (done) + Phase 2 |
| **Private 1:1 Channels** | Channel creation, per-channel authz, TTL eviction, DM-specific flows | Phase 4 |
| **Storage / Persistence** | SQLite (or SQLCipher) for AS, DS, client state; migrations; backup/restore | M6 + Phase 6 |
| **Observability / Ops** | Structured logging, metrics, distributed tracing, healthcheck endpoints | Phase 6 |
| **Client Resilience** | Offline queue, retry with backoff, idempotent message IDs, gap detection | Phase 4 |
| **Compatibility / Protocols** | Wire versioning, N-1 client interoperability, ciphersuite negotiation | Phase 2 + Phase 5 |

---

## Security Plan (By Design)

quicprochat follows a security-by-design philosophy. The standards below are
non-negotiable -- see [Coding Standards](../contributing/coding-standards.md) for
how they are enforced in code.

### Governance

- `CODEOWNERS` file mapping each crate to a responsible reviewer.
- All PRs require at least one review from a crate owner.
- Security-sensitive changes (crypto, auth, wire format) require two reviewers.
- GPG-signed commits only.

### Transport Policy

- TLS 1.3 only (`rustls` configured with `TLS13` cipher suites exclusively).
- ALPN token `b"qpc"` required; reject connections with mismatched ALPN.
- Self-signed certificates acceptable for development; production deployments
  must use a CA-signed certificate or certificate pinning.
- Connection draining on shutdown (QUIC `CONNECTION_CLOSE`).

### MLS Policy

- Ciphersuite: `MLS_128_DHKEMX25519_AES128GCM_SHA256_Ed25519` (baseline).
- Single-use KeyPackages (consumed on fetch, per RFC 9420).
- KeyPackage TTL: 24 hours; clients must rotate before expiry.
- Ciphersuite allowlist: server rejects KeyPackages with unknown ciphersuites.
- No downgrade: once a group has used a ciphersuite, members cannot rejoin with
  a weaker one.

### Input Validation

- All incoming Protobuf messages validated against schema before processing.
- Maximum payload size: 5 MB per RPC call.
- Group ID, identity key, and channel ID fields validated for correct length
  (32 bytes, 32 bytes, 16 bytes respectively).
- UTF-8 validation on all string fields.

### Secrets Management

- All private key material wrapped in `Zeroizing<T>` (via the `zeroize` crate).
- No secret material in log output at any level.
- No `unwrap()` on cryptographic operations -- all errors are typed and propagated.
- Constant-time comparison for authentication tokens and key fingerprints.

### Abuse / DoS Controls

- Rate limiting: 50 requests/second per IP, per account, and per device.
- Payload cap: 5 MB per message.
- Connection limit: configurable max concurrent QUIC connections.
- KeyPackage upload limit: configurable per account (prevents store exhaustion).
- Long-poll timeout cap: server-enforced maximum for `fetchWait`.

### Data Protection

- MLS ciphertext is opaque to the server (DS never holds group keys).
- Message retention: 7 days default, configurable.
- KeyPackage retention: 24 hours (TTL eviction).
- At-rest encryption for persistent storage (SQLCipher at M6).

### Logging Safety

- Structured logging via `tracing` with `env-filter`.
- Sensitive fields (keys, tokens, ciphertext) are never logged, even at `TRACE`.
- Audit-level events: auth success/failure, token issuance, keypackage upload,
  enqueue/fetch, rate limit hits.

### Testing

- Unit tests for all crypto operations (see [Testing Strategy](../contributing/testing.md)).
- Integration tests for every RPC method.
- Negative tests: malformed input, expired tokens, wrong identity, replay attempts.
- N-1 compatibility tests (old client against new server).
- Fuzzing targets for Protobuf parsers and MLS message handling (Phase 5).

---

## Work Breakdown (6 Phases)

### Phase 1 -- Baselines and Governance

**Goal:** Establish project hygiene before adding features.

| Task | Description |
|------|-------------|
| CODEOWNERS | Map crates to responsible reviewers |
| CI pipeline | GitHub Actions: `cargo test --workspace`, `cargo clippy`, `cargo fmt --check`, `cargo deny check` |
| SBOM generation | `cargo-cyclonedx` or `cargo-about` in CI; publish with each release |
| Threat model | Document assets, adversaries, attack surface, trust boundaries; reference in [Threat Model](../cryptography/threat-model.md) |
| Dependency audit | `cargo audit` in CI; pin all major versions per [Coding Standards](../contributing/coding-standards.md) |

### Phase 2 -- Protocols and Core Hardening

**Goal:** Lock down the wire format and cryptographic policy.

| Task | Description |
|------|-------------|
| Wire versioning | Version field in all Protobuf frames; reject unknown versions |
| Ciphersuite allowlist | Server rejects KeyPackages outside the allowed set |
| Downgrade guards | Prevent epoch rollback; reject Commits with weaker ciphersuites |
| ALPN enforcement | Reject connections without `b"qpc"` ALPN token |
| Connection draining | Graceful QUIC `CONNECTION_CLOSE` on server shutdown |
| KeyPackage rotation | Client-side timer to upload fresh KeyPackages before TTL expiry |

### Phase 3 -- Auth, Device, and Server Hardening

**Goal:** Add account/device identity and token-based authentication.

See [Auth, Devices, and Tokens](authz-plan.md) for the full design.

| Task | Description |
|------|-------------|
| Account + device model | `{account_id, device_id, device_pubkey}` with status lifecycle |
| Token issuance | Access + refresh tokens; configurable expiry |
| RPC auth middleware | Validate token on every RPC; map to account/device |
| Identity binding | Bind MLS identity key to account; reject mismatched uploads |
| Rate limiting | Per-IP, per-account, per-device counters |
| Audit logging | Auth events, token lifecycle, rate limit hits |

### Phase 4 -- Delivery Semantics and Client Resilience

**Goal:** Reliable message delivery and 1:1 channels.

See [1:1 Channel Design](dm-channels.md) for the DM-specific design.

| Task | Description |
|------|-------------|
| Idempotent message IDs | Client-generated UUIDs; server deduplicates |
| Ordering guarantees | Per-channel sequence numbers; client detects gaps |
| Offline queue | Server retains messages for offline recipients (up to TTL) |
| 1:1 channels | Channel creation, membership, per-channel authz |
| TTL eviction | Background sweep + fetch-time check for expired messages |
| Client retry | Exponential backoff with jitter on transient failures |

### Phase 5 -- E2E Harness and Security Tests

**Goal:** Automated end-to-end testing and security validation.

| Task | Description |
|------|-------------|
| docker-compose testnet | Multi-node test environment with configurable topology |
| Positive E2E tests | Full group lifecycle: register, create, invite, join, send, recv, leave |
| Negative E2E tests | Expired tokens, wrong identity, replay, malformed messages |
| Compat matrix | N-1 client/server version testing |
| Fuzz targets | `cargo-fuzz` targets for Protobuf parsers, MLS message handlers |
| Golden-wire fixtures | Serialised test vectors for regression testing across versions |

### Phase 6 -- Reliability, Performance, and Operations

**Goal:** Production-grade operations and performance validation.

| Task | Description |
|------|-------------|
| SQLite/SQLCipher persistence | AS key store, DS message log, client state (M6) |
| Soak testing | 72-hour continuous operation under synthetic load |
| Load testing | Throughput and latency benchmarks (Criterion + custom harness) |
| Chaos testing | Network partitions, process crashes, disk full scenarios |
| Backup / restore | SQLite backup with integrity verification |
| Canary / rollback | Rolling deployment strategy with automatic rollback on failure |
| Metrics + dashboards | Prometheus metrics, Grafana dashboards (see [Future Research](future-research.md)) |

---

## Planning Checklist

Use this checklist when planning a new milestone or phase. Each item should have
a documented decision before implementation begins.

- [ ] **Release criteria / SLOs** -- Define what "done" means. Latency targets,
      error rate thresholds, test coverage minimums.
- [ ] **Threat model review** -- Update the [Threat Model](../cryptography/threat-model.md)
      for any new attack surface introduced by this phase.
- [ ] **Protocol policy** -- Ciphersuite allowlist, wire version, downgrade rules.
- [ ] **Identity / auth model** -- Who authenticates, how, and what operations
      are gated.
- [ ] **Data model** -- Schema changes, migrations, backward compatibility.
- [ ] **Abuse controls** -- Rate limits, size caps, connection limits for this phase.
- [ ] **Observability contracts** -- What new metrics, logs, and traces are needed.
- [ ] **Environments / secrets** -- Dev, staging, production configuration;
      secret rotation plan.
- [ ] **Testing matrix** -- Unit, integration, E2E, negative, fuzz, compat tests
      for this phase.
- [ ] **Rollout / ops** -- Deployment strategy, rollback plan, monitoring during
      rollout.

---

## Cross-references

- [Milestones](milestones.md) -- feature milestone tracker
- [Auth, Devices, and Tokens](authz-plan.md) -- Phase 3 design
- [1:1 Channel Design](dm-channels.md) -- Phase 4 design
- [Future Research](future-research.md) -- technology options for Phase 6+
- [Coding Standards](../contributing/coding-standards.md) -- engineering standards
- [Testing Strategy](../contributing/testing.md) -- test structure and conventions
- [Threat Model](../cryptography/threat-model.md) -- security analysis