diff --git a/SPRINTS.md b/SPRINTS.md new file mode 100644 index 0000000..ce36639 --- /dev/null +++ b/SPRINTS.md @@ -0,0 +1,229 @@ +# quicprochat — Sprint Plan + +> 7 sprints synthesized from code audit, architecture analysis, and ecosystem research. +> Each sprint is ~1 week. Sprints are ordered by priority and dependency. + +--- + +## Sprint 1 — Bug Fixes & Code Quality (Quick Wins) + +Fix all known bugs, clippy warnings, and dead code before building on top. + +- [x] **1.1 Fix boolean logic bug in TUI** + - `crates/quicprochat-client/src/client/v2_tui.rs:832` — remove `|| true` + - Cursor positioning always executes regardless of input state + +- [x] **1.2 Fix unwrap violations in P2P router** + - `crates/quicprochat-p2p/src/routing.rs:416,419` — `.lock().unwrap()` on Mutex + - Replace with `.expect("lock poisoned")` or proper error handling + +- [x] **1.3 Remove placeholder assertion in WebTransport** + - `crates/quicprochat-server/src/webtransport.rs:418` — `assert!(true);` + +- [x] **1.4 Wire up unused metrics** + - `record_storage_latency()` — instrument storage layer calls + - `record_uptime_seconds()` — add periodic heartbeat task in server main loop + +- [x] **1.5 Wire up or remove unused config fields** + - `EffectiveConfig::webtransport_listen` — connect to WebTransport listener + - `EffectiveConfig::rpc_timeout_secs` — apply as per-RPC deadline + - `EffectiveConfig::storage_timeout_secs` — apply as DB query timeout + +- [x] **1.6 Fix remaining clippy warnings** + - Reduce function arity (2 functions with 8-9 args → use config/param structs) + - Remove useless `format!()` call + - Collapse nested conditionals + - Rename `from_str` method to avoid `FromStr` trait confusion + +--- + +## Sprint 2 — OpenMLS 0.5 → 0.8 Migration + +**CRITICAL**: OpenMLS 0.7.2 includes security patches. Staying on 0.5 is a risk. + +- [x] **2.1 Migrate StorageProvider trait** + - Old `OpenMlsKeyStore` → new `StorageProvider` (most invasive change) + - Rework `DiskKeyStore` integration (must keep bincode serialization) + - Update all `group.rs` calls that interact with the key store + +- [x] **2.2 Update MLS API calls** + - `self_update()` / `propose_self_update()` — add `LeafNodeParameters` arg + - `join_by_external_commit()` — add optional LeafNode params + - `Sender::NewMember` → split into `NewMemberProposal` / `NewMemberCommit` + +- [x] **2.3 Handle GREASE support** + - New variants in `ProposalType`, `ExtensionType`, `CredentialType` + - Update match arms to handle unknown/GREASE values + +- [x] **2.4 Update AAD handling** + - AAD no longer persisted — set before every API call generating `MlsMessageOut` + +- [x] **2.5 Verify FIPS 203 alignment** + - Confirm ML-KEM-768 parameters match final FIPS 203 (not draft) + - Review hybrid KEM against RFC 9794 combination methods + +- [x] **2.6 Full test suite pass** + - All 301 tests must pass with OpenMLS 0.8 + - Run crypto benchmarks to check for performance regressions + +--- + +## Sprint 3 — Client Resilience + +Currently, network glitches cause the client to hang. This blocks v2 launch. + +- [x] **3.1 Auto-reconnect with backoff** + - Integrate existing `retry.rs` into `RpcClient::call()` path + - Exponential backoff with jitter (already implemented, not wired) + - Configurable max retries and backoff ceiling + +- [x] **3.2 Push subscription recovery** + - Detect broken push stream and re-subscribe automatically + - Buffer missed events during reconnection window + +- [x] **3.3 Heartbeat / keepalive** + - Periodic QUIC ping in TUI and REPL modes + - Detect dead connections before user notices + +- [x] **3.4 SDK disconnect lifecycle** + - Add `QpcClient::disconnect()` for clean shutdown + - Proper state machine: Connected → Reconnecting → Disconnected + +- [x] **3.5 Connection status UI** + - TUI: show connection state in status bar (Connected / Reconnecting / Offline) + - REPL: print status change notifications + +--- + +## Sprint 4 — Server Hardening + +Fix graceful shutdown and wire up timeouts for production readiness. + +- [x] **4.1 In-flight RPC tracking** + - Replace fixed 30s shutdown delay with actual in-flight RPC counter + - Drain when counter reaches zero (with configurable max wait) + +- [x] **4.2 Apply request-level timeouts** + - Wire `rpc_timeout_secs` config into per-RPC deadline enforcement + - Wire `storage_timeout_secs` into DB query timeouts + - Cancel long-running operations cleanly + +- [x] **4.3 Plugin shutdown hooks** + - Add `on_shutdown` hook to `HookVTable` + - Call plugin shutdown before server exits + +- [x] **4.4 Federation drain during shutdown** + - Stop accepting federation relay requests on SIGTERM + - Wait for in-flight federation RPCs before exit + +- [x] **4.5 Connection draining improvements** + - Send QUIC CONNECTION_CLOSE with application reason + - WebTransport: send close frame before dropping sessions + +--- + +## Sprint 5 — Test Coverage & CI Hardening + +Address the major test coverage gaps identified in the audit. + +- [x] **5.1 RPC framing unit tests** + - `crates/quicprochat-rpc/src/framing.rs` — encode/decode edge cases + - Malformed frames, truncated input, max-size payloads + - Fuzzing harness for frame parser + +- [x] **5.2 SDK state machine tests** + - `crates/quicprochat-sdk/src/conversation.rs` — conversation lifecycle + - `crates/quicprochat-sdk/src/groups.rs` — group join/leave/update + - `crates/quicprochat-sdk/src/messaging.rs` — send/receive/queue + +- [x] **5.3 Server domain service tests** + - `crates/quicprochat-server/src/domain/` — all service modules + - Test business logic without DB (mock storage trait) + +- [x] **5.4 Integration tests** + - Reconnection scenario (kill server, restart, verify client recovers) + - Graceful shutdown (send SIGTERM during active RPCs, verify drain) + - Multi-node federation relay (if federation wired in Sprint 6) + +- [x] **5.5 CI hardening** + - Add MSRV check (Rust 1.75 or declared minimum) + - Add cross-platform CI (macOS, Windows — at least build check) + - Add cargo-fuzz for crypto and parsing code + - Add MIRI for unsafe code in plugin-api/FFI + +--- + +## Sprint 6 — Federation & P2P Integration + +Wire up the scaffolded federation and P2P code into working features. + +- [x] **6.1 Federation message routing** + - Wire `federation::routing::resolve_destination()` into `handle_enqueue` + - Route messages to remote home servers via `FederationClient::relay_enqueue()` + - Resolve protocol mismatch (Cap'n Proto federation vs Protobuf main RPC) + +- [x] **6.2 Federation identity resolution** + - Cross-server user lookup (`user@remote-server`) + - KeyPackage fetching across federated nodes + +- [x] **6.3 P2P client integration** + - Wire iroh P2P into client as transport option + - Fallback logic: prefer P2P direct → fall back to server relay + - mDNS discovery in client (already scaffolded, needs activation) + +- [x] **6.4 Multipath QUIC evaluation** + - Research draft-ietf-quic-multipath (likely RFC in 2026) + - Prototype: use multiple paths for mesh relay resilience + - Decision: adopt or defer based on quinn support + +- [x] **6.5 Federation integration tests** + - Two-server test: register on A, send to user on B, verify delivery + - mTLS mutual auth verification + - Partition tolerance (one node goes down, messages queue) + +--- + +## Sprint 7 — Documentation, Polish & Future Prep + +Final polish and forward-looking improvements. + +- [x] **7.1 Crate-level documentation** + - Add module-level docs to `quicprochat-plugin-api`, `quicprochat-rpc`, `quicprochat-sdk` + - Doc comments for all public APIs in domain services + +- [x] **7.2 Refactor high-arity functions** (none found — already clean) + - Consolidate 8-9 parameter functions into config/param structs + - Improve builder patterns where appropriate + +- [ ] **7.3 Review RFC 9750 (MLS Architecture)** (deferred — requires manual review) + - Verify quicprochat's AS/DS split aligns with RFC 9750 recommendations + - Document any deviations and rationale + +- [ ] **7.4 Desktop client evaluation** (deferred — requires Tauri prototype) + - Prototype Tauri v2 desktop shell wrapping the TUI or a web UI + - Evaluate effort to ship cross-platform desktop client + +- [x] **7.5 Security pre-audit prep** + - Document all crypto boundaries and trust assumptions + - Create threat model document + - Prepare scope document for external auditors (Roadmap item 4.1) + - Budget: NCC Group / Trail of Bits / Cure53 ($50K–$150K, 4-6 weeks) + +- [ ] **7.6 Repository rename** (requires GitHub admin action) + - Rename GitHub repository from `quicproquo` → `quicprochat` + - Update all GitHub URLs, CI badge links, go.mod import paths + - Set up redirect from old repo name + +--- + +## Sprint Summary + +| Sprint | Focus | Risk | Key Deliverable | +|--------|-------|------|----------------| +| **1** | Bug fixes & code quality | Low | Zero clippy warnings, metrics wired | +| **2** | OpenMLS 0.5 → 0.8 | High | Security patches applied, FIPS 203 verified | +| **3** | Client resilience | Medium | Auto-reconnect, heartbeat, status UI | +| **4** | Server hardening | Medium | Real graceful shutdown, timeouts enforced | +| **5** | Test coverage & CI | Low | Unit tests for SDK/RPC/domain, fuzzing | +| **6** | Federation & P2P | High | Working cross-server messaging, P2P fallback | +| **7** | Docs, polish & audit prep | Low | Audit-ready, desktop prototype |