diff --git a/docs/plans/mesh-protocol-gaps.md b/docs/plans/mesh-protocol-gaps.md new file mode 100644 index 0000000..711471c --- /dev/null +++ b/docs/plans/mesh-protocol-gaps.md @@ -0,0 +1,323 @@ +# Mesh Protocol Gaps — Honest Assessment & Action Plan + +> **Goal:** Identify real weaknesses in QuicProChat's mesh protocol compared to +> Reticulum, Meshtastic, and LXMF. Plan concrete improvements. +> +> Created: 2026-03-30 + +--- + +## Executive Summary + +QuicProChat has strong cryptography (MLS, PQ-KEM) but **real gaps** in the mesh layer: + +| Gap | Severity | Status | +|-----|----------|--------| +| MLS overhead too large for LoRa | **Critical** | Needs design work | +| No lightweight messaging mode | **High** | Not started | +| KeyPackage distribution over mesh | **High** | Not solved | +| Announce/routing not battle-tested | **Medium** | S3 done, needs real-world test | +| No DTN bundle protocol integration | **Medium** | Not started | +| Battery/duty-cycle optimization | **Medium** | Basic tracker exists | + +--- + +## Gap 1: MLS Overhead is Prohibitive for Constrained Links + +### The Problem + +**MLS was designed for Internet messaging, not LoRa.** + +Measured sizes (approximate): + +| Component | Size (bytes) | LoRa SF12/BW125 airtime | +|-----------|--------------|------------------------| +| MLS KeyPackage | ~500-800 | 80-130 seconds | +| MLS Welcome | ~1000-2000 | 160-320 seconds | +| MLS Commit | ~200-500 | 32-80 seconds | +| MLS ApplicationMessage | ~100-200 | 16-32 seconds | +| **MeshEnvelope overhead** | ~170 (CBOR) | 27 seconds | +| **Reticulum LXMF message** | ~100-150 | 16-24 seconds | +| **Meshtastic payload** | ~237 max | 38 seconds | + +**The math doesn't work:** + +- LoRa SF12/BW125: ~51 byte MTU, ~300 bps effective +- EU868 duty cycle: 1% = 36 seconds TX per hour +- **One MLS KeyPackage = 10-20 fragments = entire hour's duty budget** + +### Current State + +- MeshEnvelope uses CBOR, ~170 bytes overhead for a short message +- MLS operations happen at application layer, not optimized for mesh +- No fallback to lighter crypto for constrained links + +### Proposed Solutions + +#### Option A: Hybrid Crypto Modes (Recommended) + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Mode Selection Based on Transport Capability │ +├─────────────────────────────────────────────────────────────────┤ +│ │ +│ QUIC/TCP/WiFi (>10 kbps): │ +│ → Full MLS groups with PQ-KEM │ +│ → KeyPackage distribution via server │ +│ → Standard protocol │ +│ │ +│ LoRa/Serial (<1 kbps): │ +│ → "MLS-Lite" mode: │ +│ • Pre-shared group epoch key (exchanged out-of-band) │ +│ • ChaCha20-Poly1305 symmetric encryption │ +│ • Ed25519 signatures (64 bytes) │ +│ • No per-message KeyPackage exchange │ +│ • Manual key rotation via QR code or faster link │ +│ │ +│ Upgrade path: │ +│ When faster transport available → full MLS epoch sync │ +│ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +**Trade-off:** Lose automatic PCS on constrained links. Gain usability. + +#### Option B: Compressed MLS (Research) + +- Strip unused extensions from KeyPackages +- Use shorter credential identifiers (16 bytes instead of 32) +- Batch multiple KeyPackages into single transfer over fast link +- Cache and reuse KeyPackages more aggressively + +**Trade-off:** Still large. May not be enough for SF12 LoRa. + +#### Option C: LXMF-Compatible Mode + +Implement Reticulum's LXMF format as an alternative wire format: + +```rust +pub struct LxmfMessage { + destination: [u8; 16], // Truncated hash + source: [u8; 16], + signature: [u8; 64], // Ed25519 + payload: Vec, // msgpack: {timestamp, content, title, fields} +} +// Total: ~100-150 bytes for short message +``` + +**Trade-off:** Lose MLS group properties. Gain Reticulum interop and efficiency. + +### Action Items + +- [ ] **Measure actual MLS sizes** in current implementation (benchmark) +- [ ] **Design MLS-Lite spec** for constrained links +- [ ] **Implement transport capability negotiation** in TransportManager +- [ ] **Add `--constrained` mode** to MeshEnvelope for minimal overhead + +--- + +## Gap 2: KeyPackage Distribution Over Mesh + +### The Problem + +MLS requires pre-positioned KeyPackages for adding members to groups. On Internet: +server stores KeyPackages, clients fetch on demand. On mesh: **no server**. + +Current flow (broken for pure mesh): +``` +Alice wants to add Bob to group: +1. Alice fetches Bob's KeyPackage from server ← requires Internet +2. Alice creates Welcome + Commit +3. Alice sends to Bob via mesh +``` + +### Proposed Solution: Announce-Based KeyPackage Distribution + +``` +Bob announces on mesh: +1. MeshAnnounce includes: identity_key, capabilities, AND current_keypackage_hash +2. Nearby nodes cache Bob's latest KeyPackage (if they have it) +3. Alice receives Bob's announce, requests KeyPackage via mesh RPC + +KeyPackage propagation: +1. Bob periodically broadcasts KeyPackage update (larger message, less frequent) +2. Nodes with capacity (CAP_STORE) cache KeyPackages for relaying +3. TTL-based expiry (KeyPackages are single-use, but we can cache N of them) +``` + +### Action Items + +- [ ] **Extend MeshAnnounce** with optional `keypackage_hash` field +- [ ] **Add KeyPackage request/response** to mesh protocol +- [ ] **Implement KeyPackage cache** in MeshStore (separate from message queue) +- [ ] **Design KeyPackage refresh protocol** for mesh-only scenarios + +--- + +## Gap 3: No DTN/Bundle Protocol Integration + +### The Problem + +NASA/IETF Bundle Protocol (RFC 9171) is the standard for delay-tolerant networking. +Reticulum effectively reinvented it. QuicProChat should learn from both. + +Key DTN concepts we're missing: + +| Concept | DTN/BPv7 | Reticulum | QuicProChat | +|---------|----------|-----------|-------------| +| **Custody transfer** | Yes | No | No | +| **Fragmentation at bundle layer** | Yes | No | Yes (LoRa transport) | +| **Convergence layer adapters** | Formal spec | Interfaces | MeshTransport trait | +| **Routing protocols** | CGR, EPIDEMIC | Announce-based | Announce-based | +| **Priority scheduling** | Yes | No | No | + +### Proposed Improvements + +1. **Priority levels in MeshEnvelope** (emergency > data > announce) +2. **Custody transfer option** — intermediate node takes responsibility +3. **Better congestion control** — backpressure signals in announce + +### Action Items + +- [ ] **Add priority field** to MeshEnvelope +- [ ] **Research custody transfer** — is it worth the complexity? +- [ ] **Implement priority queue** in MeshStore and DutyCycleTracker + +--- + +## Gap 4: Battery/Duty-Cycle Optimization + +### The Problem + +Briar drains 4x battery due to constant BT scanning. We claim to be better but +haven't proven it. + +Current state: +- DutyCycleTracker enforces EU868 1% limit +- Announce interval is configurable (default 10 min) +- No adaptive power management + +### Proposed Improvements + +1. **Adaptive announce interval** — more frequent when activity, less when idle +2. **Listen-before-talk** — don't TX if channel is busy (LoRa CAD) +3. **Scheduled wake windows** — coordinate with peers for efficient sync +4. **Power profiles** — "always-on", "hourly-sync", "manual-only" + +### Action Items + +- [ ] **Implement CAD (Channel Activity Detection)** in LoRaTransport +- [ ] **Add power profile config** to P2pNode +- [ ] **Measure actual power consumption** with real hardware + +--- + +## Gap 5: Real-World Testing + +### The Problem + +All our mesh code runs against mocks. We claim LoRa support but haven't tested +with real radios. + +### Testing Plan + +| Test | Hardware | Status | +|------|----------|--------| +| LoRa point-to-point | 2x SX1262 dev boards | Not started | +| LoRa multi-hop | 3x SX1262, different rooms | Not started | +| Mixed transport | LoRa + WiFi relay | Not started | +| Outdoor range test | LoRa, line-of-sight 1km | Not started | +| Duty cycle compliance | SDR spectrum analyzer | Not started | + +### Action Items + +- [ ] **Procure hardware** — 3x Heltec LoRa32 or similar +- [ ] **Implement UART LoRaTransport** for real modems +- [ ] **Create test harness** for automated multi-node testing +- [ ] **Document actual performance** numbers + +--- + +## Gap 6: Comparison Claims Need Verification + +### The Problem + +Our positioning doc claims superiority over Meshtastic/Reticulum/Briar, but: + +- We haven't measured our actual overhead vs. theirs +- We haven't tested interop scenarios +- We haven't run security analysis against their threat models + +### Verification Plan + +| Claim | How to Verify | +|-------|---------------| +| "MLS is better than shared-key AES" | Threat model comparison doc | +| "Multi-hop works" | Integration test with 5+ nodes | +| "LoRa-ready" | Actual LoRa hardware test | +| "Post-quantum protects groups" | Verify hybrid KEM in MLS path | +| "Relay nodes can't read content" | Formal verification of E2E path | + +### Action Items + +- [ ] **Create benchmark suite** comparing message sizes +- [ ] **Write threat model comparison** doc (Meshtastic CVEs, Reticulum link-level) +- [ ] **Fuzz test** mesh envelope parsing +- [ ] **Get external review** of mesh crypto design + +--- + +## Implementation Priority + +### Phase 1: Make It Work (Next 2 Sprints) + +1. **S4: Multi-hop routing** — complete the core mesh functionality +2. **S5: Truncated addresses** — reduce envelope overhead +3. **Measure actual sizes** — know the real numbers + +### Phase 2: Make It Efficient (Following 2 Sprints) + +4. **Design MLS-Lite** — spec for constrained links +5. **Priority queue** — emergency messages first +6. **Hardware testing** — real LoRa validation + +### Phase 3: Make It Production-Ready + +7. **KeyPackage distribution** — mesh-native key exchange +8. **Power profiles** — battery optimization +9. **External review** — security audit of mesh layer + +--- + +## Success Metrics + +| Metric | Current | Target | +|--------|---------|--------| +| MeshEnvelope overhead (short msg) | ~170 bytes | <100 bytes | +| Time to send "hello" over SF12 LoRa | ~27 sec | <15 sec | +| KeyPackage exchange over mesh | Not possible | Works | +| Multi-hop message delivery | Mock only | Real hardware | +| Battery life (mesh mode) | Unknown | Measured & documented | + +--- + +## Honest Assessment + +**What we do well:** +- MLS group crypto is genuinely better than Meshtastic/Reticulum +- Transport abstraction is clean +- Announce protocol is solid + +**What we need to fix:** +- MLS overhead makes LoRa impractical for group setup +- No solution for KeyPackage distribution without server +- No real-world testing yet + +**What we should acknowledge in marketing:** +- "Best crypto for mesh" is true, but with caveats +- "LoRa-ready" means "designed for LoRa, pending optimization" +- We're research-stage, not production-ready + +--- + +*Last updated: 2026-03-30* diff --git a/docs/plans/mls-lite-design.md b/docs/plans/mls-lite-design.md new file mode 100644 index 0000000..3e9db27 --- /dev/null +++ b/docs/plans/mls-lite-design.md @@ -0,0 +1,325 @@ +# MLS-Lite: Lightweight Crypto for Constrained Mesh Links + +> **Goal:** Define a symmetric encryption mode that works on LoRa SF12 (51-byte MTU) +> while preserving as much MLS security as possible and enabling upgrade to full MLS +> when faster transports are available. +> +> Created: 2026-03-30 | Status: Design Draft + +--- + +## Problem Statement + +Full MLS is impractical on constrained links: + +| MLS Operation | Size (bytes) | SF12 Fragments | TX Time (1% duty) | +|---------------|--------------|----------------|-------------------| +| KeyPackage | 500-800 | 10-16 | 10-16 hours | +| Welcome | 1000-2000 | 20-40 | 20-40 hours | +| Commit | 200-500 | 4-10 | 4-10 hours | +| AppMessage | 100-200 | 2-4 | 2-4 hours | + +**Result:** Group setup over LoRa takes days. Messages take hours. Unusable. + +--- + +## Design Goals + +1. **Short message overhead:** <50 bytes for a "hello" message (fits SF12 MTU unfragmented) +2. **Group encryption:** Shared symmetric key, not just link encryption +3. **Sender authentication:** Ed25519 signature (64 bytes, fragmentable) +4. **Upgrade path:** Seamless transition to full MLS when faster link available +5. **No KeyPackage exchange:** Use pre-shared secrets or out-of-band key exchange + +--- + +## MLS-Lite Protocol + +### Mode Selection + +``` +┌─────────────────────────────────────────────────────────────┐ +│ TransportManager │ +├─────────────────────────────────────────────────────────────┤ +│ On send(destination, payload): │ +│ │ +│ 1. Check best route to destination │ +│ 2. Get transport bitrate: │ +│ - QUIC/TCP (>10 kbps) → full MLS │ +│ - LoRa SF7-9 (1-10 kbps) → MLS-Lite + signatures │ +│ - LoRa SF10-12 (<1 kbps) → MLS-Lite, no signatures │ +│ │ +│ 3. Wrap payload in appropriate envelope │ +│ 4. Fragment if needed for transport MTU │ +│ │ +└─────────────────────────────────────────────────────────────┘ +``` + +### MLS-Lite Envelope (Minimal Mode) + +For SF12 LoRa where every byte counts: + +```rust +pub struct MlsLiteEnvelope { + // Header: 25 bytes + pub version: u8, // 1 byte: 0x02 = MLS-Lite + pub flags: u8, // 1 byte: [has_sig, priority(2), reserved(5)] + pub group_id: [u8; 8], // 8 bytes: truncated group identifier + pub sender_addr: [u8; 4], // 4 bytes: truncated sender address + pub seq: u32, // 4 bytes: sequence number (replay protection) + pub epoch: u16, // 2 bytes: key epoch (for rotation) + pub nonce: [u8; 5], // 5 bytes: ChaCha20 nonce suffix (epoch is prefix) + + // Payload: variable + pub ciphertext: Vec, // ChaCha20-Poly1305 encrypted + // includes 16-byte auth tag + + // Optional signature: 64 bytes (if has_sig flag set) + pub signature: Option<[u8; 64]>, +} +// Minimal overhead: 25 bytes header + 16 bytes tag = 41 bytes +// With signature: 105 bytes total overhead +``` + +### Encryption Details + +``` +Key derivation: + group_secret = HKDF-SHA256( + ikm = pre_shared_key || group_id, + salt = "quicprochat-mls-lite-v1", + info = epoch.to_be_bytes() + ) + + encryption_key = group_secret[0..32] // ChaCha20 key + nonce_prefix = group_secret[32..39] // 7 bytes + +Full nonce (12 bytes): + nonce = nonce_prefix || envelope.nonce + +Encrypt: + ciphertext = ChaCha20-Poly1305( + key = encryption_key, + nonce = nonce, + plaintext = payload, + aad = header_bytes // version, flags, group_id, sender_addr, seq, epoch + ) +``` + +### Key Exchange (Out-of-Band) + +MLS-Lite groups are established via: + +1. **QR Code:** Scan to join group (contains group_secret + group_id) +2. **NFC Tap:** Bump phones to exchange group key +3. **Voice Readout:** 24-word mnemonic for group secret +4. **Faster Link:** Full MLS setup over QUIC, then extract epoch key for MLS-Lite + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Key Exchange Flow │ +├─────────────────────────────────────────────────────────────┤ +│ │ +│ Option A: QR Code (in-person) │ +│ Alice generates: QR(group_id || group_secret) │ +│ Bob scans → joins MLS-Lite group │ +│ │ +│ Option B: MLS Bootstrap (hybrid) │ +│ 1. Alice & Bob establish full MLS group over Internet │ +│ 2. Export current epoch key as MLS-Lite group_secret │ +│ 3. Both can now communicate over LoRa using MLS-Lite │ +│ 4. When Internet available, re-sync to full MLS │ +│ │ +│ Option C: Pre-Shared Key (deployment) │ +│ Org distributes group_secret to all devices │ +│ Like Meshtastic channel key, but with replay protection │ +│ │ +└─────────────────────────────────────────────────────────────┘ +``` + +### Key Rotation + +MLS-Lite does NOT have automatic post-compromise security. Manual rotation: + +``` +Rotation trigger: + - Periodic (e.g., weekly) + - Member leaves group + - Suspected compromise + +Rotation process: + 1. New group_secret generated (QR code, or via full MLS if available) + 2. epoch incremented + 3. Old key deleted after grace period + 4. Devices that miss rotation must re-join +``` + +### Upgrade to Full MLS + +When faster transport becomes available: + +``` +┌─────────────────────────────────────────────────────────────┐ +│ MLS-Lite → MLS Upgrade │ +├─────────────────────────────────────────────────────────────┤ +│ │ +│ 1. Device detects QUIC/TCP connectivity │ +│ 2. Contacts server, fetches peer KeyPackages │ +│ 3. Creates full MLS group with same group_id │ +│ 4. Sends MLS Welcome to all known members │ +│ 5. Members upgrade to full MLS │ +│ 6. MLS-Lite continues in parallel for LoRa-only members │ +│ │ +│ Bridging: │ +│ - Gateway nodes (CAP_GATEWAY) translate between modes │ +│ - Full MLS message → re-encrypt as MLS-Lite for LoRa │ +│ - MLS-Lite message → forward as MLS AppMessage │ +│ │ +└─────────────────────────────────────────────────────────────┘ +``` + +--- + +## Security Analysis + +### What MLS-Lite Provides + +| Property | Full MLS | MLS-Lite | Notes | +|----------|----------|----------|-------| +| **Confidentiality** | ✓ | ✓ | ChaCha20-Poly1305 | +| **Integrity** | ✓ | ✓ | Poly1305 MAC | +| **Replay protection** | ✓ | ✓ | Sequence numbers | +| **Sender auth (group)** | ✓ | ✓ | Only group members can encrypt | +| **Sender auth (individual)** | ✓ | Optional | Ed25519 signature (64 bytes) | +| **Forward secrecy** | ✓ | Partial | Only on manual epoch rotation | +| **Post-compromise security** | ✓ | ✗ | No automatic healing | +| **Transcript consistency** | ✓ | ✗ | No ratchet tree | +| **Deniability** | ✗ | ✗ | Neither provides this | + +### Threat Model + +**Protected against:** +- Passive eavesdropping (even quantum with PQ group_secret) +- Message replay (sequence numbers) +- Message tampering (AEAD) +- Outsider injection (need group_secret) + +**NOT protected against:** +- Compromised group member reading all traffic (no PCS) +- Long-term key compromise without manual rotation +- Relay node with group_secret (but they're in the group anyway) + +### Comparison to Meshtastic + +| Property | Meshtastic | MLS-Lite | +|----------|------------|----------| +| **Encryption** | AES-256-CTR | ChaCha20-Poly1305 | +| **Authentication** | None (shared key) | Optional Ed25519 | +| **Replay protection** | None | Sequence numbers | +| **Key rotation** | Manual | Manual (epoch field) | +| **Overhead** | 16 bytes (header) | 41 bytes (no sig), 105 bytes (with sig) | +| **Upgrade path** | None | → Full MLS | + +MLS-Lite is strictly better than Meshtastic's crypto while fitting similar constraints. + +--- + +## Wire Format + +### MLS-Lite Envelope (CBOR) + +``` +MlsLiteEnvelope = { + 0: uint, ; version (0x02) + 1: uint, ; flags + 2: bytes .size 8, ; group_id + 3: bytes .size 4, ; sender_addr + 4: uint, ; seq + 5: uint, ; epoch + 6: bytes .size 5, ; nonce + 7: bytes, ; ciphertext (includes 16-byte tag) + ? 8: bytes .size 64 ; signature (optional) +} +``` + +Estimated sizes: +- Minimal (1-byte payload): ~50 bytes (fits SF12 unfragmented!) +- Short message (20 bytes): ~70 bytes (2 fragments on SF12) +- With signature: add 64 bytes + +### MeshEnvelope Mode Flag + +Extend MeshEnvelope to indicate crypto mode: + +```rust +pub struct MeshEnvelope { + // ... existing fields ... + + /// Crypto mode: 0x00 = full MLS, 0x02 = MLS-Lite + pub crypto_mode: u8, +} +``` + +--- + +## Implementation Plan + +### Phase 1: Core MLS-Lite + +1. [ ] Define `MlsLiteEnvelope` struct +2. [ ] Implement key derivation (HKDF) +3. [ ] Implement encrypt/decrypt (ChaCha20-Poly1305) +4. [ ] Add sequence number tracking (replay window) +5. [ ] Add CBOR serialization +6. [ ] Unit tests + +### Phase 2: Integration + +1. [ ] Add `crypto_mode` to TransportManager routing decisions +2. [ ] Implement QR code key exchange (generate/scan) +3. [ ] Add `/mesh lite-create ` REPL command +4. [ ] Add `/mesh lite-join ` REPL command +5. [ ] Integration tests with LoRaMockMedium + +### Phase 3: Gateway/Bridge + +1. [ ] Implement MLS → MLS-Lite translation in gateway nodes +2. [ ] Add CAP_GATEWAY capability flag +3. [ ] Handle epoch sync between modes +4. [ ] End-to-end test: QUIC client → gateway → LoRa client + +--- + +## Open Questions + +1. **Signature vs. no signature?** + - Signatures add 64 bytes (1-2 extra fragments on SF12) + - Without signatures, any group member can spoof any sender + - Proposal: configurable, default to signatures on SF7-9, skip on SF10-12 + +2. **Epoch sync without server?** + - How do LoRa-only nodes learn about epoch changes? + - Proposal: Include epoch in announce, peers relay epoch updates + +3. **Post-quantum group_secret?** + - MLS-Lite uses symmetric crypto (quantum-safe for confidentiality) + - Key exchange is vulnerable if using X25519 + - Proposal: QR code includes ML-KEM-768 encapsulation for PQ key exchange + +4. **Compatibility with Reticulum/LXMF?** + - Should we use msgpack instead of CBOR for LXMF compat? + - Should we implement LXMF as an additional mode? + +--- + +## References + +- [MLS RFC 9420](https://datatracker.ietf.org/doc/rfc9420/) — Full MLS spec +- [ChaCha20-Poly1305 RFC 8439](https://datatracker.ietf.org/doc/rfc8439/) +- [HKDF RFC 5869](https://datatracker.ietf.org/doc/rfc5869/) +- [Meshtastic Encryption](https://meshtastic.org/docs/overview/encryption/) +- [Reticulum LXMF](https://github.com/markqvist/LXMF) + +--- + +*Last updated: 2026-03-30*