docs: add mesh protocol gap analysis and MLS-Lite design

Honest assessment of QuicProChat vs Reticulum/Meshtastic/Briar: - MLS overhead (500-800 byte KeyPackages) impractical for SF12 LoRa - KeyPackage distribution over mesh unsolved - No lightweight mode for constrained links MLS-Lite design proposes 41-byte overhead symmetric mode: - ChaCha20-Poly1305 with HKDF key derivation - Optional Ed25519 signatures - Upgrade path to full MLS when faster transport available - QR code / out-of-band key exchange
2026-03-30 23:29:44 +02:00
parent f9ac921a0c
commit 01bc2a4273
2 changed files with 648 additions and 0 deletions
--- a/docs/plans/mesh-protocol-gaps.md
+++ b/docs/plans/mesh-protocol-gaps.md
@@ -0,0 +1,323 @@
 # Mesh Protocol Gaps — Honest Assessment & Action Plan
 > **Goal:** Identify real weaknesses in QuicProChat's mesh protocol compared to
 > Reticulum, Meshtastic, and LXMF. Plan concrete improvements.
 >
 > Created: 2026-03-30
 ---
 ## Executive Summary
 QuicProChat has strong cryptography (MLS, PQ-KEM) but **real gaps** in the mesh layer:
 | Gap | Severity | Status |
 |-----|----------|--------|
 | MLS overhead too large for LoRa | **Critical** | Needs design work |
 | No lightweight messaging mode | **High** | Not started |
 | KeyPackage distribution over mesh | **High** | Not solved |
 | Announce/routing not battle-tested | **Medium** | S3 done, needs real-world test |
 | No DTN bundle protocol integration | **Medium** | Not started |
 | Battery/duty-cycle optimization | **Medium** | Basic tracker exists |
 ---
 ## Gap 1: MLS Overhead is Prohibitive for Constrained Links
 ### The Problem
 **MLS was designed for Internet messaging, not LoRa.**
 Measured sizes (approximate):
 | Component | Size (bytes) | LoRa SF12/BW125 airtime |
 |-----------|--------------|------------------------|
 | MLS KeyPackage | ~500-800 | 80-130 seconds |
 | MLS Welcome | ~1000-2000 | 160-320 seconds |
 | MLS Commit | ~200-500 | 32-80 seconds |
 | MLS ApplicationMessage | ~100-200 | 16-32 seconds |
 | **MeshEnvelope overhead** | ~170 (CBOR) | 27 seconds |
 | **Reticulum LXMF message** | ~100-150 | 16-24 seconds |
 | **Meshtastic payload** | ~237 max | 38 seconds |
 **The math doesn't work:**
 - LoRa SF12/BW125: ~51 byte MTU, ~300 bps effective
 - EU868 duty cycle: 1% = 36 seconds TX per hour
 - **One MLS KeyPackage = 10-20 fragments = entire hour's duty budget**
 ### Current State
 - MeshEnvelope uses CBOR, ~170 bytes overhead for a short message
 - MLS operations happen at application layer, not optimized for mesh
 - No fallback to lighter crypto for constrained links
 ### Proposed Solutions
 #### Option A: Hybrid Crypto Modes (Recommended)
 ```
 ┌─────────────────────────────────────────────────────────────────┐
 │  Mode Selection Based on Transport Capability                   │
 ├─────────────────────────────────────────────────────────────────┤
 │                                                                 │
 │  QUIC/TCP/WiFi (>10 kbps):                                     │
 │    → Full MLS groups with PQ-KEM                               │
 │    → KeyPackage distribution via server                        │
 │    → Standard protocol                                          │
 │                                                                 │
 │  LoRa/Serial (<1 kbps):                                        │
 │    → "MLS-Lite" mode:                                          │
 │      • Pre-shared group epoch key (exchanged out-of-band)      │
 │      • ChaCha20-Poly1305 symmetric encryption                  │
 │      • Ed25519 signatures (64 bytes)                           │
 │      • No per-message KeyPackage exchange                      │
 │      • Manual key rotation via QR code or faster link          │
 │                                                                 │
 │  Upgrade path:                                                  │
 │    When faster transport available → full MLS epoch sync       │
 │                                                                 │
 └─────────────────────────────────────────────────────────────────┘
 ```
 **Trade-off:** Lose automatic PCS on constrained links. Gain usability.
 #### Option B: Compressed MLS (Research)
 - Strip unused extensions from KeyPackages
 - Use shorter credential identifiers (16 bytes instead of 32)
 - Batch multiple KeyPackages into single transfer over fast link
 - Cache and reuse KeyPackages more aggressively
 **Trade-off:** Still large. May not be enough for SF12 LoRa.
 #### Option C: LXMF-Compatible Mode
 Implement Reticulum's LXMF format as an alternative wire format:
 ```rust
 pub struct LxmfMessage {
    destination: [u8; 16],   // Truncated hash
    source: [u8; 16],
    signature: [u8; 64],     // Ed25519
    payload: Vec<u8>,        // msgpack: {timestamp, content, title, fields}
 }
 // Total: ~100-150 bytes for short message
 ```
 **Trade-off:** Lose MLS group properties. Gain Reticulum interop and efficiency.
 ### Action Items
 - [ ] **Measure actual MLS sizes** in current implementation (benchmark)
 - [ ] **Design MLS-Lite spec** for constrained links
 - [ ] **Implement transport capability negotiation** in TransportManager
 - [ ] **Add `--constrained` mode** to MeshEnvelope for minimal overhead
 ---
 ## Gap 2: KeyPackage Distribution Over Mesh
 ### The Problem
 MLS requires pre-positioned KeyPackages for adding members to groups. On Internet:
 server stores KeyPackages, clients fetch on demand. On mesh: **no server**.
 Current flow (broken for pure mesh):
 ```
 Alice wants to add Bob to group:
 1. Alice fetches Bob's KeyPackage from server    ← requires Internet
 2. Alice creates Welcome + Commit
 3. Alice sends to Bob via mesh
 ```
 ### Proposed Solution: Announce-Based KeyPackage Distribution
 ```
 Bob announces on mesh:
 1. MeshAnnounce includes: identity_key, capabilities, AND current_keypackage_hash
 2. Nearby nodes cache Bob's latest KeyPackage (if they have it)
 3. Alice receives Bob's announce, requests KeyPackage via mesh RPC
 KeyPackage propagation:
 1. Bob periodically broadcasts KeyPackage update (larger message, less frequent)
 2. Nodes with capacity (CAP_STORE) cache KeyPackages for relaying
 3. TTL-based expiry (KeyPackages are single-use, but we can cache N of them)
 ```
 ### Action Items
 - [ ] **Extend MeshAnnounce** with optional `keypackage_hash` field
 - [ ] **Add KeyPackage request/response** to mesh protocol
 - [ ] **Implement KeyPackage cache** in MeshStore (separate from message queue)
 - [ ] **Design KeyPackage refresh protocol** for mesh-only scenarios
 ---
 ## Gap 3: No DTN/Bundle Protocol Integration
 ### The Problem
 NASA/IETF Bundle Protocol (RFC 9171) is the standard for delay-tolerant networking.
 Reticulum effectively reinvented it. QuicProChat should learn from both.
 Key DTN concepts we're missing:
 | Concept | DTN/BPv7 | Reticulum | QuicProChat |
 |---------|----------|-----------|-------------|
 | **Custody transfer** | Yes | No | No |
 | **Fragmentation at bundle layer** | Yes | No | Yes (LoRa transport) |
 | **Convergence layer adapters** | Formal spec | Interfaces | MeshTransport trait |
 | **Routing protocols** | CGR, EPIDEMIC | Announce-based | Announce-based |
 | **Priority scheduling** | Yes | No | No |
 ### Proposed Improvements
 1. **Priority levels in MeshEnvelope** (emergency > data > announce)
 2. **Custody transfer option** — intermediate node takes responsibility
 3. **Better congestion control** — backpressure signals in announce
 ### Action Items
 - [ ] **Add priority field** to MeshEnvelope
 - [ ] **Research custody transfer** — is it worth the complexity?
 - [ ] **Implement priority queue** in MeshStore and DutyCycleTracker
 ---
 ## Gap 4: Battery/Duty-Cycle Optimization
 ### The Problem
 Briar drains 4x battery due to constant BT scanning. We claim to be better but
 haven't proven it.
 Current state:
 - DutyCycleTracker enforces EU868 1% limit
 - Announce interval is configurable (default 10 min)
 - No adaptive power management
 ### Proposed Improvements
 1. **Adaptive announce interval** — more frequent when activity, less when idle
 2. **Listen-before-talk** — don't TX if channel is busy (LoRa CAD)
 3. **Scheduled wake windows** — coordinate with peers for efficient sync
 4. **Power profiles** — "always-on", "hourly-sync", "manual-only"
 ### Action Items
 - [ ] **Implement CAD (Channel Activity Detection)** in LoRaTransport
 - [ ] **Add power profile config** to P2pNode
 - [ ] **Measure actual power consumption** with real hardware
 ---
 ## Gap 5: Real-World Testing
 ### The Problem
 All our mesh code runs against mocks. We claim LoRa support but haven't tested
 with real radios.
 ### Testing Plan
 | Test | Hardware | Status |
 |------|----------|--------|
 | LoRa point-to-point | 2x SX1262 dev boards | Not started |
 | LoRa multi-hop | 3x SX1262, different rooms | Not started |
 | Mixed transport | LoRa + WiFi relay | Not started |
 | Outdoor range test | LoRa, line-of-sight 1km | Not started |
 | Duty cycle compliance | SDR spectrum analyzer | Not started |
 ### Action Items
 - [ ] **Procure hardware** — 3x Heltec LoRa32 or similar
 - [ ] **Implement UART LoRaTransport** for real modems
 - [ ] **Create test harness** for automated multi-node testing
 - [ ] **Document actual performance** numbers
 ---
 ## Gap 6: Comparison Claims Need Verification
 ### The Problem
 Our positioning doc claims superiority over Meshtastic/Reticulum/Briar, but:
 - We haven't measured our actual overhead vs. theirs
 - We haven't tested interop scenarios
 - We haven't run security analysis against their threat models
 ### Verification Plan
 | Claim | How to Verify |
 |-------|---------------|
 | "MLS is better than shared-key AES" | Threat model comparison doc |
 | "Multi-hop works" | Integration test with 5+ nodes |
 | "LoRa-ready" | Actual LoRa hardware test |
 | "Post-quantum protects groups" | Verify hybrid KEM in MLS path |
 | "Relay nodes can't read content" | Formal verification of E2E path |
 ### Action Items
 - [ ] **Create benchmark suite** comparing message sizes
 - [ ] **Write threat model comparison** doc (Meshtastic CVEs, Reticulum link-level)
 - [ ] **Fuzz test** mesh envelope parsing
 - [ ] **Get external review** of mesh crypto design
 ---
 ## Implementation Priority
 ### Phase 1: Make It Work (Next 2 Sprints)
 1. **S4: Multi-hop routing** — complete the core mesh functionality
 2. **S5: Truncated addresses** — reduce envelope overhead
 3. **Measure actual sizes** — know the real numbers
 ### Phase 2: Make It Efficient (Following 2 Sprints)
 4. **Design MLS-Lite** — spec for constrained links
 5. **Priority queue** — emergency messages first
 6. **Hardware testing** — real LoRa validation
 ### Phase 3: Make It Production-Ready
 7. **KeyPackage distribution** — mesh-native key exchange
 8. **Power profiles** — battery optimization
 9. **External review** — security audit of mesh layer
 ---
 ## Success Metrics
 | Metric | Current | Target |
 |--------|---------|--------|
 | MeshEnvelope overhead (short msg) | ~170 bytes | <100 bytes |
 | Time to send "hello" over SF12 LoRa | ~27 sec | <15 sec |
 | KeyPackage exchange over mesh | Not possible | Works |
 | Multi-hop message delivery | Mock only | Real hardware |
 | Battery life (mesh mode) | Unknown | Measured & documented |
 ---
 ## Honest Assessment
 **What we do well:**
 - MLS group crypto is genuinely better than Meshtastic/Reticulum
 - Transport abstraction is clean
 - Announce protocol is solid
 **What we need to fix:**
 - MLS overhead makes LoRa impractical for group setup
 - No solution for KeyPackage distribution without server
 - No real-world testing yet
 **What we should acknowledge in marketing:**
 - "Best crypto for mesh" is true, but with caveats
 - "LoRa-ready" means "designed for LoRa, pending optimization"
 - We're research-stage, not production-ready
 ---
 *Last updated: 2026-03-30*
--- a/docs/plans/mls-lite-design.md
+++ b/docs/plans/mls-lite-design.md
@@ -0,0 +1,325 @@
 # MLS-Lite: Lightweight Crypto for Constrained Mesh Links
 > **Goal:** Define a symmetric encryption mode that works on LoRa SF12 (51-byte MTU)
 > while preserving as much MLS security as possible and enabling upgrade to full MLS
 > when faster transports are available.
 >
 > Created: 2026-03-30 | Status: Design Draft
 ---
 ## Problem Statement
 Full MLS is impractical on constrained links:
 | MLS Operation | Size (bytes) | SF12 Fragments | TX Time (1% duty) |
 |---------------|--------------|----------------|-------------------|
 | KeyPackage | 500-800 | 10-16 | 10-16 hours |
 | Welcome | 1000-2000 | 20-40 | 20-40 hours |
 | Commit | 200-500 | 4-10 | 4-10 hours |
 | AppMessage | 100-200 | 2-4 | 2-4 hours |
 **Result:** Group setup over LoRa takes days. Messages take hours. Unusable.
 ---
 ## Design Goals
 1. **Short message overhead:** <50 bytes for a "hello" message (fits SF12 MTU unfragmented)
 2. **Group encryption:** Shared symmetric key, not just link encryption
 3. **Sender authentication:** Ed25519 signature (64 bytes, fragmentable)
 4. **Upgrade path:** Seamless transition to full MLS when faster link available
 5. **No KeyPackage exchange:** Use pre-shared secrets or out-of-band key exchange
 ---
 ## MLS-Lite Protocol
 ### Mode Selection
 ```
 ┌─────────────────────────────────────────────────────────────┐
 │                    TransportManager                         │
 ├─────────────────────────────────────────────────────────────┤
 │  On send(destination, payload):                             │
 │                                                             │
 │    1. Check best route to destination                       │
 │    2. Get transport bitrate:                                │
 │       - QUIC/TCP (>10 kbps) → full MLS                     │
 │       - LoRa SF7-9 (1-10 kbps) → MLS-Lite + signatures     │
 │       - LoRa SF10-12 (<1 kbps) → MLS-Lite, no signatures   │
 │                                                             │
 │    3. Wrap payload in appropriate envelope                  │
 │    4. Fragment if needed for transport MTU                  │
 │                                                             │
 └─────────────────────────────────────────────────────────────┘
 ```
 ### MLS-Lite Envelope (Minimal Mode)
 For SF12 LoRa where every byte counts:
 ```rust
 pub struct MlsLiteEnvelope {
    // Header: 25 bytes
    pub version: u8,              // 1 byte: 0x02 = MLS-Lite
    pub flags: u8,                // 1 byte: [has_sig, priority(2), reserved(5)]
    pub group_id: [u8; 8],        // 8 bytes: truncated group identifier
    pub sender_addr: [u8; 4],     // 4 bytes: truncated sender address
    pub seq: u32,                 // 4 bytes: sequence number (replay protection)
    pub epoch: u16,               // 2 bytes: key epoch (for rotation)
    pub nonce: [u8; 5],           // 5 bytes: ChaCha20 nonce suffix (epoch is prefix)
    // Payload: variable
    pub ciphertext: Vec<u8>,      // ChaCha20-Poly1305 encrypted
                                  // includes 16-byte auth tag
    // Optional signature: 64 bytes (if has_sig flag set)
    pub signature: Option<[u8; 64]>,
 }
 // Minimal overhead: 25 bytes header + 16 bytes tag = 41 bytes
 // With signature: 105 bytes total overhead
 ```
 ### Encryption Details
 ```
 Key derivation:
  group_secret = HKDF-SHA256(
    ikm = pre_shared_key || group_id,
    salt = "quicprochat-mls-lite-v1",
    info = epoch.to_be_bytes()
  )
  encryption_key = group_secret[0..32]   // ChaCha20 key
  nonce_prefix = group_secret[32..39]    // 7 bytes
 Full nonce (12 bytes):
  nonce = nonce_prefix || envelope.nonce
 Encrypt:
  ciphertext = ChaCha20-Poly1305(
    key = encryption_key,
    nonce = nonce,
    plaintext = payload,
    aad = header_bytes  // version, flags, group_id, sender_addr, seq, epoch
  )
 ```
 ### Key Exchange (Out-of-Band)
 MLS-Lite groups are established via:
 1. **QR Code:** Scan to join group (contains group_secret + group_id)
 2. **NFC Tap:** Bump phones to exchange group key
 3. **Voice Readout:** 24-word mnemonic for group secret
 4. **Faster Link:** Full MLS setup over QUIC, then extract epoch key for MLS-Lite
 ```
 ┌─────────────────────────────────────────────────────────────┐
 │                    Key Exchange Flow                         │
 ├─────────────────────────────────────────────────────────────┤
 │                                                             │
 │  Option A: QR Code (in-person)                              │
 │    Alice generates: QR(group_id || group_secret)            │
 │    Bob scans → joins MLS-Lite group                         │
 │                                                             │
 │  Option B: MLS Bootstrap (hybrid)                           │
 │    1. Alice & Bob establish full MLS group over Internet    │
 │    2. Export current epoch key as MLS-Lite group_secret     │
 │    3. Both can now communicate over LoRa using MLS-Lite     │
 │    4. When Internet available, re-sync to full MLS          │
 │                                                             │
 │  Option C: Pre-Shared Key (deployment)                      │
 │    Org distributes group_secret to all devices              │
 │    Like Meshtastic channel key, but with replay protection  │
 │                                                             │
 └─────────────────────────────────────────────────────────────┘
 ```
 ### Key Rotation
 MLS-Lite does NOT have automatic post-compromise security. Manual rotation:
 ```
 Rotation trigger:
  - Periodic (e.g., weekly)
  - Member leaves group
  - Suspected compromise
 Rotation process:
  1. New group_secret generated (QR code, or via full MLS if available)
  2. epoch incremented
  3. Old key deleted after grace period
  4. Devices that miss rotation must re-join
 ```
 ### Upgrade to Full MLS
 When faster transport becomes available:
 ```
 ┌─────────────────────────────────────────────────────────────┐
 │                    MLS-Lite → MLS Upgrade                    │
 ├─────────────────────────────────────────────────────────────┤
 │                                                             │
 │  1. Device detects QUIC/TCP connectivity                    │
 │  2. Contacts server, fetches peer KeyPackages               │
 │  3. Creates full MLS group with same group_id               │
 │  4. Sends MLS Welcome to all known members                  │
 │  5. Members upgrade to full MLS                             │
 │  6. MLS-Lite continues in parallel for LoRa-only members   │
 │                                                             │
 │  Bridging:                                                  │
 │    - Gateway nodes (CAP_GATEWAY) translate between modes    │
 │    - Full MLS message → re-encrypt as MLS-Lite for LoRa     │
 │    - MLS-Lite message → forward as MLS AppMessage           │
 │                                                             │
 └─────────────────────────────────────────────────────────────┘
 ```
 ---
 ## Security Analysis
 ### What MLS-Lite Provides
 | Property | Full MLS | MLS-Lite | Notes |
 |----------|----------|----------|-------|
 | **Confidentiality** | ✓ | ✓ | ChaCha20-Poly1305 |
 | **Integrity** | ✓ | ✓ | Poly1305 MAC |
 | **Replay protection** | ✓ | ✓ | Sequence numbers |
 | **Sender auth (group)** | ✓ | ✓ | Only group members can encrypt |
 | **Sender auth (individual)** | ✓ | Optional | Ed25519 signature (64 bytes) |
 | **Forward secrecy** | ✓ | Partial | Only on manual epoch rotation |
 | **Post-compromise security** | ✓ | ✗ | No automatic healing |
 | **Transcript consistency** | ✓ | ✗ | No ratchet tree |
 | **Deniability** | ✗ | ✗ | Neither provides this |
 ### Threat Model
 **Protected against:**
 - Passive eavesdropping (even quantum with PQ group_secret)
 - Message replay (sequence numbers)
 - Message tampering (AEAD)
 - Outsider injection (need group_secret)
 **NOT protected against:**
 - Compromised group member reading all traffic (no PCS)
 - Long-term key compromise without manual rotation
 - Relay node with group_secret (but they're in the group anyway)
 ### Comparison to Meshtastic
 | Property | Meshtastic | MLS-Lite |
 |----------|------------|----------|
 | **Encryption** | AES-256-CTR | ChaCha20-Poly1305 |
 | **Authentication** | None (shared key) | Optional Ed25519 |
 | **Replay protection** | None | Sequence numbers |
 | **Key rotation** | Manual | Manual (epoch field) |
 | **Overhead** | 16 bytes (header) | 41 bytes (no sig), 105 bytes (with sig) |
 | **Upgrade path** | None | → Full MLS |
 MLS-Lite is strictly better than Meshtastic's crypto while fitting similar constraints.
 ---
 ## Wire Format
 ### MLS-Lite Envelope (CBOR)
 ```
 MlsLiteEnvelope = {
  0: uint,           ; version (0x02)
  1: uint,           ; flags
  2: bytes .size 8,  ; group_id
  3: bytes .size 4,  ; sender_addr
  4: uint,           ; seq
  5: uint,           ; epoch
  6: bytes .size 5,  ; nonce
  7: bytes,          ; ciphertext (includes 16-byte tag)
  ? 8: bytes .size 64 ; signature (optional)
 }
 ```
 Estimated sizes:
 - Minimal (1-byte payload): ~50 bytes (fits SF12 unfragmented!)
 - Short message (20 bytes): ~70 bytes (2 fragments on SF12)
 - With signature: add 64 bytes
 ### MeshEnvelope Mode Flag
 Extend MeshEnvelope to indicate crypto mode:
 ```rust
 pub struct MeshEnvelope {
    // ... existing fields ...
    /// Crypto mode: 0x00 = full MLS, 0x02 = MLS-Lite
    pub crypto_mode: u8,
 }
 ```
 ---
 ## Implementation Plan
 ### Phase 1: Core MLS-Lite
 1. [ ] Define `MlsLiteEnvelope` struct
 2. [ ] Implement key derivation (HKDF)
 3. [ ] Implement encrypt/decrypt (ChaCha20-Poly1305)
 4. [ ] Add sequence number tracking (replay window)
 5. [ ] Add CBOR serialization
 6. [ ] Unit tests
 ### Phase 2: Integration
 1. [ ] Add `crypto_mode` to TransportManager routing decisions
 2. [ ] Implement QR code key exchange (generate/scan)
 3. [ ] Add `/mesh lite-create <name>` REPL command
 4. [ ] Add `/mesh lite-join <qr-data>` REPL command
 5. [ ] Integration tests with LoRaMockMedium
 ### Phase 3: Gateway/Bridge
 1. [ ] Implement MLS → MLS-Lite translation in gateway nodes
 2. [ ] Add CAP_GATEWAY capability flag
 3. [ ] Handle epoch sync between modes
 4. [ ] End-to-end test: QUIC client → gateway → LoRa client
 ---
 ## Open Questions
 1. **Signature vs. no signature?**
   - Signatures add 64 bytes (1-2 extra fragments on SF12)
   - Without signatures, any group member can spoof any sender
   - Proposal: configurable, default to signatures on SF7-9, skip on SF10-12
 2. **Epoch sync without server?**
   - How do LoRa-only nodes learn about epoch changes?
   - Proposal: Include epoch in announce, peers relay epoch updates
 3. **Post-quantum group_secret?**
   - MLS-Lite uses symmetric crypto (quantum-safe for confidentiality)
   - Key exchange is vulnerable if using X25519
   - Proposal: QR code includes ML-KEM-768 encapsulation for PQ key exchange
 4. **Compatibility with Reticulum/LXMF?**
   - Should we use msgpack instead of CBOR for LXMF compat?
   - Should we implement LXMF as an additional mode?
 ---
 ## References
 - [MLS RFC 9420](https://datatracker.ietf.org/doc/rfc9420/) — Full MLS spec
 - [ChaCha20-Poly1305 RFC 8439](https://datatracker.ietf.org/doc/rfc8439/)
 - [HKDF RFC 5869](https://datatracker.ietf.org/doc/rfc5869/)
 - [Meshtastic Encryption](https://meshtastic.org/docs/overview/encryption/)
 - [Reticulum LXMF](https://github.com/markqvist/LXMF)
 ---
 *Last updated: 2026-03-30*