docs: add mesh protocol gap analysis and MLS-Lite design

Honest assessment of QuicProChat vs Reticulum/Meshtastic/Briar: - MLS overhead (500-800 byte KeyPackages) impractical for SF12 LoRa - KeyPackage distribution over mesh unsolved - No lightweight mode for constrained links MLS-Lite design proposes 41-byte overhead symmetric mode: - ChaCha20-Poly1305 with HKDF key derivation - Optional Ed25519 signatures - Upgrade path to full MLS when faster transport available - QR code / out-of-band key exchange
2026-03-30 23:29:44 +02:00
parent f9ac921a0c
commit 01bc2a4273
2 changed files with 648 additions and 0 deletions
--- a/docs/plans/mesh-protocol-gaps.md
+++ b/docs/plans/mesh-protocol-gaps.md
@@ -0,0 +1,323 @@
+# Mesh Protocol Gaps — Honest Assessment & Action Plan
+
+> **Goal:** Identify real weaknesses in QuicProChat's mesh protocol compared to
+> Reticulum, Meshtastic, and LXMF. Plan concrete improvements.
+>
+> Created: 2026-03-30
+
+---
+
+## Executive Summary
+
+QuicProChat has strong cryptography (MLS, PQ-KEM) but **real gaps** in the mesh layer:
+
+| Gap | Severity | Status |
+|-----|----------|--------|
+| MLS overhead too large for LoRa | **Critical** | Needs design work |
+| No lightweight messaging mode | **High** | Not started |
+| KeyPackage distribution over mesh | **High** | Not solved |
+| Announce/routing not battle-tested | **Medium** | S3 done, needs real-world test |
+| No DTN bundle protocol integration | **Medium** | Not started |
+| Battery/duty-cycle optimization | **Medium** | Basic tracker exists |
+
+---
+
+## Gap 1: MLS Overhead is Prohibitive for Constrained Links
+
+### The Problem
+
+**MLS was designed for Internet messaging, not LoRa.**
+
+Measured sizes (approximate):
+
+| Component | Size (bytes) | LoRa SF12/BW125 airtime |
+|-----------|--------------|------------------------|
+| MLS KeyPackage | ~500-800 | 80-130 seconds |
+| MLS Welcome | ~1000-2000 | 160-320 seconds |
+| MLS Commit | ~200-500 | 32-80 seconds |
+| MLS ApplicationMessage | ~100-200 | 16-32 seconds |
+| **MeshEnvelope overhead** | ~170 (CBOR) | 27 seconds |
+| **Reticulum LXMF message** | ~100-150 | 16-24 seconds |
+| **Meshtastic payload** | ~237 max | 38 seconds |
+
+**The math doesn't work:**
+
+- LoRa SF12/BW125: ~51 byte MTU, ~300 bps effective
+- EU868 duty cycle: 1% = 36 seconds TX per hour
+- **One MLS KeyPackage = 10-20 fragments = entire hour's duty budget**
+
+### Current State
+
+- MeshEnvelope uses CBOR, ~170 bytes overhead for a short message
+- MLS operations happen at application layer, not optimized for mesh
+- No fallback to lighter crypto for constrained links
+
+### Proposed Solutions
+
+#### Option A: Hybrid Crypto Modes (Recommended)
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│  Mode Selection Based on Transport Capability                   │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                 │
+│  QUIC/TCP/WiFi (>10 kbps):                                     │
+│    → Full MLS groups with PQ-KEM                               │
+│    → KeyPackage distribution via server                        │
+│    → Standard protocol                                          │
+│                                                                 │
+│  LoRa/Serial (<1 kbps):                                        │
+│    → "MLS-Lite" mode:                                          │
+│      • Pre-shared group epoch key (exchanged out-of-band)      │
+│      • ChaCha20-Poly1305 symmetric encryption                  │
+│      • Ed25519 signatures (64 bytes)                           │
+│      • No per-message KeyPackage exchange                      │
+│      • Manual key rotation via QR code or faster link          │
+│                                                                 │
+│  Upgrade path:                                                  │
+│    When faster transport available → full MLS epoch sync       │
+│                                                                 │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+**Trade-off:** Lose automatic PCS on constrained links. Gain usability.
+
+#### Option B: Compressed MLS (Research)
+
+- Strip unused extensions from KeyPackages
+- Use shorter credential identifiers (16 bytes instead of 32)
+- Batch multiple KeyPackages into single transfer over fast link
+- Cache and reuse KeyPackages more aggressively
+
+**Trade-off:** Still large. May not be enough for SF12 LoRa.
+
+#### Option C: LXMF-Compatible Mode
+
+Implement Reticulum's LXMF format as an alternative wire format:
+
+```rust
+pub struct LxmfMessage {
+    destination: [u8; 16],   // Truncated hash
+    source: [u8; 16],
+    signature: [u8; 64],     // Ed25519
+    payload: Vec<u8>,        // msgpack: {timestamp, content, title, fields}
+}
+// Total: ~100-150 bytes for short message
+```
+
+**Trade-off:** Lose MLS group properties. Gain Reticulum interop and efficiency.
+
+### Action Items
+
+- [ ] **Measure actual MLS sizes** in current implementation (benchmark)
+- [ ] **Design MLS-Lite spec** for constrained links
+- [ ] **Implement transport capability negotiation** in TransportManager
+- [ ] **Add `--constrained` mode** to MeshEnvelope for minimal overhead
+
+---
+
+## Gap 2: KeyPackage Distribution Over Mesh
+
+### The Problem
+
+MLS requires pre-positioned KeyPackages for adding members to groups. On Internet:
+server stores KeyPackages, clients fetch on demand. On mesh: **no server**.
+
+Current flow (broken for pure mesh):
+```
+Alice wants to add Bob to group:
+1. Alice fetches Bob's KeyPackage from server    ← requires Internet
+2. Alice creates Welcome + Commit
+3. Alice sends to Bob via mesh
+```
+
+### Proposed Solution: Announce-Based KeyPackage Distribution
+
+```
+Bob announces on mesh:
+1. MeshAnnounce includes: identity_key, capabilities, AND current_keypackage_hash
+2. Nearby nodes cache Bob's latest KeyPackage (if they have it)
+3. Alice receives Bob's announce, requests KeyPackage via mesh RPC
+
+KeyPackage propagation:
+1. Bob periodically broadcasts KeyPackage update (larger message, less frequent)
+2. Nodes with capacity (CAP_STORE) cache KeyPackages for relaying
+3. TTL-based expiry (KeyPackages are single-use, but we can cache N of them)
+```
+
+### Action Items
+
+- [ ] **Extend MeshAnnounce** with optional `keypackage_hash` field
+- [ ] **Add KeyPackage request/response** to mesh protocol
+- [ ] **Implement KeyPackage cache** in MeshStore (separate from message queue)
+- [ ] **Design KeyPackage refresh protocol** for mesh-only scenarios
+
+---
+
+## Gap 3: No DTN/Bundle Protocol Integration
+
+### The Problem
+
+NASA/IETF Bundle Protocol (RFC 9171) is the standard for delay-tolerant networking.
+Reticulum effectively reinvented it. QuicProChat should learn from both.
+
+Key DTN concepts we're missing:
+
+| Concept | DTN/BPv7 | Reticulum | QuicProChat |
+|---------|----------|-----------|-------------|
+| **Custody transfer** | Yes | No | No |
+| **Fragmentation at bundle layer** | Yes | No | Yes (LoRa transport) |
+| **Convergence layer adapters** | Formal spec | Interfaces | MeshTransport trait |
+| **Routing protocols** | CGR, EPIDEMIC | Announce-based | Announce-based |
+| **Priority scheduling** | Yes | No | No |
+
+### Proposed Improvements
+
+1. **Priority levels in MeshEnvelope** (emergency > data > announce)
+2. **Custody transfer option** — intermediate node takes responsibility
+3. **Better congestion control** — backpressure signals in announce
+
+### Action Items
+
+- [ ] **Add priority field** to MeshEnvelope
+- [ ] **Research custody transfer** — is it worth the complexity?
+- [ ] **Implement priority queue** in MeshStore and DutyCycleTracker
+
+---
+
+## Gap 4: Battery/Duty-Cycle Optimization
+
+### The Problem
+
+Briar drains 4x battery due to constant BT scanning. We claim to be better but
+haven't proven it.
+
+Current state:
+- DutyCycleTracker enforces EU868 1% limit
+- Announce interval is configurable (default 10 min)
+- No adaptive power management
+
+### Proposed Improvements
+
+1. **Adaptive announce interval** — more frequent when activity, less when idle
+2. **Listen-before-talk** — don't TX if channel is busy (LoRa CAD)
+3. **Scheduled wake windows** — coordinate with peers for efficient sync
+4. **Power profiles** — "always-on", "hourly-sync", "manual-only"
+
+### Action Items
+
+- [ ] **Implement CAD (Channel Activity Detection)** in LoRaTransport
+- [ ] **Add power profile config** to P2pNode
+- [ ] **Measure actual power consumption** with real hardware
+
+---
+
+## Gap 5: Real-World Testing
+
+### The Problem
+
+All our mesh code runs against mocks. We claim LoRa support but haven't tested
+with real radios.
+
+### Testing Plan
+
+| Test | Hardware | Status |
+|------|----------|--------|
+| LoRa point-to-point | 2x SX1262 dev boards | Not started |
+| LoRa multi-hop | 3x SX1262, different rooms | Not started |
+| Mixed transport | LoRa + WiFi relay | Not started |
+| Outdoor range test | LoRa, line-of-sight 1km | Not started |
+| Duty cycle compliance | SDR spectrum analyzer | Not started |
+
+### Action Items
+
+- [ ] **Procure hardware** — 3x Heltec LoRa32 or similar
+- [ ] **Implement UART LoRaTransport** for real modems
+- [ ] **Create test harness** for automated multi-node testing
+- [ ] **Document actual performance** numbers
+
+---
+
+## Gap 6: Comparison Claims Need Verification
+
+### The Problem
+
+Our positioning doc claims superiority over Meshtastic/Reticulum/Briar, but:
+
+- We haven't measured our actual overhead vs. theirs
+- We haven't tested interop scenarios
+- We haven't run security analysis against their threat models
+
+### Verification Plan
+
+| Claim | How to Verify |
+|-------|---------------|
+| "MLS is better than shared-key AES" | Threat model comparison doc |
+| "Multi-hop works" | Integration test with 5+ nodes |
+| "LoRa-ready" | Actual LoRa hardware test |
+| "Post-quantum protects groups" | Verify hybrid KEM in MLS path |
+| "Relay nodes can't read content" | Formal verification of E2E path |
+
+### Action Items
+
+- [ ] **Create benchmark suite** comparing message sizes
+- [ ] **Write threat model comparison** doc (Meshtastic CVEs, Reticulum link-level)
+- [ ] **Fuzz test** mesh envelope parsing
+- [ ] **Get external review** of mesh crypto design
+
+---
+
+## Implementation Priority
+
+### Phase 1: Make It Work (Next 2 Sprints)
+
+1. **S4: Multi-hop routing** — complete the core mesh functionality
+2. **S5: Truncated addresses** — reduce envelope overhead
+3. **Measure actual sizes** — know the real numbers
+
+### Phase 2: Make It Efficient (Following 2 Sprints)
+
+4. **Design MLS-Lite** — spec for constrained links
+5. **Priority queue** — emergency messages first
+6. **Hardware testing** — real LoRa validation
+
+### Phase 3: Make It Production-Ready
+
+7. **KeyPackage distribution** — mesh-native key exchange
+8. **Power profiles** — battery optimization
+9. **External review** — security audit of mesh layer
+
+---
+
+## Success Metrics
+
+| Metric | Current | Target |
+|--------|---------|--------|
+| MeshEnvelope overhead (short msg) | ~170 bytes | <100 bytes |
+| Time to send "hello" over SF12 LoRa | ~27 sec | <15 sec |
+| KeyPackage exchange over mesh | Not possible | Works |
+| Multi-hop message delivery | Mock only | Real hardware |
+| Battery life (mesh mode) | Unknown | Measured & documented |
+
+---
+
+## Honest Assessment
+
+**What we do well:**
+- MLS group crypto is genuinely better than Meshtastic/Reticulum
+- Transport abstraction is clean
+- Announce protocol is solid
+
+**What we need to fix:**
+- MLS overhead makes LoRa impractical for group setup
+- No solution for KeyPackage distribution without server
+- No real-world testing yet
+
+**What we should acknowledge in marketing:**
+- "Best crypto for mesh" is true, but with caveats
+- "LoRa-ready" means "designed for LoRa, pending optimization"
+- We're research-stage, not production-ready
+
+---
+
+*Last updated: 2026-03-30*