quicproquo/docs/plans/mesh-protocol-gaps.md

# Mesh Protocol Gaps — Honest Assessment & Action Plan

> **Goal:** Identify real weaknesses in QuicProChat's mesh protocol compared to
> Reticulum, Meshtastic, and LXMF. Plan concrete improvements.
>
> Created: 2026-03-30

---

## Executive Summary

QuicProChat has strong cryptography (MLS, PQ-KEM) but **real gaps** in the mesh layer:

| Gap | Severity | Status |
|-----|----------|--------|
| MLS overhead too large for LoRa | **Critical** | **MEASURED** — see actual sizes below |
| No lightweight messaging mode | **High** | **DONE** — MLS-Lite implemented |
| KeyPackage distribution over mesh | **High** | Not solved |
| Announce/routing not battle-tested | **Medium** | S3-S4 done, needs real-world test |
| No DTN bundle protocol integration | **Medium** | Priority field added |
| Battery/duty-cycle optimization | **Medium** | Basic tracker exists |

---

## Gap 1: MLS Overhead is Prohibitive for Constrained Links

### The Problem

**MLS was designed for Internet messaging, not LoRa.**

### Actual Measured Sizes (2026-03-30)

| Component | Size (bytes) | LoRa SF12 fragments | At 1% duty |
|-----------|--------------|---------------------|------------|
| **MLS KeyPackage** | 306 | 6 | ~4 sec |
| **MLS Welcome** | 840 | 17 | ~10 sec |
| **MLS Commit (add)** | 736 | 15 | ~9 sec |
| **MLS AppMessage (5B)** | 143 | 3 | ~2 sec |
| **MLS Commit (update)** | 544 | 11 | ~7 sec |
| **MLS KeyPackage (PQ)** | 2,676 | 53 | ~32 sec |
| **MLS Welcome (PQ)** | 5,504 | 108 | ~65 sec |
| **MeshEnvelope V1 (CBOR)** | 410 | 9 | ~5 sec |
| **MeshEnvelope V2 (truncated)** | 336 | 7 | ~4 sec |
| **MLS-Lite (no sig)** | 129 | 3 | ~2 sec |
| **MLS-Lite (with sig)** | 262 | 6 | ~4 sec |
| Reticulum LXMF | ~100-150 | 2-3 | ~1-2 sec |
| Meshtastic max | 237 | 5 | ~3 sec |

**Key insights:**

- Classical MLS is **viable** for LoRa — 6 fragments for KeyPackage
- Post-quantum hybrid MLS is **prohibitive** — 53+ fragments for KeyPackage
- MLS-Lite matches Meshtastic efficiency while adding proper auth
- **Total group setup** (KeyPackage + Welcome): ~23 fragments, ~14 sec

**The math NOW works for classical MLS on LoRa:**

- LoRa SF12/BW125: ~51 byte MTU, ~300 bps effective
- EU868 duty cycle: 1% = 36 seconds TX per hour
- **One MLS KeyPackage = 6 fragments = 4 sec = acceptable**
- **Group setup = 14 sec = half duty budget, but feasible**

**Post-quantum is still problematic for constrained links.**

### Current State (Updated 2026-03-30)

- ✅ MeshEnvelope V1 uses CBOR, ~410 bytes for empty payload
- ✅ MeshEnvelope V2 uses truncated 16-byte addresses, ~336 bytes (~18% savings)
- ✅ MLS-Lite implemented: ~129 bytes without signature, ~262 with
- ✅ Classical MLS KeyPackage measured at 306 bytes (much better than expected)
- ⚠️ PQ-hybrid MLS still large (2.6KB KeyPackage)

### Proposed Solutions

#### Option A: Hybrid Crypto Modes (Recommended)

```
┌─────────────────────────────────────────────────────────────────┐
│  Mode Selection Based on Transport Capability                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  QUIC/TCP/WiFi (>10 kbps):                                     │
│    → Full MLS groups with PQ-KEM                               │
│    → KeyPackage distribution via server                        │
│    → Standard protocol                                          │
│                                                                 │
│  LoRa/Serial (<1 kbps):                                        │
│    → "MLS-Lite" mode:                                          │
│      • Pre-shared group epoch key (exchanged out-of-band)      │
│      • ChaCha20-Poly1305 symmetric encryption                  │
│      • Ed25519 signatures (64 bytes)                           │
│      • No per-message KeyPackage exchange                      │
│      • Manual key rotation via QR code or faster link          │
│                                                                 │
│  Upgrade path:                                                  │
│    When faster transport available → full MLS epoch sync       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

**Trade-off:** Lose automatic PCS on constrained links. Gain usability.

#### Option B: Compressed MLS (Research)

- Strip unused extensions from KeyPackages
- Use shorter credential identifiers (16 bytes instead of 32)
- Batch multiple KeyPackages into single transfer over fast link
- Cache and reuse KeyPackages more aggressively

**Trade-off:** Still large. May not be enough for SF12 LoRa.

#### Option C: LXMF-Compatible Mode

Implement Reticulum's LXMF format as an alternative wire format:

```rust
pub struct LxmfMessage {
    destination: [u8; 16],   // Truncated hash
    source: [u8; 16],
    signature: [u8; 64],     // Ed25519
    payload: Vec<u8>,        // msgpack: {timestamp, content, title, fields}
}
// Total: ~100-150 bytes for short message
```

**Trade-off:** Lose MLS group properties. Gain Reticulum interop and efficiency.

### Action Items

- [x] **Measure actual MLS sizes** — done, see table above
- [x] **Design MLS-Lite spec** — `docs/plans/mls-lite-design.md`
- [x] **Implement MLS-Lite** — `crates/quicprochat-p2p/src/mls_lite.rs`
- [x] **Implement MeshEnvelope V2** — truncated addresses, priority field
- [ ] **Implement transport capability negotiation** in TransportManager
- [ ] **Test MLS-Lite vs full MLS on real LoRa**

---

## Gap 2: KeyPackage Distribution Over Mesh

### The Problem

MLS requires pre-positioned KeyPackages for adding members to groups. On Internet:
server stores KeyPackages, clients fetch on demand. On mesh: **no server**.

Current flow (broken for pure mesh):
```
Alice wants to add Bob to group:
1. Alice fetches Bob's KeyPackage from server    ← requires Internet
2. Alice creates Welcome + Commit
3. Alice sends to Bob via mesh
```

### Proposed Solution: Announce-Based KeyPackage Distribution

```
Bob announces on mesh:
1. MeshAnnounce includes: identity_key, capabilities, AND current_keypackage_hash
2. Nearby nodes cache Bob's latest KeyPackage (if they have it)
3. Alice receives Bob's announce, requests KeyPackage via mesh RPC

KeyPackage propagation:
1. Bob periodically broadcasts KeyPackage update (larger message, less frequent)
2. Nodes with capacity (CAP_STORE) cache KeyPackages for relaying
3. TTL-based expiry (KeyPackages are single-use, but we can cache N of them)
```

### Action Items

- [ ] **Extend MeshAnnounce** with optional `keypackage_hash` field
- [ ] **Add KeyPackage request/response** to mesh protocol
- [ ] **Implement KeyPackage cache** in MeshStore (separate from message queue)
- [ ] **Design KeyPackage refresh protocol** for mesh-only scenarios

---

## Gap 3: No DTN/Bundle Protocol Integration

### The Problem

NASA/IETF Bundle Protocol (RFC 9171) is the standard for delay-tolerant networking.
Reticulum effectively reinvented it. QuicProChat should learn from both.

Key DTN concepts we're missing:

| Concept | DTN/BPv7 | Reticulum | QuicProChat |
|---------|----------|-----------|-------------|
| **Custody transfer** | Yes | No | No |
| **Fragmentation at bundle layer** | Yes | No | Yes (LoRa transport) |
| **Convergence layer adapters** | Formal spec | Interfaces | MeshTransport trait |
| **Routing protocols** | CGR, EPIDEMIC | Announce-based | Announce-based |
| **Priority scheduling** | Yes | No | No |

### Proposed Improvements

1. **Priority levels in MeshEnvelope** (emergency > data > announce)
2. **Custody transfer option** — intermediate node takes responsibility
3. **Better congestion control** — backpressure signals in announce

### Action Items

- [ ] **Add priority field** to MeshEnvelope
- [ ] **Research custody transfer** — is it worth the complexity?
- [ ] **Implement priority queue** in MeshStore and DutyCycleTracker

---

## Gap 4: Battery/Duty-Cycle Optimization

### The Problem

Briar drains 4x battery due to constant BT scanning. We claim to be better but
haven't proven it.

Current state:
- DutyCycleTracker enforces EU868 1% limit
- Announce interval is configurable (default 10 min)
- No adaptive power management

### Proposed Improvements

1. **Adaptive announce interval** — more frequent when activity, less when idle
2. **Listen-before-talk** — don't TX if channel is busy (LoRa CAD)
3. **Scheduled wake windows** — coordinate with peers for efficient sync
4. **Power profiles** — "always-on", "hourly-sync", "manual-only"

### Action Items

- [ ] **Implement CAD (Channel Activity Detection)** in LoRaTransport
- [ ] **Add power profile config** to P2pNode
- [ ] **Measure actual power consumption** with real hardware

---

## Gap 5: Real-World Testing

### The Problem

All our mesh code runs against mocks. We claim LoRa support but haven't tested
with real radios.

### Testing Plan

| Test | Hardware | Status |
|------|----------|--------|
| LoRa point-to-point | 2x SX1262 dev boards | Not started |
| LoRa multi-hop | 3x SX1262, different rooms | Not started |
| Mixed transport | LoRa + WiFi relay | Not started |
| Outdoor range test | LoRa, line-of-sight 1km | Not started |
| Duty cycle compliance | SDR spectrum analyzer | Not started |

### Action Items

- [ ] **Procure hardware** — 3x Heltec LoRa32 or similar
- [ ] **Implement UART LoRaTransport** for real modems
- [ ] **Create test harness** for automated multi-node testing
- [ ] **Document actual performance** numbers

---

## Gap 6: Comparison Claims Need Verification

### The Problem

Our positioning doc claims superiority over Meshtastic/Reticulum/Briar, but:

- We haven't measured our actual overhead vs. theirs
- We haven't tested interop scenarios
- We haven't run security analysis against their threat models

### Verification Plan

| Claim | How to Verify |
|-------|---------------|
| "MLS is better than shared-key AES" | Threat model comparison doc |
| "Multi-hop works" | Integration test with 5+ nodes |
| "LoRa-ready" | Actual LoRa hardware test |
| "Post-quantum protects groups" | Verify hybrid KEM in MLS path |
| "Relay nodes can't read content" | Formal verification of E2E path |

### Action Items

- [ ] **Create benchmark suite** comparing message sizes
- [ ] **Write threat model comparison** doc (Meshtastic CVEs, Reticulum link-level)
- [ ] **Fuzz test** mesh envelope parsing
- [ ] **Get external review** of mesh crypto design

---

## Implementation Priority

### Phase 1: Make It Work (Next 2 Sprints)

1. **S4: Multi-hop routing** — complete the core mesh functionality
2. **S5: Truncated addresses** — reduce envelope overhead
3. **Measure actual sizes** — know the real numbers

### Phase 2: Make It Efficient (Following 2 Sprints)

4. **Design MLS-Lite** — spec for constrained links
5. **Priority queue** — emergency messages first
6. **Hardware testing** — real LoRa validation

### Phase 3: Make It Production-Ready

7. **KeyPackage distribution** — mesh-native key exchange
8. **Power profiles** — battery optimization
9. **External review** — security audit of mesh layer

---

## Success Metrics

| Metric | Previous | Current | Target |
|--------|----------|---------|--------|
| MeshEnvelope overhead (empty) | ~410 bytes | ~336 (V2) | ✅ Done |
| MLS-Lite message (no sig) | N/A | ~129 bytes | ✅ Done |
| Time to send "hello" over SF12 LoRa | ~27 sec | ~4 sec (MLS-Lite) | ✅ Done |
| KeyPackage exchange over mesh | Not possible | Pending | Works |
| Multi-hop message delivery | Mock only | Code complete | Real hardware |
| Battery life (mesh mode) | Unknown | Unknown | Measured |

---

## Honest Assessment

**What we do well:**
- MLS group crypto is genuinely better than Meshtastic/Reticulum
- Transport abstraction is clean
- Announce protocol is solid
- **NEW: Classical MLS KeyPackage (306B) is actually LoRa-viable**
- **NEW: MLS-Lite provides Meshtastic-level efficiency with real auth**

**What we still need to fix:**
- No solution for KeyPackage distribution without server
- No real-world testing with actual LoRa hardware
- Post-quantum hybrid mode too large for constrained links

**What we can now claim:**
- "MLS on LoRa" — YES, classical MLS works with ~14 sec group setup
- "MLS-Lite for constrained" — YES, ~2-4 sec messages with auth
- "Post-quantum on LoRa" — NO, hybrid mode is impractical (2.6KB KeyPackage)
- "Production-ready" — NO, still research-stage, pending hardware tests

---

*Last updated: 2026-03-30*