Files

Christian Nennemann 01bc2a4273 docs: add mesh protocol gap analysis and MLS-Lite design

Honest assessment of QuicProChat vs Reticulum/Meshtastic/Briar:
- MLS overhead (500-800 byte KeyPackages) impractical for SF12 LoRa
- KeyPackage distribution over mesh unsolved
- No lightweight mode for constrained links

MLS-Lite design proposes 41-byte overhead symmetric mode:
- ChaCha20-Poly1305 with HKDF key derivation
- Optional Ed25519 signatures
- Upgrade path to full MLS when faster transport available
- QR code / out-of-band key exchange

2026-03-30 23:29:44 +02:00

11 KiB

Raw Blame History

Mesh Protocol Gaps — Honest Assessment & Action Plan

Goal: Identify real weaknesses in QuicProChat's mesh protocol compared to Reticulum, Meshtastic, and LXMF. Plan concrete improvements.

Created: 2026-03-30

Executive Summary

QuicProChat has strong cryptography (MLS, PQ-KEM) but real gaps in the mesh layer:

Gap	Severity	Status
MLS overhead too large for LoRa	Critical	Needs design work
No lightweight messaging mode	High	Not started
KeyPackage distribution over mesh	High	Not solved
Announce/routing not battle-tested	Medium	S3 done, needs real-world test
No DTN bundle protocol integration	Medium	Not started
Battery/duty-cycle optimization	Medium	Basic tracker exists

Gap 1: MLS Overhead is Prohibitive for Constrained Links

The Problem

MLS was designed for Internet messaging, not LoRa.

Measured sizes (approximate):

Component	Size (bytes)	LoRa SF12/BW125 airtime
MLS KeyPackage	~500-800	80-130 seconds
MLS Welcome	~1000-2000	160-320 seconds
MLS Commit	~200-500	32-80 seconds
MLS ApplicationMessage	~100-200	16-32 seconds
MeshEnvelope overhead	~170 (CBOR)	27 seconds
Reticulum LXMF message	~100-150	16-24 seconds
Meshtastic payload	~237 max	38 seconds

The math doesn't work:

LoRa SF12/BW125: ~51 byte MTU, ~300 bps effective
EU868 duty cycle: 1% = 36 seconds TX per hour
One MLS KeyPackage = 10-20 fragments = entire hour's duty budget

Current State

MeshEnvelope uses CBOR, ~170 bytes overhead for a short message
MLS operations happen at application layer, not optimized for mesh
No fallback to lighter crypto for constrained links

Proposed Solutions

Option A: Hybrid Crypto Modes (Recommended)

┌─────────────────────────────────────────────────────────────────┐
│  Mode Selection Based on Transport Capability                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  QUIC/TCP/WiFi (>10 kbps):                                     │
│    → Full MLS groups with PQ-KEM                               │
│    → KeyPackage distribution via server                        │
│    → Standard protocol                                          │
│                                                                 │
│  LoRa/Serial (<1 kbps):                                        │
│    → "MLS-Lite" mode:                                          │
│      • Pre-shared group epoch key (exchanged out-of-band)      │
│      • ChaCha20-Poly1305 symmetric encryption                  │
│      • Ed25519 signatures (64 bytes)                           │
│      • No per-message KeyPackage exchange                      │
│      • Manual key rotation via QR code or faster link          │
│                                                                 │
│  Upgrade path:                                                  │
│    When faster transport available → full MLS epoch sync       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Trade-off: Lose automatic PCS on constrained links. Gain usability.

Option B: Compressed MLS (Research)

Strip unused extensions from KeyPackages
Use shorter credential identifiers (16 bytes instead of 32)
Batch multiple KeyPackages into single transfer over fast link
Cache and reuse KeyPackages more aggressively

Trade-off: Still large. May not be enough for SF12 LoRa.

Option C: LXMF-Compatible Mode

Implement Reticulum's LXMF format as an alternative wire format:

pub struct LxmfMessage {
    destination: [u8; 16],   // Truncated hash
    source: [u8; 16],
    signature: [u8; 64],     // Ed25519
    payload: Vec<u8>,        // msgpack: {timestamp, content, title, fields}
}
// Total: ~100-150 bytes for short message

Trade-off: Lose MLS group properties. Gain Reticulum interop and efficiency.

Action Items

Measure actual MLS sizes in current implementation (benchmark)
Design MLS-Lite spec for constrained links
Implement transport capability negotiation in TransportManager
Add --constrained mode to MeshEnvelope for minimal overhead

Gap 2: KeyPackage Distribution Over Mesh

The Problem

MLS requires pre-positioned KeyPackages for adding members to groups. On Internet: server stores KeyPackages, clients fetch on demand. On mesh: no server.

Current flow (broken for pure mesh):

Alice wants to add Bob to group:
1. Alice fetches Bob's KeyPackage from server    ← requires Internet
2. Alice creates Welcome + Commit
3. Alice sends to Bob via mesh

Proposed Solution: Announce-Based KeyPackage Distribution

Bob announces on mesh:
1. MeshAnnounce includes: identity_key, capabilities, AND current_keypackage_hash
2. Nearby nodes cache Bob's latest KeyPackage (if they have it)
3. Alice receives Bob's announce, requests KeyPackage via mesh RPC

KeyPackage propagation:
1. Bob periodically broadcasts KeyPackage update (larger message, less frequent)
2. Nodes with capacity (CAP_STORE) cache KeyPackages for relaying
3. TTL-based expiry (KeyPackages are single-use, but we can cache N of them)

Action Items

Extend MeshAnnounce with optional keypackage_hash field
Add KeyPackage request/response to mesh protocol
Implement KeyPackage cache in MeshStore (separate from message queue)
Design KeyPackage refresh protocol for mesh-only scenarios

Gap 3: No DTN/Bundle Protocol Integration

The Problem

NASA/IETF Bundle Protocol (RFC 9171) is the standard for delay-tolerant networking. Reticulum effectively reinvented it. QuicProChat should learn from both.

Key DTN concepts we're missing:

Concept	DTN/BPv7	Reticulum	QuicProChat
Custody transfer	Yes	No	No
Fragmentation at bundle layer	Yes	No	Yes (LoRa transport)
Convergence layer adapters	Formal spec	Interfaces	MeshTransport trait
Routing protocols	CGR, EPIDEMIC	Announce-based	Announce-based
Priority scheduling	Yes	No	No

Proposed Improvements

Priority levels in MeshEnvelope (emergency > data > announce)
Custody transfer option — intermediate node takes responsibility
Better congestion control — backpressure signals in announce

Action Items

Add priority field to MeshEnvelope
Research custody transfer — is it worth the complexity?
Implement priority queue in MeshStore and DutyCycleTracker

Gap 4: Battery/Duty-Cycle Optimization

The Problem

Briar drains 4x battery due to constant BT scanning. We claim to be better but haven't proven it.

Current state:

DutyCycleTracker enforces EU868 1% limit
Announce interval is configurable (default 10 min)
No adaptive power management

Proposed Improvements

Adaptive announce interval — more frequent when activity, less when idle
Listen-before-talk — don't TX if channel is busy (LoRa CAD)
Scheduled wake windows — coordinate with peers for efficient sync
Power profiles — "always-on", "hourly-sync", "manual-only"

Action Items

Implement CAD (Channel Activity Detection) in LoRaTransport
Add power profile config to P2pNode
Measure actual power consumption with real hardware

Gap 5: Real-World Testing

The Problem

All our mesh code runs against mocks. We claim LoRa support but haven't tested with real radios.

Testing Plan

Test	Hardware	Status
LoRa point-to-point	2x SX1262 dev boards	Not started
LoRa multi-hop	3x SX1262, different rooms	Not started
Mixed transport	LoRa + WiFi relay	Not started
Outdoor range test	LoRa, line-of-sight 1km	Not started
Duty cycle compliance	SDR spectrum analyzer	Not started

Action Items

Procure hardware — 3x Heltec LoRa32 or similar
Implement UART LoRaTransport for real modems
Create test harness for automated multi-node testing
Document actual performance numbers

Gap 6: Comparison Claims Need Verification

The Problem

Our positioning doc claims superiority over Meshtastic/Reticulum/Briar, but:

We haven't measured our actual overhead vs. theirs
We haven't tested interop scenarios
We haven't run security analysis against their threat models

Verification Plan

Claim	How to Verify
"MLS is better than shared-key AES"	Threat model comparison doc
"Multi-hop works"	Integration test with 5+ nodes
"LoRa-ready"	Actual LoRa hardware test
"Post-quantum protects groups"	Verify hybrid KEM in MLS path
"Relay nodes can't read content"	Formal verification of E2E path

Action Items

Create benchmark suite comparing message sizes
Write threat model comparison doc (Meshtastic CVEs, Reticulum link-level)
Fuzz test mesh envelope parsing
Get external review of mesh crypto design

Implementation Priority

Phase 1: Make It Work (Next 2 Sprints)

S4: Multi-hop routing — complete the core mesh functionality
S5: Truncated addresses — reduce envelope overhead
Measure actual sizes — know the real numbers

Phase 2: Make It Efficient (Following 2 Sprints)

Design MLS-Lite — spec for constrained links
Priority queue — emergency messages first
Hardware testing — real LoRa validation

Phase 3: Make It Production-Ready

KeyPackage distribution — mesh-native key exchange
Power profiles — battery optimization
External review — security audit of mesh layer

Success Metrics

Metric	Current	Target
MeshEnvelope overhead (short msg)	~170 bytes	<100 bytes
Time to send "hello" over SF12 LoRa	~27 sec	<15 sec
KeyPackage exchange over mesh	Not possible	Works
Multi-hop message delivery	Mock only	Real hardware
Battery life (mesh mode)	Unknown	Measured & documented

Honest Assessment

What we do well:

MLS group crypto is genuinely better than Meshtastic/Reticulum
Transport abstraction is clean
Announce protocol is solid

What we need to fix:

MLS overhead makes LoRa impractical for group setup
No solution for KeyPackage distribution without server
No real-world testing yet

What we should acknowledge in marketing:

"Best crypto for mesh" is true, but with caveats
"LoRa-ready" means "designed for LoRa, pending optimization"
We're research-stage, not production-ready

Last updated: 2026-03-30

11 KiB Raw Blame History

Mesh Protocol Gaps — Honest Assessment & Action Plan

Executive Summary

Gap 1: MLS Overhead is Prohibitive for Constrained Links

The Problem

Current State

Proposed Solutions

Option A: Hybrid Crypto Modes (Recommended)

Option B: Compressed MLS (Research)

Option C: LXMF-Compatible Mode

Action Items

Gap 2: KeyPackage Distribution Over Mesh

The Problem

Proposed Solution: Announce-Based KeyPackage Distribution

Action Items

Gap 3: No DTN/Bundle Protocol Integration

The Problem

Proposed Improvements

Action Items

Gap 4: Battery/Duty-Cycle Optimization

The Problem

Proposed Improvements

Action Items

Gap 5: Real-World Testing

The Problem

Testing Plan

Action Items

Gap 6: Comparison Claims Need Verification

The Problem

Verification Plan

Action Items

Implementation Priority

Phase 1: Make It Work (Next 2 Sprints)

Phase 2: Make It Efficient (Following 2 Sprints)

Phase 3: Make It Production-Ready

Success Metrics

Honest Assessment

11 KiB

Raw Blame History