Add health checks (/healthz), Prometheus metrics export (/metricsz), and tracing spans to the P2P mesh node. MeshNode.run() starts GC and health server as background tasks, returning a RunHandle for lifecycle management. Health endpoint returns 503 during graceful shutdown drain.
318 lines
15 KiB
Markdown
318 lines
15 KiB
Markdown
# Status Log
|
|
|
|
## 2026-04-11 — Observability & MeshNode run() wiring
|
|
|
|
### Completed
|
|
- **observability.rs** — new module with health checks, Prometheus text export, HTTP server
|
|
- `NodeHealth` struct with per-subsystem health checks (transport, routing, store)
|
|
- `HealthStatus` enum (Healthy/Degraded/Draining/Unhealthy) with HTTP status codes
|
|
- `prometheus_text()` — renders `MetricsSnapshot` in Prometheus exposition format
|
|
- `HealthServer` — lightweight TCP-based HTTP server for `/healthz` and `/metricsz`
|
|
- **MeshNode.run()** — starts background tasks and returns a `RunHandle`
|
|
- Periodic GC task (store, routing table, rate limiters) with configurable interval
|
|
- Health/metrics HTTP server (optional, via `MeshNodeBuilder.health_listen()`)
|
|
- Shutdown coordination via `watch` channel
|
|
- **RunHandle** — public API for interacting with a running node
|
|
- `.node()` — access to the MeshNode
|
|
- `.health()` — current health snapshot
|
|
- `.metrics_snapshot()` — current metrics
|
|
- `.health_addr()` — bound health server address
|
|
- `.shutdown()` — graceful shutdown (signals tasks + drains transports)
|
|
- **Tracing spans** — `#[tracing::instrument]` on `process_incoming()` and `send()`
|
|
- Includes sender/dest address and payload length as span fields
|
|
- GC cycle wrapped in `mesh_gc` info span
|
|
- **Draining flag** — `AtomicBool` for shutdown awareness; health endpoint returns 503
|
|
|
|
### Test Coverage
|
|
- 232 total tests passing (212 lib + 3 fapp_flow + 1 meshservice + 16 multi_node)
|
|
- 7 new observability unit tests (health healthy/degraded/draining, prometheus format)
|
|
- Full workspace `cargo check` clean
|
|
|
|
### What's Next
|
|
1. Wire `MeshNode.run()` into an example binary or the server
|
|
2. Announce loop task (periodic re-announce to neighbors)
|
|
3. Grafana dashboard for mesh metrics
|
|
4. Integration test for health HTTP endpoint
|
|
|
|
---
|
|
|
|
## 2026-04-01 — meshservice workspace integration
|
|
|
|
### Completed
|
|
- **Workspace** — `crates/meshservice/` is a workspace member (`Cargo.toml`); `cargo check -p meshservice` and full `cargo check --workspace` succeed.
|
|
- **P2P bridge test** — `crates/quicprochat-p2p/tests/meshservice_tcp_transport.rs`: same Ed25519 seed for `MeshIdentity` and `meshservice::ServiceIdentity`; FAPP announce encoded with `meshservice::wire`, sent over `TcpTransport`, decoded and handled by `ServiceRouter` + `FappService::relay()`.
|
|
- **Client command engine** — `SlashCommand::MeshTrace` / `MeshStats` wired through `Command` and `execute_slash` (fixes non-exhaustive match); playbook steps `mesh-trace` / `mesh-stats` added.
|
|
|
|
### Integration notes
|
|
- **Transport**: `meshservice` is transport-agnostic; carry `wire::encode` bytes inside `MeshEnvelope` / mesh ALPN (`quicprochat/mesh/1`) for production — not yet a direct dependency from `quicprochat-p2p` lib code.
|
|
- **FAPP duplication**: `quicprochat-p2p::fapp` (legacy mesh FAPP) and `meshservice::services::fapp` (generic service layer) coexist; long-term alignment TBD.
|
|
|
|
---
|
|
|
|
## 2026-04-01 — Production Infrastructure Sprint
|
|
|
|
### Completed
|
|
- **Error handling** — `error.rs`: Structured error types with context for all subsystems
|
|
- MeshError, TransportError, RoutingError, CryptoError, ProtocolError, StoreError, ConfigError
|
|
- ErrorContext trait for chaining errors with context
|
|
- Helper methods for common error construction
|
|
|
|
- **Configuration** — `config.rs`: Runtime config with TOML parsing
|
|
- MeshConfig, IdentityConfig, AnnounceConfig, RoutingConfig, StoreConfig
|
|
- TransportConfig (QUIC/TCP/LoRa), CryptoConfig, RateLimitConfig, LoggingConfig
|
|
- Validation with meaningful error messages
|
|
- MeshConfig::constrained() preset for low-resource devices
|
|
|
|
- **Metrics/Observability** — `metrics.rs`: Counter/Gauge/Histogram primitives
|
|
- Per-transport metrics (sent/received/errors/bytes)
|
|
- Routing metrics (table size, lookups, misses)
|
|
- Store metrics (stored/delivered/expired)
|
|
- Crypto metrics (encryptions, failures, replay detections)
|
|
- JSON-serializable MetricsSnapshot for export
|
|
|
|
- **Rate limiting** — `rate_limit.rs`: DoS protection
|
|
- TokenBucket with configurable refill rate
|
|
- Per-peer limiters for messages, announces, KeyPackage requests
|
|
- DutyCycleTracker for LoRa EU868 compliance
|
|
- BackpressureController with priority-based shedding
|
|
|
|
- **Persistence** — `persistence.rs`: Durable storage
|
|
- AppendLog with JSON entries and compaction
|
|
- PersistentRoutingTable with TTL-based expiry
|
|
- PersistentMessageStore for offline delivery
|
|
- Atomic file operations with fsync
|
|
|
|
- **Graceful shutdown** — `shutdown.rs`: Coordinated termination
|
|
- ShutdownCoordinator with phase transitions (Draining → Persisting → Cleanup → Complete)
|
|
- TaskGuard RAII for tracking active tasks
|
|
- ConnectionDrainer for clean connection teardown
|
|
- ShutdownHooks for persist/cleanup callbacks
|
|
|
|
- **Integration tests** — `tests/multi_node.rs`: 16 production scenarios
|
|
- Rate limiting per-peer isolation
|
|
- Store-and-forward, message dedup, GC
|
|
- Envelope V2 signatures, forwarding, broadcast
|
|
- Config validation, TOML roundtrip
|
|
- Shutdown coordination, concurrent access
|
|
|
|
### Test Coverage
|
|
- 189 unit tests + 16 integration tests = **205 total**
|
|
- All passing
|
|
|
|
### What's Next
|
|
1. Wire new modules into P2pNode startup
|
|
2. Add tracing spans for distributed tracing
|
|
3. Health check HTTP endpoint
|
|
4. Prometheus metrics export
|
|
|
|
---
|
|
|
|
## 2026-04-01 — MeshNode: Production Integration
|
|
|
|
### Completed
|
|
- **MeshNode** — `mesh_node.rs`: Production-ready node integrating all subsystems
|
|
- `MeshNodeBuilder`: Fluent API for configuration
|
|
- `MeshConfig` integration for all settings
|
|
- `MeshMetrics` tracking for all operations
|
|
- Rate limiting on incoming messages via `RateLimiter`
|
|
- Backpressure control via `BackpressureController`
|
|
- Graceful shutdown via `ShutdownCoordinator`
|
|
- Optional `FappRouter` based on capabilities
|
|
- `MeshRouter` for envelope routing
|
|
- `TransportManager` for multi-transport support
|
|
|
|
### Key APIs
|
|
```rust
|
|
// Build a mesh node
|
|
let node = MeshNodeBuilder::new()
|
|
.config(config)
|
|
.identity(identity)
|
|
.fapp_relay()
|
|
.fapp_patient()
|
|
.build()
|
|
.await?;
|
|
|
|
// Process incoming with rate limiting + metrics
|
|
let action = node.process_incoming(&sender_addr, envelope)?;
|
|
|
|
// Garbage collection
|
|
node.gc()?;
|
|
|
|
// Graceful shutdown
|
|
node.shutdown().await;
|
|
```
|
|
|
|
### Test Coverage
|
|
- 222 total tests (203 lib + 3 fapp_flow + 16 multi_node)
|
|
- 5 new mesh_node tests
|
|
|
|
---
|
|
|
|
## 2026-04-01 — FAPP: Complete E2E Flow
|
|
|
|
### Completed (Latest)
|
|
- **E2E Encryption** — `fapp.rs`: SlotReserve/SlotConfirm with X25519 + ChaCha20-Poly1305
|
|
- `PatientEphemeralKey`: generates X25519 keypair for reservation
|
|
- `TherapistCrypto`: decrypts reserves, creates confirms with forward secrecy
|
|
- `PatientCrypto`: creates reserves, decrypts confirmations
|
|
- Each confirmation uses fresh ephemeral key for forward secrecy
|
|
|
|
- **FappRouter Reserve/Confirm** — `fapp_router.rs`:
|
|
- `DeliverReserve` / `DeliverConfirm` action variants
|
|
- `process_slot_reserve()`: routes to therapist or floods
|
|
- `process_slot_confirm()`: delivers to patient
|
|
- `send_reserve()` / `send_confirm()`: capability-checked sends
|
|
- `send_response()`: relay-to-patient response routing
|
|
|
|
- **Integration Tests** — `tests/fapp_flow.rs`:
|
|
- `full_fapp_flow_announce_query_reserve_confirm`: Complete flow from announce to confirmed appointment
|
|
- `fapp_rejection_flow`: Tests therapist declining a reservation
|
|
- `fapp_query_filters`: Tests Fachrichtung, PLZ, and other filters
|
|
|
|
### Test Coverage
|
|
- 217 total tests (198 lib + 3 fapp_flow + 16 multi_node)
|
|
- 31 FAPP-specific tests (24 fapp + 7 fapp_router)
|
|
|
|
### What's Next
|
|
1. Wire FappRouter into P2pNode startup
|
|
2. LoRa testing for FAPP messages
|
|
|
|
---
|
|
|
|
## 2026-03-31 — FAPP: Free Appointment Propagation Protocol
|
|
|
|
### Completed
|
|
- **Protocol spec** — `docs/specs/fapp-protocol.md`: decentralized psychotherapy appointment discovery over mesh
|
|
- **Rust module** — `crates/quicprochat-p2p/src/fapp.rs`: full data structures, store, query matching, signature verification
|
|
- **Message types**: SlotAnnounce, SlotQuery, SlotResponse, SlotReserve, SlotConfirm
|
|
- **Domain model**: Fachrichtung, Modalitaet, Kostentraeger, SlotType (German enum names for domain concepts)
|
|
- **FappStore**: in-memory cache with dedup (therapist_address + sequence), TTL expiry, signature verification, capacity limits
|
|
- **Query matching**: filter by Fachrichtung, Modalitaet, Kostentraeger, PLZ prefix, time range, SlotType, max_results
|
|
- **Privacy model**: therapist identity public (Approbation-bound), patient queries anonymous
|
|
|
|
### Design Decisions
|
|
- Extends announce.rs capability bitfield with CAP_FAPP_THERAPIST (0x0100), CAP_FAPP_RELAY (0x0200), CAP_FAPP_PATIENT (0x0400)
|
|
- Uses same signing pattern as MeshAnnounce: hop_count excluded from signature, forwarding nodes don't re-sign
|
|
- CBOR wire format consistent with existing envelope/announce code
|
|
- Location hint is PLZ only (e.g. "80331") — never exact address
|
|
- Anti-spam: Approbation hash binding, signature verification, sequence-based dedup, rate limiting, TTL enforcement
|
|
|
|
---
|
|
|
|
## 2026-03-30 — Mesh Protocol Infrastructure Sprint
|
|
|
|
### Completed (Latest)
|
|
- **KeyPackage distribution** — `keypackage_cache.rs` + `mesh_protocol.rs`
|
|
- MeshAnnounce extended with `keypackage_hash` field
|
|
- KeyPackageRequest/Response/Unavailable messages
|
|
- KeyPackageCache with TTL, per-address limits, LRU eviction
|
|
- **Transport capability negotiation** — `transport.rs` TransportCapability
|
|
- Auto-classification: Unconstrained/Medium/Constrained/SeverelyConstrained
|
|
- CryptoMode recommendation per capability level
|
|
- TransportManager.recommended_crypto(), select_for_size()
|
|
- **MLS-Lite upgrade path** — `crypto_negotiation.rs`
|
|
- GroupCryptoState tracks current mode
|
|
- MlsLiteBootstrap derives MLS-Lite keys from MLS epoch secret
|
|
- Enables same group to use full MLS on WiFi, MLS-Lite on LoRa
|
|
|
|
### Previously Completed
|
|
- **S4: Multi-hop routing** — `MeshRouter` with `send()`, `handle_incoming()`, `forward()`, `drain_store_for()`
|
|
- **S4: REPL commands** — `/mesh trace <address>` and `/mesh stats`
|
|
- **S5: Truncated addresses** — `MeshEnvelopeV2` with 16-byte addresses (~18% smaller)
|
|
- **MLS-Lite** — Lightweight symmetric mode for constrained links (`mls_lite.rs`)
|
|
- **Size measurements** — Actual MLS and envelope sizes benchmarked
|
|
|
|
### Actual Measured Sizes (Key Finding!)
|
|
|
|
| Component | Size | LoRa SF12 fragments |
|
|
|-----------|------|---------------------|
|
|
| MLS KeyPackage | 306 bytes | 6 |
|
|
| MLS Welcome | 840 bytes | 17 |
|
|
| MLS-Lite (no sig) | 129 bytes | 3 |
|
|
| MLS-Lite (with sig) | 262 bytes | 6 |
|
|
| MeshEnvelope V1 | 410 bytes | 9 |
|
|
| MeshEnvelope V2 | 336 bytes | 7 |
|
|
| MLS KeyPackage (PQ hybrid) | 2,676 bytes | 53 |
|
|
|
|
**Key insight:** Classical MLS is actually LoRa-viable! 6 fragments for KeyPackage, ~14 sec for group setup at 1% duty. PQ hybrid remains impractical.
|
|
|
|
### What's Next
|
|
1. KeyPackage distribution over mesh (announce-based)
|
|
2. Transport capability negotiation
|
|
3. Real hardware testing (LoRa boards)
|
|
4. MLS-Lite upgrade path to full MLS
|
|
|
|
---
|
|
|
|
## 2026-03-30 — Mesh Protocol Gap Analysis
|
|
|
|
### Completed
|
|
- Created `docs/plans/mesh-protocol-gaps.md` — honest assessment of QuicProChat vs. Reticulum/Meshtastic/Briar
|
|
- Created `docs/src/design-rationale/mesh-protocol-comparison.md` — technical comparison document
|
|
- Updated `docs/positioning.md` — sharper messaging + honest limitations
|
|
|
|
### Key Insight
|
|
QuicProChat has **best-in-class crypto** AND **viable mesh efficiency** (for classical MLS). PQ hybrid mode needs constrained-link fallback.
|
|
|
|
### Open Design Questions
|
|
- How to distribute KeyPackages over mesh without server?
|
|
- Should we implement LXMF compatibility for Reticulum interop?
|
|
|
|
---
|
|
|
|
## 2026-03-30 — Sprint 6: LoRa transport & integration demo
|
|
|
|
### Completed
|
|
- Added `transport_lora.rs`: `LoRaConfig`, Semtech-style airtime estimate, `DutyCycleTracker` (rolling 1 h window, `eu868_one_percent()`), `LoRaMockMedium` + `LoRaTransport` implementing `MeshTransport` (`lora` name for `TransportManager`), LR framing with automatic fragmentation/reassembly, tests (mock roundtrip, fragmentation, duty accounting, `split_for_mtu`).
|
|
- Example `mesh_lora_relay_demo`: A (LoRa mock) → B (relay) → C (TCP) and reply path; `scripts/mesh-demo.sh` runs it.
|
|
- Wired `pub mod transport_lora` in `lib.rs`.
|
|
- Adjusted `cbor_smaller_than_json` to assert CBOR is materially smaller than JSON (fixed overhead dominates; a strict half-JSON threshold failed on current envelope sizes).
|
|
|
|
### What's next
|
|
- Optional: UART-backed `LoRaTransport` behind a feature flag (modem-specific framing).
|
|
- Hardware runbook: replace mock medium with RNode / SX1262 serial when available.
|
|
|
|
## 2026-03-30 — Sprint 3: Announce & Discovery Protocol
|
|
|
|
### Completed
|
|
- Created `MeshAnnounce` struct with Ed25519 signed announcements, CBOR wire format, hop forwarding
|
|
- Created `compute_address()` — SHA-256 truncation of identity key to 16-byte mesh address
|
|
- Created `RoutingTable` with `RoutingEntry` — keyed by 16-byte address, supports lookup by address or full key, TTL-based expiry, sequence-based stale rejection
|
|
- Created `AnnounceDedup` for loop prevention (address+sequence deduplication)
|
|
- Created `AnnounceConfig` with sensible defaults (10min interval, 30min max age, 8 max hops)
|
|
- Created `create_announce()` and `process_received_announce()` — complete announce processing pipeline (verify, expiry check, dedup, routing update, propagation decision)
|
|
- Capability flags: CAP_RELAY, CAP_STORE, CAP_GATEWAY, CAP_CONSTRAINED
|
|
- Tests: 17 tests across 3 modules covering signature verification, tampering, forwarding, expiry, dedup, routing updates, stale rejection, CBOR roundtrip, address determinism
|
|
- Updated lib.rs with `announce`, `announce_protocol`, `routing_table` modules
|
|
|
|
### What's Next
|
|
- S4: Multi-Hop Routing
|
|
- Integrate announce protocol with TransportManager for actual broadcast/receive loops
|
|
- Add tokio async announce loop (periodic re-announce, GC timer)
|
|
|
|
### Notes
|
|
- Signature excludes `hop_count` (same design as MeshEnvelope) so forwarding doesn't break verification
|
|
- Protocol engine uses free functions rather than a stateful struct — simpler, more testable
|
|
- Cannot run `cargo test` in this environment (no C toolchain / linker available)
|
|
|
|
## 2026-03-30 — Sprint 2: Transport Abstraction Layer
|
|
|
|
### Completed
|
|
- Created `MeshTransport` trait with `send`, `recv`, `discover`, `close` methods
|
|
- Created `TransportAddr` enum for transport-agnostic addressing (Iroh, Socket, LoRa, Serial, Raw)
|
|
- Created `TransportInfo` struct for transport capability metadata
|
|
- Implemented `IrohTransport` wrapping iroh `Endpoint` with same length-prefixed framing as `P2pNode`
|
|
- Implemented `TcpTransport` using tokio `TcpListener`/`TcpStream` with length-prefixed framing
|
|
- Implemented `TransportManager` for multi-transport routing based on address type
|
|
- Added `async-trait` dependency, enabled tokio `net` + `io-util` features
|
|
- Tests: TransportAddr Display formatting, TCP roundtrip, TransportManager routing, error cases
|
|
|
|
### What's Next
|
|
- S3: Announce & Discovery Protocol
|
|
- Future: integrate transport layer into `HybridRouter` / replace direct iroh usage
|
|
|
|
### Notes
|
|
- New transport layer sits alongside existing `P2pNode` — no breaking changes
|
|
- `IrohTransport` uses separate ALPN (`quicprochat/mesh/1`) to avoid conflicts with `P2pNode`
|
|
- Cannot run `cargo test`/`cargo clippy` in this environment (no Rust toolchain installed)
|