feat: add observability module and wire MeshNode run() with background tasks

Add health checks (/healthz), Prometheus metrics export (/metricsz),
and tracing spans to the P2P mesh node. MeshNode.run() starts GC and
health server as background tasks, returning a RunHandle for lifecycle
management. Health endpoint returns 503 during graceful shutdown drain.
This commit is contained in:
2026-04-11 17:52:03 +02:00
parent 95ce8898fd
commit da0085f1a6
4 changed files with 592 additions and 2 deletions

View File

@@ -1,5 +1,41 @@
# Status Log
## 2026-04-11 — Observability & MeshNode run() wiring
### Completed
- **observability.rs** — new module with health checks, Prometheus text export, HTTP server
- `NodeHealth` struct with per-subsystem health checks (transport, routing, store)
- `HealthStatus` enum (Healthy/Degraded/Draining/Unhealthy) with HTTP status codes
- `prometheus_text()` — renders `MetricsSnapshot` in Prometheus exposition format
- `HealthServer` — lightweight TCP-based HTTP server for `/healthz` and `/metricsz`
- **MeshNode.run()** — starts background tasks and returns a `RunHandle`
- Periodic GC task (store, routing table, rate limiters) with configurable interval
- Health/metrics HTTP server (optional, via `MeshNodeBuilder.health_listen()`)
- Shutdown coordination via `watch` channel
- **RunHandle** — public API for interacting with a running node
- `.node()` — access to the MeshNode
- `.health()` — current health snapshot
- `.metrics_snapshot()` — current metrics
- `.health_addr()` — bound health server address
- `.shutdown()` — graceful shutdown (signals tasks + drains transports)
- **Tracing spans** — `#[tracing::instrument]` on `process_incoming()` and `send()`
- Includes sender/dest address and payload length as span fields
- GC cycle wrapped in `mesh_gc` info span
- **Draining flag** — `AtomicBool` for shutdown awareness; health endpoint returns 503
### Test Coverage
- 232 total tests passing (212 lib + 3 fapp_flow + 1 meshservice + 16 multi_node)
- 7 new observability unit tests (health healthy/degraded/draining, prometheus format)
- Full workspace `cargo check` clean
### What's Next
1. Wire `MeshNode.run()` into an example binary or the server
2. Announce loop task (periodic re-announce to neighbors)
3. Grafana dashboard for mesh metrics
4. Integration test for health HTTP endpoint
---
## 2026-04-01 — meshservice workspace integration
### Completed