feat: add draft data, gap analysis report, and workspace config
Some checks failed
CI / test (3.11) (push) Failing after 1m37s
CI / test (3.12) (push) Failing after 57s

This commit is contained in:
2026-04-06 18:47:15 +02:00
parent 4f310407b0
commit 2506b6325a
189 changed files with 62649 additions and 0 deletions

View File

@@ -0,0 +1,60 @@
# Research Brief
## Problem framing
Fact: the analyzer identifies Agent Error Recovery and Rollback as a critical gap in the current IETF AI/agent landscape, especially within autonomous netops. Fact: the gap statement is specific: current drafts discuss communication and coordination, but do not define a common mechanism for machine-actionable failure signaling, rollback boundaries, or coordinated recovery across dependent agents.
Inference: this is a good first draft topic because it is narrower and more defensible than a full agent orchestration architecture, while still addressing a real interoperability and safety problem. Hypothesis: the best initial document is an experimental protocol or profile for failure, checkpoint, rollback-request, and rollback-result semantics, not a complete workflow language.
## Evidence from existing drafts
Fact: the gap report cites only six extracted ideas that partially touch this area. The strongest adjacent ideas are "Task-Oriented Multi-Agent Recovery Framework", "Inter-Agent Communication Protocol Requirements", and "State Consistency Management" from `draft-yue-anima-agent-recovery-networks`, plus "Mandatory restrictive failure behavior" from `draft-srijal-agents-policy`.
Fact: adjacent drafts in the space include `draft-li-dmsc-macp`, `draft-fu-nmop-agent-communication-framework`, `draft-mallick-muacp`, and `draft-zyyhl-agent-networks-framework`. These appear to focus on collaboration or communication frameworks, not interoperable rollback semantics.
Fact: the landscape overview shows high activity and overlap in adjacent categories, but not maturity on recovery. `draft-li-dmsc-macp` scores well overall, while `draft-fu-nmop-agent-communication-framework` is relevant but lower maturity. This suggests there is ecosystem pressure for operational coordination, yet no shared recovery core has emerged.
Fact: the ideas corpus also shows related building blocks such as agent context propagation, working memory, authorization profiles, attestation, and policy enforcement. These matter because rollback decisions depend on shared execution context and trustworthy signaling, even if the rollback draft should not standardize those mechanisms itself.
## Overlap and adjacent work
Fact: `holistic-agent-ecosystem-draft-outlines.md` already frames recovery as part of a broader family and recommends using an execution-evidence substrate such as ECT rather than inventing a second DAG or token format. That same document suggests rollback should be represented through explicit checkpoint, error, rollback-request, and rollback-result events.
Inference: the closest collision risk is not another rollback standard, but accidental overreach into three nearby topics:
- full task DAG and orchestration semantics
- human override and intervention
- dynamic trust and assurance
Inference: the architect should treat those as interfaces, not as primary scope. The rollback draft should define how recovery interacts with dependencies and checkpoints, while leaving workflow planning, trust scoring, and human escalation to companion work or future drafts.
## Gaps and unresolved questions
Fact: the current evidence does not yet establish a canonical wire format or transport for rollback signaling. Fact: the analyzer materials argue for reusing adjacent execution-evidence work, but do not prove that one specific substrate is mature enough to normatively depend on.
Open questions:
- What is the minimum mandatory information in a failure signal: task identifier, parent dependency, failure class, reversibility, checkpoint reference, and rollback scope are likely candidates, but the exact set still needs comparison against existing drafts.
- Should rollback scope be defined as explicit dependency closure, implementation-local policy, or both?
- How should partially completed downstream actions be marked when they are not cleanly reversible?
- Which failures require automatic circuit breaking versus optional operator or policy input?
- Can the draft stay protocol-agnostic while still being testable by independent implementers?
## Additional data worth investigating
- Verify whether WIMSE or ECT-related drafts already define reusable execution identifiers, parent linkage, or signed event records that would let this draft avoid inventing its own carrier.
- Inspect `draft-yue-anima-agent-recovery-networks` directly for concrete recovery states, not just its analyzer summary.
- Compare `draft-li-dmsc-macp` and `draft-fu-nmop-agent-communication-framework` for any existing error taxonomy, dependency model, or task lifecycle signaling.
- Search the ideas set for `checkpoint`, `rollback`, `error`, `failure`, `compensation`, and `circuit breaker` to see whether additional partially related mechanisms were missed by the headline gap report.
## Recommendation to the architect
Design the first draft as a narrowly scoped experimental specification for coordinated recovery semantics in multi-agent execution. Keep the document centered on:
- failure and checkpoint vocabulary
- task state transitions
- rollback request and result signaling
- dependency-aware rollback scope
- minimal security requirements for authentic and authorized recovery events
Avoid defining a new identity system, full orchestration language, human override workflow, or trust-scoring model. If a reusable execution-evidence substrate exists, bind to it; otherwise define a minimal abstract event model that can later be profiled onto specific carriers.