feat: add draft data, gap analysis report, and workspace config
Some checks failed
CI / test (3.11) (push) Failing after 1m37s
CI / test (3.12) (push) Failing after 57s

This commit is contained in:
2026-04-06 18:47:15 +02:00
parent 4f310407b0
commit 2506b6325a
189 changed files with 62649 additions and 0 deletions

View File

@@ -0,0 +1,70 @@
# User Spec
## Topic
Agent Error Recovery and Rollback for Multi-Agent Systems
## Goal
Produce a credible IETF-style Internet-Draft for a narrowly scoped mechanism that standardizes how cooperating agents report failures, define rollback scope, and execute coordinated recovery without cascading damage.
## Intended status
Experimental.
Rationale: the problem is clearly real and under-specified, but the ecosystem is still young and the mechanism should not pretend to have full deployment consensus yet.
## Problem to solve
Current AI-agent and autonomous-operations drafts define communication, identity, and orchestration patterns, but the landscape analysis shows no common mechanism for:
- signaling execution failure in a machine-actionable way
- declaring rollback boundaries and blast radius
- coordinating rollback across dependent agents
- recording recovery outcomes for audit and future trust decisions
This creates high interoperability and safety risk for autonomous systems that act across multiple services or domains.
## What must be true in the final draft
- The draft stays tightly scoped to recovery and rollback semantics, not a full agent architecture.
- The mechanism is protocol-agnostic enough to work across multiple agent ecosystems.
- The draft defines concrete states, triggers, and recovery procedures that two implementers could follow consistently.
- Security Considerations meaningfully address spoofed rollback, unauthorized override, replay, and denial-of-service by false failure signaling.
- The text is shaped like a real Internet-Draft, not a product design memo.
- The draft clearly states what is in scope now and what is deferred to later work such as richer workflow orchestration or dynamic trust scoring.
## Constraints
- scope constraints
Keep this to rollback and recovery coordination. Do not absorb lifecycle management, full workflow DAG standardization, or human override into the core mechanism except where needed as interfaces.
- compatibility constraints
Reuse adjacent concepts where possible from existing IETF-style work on execution evidence, attestation, or agent communication. Do not invent a full new identity or transport stack.
- terminology constraints
Use conservative standards language. Prefer terms like agent, execution, checkpoint, rollback set, dependency, and recovery record. Avoid buzzwords and branding.
## Source materials to prioritize
- `/home/c/projects/ietf-draft-analyzer/data/reports/gaps.md`
- `/home/c/projects/ietf-draft-analyzer/data/reports/holistic-agent-ecosystem-draft-outlines.md`
- `/home/c/projects/ietf-draft-analyzer/data/reports/ideas.md`
- `/home/c/projects/ietf-draft-analyzer/data/reports/overview.md`
- `draft-yue-anima-agent-recovery-networks`
- `draft-li-dmsc-macp`
- `draft-fu-nmop-agent-communication-framework`
- `draft-srijal-agents-policy`
- related WIMSE or ECT materials when they help avoid redefining execution evidence
## Success criteria
- A reader can tell exactly what an agent must emit or process when a task fails.
- A reader can tell how rollback scope is determined and how dependent agents respond.
- The draft includes enough structure to support interoperability testing later.
- Specialist reviewers can criticize the draft on substance rather than on missing basic sections or obvious ambiguity.
## Questions for the team
- What is the smallest interoperable core for rollback semantics?
- Should checkpoints and recovery records be abstract objects, protocol messages, or profileable metadata on top of another carrier?
- What information is mandatory in a failure signal versus optional?
- How should rollback interact with partially completed downstream work?

View File

@@ -0,0 +1,27 @@
# Cycle Status
## Summary
- cycle: agent-error-recovery-rollback
- version: v1
- last updated: 2026-03-02 18:00 UTC
## Artifact Status
- `00-user-spec.md`: written
- `10-research-brief.md`: written
- `20-architecture-brief.md`: written
- `30-outline.md`: written
- `40-draft-v1.md`: written
- `50-reviews-v1/security.md`: written
- `50-reviews-v1/software.md`: written
- `50-reviews-v1/architecture.md`: written
- `50-reviews-v1/ietf-senior.md`: written
- `55-review-synthesis-v1.md`: written
- `60-revision-plan-v1.md`: written
## Notes
- written means the artifact contains substantive content.
- stub means the file exists but still appears to be a placeholder.
- missing means the expected file has not been created.

View File

@@ -0,0 +1,27 @@
# Cycle Status
## Summary
- cycle: agent-error-recovery-rollback
- version: v2
- last updated: 2026-03-02 18:06 UTC
## Artifact Status
- `00-user-spec.md`: written
- `10-research-brief.md`: written
- `20-architecture-brief.md`: written
- `30-outline.md`: written
- `40-draft-v2.md`: written
- `50-reviews-v2/security.md`: stub
- `50-reviews-v2/software.md`: stub
- `50-reviews-v2/architecture.md`: stub
- `50-reviews-v2/ietf-senior.md`: stub
- `55-review-synthesis-v2.md`: stub
- `60-revision-plan-v2.md`: stub
## Notes
- written means the artifact contains substantive content.
- stub means the file exists but still appears to be a placeholder.
- missing means the expected file has not been created.

View File

@@ -0,0 +1,60 @@
# Research Brief
## Problem framing
Fact: the analyzer identifies Agent Error Recovery and Rollback as a critical gap in the current IETF AI/agent landscape, especially within autonomous netops. Fact: the gap statement is specific: current drafts discuss communication and coordination, but do not define a common mechanism for machine-actionable failure signaling, rollback boundaries, or coordinated recovery across dependent agents.
Inference: this is a good first draft topic because it is narrower and more defensible than a full agent orchestration architecture, while still addressing a real interoperability and safety problem. Hypothesis: the best initial document is an experimental protocol or profile for failure, checkpoint, rollback-request, and rollback-result semantics, not a complete workflow language.
## Evidence from existing drafts
Fact: the gap report cites only six extracted ideas that partially touch this area. The strongest adjacent ideas are "Task-Oriented Multi-Agent Recovery Framework", "Inter-Agent Communication Protocol Requirements", and "State Consistency Management" from `draft-yue-anima-agent-recovery-networks`, plus "Mandatory restrictive failure behavior" from `draft-srijal-agents-policy`.
Fact: adjacent drafts in the space include `draft-li-dmsc-macp`, `draft-fu-nmop-agent-communication-framework`, `draft-mallick-muacp`, and `draft-zyyhl-agent-networks-framework`. These appear to focus on collaboration or communication frameworks, not interoperable rollback semantics.
Fact: the landscape overview shows high activity and overlap in adjacent categories, but not maturity on recovery. `draft-li-dmsc-macp` scores well overall, while `draft-fu-nmop-agent-communication-framework` is relevant but lower maturity. This suggests there is ecosystem pressure for operational coordination, yet no shared recovery core has emerged.
Fact: the ideas corpus also shows related building blocks such as agent context propagation, working memory, authorization profiles, attestation, and policy enforcement. These matter because rollback decisions depend on shared execution context and trustworthy signaling, even if the rollback draft should not standardize those mechanisms itself.
## Overlap and adjacent work
Fact: `holistic-agent-ecosystem-draft-outlines.md` already frames recovery as part of a broader family and recommends using an execution-evidence substrate such as ECT rather than inventing a second DAG or token format. That same document suggests rollback should be represented through explicit checkpoint, error, rollback-request, and rollback-result events.
Inference: the closest collision risk is not another rollback standard, but accidental overreach into three nearby topics:
- full task DAG and orchestration semantics
- human override and intervention
- dynamic trust and assurance
Inference: the architect should treat those as interfaces, not as primary scope. The rollback draft should define how recovery interacts with dependencies and checkpoints, while leaving workflow planning, trust scoring, and human escalation to companion work or future drafts.
## Gaps and unresolved questions
Fact: the current evidence does not yet establish a canonical wire format or transport for rollback signaling. Fact: the analyzer materials argue for reusing adjacent execution-evidence work, but do not prove that one specific substrate is mature enough to normatively depend on.
Open questions:
- What is the minimum mandatory information in a failure signal: task identifier, parent dependency, failure class, reversibility, checkpoint reference, and rollback scope are likely candidates, but the exact set still needs comparison against existing drafts.
- Should rollback scope be defined as explicit dependency closure, implementation-local policy, or both?
- How should partially completed downstream actions be marked when they are not cleanly reversible?
- Which failures require automatic circuit breaking versus optional operator or policy input?
- Can the draft stay protocol-agnostic while still being testable by independent implementers?
## Additional data worth investigating
- Verify whether WIMSE or ECT-related drafts already define reusable execution identifiers, parent linkage, or signed event records that would let this draft avoid inventing its own carrier.
- Inspect `draft-yue-anima-agent-recovery-networks` directly for concrete recovery states, not just its analyzer summary.
- Compare `draft-li-dmsc-macp` and `draft-fu-nmop-agent-communication-framework` for any existing error taxonomy, dependency model, or task lifecycle signaling.
- Search the ideas set for `checkpoint`, `rollback`, `error`, `failure`, `compensation`, and `circuit breaker` to see whether additional partially related mechanisms were missed by the headline gap report.
## Recommendation to the architect
Design the first draft as a narrowly scoped experimental specification for coordinated recovery semantics in multi-agent execution. Keep the document centered on:
- failure and checkpoint vocabulary
- task state transitions
- rollback request and result signaling
- dependency-aware rollback scope
- minimal security requirements for authentic and authorized recovery events
Avoid defining a new identity system, full orchestration language, human override workflow, or trust-scoring model. If a reusable execution-evidence substrate exists, bind to it; otherwise define a minimal abstract event model that can later be profiled onto specific carriers.

View File

@@ -0,0 +1,121 @@
# Architecture Brief
## Scope
Define an experimental, protocol-agnostic recovery model for multi-agent execution that standardizes:
- failure signaling
- checkpoint references
- rollback request and rollback result semantics
- dependency-aware rollback scope
- minimum task state transitions relevant to recovery
The document should be narrow enough that an existing agent protocol or execution-evidence carrier can adopt it as a profile or extension.
## Non-goals
- defining a full workflow or DAG language
- defining human override or approval workflows beyond a hook for escalation
- defining identity, authentication, or attestation systems
- defining global trust scoring or reputation exchange
- defining scheduler behavior, quota fairness, or resource arbitration beyond optional future hooks
## Terminology and actors
- `agent`: autonomous software entity performing one or more tasks
- `task`: a discrete unit of work whose execution and outcome can be referenced
- `dependency`: another task whose outcome affects whether the current task may continue or must roll back
- `checkpoint`: a recorded pre-action or recovery-safe state from which rollback may proceed
- `failure event`: a machine-actionable signal that a task or dependency failed
- `rollback set`: the set of tasks and effects that the sender requests to revert or compensate
- `recovery record`: a record of rollback attempt, success, partial success, or failure
- `coordinator`: optional role that computes rollback scope across multiple dependent agents
Actors:
- originating agent that detects failure
- dependent agent that receives failure or rollback signals
- optional coordination service or gateway
- policy authority or operator only when automatic rollback is disallowed
## Protocol or data model shape
Use an abstract event model with four core event types:
1. `checkpoint`
2. `failure`
3. `rollback-request`
4. `rollback-result`
Each event should carry a minimum common envelope:
- event identifier
- task identifier
- workflow or execution context identifier if available
- sender identity reference
- timestamp
- referenced parent task or dependency identifiers where relevant
Event-specific content:
- `checkpoint`: checkpoint identifier, reversibility class, optional expiry
- `failure`: failure class, severity, reversibility indicator, blast-radius hint, failed dependency reference
- `rollback-request`: target checkpoint or rollback boundary, requested rollback scope, reason code, urgency, idempotency token
- `rollback-result`: outcome status, actual scope applied, partial rollback indicators, residual risk or manual follow-up required
State model:
- `pending`
- `running`
- `completed`
- `failed`
- `rollback-requested`
- `rolled-back`
- `rollback-failed`
- `compensation-required`
Design choice: keep the carrier abstract in this first draft, but include a section describing how the model may bind to existing execution-evidence formats if such a substrate is available and sufficiently mature.
## Normative requirements candidates
- Agents MUST emit a failure event when a task failure can affect dependent execution outside local process scope.
- Failure events MUST identify the failed task and SHOULD identify affected dependencies when known.
- Rollback requests MUST be idempotent and uniquely identifiable.
- Agents receiving a rollback request MUST return a rollback result, even when rollback is refused or only partially completed.
- A rollback result MUST indicate one of: success, partial success, refusal, irreversible, or failure.
- Agents MUST NOT claim successful rollback unless the referenced effects were actually reverted or explicitly compensated.
- If a task is not reversible, the agent MUST signal that fact explicitly rather than silently ignoring rollback.
- Implementations SHOULD support checkpoint references when a task has externally visible side effects.
- The specification SHOULD allow policy-controlled escalation rather than requiring automatic rollback for every failure.
- The document MUST distinguish rollback of prior effects from cancellation of work that has not yet executed.
## Security, privacy, and abuse considerations
- unauthorized rollback requests could be used as denial-of-service
- spoofed failure signals could trigger cascading rollback
- replayed rollback requests could repeatedly unwind completed work
- rollback metadata may expose internal topology or sensitive task relationships
- partial rollback can create inconsistent downstream state that attackers can exploit
- signed or otherwise authenticated event carriage is strongly preferred, but the draft should avoid redefining base authentication
- the draft should require clear handling of refusal, partial rollback, and policy escalation to avoid silent unsafe states
Privacy is probably secondary but not zero: task identifiers, dependency graphs, and failure reasons can leak operational details.
## IANA impact
Most likely minimal for the first version.
If the draft defines abstract event or reason-code registries, keep them compact:
- rollback event types
- failure classes
- rollback outcome codes
If an existing registry from an underlying carrier can be reused, prefer that.
## Open design questions
- Should rollback scope be defined normatively as dependency closure, or left partially implementation-specific with mandatory disclosure of actual scope?
- Is a separate `cancellation` event needed, or is that explicitly out of scope for this draft?
- How much of checkpoint semantics should be mandatory versus profile-specific?
- Can one draft stay both carrier-agnostic and implementable, or does it need a non-normative binding example to avoid vagueness?

View File

@@ -0,0 +1,79 @@
# Draft Outline
## Abstract
State that the document defines experimental recovery semantics for multi-agent task execution, including failure signaling, rollback requests, rollback results, and checkpoint references. Make clear it is protocol-agnostic and intended to improve interoperable recovery behavior across agent ecosystems.
## Section plan
1. Introduction
2. Terminology
3. Problem Statement and Design Goals
4. Recovery Model Overview
5. Event Types and Required Fields
6. Task States and Recovery Procedures
7. Rollback Scope and Dependency Handling
8. Error Conditions and Partial Rollback
9. Security Considerations
10. Privacy Considerations
11. IANA Considerations
12. References
## Author guidance by section
### 1. Introduction
Explain why autonomous multi-agent systems need interoperable recovery behavior. Keep this grounded in failure propagation and operational safety, not generic AI rhetoric.
### 2. Terminology
Define only the core terms needed for this document: task, dependency, checkpoint, failure event, rollback set, recovery record, coordinator. Keep terms stable and conservative.
### 3. Problem Statement and Design Goals
Describe the exact gap: current drafts define communication and orchestration patterns, but no common rollback semantics. Include explicit goals such as idempotency, partial rollback transparency, and protocol-agnostic applicability.
### 4. Recovery Model Overview
Describe the model at a high level before any field-level detail. Separate local failure handling from cross-agent recovery signaling. Make clear what this document does not define.
### 5. Event Types and Required Fields
Define `checkpoint`, `failure`, `rollback-request`, and `rollback-result`. This section must specify required versus optional fields and avoid vague "metadata may include" language where interoperability depends on a field.
### 6. Task States and Recovery Procedures
Define the state transitions relevant to failure and rollback. Include procedure ordering: detect failure, emit failure event, decide rollback scope, send rollback request, emit rollback result. If escalation is possible, say when.
### 7. Rollback Scope and Dependency Handling
Define how dependencies influence rollback. Be explicit about direct versus transitive effects, what happens when scope is uncertain, and how actual applied scope is reported back.
### 8. Error Conditions and Partial Rollback
Handle non-reversible tasks, refusal, timeout, duplicate requests, and partial success. This section is important for implementability and must not collapse into generic prose.
### 9. Security Considerations
Address spoofing, replay, unauthorized rollback, false failure signaling, topology leakage, and abuse of partial rollback states. The section should be mechanism-specific.
### 10. Privacy Considerations
Address exposure of task identifiers, failure causes, dependency graphs, and sensitive operational details.
### 11. IANA Considerations
Either clearly say none, or request small registries for failure classes and rollback outcomes. Do not hand-wave this.
### 12. References
Use placeholders where necessary, but include adjacent drafts that informed the design and any underlying execution-evidence substrate if referenced.
## Issues that must not be hand-waved
- what fields are mandatory in each event
- what counts as a successful versus partial rollback
- how rollback requests remain idempotent
- what an agent does when a requested rollback is impossible
- how dependency-driven rollback scope is determined and reported
- what security properties the mechanism relies on from lower layers

View File

@@ -0,0 +1,216 @@
# Draft
## Abstract
This document defines experimental recovery semantics for multi-agent task execution. It specifies common event types for failure signaling, checkpoint reference, rollback requests, and rollback results so that cooperating agents can coordinate recovery after operational faults. The mechanism is protocol-agnostic and is intended to be profiled onto existing agent communication or execution-evidence substrates. The goal is to improve interoperability when autonomous systems must contain failures, report rollback scope, and communicate partial or unsuccessful recovery without silent divergence.
## 1. Introduction
Multi-agent systems increasingly perform coordinated work across services, tools, and administrative domains. In such systems, one task failure can invalidate downstream work, require compensating actions, or force a broader rollback of externally visible effects. Existing drafts define communication frameworks, discovery, identity, and broader orchestration concepts, but they do not define a shared recovery core that independent implementations can follow.
Absent common recovery semantics, one implementation may silently retry while another expects explicit rollback, and a third may report only local failure without describing downstream consequences. That mismatch creates interoperability risk and operational safety risk, especially when agents act without immediate human supervision.
This document defines a narrow recovery model for cross-agent failure handling. It does not define a full workflow language, a transport binding, or a human override system. Instead, it defines event semantics and minimum procedure rules so that agents can exchange recovery-relevant information consistently.
## 2. Terminology
The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals.
Agent: an autonomous software entity that performs one or more tasks and may exchange recovery events with peers.
Task: a discrete unit of work whose execution and outcome can be identified.
Dependency: a relationship in which one task relies on the prior completion, state, or side effects of another task.
Checkpoint: a recorded state or recovery-safe reference from which rollback can proceed.
Failure Event: a machine-actionable record that a task or dependency failed in a way that can affect other participants.
Rollback Set: the set of tasks, effects, or checkpoints that a rollback request identifies as the intended recovery scope.
Recovery Record: a record of rollback attempt, refusal, partial rollback, success, or failure.
Coordinator: an optional component that computes or distributes rollback scope across multiple agents.
Compensation: a follow-up action that mitigates an irreversible effect when direct rollback is not possible.
## 3. Problem Statement
Current agent ecosystems have uneven support for failure handling. Some drafts discuss task coordination or operational recovery, but the analyzed landscape still lacks a common method to express:
- that a task failed in a cross-agent relevant way,
- which dependencies are affected,
- which checkpoint or rollback boundary should be used, and
- whether rollback succeeded, only partially succeeded, or was impossible.
The absence of these common semantics makes independent implementation difficult. An originating agent may believe it has requested rollback, while a receiving agent may treat the same signal as informational. Similarly, partial rollback can leave downstream agents operating on inconsistent assumptions if outcome reporting is underspecified.
The design goals for this document are:
- protocol-agnostic applicability,
- minimal mandatory fields for interoperability,
- idempotent rollback requests,
- explicit reporting of partial or impossible rollback, and
- compatibility with existing lower-layer identity and integrity mechanisms.
## 4. Recovery Model Overview
This document defines four event types:
- `checkpoint`
- `failure`
- `rollback-request`
- `rollback-result`
These events MAY be carried in a message protocol, stored as execution records, or embedded in a larger workflow substrate. This document does not standardize the carrier. It standardizes the meaning of the events and the minimum information needed for interoperable recovery behavior.
Each event has a common envelope containing:
- an event identifier,
- a task identifier,
- a sender identity reference,
- a timestamp, and
- any relevant workflow or execution context identifier.
The recovery model assumes that a failure can be local or cross-agent relevant. Local failures that cannot affect any external dependency do not require signaling under this document. When a failure can affect dependent work outside local scope, the originating agent MUST emit a `failure` event.
If rollback is needed, the requester sends a `rollback-request` identifying the requested scope. The receiver returns a `rollback-result` stating whether the requested recovery succeeded, partially succeeded, was refused, was impossible, or failed.
## 5. Event Types and Required Fields
### 5.1 Checkpoint
A `checkpoint` event identifies a recovery-safe reference that later rollback may target. A checkpoint event MUST include:
- event identifier,
- task identifier,
- checkpoint identifier,
- sender identity reference,
- timestamp.
A checkpoint event SHOULD include reversibility class and MAY include checkpoint expiry or retention information.
### 5.2 Failure
A `failure` event reports a task failure that can affect dependent execution outside local process scope. A failure event MUST include:
- event identifier,
- failed task identifier,
- sender identity reference,
- timestamp,
- failure class,
- reversibility indicator.
A failure event SHOULD include affected dependency identifiers when known, and MAY include severity, blast-radius hint, or checkpoint reference.
### 5.3 Rollback Request
A `rollback-request` event asks another participant to revert or compensate previously applied effects. A rollback request MUST include:
- event identifier,
- requester identity reference,
- target task identifier or checkpoint identifier,
- requested rollback scope,
- idempotency token,
- timestamp.
A rollback request SHOULD include reason code and urgency. A rollback request MAY include dependency evidence or policy reference supporting the request.
### 5.4 Rollback Result
A `rollback-result` event reports the outcome of processing a rollback request. A rollback result MUST include:
- event identifier,
- referenced rollback-request identifier,
- responder identity reference,
- outcome code,
- timestamp,
- actual scope applied.
The outcome code MUST be one of:
- `success`
- `partial-success`
- `refused`
- `irreversible`
- `failure`
A rollback result SHOULD include residual risk description when the result is not `success`. A rollback result MAY include compensation details.
## 6. Task States and Recovery Procedures
For purposes of this document, relevant task states are:
- `pending`
- `running`
- `completed`
- `failed`
- `rollback-requested`
- `rolled-back`
- `rollback-failed`
- `compensation-required`
When an agent detects a task failure that can affect external dependents, it MUST transition the affected task to `failed` and emit a `failure` event. If policy permits automatic recovery, the originating agent or coordinator SHOULD determine the rollback set and issue one or more `rollback-request` events. If policy does not permit automatic rollback, the implementation SHOULD enter a local hold or escalation path rather than silently continuing.
An agent receiving a `rollback-request` MUST process duplicate requests idempotently. If the request can be honored, the agent applies rollback or compensation as appropriate and emits a `rollback-result`. If the request cannot be honored because the effect is irreversible or unauthorized, the agent MUST emit a `rollback-result` with the appropriate outcome code.
This document distinguishes rollback from cancellation. Cancellation of work not yet started is out of scope except where a local implementation uses cancellation internally to satisfy a rollback request.
## 7. Rollback Scope and Dependency Handling
Rollback scope is central to interoperability. A rollback request MUST identify either:
- a target checkpoint, or
- an explicit rollback set.
When transitive dependencies are known, the requester SHOULD include them or indicate that transitive evaluation is required. When dependency knowledge is incomplete, the requester MUST still identify the minimum known affected scope and the responder MUST report the actual scope applied in the rollback result.
An implementation MUST NOT report successful rollback for effects outside the applied scope. If only part of the requested rollback set is reversed, the responder MUST return `partial-success` and describe any remaining irreversible or uncompensated effects.
A coordinator MAY compute rollback scope across multiple agents, but this document does not require a coordinator role. Peers can interoperate directly as long as they provide the required event information.
## 8. Error Conditions and Partial Rollback
The following conditions require explicit handling:
- duplicate rollback requests,
- timeout while waiting for rollback completion,
- refusal due to insufficient authorization,
- irreversible effects,
- partial rollback where some effects are reversed and others remain,
- failure of the rollback procedure itself.
If a requested rollback is impossible, the responding agent MUST indicate `irreversible` or `failure` as appropriate and SHOULD indicate whether compensation is available. If a request is refused for policy reasons, the agent MUST indicate `refused` and SHOULD include a reason that is usable by the requester or an external policy authority.
Implementations SHOULD avoid silent downgrade from rollback to best-effort local cleanup. If only local cleanup occurred, the rollback result SHOULD say so clearly.
## 9. Security Considerations
Unauthorized rollback requests can be used to deny service or corrupt coordinated work. Implementations therefore need an authenticated and authorized carriage for the events defined here, even though this document does not define the underlying security protocol.
Spoofed failure events can trigger unnecessary rollback. Replay of old rollback requests can repeatedly unwind valid work. Implementations SHOULD provide replay resistance and SHOULD bind requests and results to stable task and requester identifiers.
Partial rollback is itself a security concern because it can leave downstream systems in an inconsistent state that an attacker can exploit. For that reason, responders MUST explicitly report residual scope and any remaining irreversible effects.
Failure and rollback metadata can also reveal topology, task dependencies, and operational weaknesses. Deployments SHOULD minimize unnecessary disclosure and SHOULD apply least-privilege access to recovery records.
## 10. Privacy Considerations
Task identifiers, failure classes, dependency relationships, and reason codes may expose sensitive operational details. In some deployments, these details can reveal user behavior, internal service structure, or policy logic.
Implementations SHOULD disclose only the information necessary for interoperable recovery. If a deployment requires broader analytics or audit retention, that policy is deployment-specific and outside the scope of this document.
## 11. IANA Considerations
This document currently requests no IANA action.
Future versions may request compact registries for failure classes, rollback outcome codes, or event type identifiers if implementation experience shows that fixed interoperation points are needed.
## 12. References
- [RFC2119] Key words for use in RFCs to Indicate Requirement Levels.
- [RFC8174] Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words.
- Placeholder reference for adjacent execution-evidence substrate, if adopted.
- Placeholder reference for `draft-yue-anima-agent-recovery-networks`.
- Placeholder reference for `draft-li-dmsc-macp`.
- Placeholder reference for `draft-fu-nmop-agent-communication-framework`.

View File

@@ -0,0 +1,242 @@
# Draft
## Abstract
This document defines experimental recovery semantics for multi-agent task execution. It specifies interoperable event semantics for failure signaling, checkpoint reference, rollback requests, and rollback results so that cooperating agents can coordinate recovery after operational faults. The mechanism is carrier-agnostic and is intended to be profiled onto existing agent communication or execution-evidence substrates. It addresses an interoperability gap in current agent systems: different implementations can detect the same failure yet diverge materially in how they request rollback, report applied scope, and disclose partial or irreversible outcomes.
## 1. Introduction
Multi-agent systems increasingly perform coordinated work across services, tools, and administrative domains. In such systems, one task failure can invalidate downstream work, require compensating actions, or force a broader rollback of externally visible effects. Existing drafts define communication frameworks, discovery, identity, and broader orchestration concepts, but they do not yet provide a small interoperable recovery core that independent implementations can share.
Without common recovery behavior, one implementation may silently retry while another expects explicit rollback, and a third may report only local failure without describing downstream consequences. Those differences are not just operationally inconvenient; they create genuine safety and interoperability risk when agents act without immediate human supervision.
This document therefore defines an abstract recovery protocol model for cross-agent failure handling. It does not define a workflow language, a transport binding, or a human override system. It does define required event meaning, minimum fields, authorization and replay expectations, rollback-scope reporting, and outcome reporting sufficient for interoperable recovery behavior.
The intended status of this document is Experimental.
## 2. Terminology
The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals.
Agent: an autonomous software entity that performs one or more tasks and may exchange recovery events with peers.
Task: a discrete unit of work whose execution and outcome can be identified.
Dependency: a relationship in which one task relies on the prior completion, state, or side effects of another task.
Checkpoint: a recorded recovery-safe reference from which rollback or compensation planning can proceed.
Failure Event: a machine-actionable record indicating that a task or dependency failed in a way that can affect other participants.
Rollback Set: the abstract set of task identifiers, checkpoint identifiers, or effect identifiers that a rollback request identifies as in scope.
Recovery Record: a record of rollback attempt, refusal, partial rollback, success, or failure.
Compensation: a follow-up action that mitigates an irreversible effect when direct rollback is not possible.
## 3. Problem Statement
Current agent ecosystems have uneven support for failure handling. Some drafts discuss task coordination or operational recovery, but the analyzed landscape still lacks a common method to express:
- that a task failed in a cross-agent relevant way,
- which dependencies are affected,
- which checkpoint or rollback boundary should be used,
- what rollback scope is being requested, and
- whether rollback succeeded, only partially succeeded, was refused, or was impossible.
The absence of these common semantics makes independent implementation difficult. An originating agent may believe it has requested rollback, while a receiving agent may treat the same signal as informational. Similarly, partial rollback can leave downstream agents operating on inconsistent assumptions if outcome reporting is underspecified.
The design goals for this document are:
- protocol-agnostic applicability,
- minimal mandatory fields for interoperability,
- idempotent rollback requests,
- explicit authorization and replay handling,
- explicit reporting of partial or impossible rollback, and
- compatibility with existing lower-layer identity and integrity mechanisms.
## 4. Recovery Model Overview
This document defines four event types:
- `checkpoint`
- `failure`
- `rollback-request`
- `rollback-result`
These events MAY be carried in a message protocol, stored as execution records, or embedded in a larger workflow substrate. This document does not standardize the carrier. It standardizes the abstract protocol behavior and the minimum information needed for interoperable recovery.
Each event has a common envelope containing:
- an event identifier,
- a task identifier,
- a sender identity reference,
- a timestamp, and
- any relevant workflow or execution context identifier.
The recovery model assumes that a failure can be local or cross-agent relevant. Local failures that cannot affect any external dependency do not require signaling under this document. When a failure can affect dependent work outside local scope, the originating agent MUST emit a `failure` event.
If rollback is needed, the requester sends a `rollback-request` identifying the requested scope. The receiver evaluates authorization, replay status, and local reversibility before acting. The receiver then returns a `rollback-result` stating whether the requested recovery succeeded, partially succeeded, was refused, was impossible, or failed.
## 5. Event Types and Required Fields
### 5.1 Checkpoint
A `checkpoint` event identifies a recovery-safe reference that later rollback may target. A checkpoint event MUST include:
- event identifier,
- task identifier,
- checkpoint identifier,
- sender identity reference,
- timestamp.
A checkpoint event SHOULD include reversibility class and MAY include checkpoint expiry or retention information.
### 5.2 Failure
A `failure` event reports a task failure that can affect dependent execution outside local process scope. A failure event MUST include:
- event identifier,
- failed task identifier,
- sender identity reference,
- timestamp,
- failure class,
- reversibility indicator.
A failure event SHOULD include affected dependency identifiers when known, and MAY include severity, blast-radius hint, or checkpoint reference.
### 5.3 Rollback Request
A `rollback-request` event asks another participant to revert or compensate previously applied effects. A rollback request MUST include:
- event identifier,
- requester identity reference,
- target task identifier or checkpoint identifier,
- requested rollback scope,
- idempotency token,
- timestamp.
A rollback request SHOULD include reason code and urgency. A rollback request MAY include dependency evidence or policy reference supporting the request.
Before applying rollback, a receiver MUST evaluate whether the requester is authorized to request rollback for the identified scope. If authorization fails, the receiver MUST NOT apply rollback and MUST emit a `rollback-result` with outcome `refused`.
### 5.4 Rollback Result
A `rollback-result` event reports the outcome of processing a rollback request. A rollback result MUST include:
- event identifier,
- referenced rollback-request identifier,
- responder identity reference,
- outcome code,
- timestamp,
- actual scope applied.
The outcome code MUST be one of:
- `success`
- `partial-success`
- `refused`
- `irreversible`
- `failure`
If the outcome code is not `success`, the rollback result MUST include enough detail to indicate remaining unapplied scope, residual irreversible effects, or refusal reason. A rollback result MAY include compensation details.
## 6. Task States and Recovery Procedures
For purposes of this document, relevant task states are:
- `pending`
- `running`
- `completed`
- `failed`
- `rollback-requested`
- `rolled-back`
- `rollback-failed`
- `compensation-required`
When an agent detects a task failure that can affect external dependents, it MUST transition the affected task to `failed` and emit a `failure` event. If policy permits automatic recovery, the originating agent SHOULD determine the rollback set and issue one or more `rollback-request` events. If policy does not permit automatic rollback, the implementation SHOULD enter a local hold or escalation path rather than silently continuing.
An agent receiving a `rollback-request` MUST process duplicate requests idempotently. To do so, the receiver MUST correlate the request identifier and idempotency token and MUST reject or safely ignore stale replayed requests according to local replay policy. A request that is recognized as stale replay MUST NOT cause a second rollback action.
If the request is authorized and can be honored, the agent applies rollback or compensation as appropriate and emits a `rollback-result`. If the request cannot be honored because the effect is irreversible, unauthorized, or operationally failed, the agent MUST emit a `rollback-result` with the appropriate outcome code.
This document distinguishes rollback from cancellation. Cancellation of work not yet started is out of scope except where a local implementation uses cancellation internally while fulfilling a rollback request.
### 6.1 State Transition Guidance
| Current State | Trigger | Next State | Required Output |
|---|---|---|---|
| `running` | cross-agent relevant failure detected | `failed` | `failure` |
| `completed` | authorized rollback requested | `rollback-requested` | none immediately |
| `rollback-requested` | rollback fully applied | `rolled-back` | `rollback-result(success)` |
| `rollback-requested` | rollback partially applied | `compensation-required` | `rollback-result(partial-success)` |
| `rollback-requested` | rollback impossible | `rollback-failed` or `compensation-required` | `rollback-result(irreversible)` |
| `rollback-requested` | processing failure | `rollback-failed` | `rollback-result(failure)` |
This table is intentionally minimal. Local implementations MAY track finer-grained states, but interoperable outputs MUST remain consistent with the transitions above.
## 7. Rollback Scope and Dependency Handling
Rollback scope is central to interoperability. A rollback request MUST identify either:
- a target checkpoint, or
- an explicit rollback set.
At minimum, a rollback set MUST identify one or more affected task identifiers, checkpoint identifiers, or effect identifiers. When transitive dependencies are known, the requester SHOULD indicate whether the scope includes only direct dependencies or includes transitive dependencies as well.
When dependency knowledge is incomplete, the requester MUST still identify the minimum known affected scope and the responder MUST report the actual scope applied in the rollback result. A responder MUST NOT report successful rollback for effects outside the applied scope.
If only part of the requested rollback set is reversed, the responder MUST return `partial-success` and MUST describe any remaining irreversible or uncompensated effects.
## 8. Error Conditions and Partial Rollback
The following conditions require explicit handling:
- duplicate rollback requests,
- stale replay of prior rollback requests,
- timeout while waiting for rollback completion,
- refusal due to insufficient authorization,
- irreversible effects,
- partial rollback where some effects are reversed and others remain,
- failure of the rollback procedure itself.
If a requested rollback is impossible, the responding agent MUST indicate `irreversible` or `failure` as appropriate and SHOULD indicate whether compensation is available. If a request times out after some scope has been applied, the responder SHOULD return `partial-success` rather than silently collapsing to generic failure.
Implementations SHOULD avoid silent downgrade from rollback to best-effort local cleanup. If only local cleanup occurred, the rollback result SHOULD say so clearly.
### 8.1 Non-Normative Example Flow
Agent A executes task `t-17`, which depends on Agent B having applied task `t-12`. Agent B later detects that `t-12` wrote invalid external state and emits `failure(failed-task=t-12, affected-dependency=t-17)`. Agent A determines that rollback is required for `t-17` and sends `rollback-request(request-id=r-8, target-task=t-17, scope={t-17, ckpt-17-precommit}, idempotency-token=abc123)`.
Agent A's peer evaluates requester authorization and replay status, applies rollback to `t-17`, but cannot reverse one externally visible notification. It therefore emits `rollback-result(ref=r-8, outcome=partial-success, actual-scope={t-17, ckpt-17-precommit}, residual=notification already delivered)`. A downstream relying party can now distinguish partial rollback from full recovery and act accordingly.
## 9. Security Considerations
Unauthorized rollback requests can be used to deny service or corrupt coordinated work. Implementations therefore need authenticated carriage and explicit authorization checks for the events defined here, even though this document does not define the underlying security protocol.
Spoofed failure events can trigger unnecessary rollback. Replay of old rollback requests can repeatedly unwind valid work. Implementations MUST prevent replayed requests from causing repeated rollback actions and SHOULD bind requests and results to stable task and requester identifiers.
Partial rollback is itself a security concern because it can leave downstream systems in an inconsistent state that an attacker can exploit. For that reason, responders MUST explicitly report residual scope and any remaining irreversible effects.
Failure and rollback metadata can also reveal topology, task dependencies, and operational weaknesses. Deployments SHOULD minimize unnecessary disclosure and SHOULD apply least-privilege access to recovery records.
## 10. Privacy Considerations
Task identifiers, failure classes, dependency relationships, and reason codes may expose sensitive operational details. In some deployments, these details can reveal user behavior, internal service structure, or policy logic.
Implementations SHOULD disclose only the information necessary for interoperable recovery. If a deployment requires broader analytics or audit retention, that policy is deployment-specific and outside the scope of this document.
## 11. IANA Considerations
This document currently requests no IANA action.
Future versions may request compact registries for failure classes, rollback outcome codes, or event type identifiers if implementation experience shows that fixed interoperation points are needed.
## 12. References
- [RFC2119] Key words for use in RFCs to Indicate Requirement Levels.
- [RFC8174] Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words.
- Placeholder reference for adjacent execution-evidence substrate, if adopted.
- Placeholder reference for `draft-yue-anima-agent-recovery-networks`.
- Placeholder reference for `draft-li-dmsc-macp`.
- Placeholder reference for `draft-fu-nmop-agent-communication-framework`.

View File

@@ -0,0 +1,24 @@
# Architecture Review
## Findings
### Medium: the draft is mostly well scoped, but it wavers between abstract event semantics and protocol behavior
The document says it is carrier-agnostic and not a transport binding, which is correct. However, several MUST-level statements already imply protocol behavior. That is acceptable, but the architecture should acknowledge that the document defines an abstract protocol model, not only vocabulary.
### Medium: coordinator role is introduced but not integrated into the model
The coordinator is defined as optional, yet no section explains how peers distinguish coordinator-computed scope from sender-local scope. That leaves a conceptual hole in the actor model.
### Medium: cancellation is declared out of scope, but the boundary with rollback is not fully clean
The text says cancellation of work not yet started is out of scope, except when used internally to satisfy rollback. That line is defensible, but it should be expressed more rigorously to prevent readers from assuming cancellation semantics are standardized here.
## Open questions
- Should the draft describe itself as an abstract recovery protocol profile rather than only "semantics"?
- Does the optional coordinator need one or two normative constraints, or should it be deferred entirely?
## Residual risk
Scope discipline is good overall. The main remaining architectural risk is ambiguity about whether this document is merely descriptive or actually defines interoperable protocol behavior. It should explicitly choose the latter in a carefully bounded way.

View File

@@ -0,0 +1,28 @@
# IETF Senior Review
## Findings
### High: the draft still reads more like a design sketch than a publishable Internet-Draft
The overall structure is right, but several sections stop at high-level intent. A publishable draft needs more disciplined distinction between required behavior, optional behavior, and explanatory rationale. Sections 5 through 8 are closest to publishable, but they still need slightly more rigor.
### Medium: the abstract is acceptable but could better state the interoperability problem and deployment value
The current abstract says what the document defines, but it could more directly explain why existing agent systems fail to interoperate during recovery and why this document matters.
### Medium: References and IANA sections are too provisional
It is fine to keep placeholders at this stage, but the text currently signals that core dependencies are undecided. Before wider circulation, the draft should either name the expected adjacent substrate or state clearly that no substrate dependency is required.
### Medium: terminology is mostly clean, but some items still need RFC-style definition form
The terms are understandable, yet a few are written more like explanations than stable definitions. Tightening the definition style would help the document feel more standards-native.
## Open questions
- Does the draft intend to progress as a standalone individual draft or as part of a family with a shared terminology base?
- Should the document explicitly call itself Experimental in the introduction rather than only in external cycle metadata?
## Residual publishability risk
This is a credible start. The remaining publishability risk is not the idea; it is the need for one more iteration of standards-style precision and dependency cleanup.

View File

@@ -0,0 +1,28 @@
# Security Review
## Findings
### High: rollback authorization is left entirely to the lower layer without a required authorization decision point
The draft says recovery events need authenticated and authorized carriage, but it never states when a receiver is required to evaluate authorization before acting on a `rollback-request`. Two compliant implementations could therefore both authenticate the requester yet differ on whether task-level rollback authority is required. The draft should require an explicit authorization check before any irreversible rollback action is attempted.
### High: replay protection is mentioned but underspecified for interoperable use
The draft says implementations SHOULD provide replay resistance, but `rollback-request` already defines an idempotency token and stable identifiers. That is enough structure to make stronger requirements possible. Without a minimum replay-handling rule, an attacker can reuse stale rollback requests in a way that different implementations will treat inconsistently.
### Medium: failure-event spoofing risk is identified, but the draft does not require correlation between failure and rollback flows
An attacker who can inject a plausible `failure` event may induce unnecessary rollback decisions. The draft should at least require that a `rollback-request` reference a specific task or failure context and that receivers preserve the linkage in the `rollback-result`.
### Medium: partial rollback can leave exploitable inconsistent state, but no minimum disclosure is mandated
The draft correctly notes the risk, yet "residual risk description" is only a SHOULD. For partial-success and irreversible outcomes, a stronger requirement is warranted so downstream agents can react safely.
## Open questions
- Should authorization be expressed as a generic requirement only, or should the document define a task-scope authorization concept for rollback actions?
- Should replay resistance be a MUST for all deployments, or only when rollback has externally visible effects?
## Residual risk
Even with the fixes above, the draft will still depend heavily on lower-layer identity and authorization systems. That is acceptable, but the security section should say so more concretely and bind protocol behavior to those assumptions.

View File

@@ -0,0 +1,28 @@
# Software Review
## Findings
### High: required fields are defined, but no concrete message shape or example flow is provided
The event model is understandable, but two implementers could still serialize or correlate it differently. A non-normative example showing `failure -> rollback-request -> rollback-result` with task identifiers, dependency references, and partial-success handling would materially reduce ambiguity.
### High: task state transitions are incomplete at the procedure level
The draft lists states but does not specify enough transition rules. For example, can a task move from `completed` directly to `rollback-requested`? Can `compensation-required` be terminal? Can `rollback-failed` later transition to `rolled-back` after manual intervention? Without a transition table or explicit rules, interoperability tests will be hard to design.
### Medium: rollback scope remains too abstract for independent implementations
The draft requires a target checkpoint or explicit rollback set, but it does not describe the structure of a rollback set or how direct and transitive dependencies are represented. The draft needs at least a minimal abstract shape for scope membership.
### Medium: timeout behavior is named but not operationalized
Timeout is listed as an error condition, but no rule says whether timeout yields `failure`, `partial-success`, or local retry. This will fragment behavior.
## Open questions
- Is a compact transition table sufficient, or does the draft need a separate state machine subsection?
- Should rollback set representation be a list of task identifiers, checkpoint identifiers, or both?
## Residual risk
The current draft is close to implementable, but it still needs one more layer of precision around flow shape and state progression before two vendors would likely build compatible behavior.

View File

@@ -0,0 +1,7 @@
# Architecture Review
## Findings
## Open questions
## Residual risk

View File

@@ -0,0 +1,7 @@
# IETF Senior Review
## Findings
## Open questions
## Residual publishability risk

View File

@@ -0,0 +1,7 @@
# Security Review
## Findings
## Open questions
## Residual risk

View File

@@ -0,0 +1,7 @@
# Software Review
## Findings
## Open questions
## Residual risk

View File

@@ -0,0 +1,26 @@
# Review Synthesis
## Blocking findings
- Add an explicit authorization-decision requirement before acting on rollback requests. The security review correctly identifies this as the biggest missing control.
- Tighten replay handling by linking idempotency, request identity, and stale-request rejection into one interoperable rule.
- Add one concrete non-normative flow example and a compact transition table. The software review is right that the draft is still too abstract for two independent implementations.
## Major findings
- Clarify whether the document is an abstract protocol model or only event vocabulary. The architecture review recommends choosing the former in a bounded way.
- Specify minimum disclosure rules for partial-success, irreversible, and refused outcomes so downstream agents can react safely.
- Clarify rollback-scope representation at the abstract level: what a rollback set minimally contains and how direct versus transitive scope is reported.
- Improve the abstract and introduction to frame the interoperability problem more directly.
## Minor findings
- Tighten terminology definitions into more RFC-like form.
- Clarify the coordinator role or remove it if not needed in this revision.
- Clarify the cancellation boundary.
- Reduce placeholder feel in References and dependency text.
## Conflicts resolved
- No meaningful reviewer conflict exists on scope. All reviewers favor keeping the document narrow.
- The only tension is between remaining carrier-agnostic and becoming implementable. Resolution: keep the model carrier-agnostic, but add one non-normative example and stronger abstract structure rather than binding to a specific substrate in v1.

View File

@@ -0,0 +1,9 @@
# Review Synthesis
## Blocking findings
## Major findings
## Minor findings
## Conflicts resolved

View File

@@ -0,0 +1,28 @@
# Revision Plan
## Blocking changes
- Add a normative requirement that receivers evaluate authorization before honoring a rollback request.
- Add a normative replay-handling rule tying request identity, idempotency token, and stale-request rejection together.
- Add a compact state-transition table covering normal failure, rollback request, partial success, irreversible outcome, and compensation-required cases.
- Add one non-normative end-to-end example flow with concrete identifiers and a partial-success outcome.
## High-value improvements
- Clarify rollback-set structure and how transitive scope is represented or reported.
- Strengthen `rollback-result` requirements for partial-success, refused, and irreversible outcomes.
- Tighten the abstract, introduction, and terminology wording to sound more like an actual I-D.
- Either define the coordinator role more clearly or remove it from this version.
## Deferred items
- Binding to a specific execution-evidence substrate
- Human override or operator approval flow
- Registries for failure classes and rollback outcomes unless implementation feedback requires them
## Draft order for next iteration
1. Revise abstract and terminology.
2. Revise Sections 5 through 8 for authorization, replay, scope shape, and state transitions.
3. Add non-normative example flow.
4. Revisit Security, Privacy, IANA, and References after the protocol text settles.

View File

@@ -0,0 +1,9 @@
# Revision Plan
## Blocking changes
## High-value improvements
## Deferred items
## Draft order for next iteration