ietf-draft-analyzer/workspace/draft-team/cycles/agent-error-recovery-rollback/40-draft-v1.md

# Draft

## Abstract

This document defines experimental recovery semantics for multi-agent task execution. It specifies common event types for failure signaling, checkpoint reference, rollback requests, and rollback results so that cooperating agents can coordinate recovery after operational faults. The mechanism is protocol-agnostic and is intended to be profiled onto existing agent communication or execution-evidence substrates. The goal is to improve interoperability when autonomous systems must contain failures, report rollback scope, and communicate partial or unsuccessful recovery without silent divergence.

## 1. Introduction

Multi-agent systems increasingly perform coordinated work across services, tools, and administrative domains. In such systems, one task failure can invalidate downstream work, require compensating actions, or force a broader rollback of externally visible effects. Existing drafts define communication frameworks, discovery, identity, and broader orchestration concepts, but they do not define a shared recovery core that independent implementations can follow.

Absent common recovery semantics, one implementation may silently retry while another expects explicit rollback, and a third may report only local failure without describing downstream consequences. That mismatch creates interoperability risk and operational safety risk, especially when agents act without immediate human supervision.

This document defines a narrow recovery model for cross-agent failure handling. It does not define a full workflow language, a transport binding, or a human override system. Instead, it defines event semantics and minimum procedure rules so that agents can exchange recovery-relevant information consistently.

## 2. Terminology

The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals.

Agent: an autonomous software entity that performs one or more tasks and may exchange recovery events with peers.

Task: a discrete unit of work whose execution and outcome can be identified.

Dependency: a relationship in which one task relies on the prior completion, state, or side effects of another task.

Checkpoint: a recorded state or recovery-safe reference from which rollback can proceed.

Failure Event: a machine-actionable record that a task or dependency failed in a way that can affect other participants.

Rollback Set: the set of tasks, effects, or checkpoints that a rollback request identifies as the intended recovery scope.

Recovery Record: a record of rollback attempt, refusal, partial rollback, success, or failure.

Coordinator: an optional component that computes or distributes rollback scope across multiple agents.

Compensation: a follow-up action that mitigates an irreversible effect when direct rollback is not possible.

## 3. Problem Statement

Current agent ecosystems have uneven support for failure handling. Some drafts discuss task coordination or operational recovery, but the analyzed landscape still lacks a common method to express:

- that a task failed in a cross-agent relevant way,
- which dependencies are affected,
- which checkpoint or rollback boundary should be used, and
- whether rollback succeeded, only partially succeeded, or was impossible.

The absence of these common semantics makes independent implementation difficult. An originating agent may believe it has requested rollback, while a receiving agent may treat the same signal as informational. Similarly, partial rollback can leave downstream agents operating on inconsistent assumptions if outcome reporting is underspecified.

The design goals for this document are:

- protocol-agnostic applicability,
- minimal mandatory fields for interoperability,
- idempotent rollback requests,
- explicit reporting of partial or impossible rollback, and
- compatibility with existing lower-layer identity and integrity mechanisms.

## 4. Recovery Model Overview

This document defines four event types:

- `checkpoint`
- `failure`
- `rollback-request`
- `rollback-result`

These events MAY be carried in a message protocol, stored as execution records, or embedded in a larger workflow substrate. This document does not standardize the carrier. It standardizes the meaning of the events and the minimum information needed for interoperable recovery behavior.

Each event has a common envelope containing:

- an event identifier,
- a task identifier,
- a sender identity reference,
- a timestamp, and
- any relevant workflow or execution context identifier.

The recovery model assumes that a failure can be local or cross-agent relevant. Local failures that cannot affect any external dependency do not require signaling under this document. When a failure can affect dependent work outside local scope, the originating agent MUST emit a `failure` event.

If rollback is needed, the requester sends a `rollback-request` identifying the requested scope. The receiver returns a `rollback-result` stating whether the requested recovery succeeded, partially succeeded, was refused, was impossible, or failed.

## 5. Event Types and Required Fields

### 5.1 Checkpoint

A `checkpoint` event identifies a recovery-safe reference that later rollback may target. A checkpoint event MUST include:

- event identifier,
- task identifier,
- checkpoint identifier,
- sender identity reference,
- timestamp.

A checkpoint event SHOULD include reversibility class and MAY include checkpoint expiry or retention information.

### 5.2 Failure

A `failure` event reports a task failure that can affect dependent execution outside local process scope. A failure event MUST include:

- event identifier,
- failed task identifier,
- sender identity reference,
- timestamp,
- failure class,
- reversibility indicator.

A failure event SHOULD include affected dependency identifiers when known, and MAY include severity, blast-radius hint, or checkpoint reference.

### 5.3 Rollback Request

A `rollback-request` event asks another participant to revert or compensate previously applied effects. A rollback request MUST include:

- event identifier,
- requester identity reference,
- target task identifier or checkpoint identifier,
- requested rollback scope,
- idempotency token,
- timestamp.

A rollback request SHOULD include reason code and urgency. A rollback request MAY include dependency evidence or policy reference supporting the request.

### 5.4 Rollback Result

A `rollback-result` event reports the outcome of processing a rollback request. A rollback result MUST include:

- event identifier,
- referenced rollback-request identifier,
- responder identity reference,
- outcome code,
- timestamp,
- actual scope applied.

The outcome code MUST be one of:

- `success`
- `partial-success`
- `refused`
- `irreversible`
- `failure`

A rollback result SHOULD include residual risk description when the result is not `success`. A rollback result MAY include compensation details.

## 6. Task States and Recovery Procedures

For purposes of this document, relevant task states are:

- `pending`
- `running`
- `completed`
- `failed`
- `rollback-requested`
- `rolled-back`
- `rollback-failed`
- `compensation-required`

When an agent detects a task failure that can affect external dependents, it MUST transition the affected task to `failed` and emit a `failure` event. If policy permits automatic recovery, the originating agent or coordinator SHOULD determine the rollback set and issue one or more `rollback-request` events. If policy does not permit automatic rollback, the implementation SHOULD enter a local hold or escalation path rather than silently continuing.

An agent receiving a `rollback-request` MUST process duplicate requests idempotently. If the request can be honored, the agent applies rollback or compensation as appropriate and emits a `rollback-result`. If the request cannot be honored because the effect is irreversible or unauthorized, the agent MUST emit a `rollback-result` with the appropriate outcome code.

This document distinguishes rollback from cancellation. Cancellation of work not yet started is out of scope except where a local implementation uses cancellation internally to satisfy a rollback request.

## 7. Rollback Scope and Dependency Handling

Rollback scope is central to interoperability. A rollback request MUST identify either:

- a target checkpoint, or
- an explicit rollback set.

When transitive dependencies are known, the requester SHOULD include them or indicate that transitive evaluation is required. When dependency knowledge is incomplete, the requester MUST still identify the minimum known affected scope and the responder MUST report the actual scope applied in the rollback result.

An implementation MUST NOT report successful rollback for effects outside the applied scope. If only part of the requested rollback set is reversed, the responder MUST return `partial-success` and describe any remaining irreversible or uncompensated effects.

A coordinator MAY compute rollback scope across multiple agents, but this document does not require a coordinator role. Peers can interoperate directly as long as they provide the required event information.

## 8. Error Conditions and Partial Rollback

The following conditions require explicit handling:

- duplicate rollback requests,
- timeout while waiting for rollback completion,
- refusal due to insufficient authorization,
- irreversible effects,
- partial rollback where some effects are reversed and others remain,
- failure of the rollback procedure itself.

If a requested rollback is impossible, the responding agent MUST indicate `irreversible` or `failure` as appropriate and SHOULD indicate whether compensation is available. If a request is refused for policy reasons, the agent MUST indicate `refused` and SHOULD include a reason that is usable by the requester or an external policy authority.

Implementations SHOULD avoid silent downgrade from rollback to best-effort local cleanup. If only local cleanup occurred, the rollback result SHOULD say so clearly.

## 9. Security Considerations

Unauthorized rollback requests can be used to deny service or corrupt coordinated work. Implementations therefore need an authenticated and authorized carriage for the events defined here, even though this document does not define the underlying security protocol.

Spoofed failure events can trigger unnecessary rollback. Replay of old rollback requests can repeatedly unwind valid work. Implementations SHOULD provide replay resistance and SHOULD bind requests and results to stable task and requester identifiers.

Partial rollback is itself a security concern because it can leave downstream systems in an inconsistent state that an attacker can exploit. For that reason, responders MUST explicitly report residual scope and any remaining irreversible effects.

Failure and rollback metadata can also reveal topology, task dependencies, and operational weaknesses. Deployments SHOULD minimize unnecessary disclosure and SHOULD apply least-privilege access to recovery records.

## 10. Privacy Considerations

Task identifiers, failure classes, dependency relationships, and reason codes may expose sensitive operational details. In some deployments, these details can reveal user behavior, internal service structure, or policy logic.

Implementations SHOULD disclose only the information necessary for interoperable recovery. If a deployment requires broader analytics or audit retention, that policy is deployment-specific and outside the scope of this document.

## 11. IANA Considerations

This document currently requests no IANA action.

Future versions may request compact registries for failure classes, rollback outcome codes, or event type identifiers if implementation experience shows that fixed interoperation points are needed.

## 12. References

- [RFC2119] Key words for use in RFCs to Indicate Requirement Levels.
- [RFC8174] Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words.
- Placeholder reference for adjacent execution-evidence substrate, if adopted.
- Placeholder reference for `draft-yue-anima-agent-recovery-networks`.
- Placeholder reference for `draft-li-dmsc-macp`.
- Placeholder reference for `draft-fu-nmop-agent-communication-framework`.