--- title: "Agent Error Recovery and Rollback (AERR)" abbrev: "AERR" category: std docname: draft-aerr-agent-error-recovery-rollback-00 submissiontype: IETF number: date: v: 3 area: "OPS" workgroup: "NMOP" keyword: - error recovery - rollback - circuit breaker - agentic workflows - execution context author: - fullname: Generated by IETF Draft Analyzer organization: Independent email: placeholder@example.com normative: RFC7519: RFC7515: RFC9110: I-D.nennemann-wimse-ect: title: "Execution Context Tokens for Distributed Agentic Workflows" target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/ informative: I-D.nennemann-agent-dag-hitl-safety: title: "Agent Context Policy Token: DAG Delegation with Human Override" target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/ --- abstract This document defines the Agent Error Recovery and Rollback (AERR) protocol, a standard for handling errors, cascading failures, and rollback in multi-agent systems. AERR defines three mechanisms: state checkpoints recorded as Execution Context Token (ECT) DAG nodes, a circuit breaker pattern to contain cascading failures, and a rollback protocol that walks the ECT DAG backwards to revert agent actions to a known-good state. By building on ECT, AERR inherits cryptographic audit trails, assurance levels, and DAG validation without inventing parallel infrastructure. --- middle # Introduction The IETF AI/agent landscape includes 60 drafts on autonomous network operations but none that standardize error recovery. When an autonomous agent misconfigures a router, allocates resources incorrectly, or triggers a cascade of failures across a multi-agent system, there is no standard mechanism for detecting the failure, containing its blast radius, or reverting to a safe state. AERR borrows proven patterns from distributed systems -- checkpoints from database transactions, circuit breakers from microservice architectures, rollback from version control -- and adapts them for AI agent workflows. Rather than inventing its own audit and tracing layer, AERR records all checkpoints, errors, and rollbacks as ECT DAG nodes {{I-D.nennemann-wimse-ect}}, giving every recovery action a cryptographic proof chain. Design principles: 1. Agents that take consequential actions MUST be able to undo them, or MUST declare them irreversible upfront. 2. Failure containment takes priority over failure diagnosis. 3. The protocol adds minimal overhead to the happy path. # Conventions and Definitions {::boilerplate bcp14-tagged} Checkpoint: : An ECT recording an agent's state hash before a consequential action, providing a restore point for rollback. Circuit Breaker: : A mechanism that stops an agent from propagating requests to a failing downstream agent, preventing cascading failures. Rollback: : The process of reverting an agent's actions and state to a previously recorded checkpoint, walking the ECT DAG backwards. Blast Radius: : The set of agents and systems affected by a single agent's failure, determinable by traversing the ECT DAG forward from the failing node. # Problem Statement Consider a network operations scenario: Agent A instructs Agent B to update firewall rules, which causes Agent C's traffic monitoring to fail, which causes Agent D to misclassify traffic. Today each agent handles errors independently. There is no standard way for Agent D to signal that the root cause is upstream, for the cascade to be halted, or for the chain of actions to be rolled back. The ECT DAG {{I-D.nennemann-wimse-ect}} already records causal ordering of agent actions via `par` references. AERR adds checkpoint semantics, error propagation, and rollback operations on top of this existing structure. # Checkpoint Mechanism {#checkpoints} An AERR-compliant agent MUST create a checkpoint ECT before any action it classifies as consequential. An action is consequential if it modifies external state (e.g., network config, database records, API calls with side effects). ## Checkpoint as ECT A checkpoint is an ECT with: - `exec_act`: `"aerr:checkpoint"` - `par`: the `jti` of the preceding task ECT in the workflow - `out_hash`: SHA-256 hash of the agent's state snapshot at checkpoint time (for rollback integrity verification) The `ext` claim carries AERR-specific metadata: ~~~json { "ext": { "aerr.action_type": "config_update", "aerr.target": "router-07.example.com", "aerr.reversible": true, "aerr.rollback_uri": "https://agent-b.example.com/aerr/rollback", "aerr.ttl": 86400 } } ~~~ {: #fig-checkpoint title="Checkpoint ECT Extension Claims"} The `aerr.reversible` field MUST be present. If `false`, the agent declares that this action cannot be automatically undone and rollback requests MUST be escalated to a human operator via the HITL mechanism {{I-D.nennemann-agent-dag-hitl-safety}}. Agents MAY create hierarchical checkpoints using the ECT DAG: a parent checkpoint ECT with `par` references to multiple child checkpoint ECTs. Rolling back the parent rolls back all children. ## Checkpoint Storage Checkpoint ECTs MUST be stored for at least the duration specified by `aerr.ttl`. At L3 {{I-D.nennemann-wimse-ect}}, checkpoints are automatically preserved in the audit ledger. At L1 and L2, agents MUST store checkpoints in durable local storage that survives agent restarts. # Error Signaling {#error-signals} When an agent detects an error, it MUST produce an error ECT and propagate it to affected agents in the DAG. ## Error ECT An error signal is an ECT with: - `exec_act`: `"aerr:error"` - `par`: the `jti` of the checkpoint ECT associated with the failing action The `ext` claim carries error details: ~~~json { "ext": { "aerr.severity": "critical", "aerr.error_type": "action_failed", "aerr.description": "BGP session did not establish", "aerr.checkpoint_id": "550e8400-e29b-41d4-a716-446655440001", "aerr.upstream_errors": [] } } ~~~ {: #fig-error title="Error ECT Extension Claims"} Severity levels: `info`, `warning`, `error`, `critical`. Error types: `action_failed`, `timeout`, `constraint_violation`, `resource_exhausted`, `upstream_cascade`, `unknown`. ## Error Propagation via DAG When an agent receives an error ECT caused by an action it initiated, it MUST either: (a) Attempt automatic rollback of its checkpoint ({{rollback}}), or (b) Escalate to its operator if the action was irreversible. The `aerr.upstream_errors` array allows agents to chain error context by referencing `jti` values of predecessor error ECTs, building a causal trace from symptom to root cause through the DAG. ## HITL Escalation When an error requires human intervention, the error ECT SHOULD trigger a HITL rule per {{I-D.nennemann-agent-dag-hitl-safety}}. Example policy: ~~~json { "hitl": { "rules": [{ "id": "r-critical-error", "trigger": { "kind": "keyword_match", "op": "eq", "value": "critical", "input_ref": "ext.aerr.severity" }, "required_role": "operator:oncall", "action": "escalate", "allow_override": true, "override_action": "continue" }] } } ~~~ {: #fig-hitl-error title="HITL Policy for Critical Errors"} # Circuit Breaker Pattern {#circuit-breaker} Each agent MUST implement a circuit breaker for every downstream agent it communicates with. ## States CLOSED (normal): : Requests flow through. The agent tracks the error rate over a sliding window (default: 60 seconds). OPEN (failure detected): : When the error rate exceeds a threshold (default: 50% over the window), the breaker opens. All requests to the downstream agent are immediately rejected with `aerr.error_type`: `circuit_open`. The agent MUST produce an error ECT and emit it to upstream peers. HALF-OPEN (recovery probe): : After a cooldown period (default: 30 seconds), the breaker allows a single probe request. If it succeeds, the breaker returns to CLOSED. If it fails, it returns to OPEN with doubled cooldown (exponential backoff, max 300 seconds). ## State Change ECTs Each circuit breaker state change MUST produce an ECT: - `exec_act`: `"aerr:circuit_open"`, `"aerr:circuit_half_open"`, or `"aerr:circuit_closed"` - `par`: the `jti` of the error ECT that triggered the transition This records the health topology of the agent network in the ECT DAG, queryable from the audit ledger at L3. ## Observability Agents MUST expose circuit breaker state at: ~~~ GET /aerr/circuits ~~~ Response: ~~~json { "circuits": [{ "downstream_agent": "spiffe://example.com/agent/router-mgr", "state": "open", "error_rate": 0.75, "last_failure_ect": "550e8400-e29b-41d4-a716-446655440099", "cooldown_remaining_s": 22 }] } ~~~ {: #fig-circuits title="Circuit Breaker Status"} # Rollback Protocol {#rollback} ## Rollback Request A rollback is initiated by sending an HTTP POST to the target agent's rollback endpoint: ~~~ POST /aerr/rollback HTTP/1.1 Content-Type: application/json Execution-Context: { "rollback_id": "urn:uuid:...", "checkpoint_id": "550e8400-e29b-41d4-a716-446655440001", "reason": "Upstream action caused cascading failure", "cascade": true } ~~~ {: #fig-rollback-req title="Rollback Request"} The request MUST include an ECT in the Execution-Context header with `exec_act`: `"aerr:rollback_request"` and `par` referencing the error ECT that motivated the rollback. When `cascade` is `true`, the receiving agent MUST also initiate rollback of any downstream checkpoints created as a consequence of the checkpointed action. The ECT DAG's `par` chain identifies these downstream actions. ## Rollback Response The agent produces a rollback result ECT with: - `exec_act`: `"aerr:rollback_complete"` (or `"aerr:rollback_escalated"`) - `par`: the `jti` of the rollback request ECT - `out_hash`: SHA-256 hash of the agent's state after rollback ~~~json { "ext": { "aerr.rollback_id": "urn:uuid:...", "aerr.status": "completed", "aerr.state_hash_before": "sha256:...", "aerr.state_hash_after": "sha256:...", "aerr.cascaded": [ {"agent": "spiffe://example.com/agent/monitor", "status": "completed"}, {"agent": "spiffe://example.com/agent/classify", "status": "escalated"} ] } } ~~~ {: #fig-rollback-resp title="Rollback Result ECT"} Status values: `completed`, `partial`, `escalated`, `failed`. `escalated` means the action was irreversible and a human operator has been notified via HITL. `partial` means some but not all downstream rollbacks succeeded. ## Idempotency Agents MUST implement idempotent rollback: receiving the same `rollback_id` twice MUST return the same result without re-executing the rollback. # Security Considerations Rollback requests are sensitive operations. Agents MUST authenticate rollback requests via the ECT signature chain -- only agents whose ECTs appear in the same workflow DAG (identified by `wid`) SHOULD be authorized to request rollback. Checkpoint ECTs contain `out_hash` of agent state but not the state itself. Agents MUST encrypt stored state snapshots at rest. Circuit breaker status exposes system health topology. The `/aerr/circuits` endpoint SHOULD be access-controlled. Malicious agents could emit false error ECTs to trigger rollbacks. Agents SHOULD verify that error ECTs reference valid checkpoint `jti` values from their own workflow DAG before initiating rollback. At L2 and L3, ECT signatures prevent forgery. # IANA Considerations This document requests the following IANA registrations: 1. An "AERR Error Type" registry under Specification Required policy. Initial entries: `action_failed`, `timeout`, `constraint_violation`, `resource_exhausted`, `upstream_cascade`, `circuit_open`, `unknown`. 2. Registration of `exec_act` values `aerr:checkpoint`, `aerr:error`, `aerr:rollback_request`, `aerr:rollback_complete`, `aerr:circuit_open`, `aerr:circuit_half_open`, `aerr:circuit_closed` in a future ECT action type registry. --- back # Acknowledgments {:numbered="false"} This document builds on the Execution Context Token specification {{I-D.nennemann-wimse-ect}} for DAG-based audit trails and the Agent Context Policy Token {{I-D.nennemann-agent-dag-hitl-safety}} for HITL escalation of irreversible actions.