Internet-Draft AI/Agent WG Intended status: Standards Track March 2026 Expires: September 15, 2026 Agent Error Recovery and Rollback (AERR) draft-aerr-agent-error-recovery-rollback-00 Abstract This document defines the Agent Error Recovery and Rollback (AERR) protocol, a lightweight standard for handling errors, cascading failures, and rollback in multi-agent systems. Autonomous AI agents increasingly make unsupervised decisions, yet no standard exists for how agents checkpoint state, signal errors to peers, contain cascading failures, or roll back autonomous decisions gone wrong. AERR defines three mechanisms: state checkpoints that agents create before consequential actions, a circuit breaker pattern to contain cascading failures across agent networks, and a rollback protocol for reverting agent actions to a known-good state. The protocol is transport- agnostic and builds on JSON and standard HTTP semantics. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. This document is intended to have Standards Track status. Distribution of this memo is unlimited. Table of Contents 1. Introduction 2. Terminology 3. Problem Statement 4. Checkpoint Mechanism 5. Error Signaling 6. Circuit Breaker Pattern 7. Rollback Protocol 8. Security Considerations 9. IANA Considerations 1. Introduction The IETF AI/agent landscape includes 60 drafts on autonomous network operations but none that standardize error recovery. When an autonomous agent misconfigures a router, allocates resources incorrectly, or triggers an unintended cascade of actions across a multi-agent system, there is currently no standard mechanism for detecting the failure, containing its blast radius, or reverting to a safe state. AERR borrows proven patterns from distributed systems: checkpoints from database transactions, circuit breakers from microservice architectures, and rollback from version control. It adapts these patterns to the specific needs of AI agents, where actions may be partially reversible and where the agent that caused the error may not be the best one to fix it. Design principles: 1. Agents that take consequential actions MUST be able to undo them, or MUST declare them irreversible upfront. 2. Failure containment takes priority over failure diagnosis. 3. The protocol adds minimal overhead to the happy path. 2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. Checkpoint: A snapshot of an agent's state and the external effects of its actions at a point in time, sufficient to restore the system to that state. Circuit Breaker: A mechanism that stops an agent from propagating requests to a failing downstream agent, preventing cascading failures. Rollback: The process of reverting an agent's actions and state to a previously recorded checkpoint. Blast Radius: The set of agents and systems affected by a single agent's failure. 3. Problem Statement Consider a network operations scenario: Agent A instructs Agent B to update firewall rules, which causes Agent C's traffic monitoring to fail, which causes Agent D to misclassify traffic patterns. Today each agent handles errors independently with no coordination. There is no standard way for Agent D to signal that the root cause is upstream, for the cascade to be halted, or for the chain of actions to be rolled back. The only existing draft that partially addresses this space (draft-yue-anima-agent-recovery-networks) focuses on mobile network fault recovery and does not provide general-purpose error recovery primitives usable across agent types. 4. Checkpoint Mechanism An AERR-compliant agent MUST create a checkpoint before any action it classifies as "consequential." An action is consequential if it modifies external state (e.g., network config, database records, API calls with side effects). A checkpoint is a JSON object: { "checkpoint_id": "urn:uuid:...", "agent_id": "urn:uuid:...", "timestamp": "2026-03-01T12:00:00Z", "action": { "type": "config_update", "target": "router-07.example.com", "description": "Update BGP peer config" }, "reversible": true, "rollback_procedure": { "method": "POST", "uri": "https://agent-b.example.com/aerr/rollback", "payload_ref": "urn:uuid:...prior-config-snapshot" }, "state_hash": "sha256:abcdef...", "ttl": 86400 } The "reversible" field MUST be present. If false, the agent declares that this action cannot be automatically undone and rollback requests for this checkpoint MUST be escalated to a human operator. The "state_hash" provides integrity verification: the agent hashes its relevant state at checkpoint time so that rollback can verify it is restoring to an authentic prior state. Checkpoints MUST be stored for at least the duration specified by "ttl" (seconds). Agents SHOULD store checkpoints in durable storage that survives agent restarts. Agents MAY create hierarchical checkpoints where a parent checkpoint groups multiple child checkpoints from a multi-step operation. Rolling back the parent rolls back all children. 5. Error Signaling When an agent detects an error, it MUST emit an AERR error signal to all agents in the current action chain. The error signal is an HTTP POST to each peer's AERR endpoint: POST /aerr/error HTTP/1.1 Content-Type: application/json { "error_id": "urn:uuid:...", "source_agent": "urn:uuid:...", "severity": "critical", "checkpoint_id": "urn:uuid:...", "error_type": "action_failed", "description": "BGP session did not establish after config update", "timestamp": "2026-03-01T12:05:00Z", "upstream_errors": [] } Severity levels: "info", "warning", "error", "critical". Error types: "action_failed", "timeout", "constraint_violation", "resource_exhausted", "upstream_cascade", "unknown". When an agent receives an error signal caused by an action it initiated, it MUST either: (a) Attempt automatic rollback of its checkpoint, or (b) Escalate to its operator if the action was irreversible. The "upstream_errors" array allows agents to chain error context, building a causal trace from the symptom back to the root cause. 6. Circuit Breaker Pattern Each agent MUST implement a circuit breaker for every downstream agent it communicates with. The circuit breaker has three states: CLOSED (normal operation): Requests flow through. The agent tracks the error rate over a sliding window (default: 60s). OPEN (failure detected): When the error rate exceeds a threshold (default: 50% over the window), the circuit breaker opens. All requests to the downstream agent are immediately rejected with error_type "circuit_open". The agent MUST emit an error signal to upstream peers. HALF-OPEN (recovery probe): After a cooldown period (default: 30s), the circuit breaker allows a single probe request. If it succeeds, the breaker returns to CLOSED. If it fails, it returns to OPEN with a doubled cooldown (exponential backoff, max 300s). Agents MUST expose circuit breaker state at: GET /aerr/circuits Response: { "circuits": [ { "downstream_agent": "urn:uuid:...", "state": "open", "error_rate": 0.75, "last_failure": "2026-03-01T12:05:00Z", "cooldown_remaining_s": 22 } ] } This enables monitoring systems and upstream agents to understand the health topology of the agent network. 7. Rollback Protocol A rollback is initiated by sending an HTTP POST to the target agent's rollback endpoint: POST /aerr/rollback HTTP/1.1 Content-Type: application/json { "rollback_id": "urn:uuid:...", "checkpoint_id": "urn:uuid:...", "reason": "Upstream action caused cascading failure", "initiator": "urn:uuid:...", "cascade": true } When "cascade" is true, the receiving agent MUST also initiate rollback of any downstream checkpoints that were created as a consequence of the checkpointed action. This enables a single rollback request to unwind an entire chain of agent actions. The agent MUST respond with a rollback result: { "rollback_id": "urn:uuid:...", "status": "completed", "checkpoint_id": "urn:uuid:...", "state_hash_before": "sha256:...", "state_hash_after": "sha256:...", "cascaded_rollbacks": [ {"agent_id": "urn:uuid:...", "status": "completed"}, {"agent_id": "urn:uuid:...", "status": "escalated"} ] } Rollback status values: "completed", "partial", "escalated", "failed". "escalated" means the action was irreversible and a human operator has been notified. "partial" means some but not all downstream rollbacks succeeded. Agents MUST implement idempotent rollback: receiving the same rollback_id twice MUST return the same result without re- executing the rollback. 8. Security Considerations Rollback requests are sensitive operations. Agents MUST authenticate rollback requests using mutual TLS or signed JWTs. Only agents in the same action chain (identified by checkpoint lineage) SHOULD be authorized to request rollback. Checkpoint data may contain sensitive system state. Agents MUST encrypt stored checkpoints at rest and MUST NOT include checkpoint contents in error signals. Circuit breaker state is observable information about system health. The /aerr/circuits endpoint SHOULD be access- controlled to prevent adversaries from mapping system topology. Malicious agents could send false error signals to trigger unnecessary rollbacks. Agents SHOULD verify that error signals reference valid checkpoint IDs from their own action chains before initiating rollback. 9. IANA Considerations This document requests IANA establish the following: 1. An "AERR Error Type" registry under Specification Required policy. Initial entries: "action_failed", "timeout", "constraint_violation", "resource_exhausted", "upstream_cascade", "unknown". 2. An "AERR Severity Level" registry under Specification Required policy. Initial entries: "info", "warning", "error", "critical". 3. Well-known URI registrations for "aerr/error", "aerr/rollback", and "aerr/circuits" per RFC 8615. Author's Address Generated by IETF Draft Analyzer 2026-03-01