ietf-draft-analyzer/workspace/drafts/new-drafts/draft-aerr-agent-error-recovery-rollback-00.txt

Internet-Draft                                           AI/Agent WG
Intended status: Standards Track                          March 2026
Expires: September 15, 2026


         Agent Error Recovery and Rollback (AERR)
         draft-aerr-agent-error-recovery-rollback-00

Abstract

   This document defines the Agent Error Recovery and Rollback
   (AERR) protocol, a lightweight standard for handling errors,
   cascading failures, and rollback in multi-agent systems.
   Autonomous AI agents increasingly make unsupervised decisions,
   yet no standard exists for how agents checkpoint state, signal
   errors to peers, contain cascading failures, or roll back
   autonomous decisions gone wrong.  AERR defines three mechanisms:
   state checkpoints that agents create before consequential
   actions, a circuit breaker pattern to contain cascading failures
   across agent networks, and a rollback protocol for reverting
   agent actions to a known-good state.  The protocol is transport-
   agnostic and builds on JSON and standard HTTP semantics.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   This document is intended to have Standards Track status.
   Distribution of this memo is unlimited.

Table of Contents

   1.  Introduction
   2.  Terminology
   3.  Problem Statement
   4.  Checkpoint Mechanism
   5.  Error Signaling
   6.  Circuit Breaker Pattern
   7.  Rollback Protocol
   8.  Security Considerations
   9.  IANA Considerations

1.  Introduction

   The IETF AI/agent landscape includes 60 drafts on autonomous
   network operations but none that standardize error recovery.
   When an autonomous agent misconfigures a router, allocates
   resources incorrectly, or triggers an unintended cascade of
   actions across a multi-agent system, there is currently no
   standard mechanism for detecting the failure, containing its
   blast radius, or reverting to a safe state.

   AERR borrows proven patterns from distributed systems:
   checkpoints from database transactions, circuit breakers from
   microservice architectures, and rollback from version control.
   It adapts these patterns to the specific needs of AI agents,
   where actions may be partially reversible and where the agent
   that caused the error may not be the best one to fix it.

   Design principles:
   1. Agents that take consequential actions MUST be able to undo
      them, or MUST declare them irreversible upfront.
   2. Failure containment takes priority over failure diagnosis.
   3. The protocol adds minimal overhead to the happy path.

2.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
   NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described
   in RFC 2119 [RFC2119].

   Checkpoint: A snapshot of an agent's state and the external
   effects of its actions at a point in time, sufficient to
   restore the system to that state.

   Circuit Breaker: A mechanism that stops an agent from
   propagating requests to a failing downstream agent, preventing
   cascading failures.

   Rollback: The process of reverting an agent's actions and state
   to a previously recorded checkpoint.

   Blast Radius: The set of agents and systems affected by a
   single agent's failure.

3.  Problem Statement

   Consider a network operations scenario: Agent A instructs
   Agent B to update firewall rules, which causes Agent C's
   traffic monitoring to fail, which causes Agent D to
   misclassify traffic patterns.  Today each agent handles errors
   independently with no coordination.  There is no standard way
   for Agent D to signal that the root cause is upstream, for the
   cascade to be halted, or for the chain of actions to be rolled
   back.

   The only existing draft that partially addresses this space
   (draft-yue-anima-agent-recovery-networks) focuses on mobile
   network fault recovery and does not provide general-purpose
   error recovery primitives usable across agent types.

4.  Checkpoint Mechanism

   An AERR-compliant agent MUST create a checkpoint before any
   action it classifies as "consequential."  An action is
   consequential if it modifies external state (e.g., network
   config, database records, API calls with side effects).

   A checkpoint is a JSON object:

      {
        "checkpoint_id": "urn:uuid:...",
        "agent_id": "urn:uuid:...",
        "timestamp": "2026-03-01T12:00:00Z",
        "action": {
          "type": "config_update",
          "target": "router-07.example.com",
          "description": "Update BGP peer config"
        },
        "reversible": true,
        "rollback_procedure": {
          "method": "POST",
          "uri": "https://agent-b.example.com/aerr/rollback",
          "payload_ref": "urn:uuid:...prior-config-snapshot"
        },
        "state_hash": "sha256:abcdef...",
        "ttl": 86400
      }

   The "reversible" field MUST be present.  If false, the agent
   declares that this action cannot be automatically undone and
   rollback requests for this checkpoint MUST be escalated to a
   human operator.

   The "state_hash" provides integrity verification: the agent
   hashes its relevant state at checkpoint time so that rollback
   can verify it is restoring to an authentic prior state.

   Checkpoints MUST be stored for at least the duration specified
   by "ttl" (seconds).  Agents SHOULD store checkpoints in durable
   storage that survives agent restarts.

   Agents MAY create hierarchical checkpoints where a parent
   checkpoint groups multiple child checkpoints from a multi-step
   operation.  Rolling back the parent rolls back all children.

5.  Error Signaling

   When an agent detects an error, it MUST emit an AERR error
   signal to all agents in the current action chain.  The error
   signal is an HTTP POST to each peer's AERR endpoint:

      POST /aerr/error HTTP/1.1
      Content-Type: application/json

      {
        "error_id": "urn:uuid:...",
        "source_agent": "urn:uuid:...",
        "severity": "critical",
        "checkpoint_id": "urn:uuid:...",
        "error_type": "action_failed",
        "description": "BGP session did not establish after config update",
        "timestamp": "2026-03-01T12:05:00Z",
        "upstream_errors": []
      }

   Severity levels: "info", "warning", "error", "critical".

   Error types: "action_failed", "timeout", "constraint_violation",
   "resource_exhausted", "upstream_cascade", "unknown".

   When an agent receives an error signal caused by an action it
   initiated, it MUST either:
   (a) Attempt automatic rollback of its checkpoint, or
   (b) Escalate to its operator if the action was irreversible.

   The "upstream_errors" array allows agents to chain error
   context, building a causal trace from the symptom back to the
   root cause.

6.  Circuit Breaker Pattern

   Each agent MUST implement a circuit breaker for every downstream
   agent it communicates with.  The circuit breaker has three
   states:

   CLOSED (normal operation): Requests flow through.  The agent
   tracks the error rate over a sliding window (default: 60s).

   OPEN (failure detected): When the error rate exceeds a
   threshold (default: 50% over the window), the circuit breaker
   opens.  All requests to the downstream agent are immediately
   rejected with error_type "circuit_open".  The agent MUST emit
   an error signal to upstream peers.

   HALF-OPEN (recovery probe): After a cooldown period (default:
   30s), the circuit breaker allows a single probe request.  If it
   succeeds, the breaker returns to CLOSED.  If it fails, it
   returns to OPEN with a doubled cooldown (exponential backoff,
   max 300s).

   Agents MUST expose circuit breaker state at:

      GET /aerr/circuits

   Response:
      {
        "circuits": [
          {
            "downstream_agent": "urn:uuid:...",
            "state": "open",
            "error_rate": 0.75,
            "last_failure": "2026-03-01T12:05:00Z",
            "cooldown_remaining_s": 22
          }
        ]
      }

   This enables monitoring systems and upstream agents to
   understand the health topology of the agent network.

7.  Rollback Protocol

   A rollback is initiated by sending an HTTP POST to the target
   agent's rollback endpoint:

      POST /aerr/rollback HTTP/1.1
      Content-Type: application/json

      {
        "rollback_id": "urn:uuid:...",
        "checkpoint_id": "urn:uuid:...",
        "reason": "Upstream action caused cascading failure",
        "initiator": "urn:uuid:...",
        "cascade": true
      }

   When "cascade" is true, the receiving agent MUST also initiate
   rollback of any downstream checkpoints that were created as a
   consequence of the checkpointed action.  This enables a single
   rollback request to unwind an entire chain of agent actions.

   The agent MUST respond with a rollback result:

      {
        "rollback_id": "urn:uuid:...",
        "status": "completed",
        "checkpoint_id": "urn:uuid:...",
        "state_hash_before": "sha256:...",
        "state_hash_after": "sha256:...",
        "cascaded_rollbacks": [
          {"agent_id": "urn:uuid:...", "status": "completed"},
          {"agent_id": "urn:uuid:...", "status": "escalated"}
        ]
      }

   Rollback status values: "completed", "partial", "escalated",
   "failed".

   "escalated" means the action was irreversible and a human
   operator has been notified.  "partial" means some but not all
   downstream rollbacks succeeded.

   Agents MUST implement idempotent rollback: receiving the same
   rollback_id twice MUST return the same result without re-
   executing the rollback.

8.  Security Considerations

   Rollback requests are sensitive operations.  Agents MUST
   authenticate rollback requests using mutual TLS or signed JWTs.
   Only agents in the same action chain (identified by checkpoint
   lineage) SHOULD be authorized to request rollback.

   Checkpoint data may contain sensitive system state.  Agents
   MUST encrypt stored checkpoints at rest and MUST NOT include
   checkpoint contents in error signals.

   Circuit breaker state is observable information about system
   health.  The /aerr/circuits endpoint SHOULD be access-
   controlled to prevent adversaries from mapping system topology.

   Malicious agents could send false error signals to trigger
   unnecessary rollbacks.  Agents SHOULD verify that error signals
   reference valid checkpoint IDs from their own action chains
   before initiating rollback.

9.  IANA Considerations

   This document requests IANA establish the following:

   1. An "AERR Error Type" registry under Specification Required
      policy.  Initial entries: "action_failed", "timeout",
      "constraint_violation", "resource_exhausted",
      "upstream_cascade", "unknown".

   2. An "AERR Severity Level" registry under Specification
      Required policy.  Initial entries: "info", "warning",
      "error", "critical".

   3. Well-known URI registrations for "aerr/error",
      "aerr/rollback", and "aerr/circuits" per RFC 8615.

Author's Address

   Generated by IETF Draft Analyzer
   2026-03-01