310 lines
11 KiB
Plaintext
310 lines
11 KiB
Plaintext
Internet-Draft AI/Agent WG
|
|
Intended status: Standards Track March 2026
|
|
Expires: September 15, 2026
|
|
|
|
|
|
Agent Error Recovery and Rollback (AERR)
|
|
draft-aerr-agent-error-recovery-rollback-00
|
|
|
|
Abstract
|
|
|
|
This document defines the Agent Error Recovery and Rollback
|
|
(AERR) protocol, a lightweight standard for handling errors,
|
|
cascading failures, and rollback in multi-agent systems.
|
|
Autonomous AI agents increasingly make unsupervised decisions,
|
|
yet no standard exists for how agents checkpoint state, signal
|
|
errors to peers, contain cascading failures, or roll back
|
|
autonomous decisions gone wrong. AERR defines three mechanisms:
|
|
state checkpoints that agents create before consequential
|
|
actions, a circuit breaker pattern to contain cascading failures
|
|
across agent networks, and a rollback protocol for reverting
|
|
agent actions to a known-good state. The protocol is transport-
|
|
agnostic and builds on JSON and standard HTTP semantics.
|
|
|
|
Status of This Memo
|
|
|
|
This Internet-Draft is submitted in full conformance with the
|
|
provisions of BCP 78 and BCP 79.
|
|
|
|
This document is intended to have Standards Track status.
|
|
Distribution of this memo is unlimited.
|
|
|
|
Table of Contents
|
|
|
|
1. Introduction
|
|
2. Terminology
|
|
3. Problem Statement
|
|
4. Checkpoint Mechanism
|
|
5. Error Signaling
|
|
6. Circuit Breaker Pattern
|
|
7. Rollback Protocol
|
|
8. Security Considerations
|
|
9. IANA Considerations
|
|
|
|
1. Introduction
|
|
|
|
The IETF AI/agent landscape includes 60 drafts on autonomous
|
|
network operations but none that standardize error recovery.
|
|
When an autonomous agent misconfigures a router, allocates
|
|
resources incorrectly, or triggers an unintended cascade of
|
|
actions across a multi-agent system, there is currently no
|
|
standard mechanism for detecting the failure, containing its
|
|
blast radius, or reverting to a safe state.
|
|
|
|
AERR borrows proven patterns from distributed systems:
|
|
checkpoints from database transactions, circuit breakers from
|
|
microservice architectures, and rollback from version control.
|
|
It adapts these patterns to the specific needs of AI agents,
|
|
where actions may be partially reversible and where the agent
|
|
that caused the error may not be the best one to fix it.
|
|
|
|
Design principles:
|
|
1. Agents that take consequential actions MUST be able to undo
|
|
them, or MUST declare them irreversible upfront.
|
|
2. Failure containment takes priority over failure diagnosis.
|
|
3. The protocol adds minimal overhead to the happy path.
|
|
|
|
2. Terminology
|
|
|
|
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
|
|
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
|
|
"OPTIONAL" in this document are to be interpreted as described
|
|
in RFC 2119 [RFC2119].
|
|
|
|
Checkpoint: A snapshot of an agent's state and the external
|
|
effects of its actions at a point in time, sufficient to
|
|
restore the system to that state.
|
|
|
|
Circuit Breaker: A mechanism that stops an agent from
|
|
propagating requests to a failing downstream agent, preventing
|
|
cascading failures.
|
|
|
|
Rollback: The process of reverting an agent's actions and state
|
|
to a previously recorded checkpoint.
|
|
|
|
Blast Radius: The set of agents and systems affected by a
|
|
single agent's failure.
|
|
|
|
3. Problem Statement
|
|
|
|
Consider a network operations scenario: Agent A instructs
|
|
Agent B to update firewall rules, which causes Agent C's
|
|
traffic monitoring to fail, which causes Agent D to
|
|
misclassify traffic patterns. Today each agent handles errors
|
|
independently with no coordination. There is no standard way
|
|
for Agent D to signal that the root cause is upstream, for the
|
|
cascade to be halted, or for the chain of actions to be rolled
|
|
back.
|
|
|
|
The only existing draft that partially addresses this space
|
|
(draft-yue-anima-agent-recovery-networks) focuses on mobile
|
|
network fault recovery and does not provide general-purpose
|
|
error recovery primitives usable across agent types.
|
|
|
|
4. Checkpoint Mechanism
|
|
|
|
An AERR-compliant agent MUST create a checkpoint before any
|
|
action it classifies as "consequential." An action is
|
|
consequential if it modifies external state (e.g., network
|
|
config, database records, API calls with side effects).
|
|
|
|
A checkpoint is a JSON object:
|
|
|
|
{
|
|
"checkpoint_id": "urn:uuid:...",
|
|
"agent_id": "urn:uuid:...",
|
|
"timestamp": "2026-03-01T12:00:00Z",
|
|
"action": {
|
|
"type": "config_update",
|
|
"target": "router-07.example.com",
|
|
"description": "Update BGP peer config"
|
|
},
|
|
"reversible": true,
|
|
"rollback_procedure": {
|
|
"method": "POST",
|
|
"uri": "https://agent-b.example.com/aerr/rollback",
|
|
"payload_ref": "urn:uuid:...prior-config-snapshot"
|
|
},
|
|
"state_hash": "sha256:abcdef...",
|
|
"ttl": 86400
|
|
}
|
|
|
|
The "reversible" field MUST be present. If false, the agent
|
|
declares that this action cannot be automatically undone and
|
|
rollback requests for this checkpoint MUST be escalated to a
|
|
human operator.
|
|
|
|
The "state_hash" provides integrity verification: the agent
|
|
hashes its relevant state at checkpoint time so that rollback
|
|
can verify it is restoring to an authentic prior state.
|
|
|
|
Checkpoints MUST be stored for at least the duration specified
|
|
by "ttl" (seconds). Agents SHOULD store checkpoints in durable
|
|
storage that survives agent restarts.
|
|
|
|
Agents MAY create hierarchical checkpoints where a parent
|
|
checkpoint groups multiple child checkpoints from a multi-step
|
|
operation. Rolling back the parent rolls back all children.
|
|
|
|
5. Error Signaling
|
|
|
|
When an agent detects an error, it MUST emit an AERR error
|
|
signal to all agents in the current action chain. The error
|
|
signal is an HTTP POST to each peer's AERR endpoint:
|
|
|
|
POST /aerr/error HTTP/1.1
|
|
Content-Type: application/json
|
|
|
|
{
|
|
"error_id": "urn:uuid:...",
|
|
"source_agent": "urn:uuid:...",
|
|
"severity": "critical",
|
|
"checkpoint_id": "urn:uuid:...",
|
|
"error_type": "action_failed",
|
|
"description": "BGP session did not establish after config update",
|
|
"timestamp": "2026-03-01T12:05:00Z",
|
|
"upstream_errors": []
|
|
}
|
|
|
|
Severity levels: "info", "warning", "error", "critical".
|
|
|
|
Error types: "action_failed", "timeout", "constraint_violation",
|
|
"resource_exhausted", "upstream_cascade", "unknown".
|
|
|
|
When an agent receives an error signal caused by an action it
|
|
initiated, it MUST either:
|
|
(a) Attempt automatic rollback of its checkpoint, or
|
|
(b) Escalate to its operator if the action was irreversible.
|
|
|
|
The "upstream_errors" array allows agents to chain error
|
|
context, building a causal trace from the symptom back to the
|
|
root cause.
|
|
|
|
6. Circuit Breaker Pattern
|
|
|
|
Each agent MUST implement a circuit breaker for every downstream
|
|
agent it communicates with. The circuit breaker has three
|
|
states:
|
|
|
|
CLOSED (normal operation): Requests flow through. The agent
|
|
tracks the error rate over a sliding window (default: 60s).
|
|
|
|
OPEN (failure detected): When the error rate exceeds a
|
|
threshold (default: 50% over the window), the circuit breaker
|
|
opens. All requests to the downstream agent are immediately
|
|
rejected with error_type "circuit_open". The agent MUST emit
|
|
an error signal to upstream peers.
|
|
|
|
HALF-OPEN (recovery probe): After a cooldown period (default:
|
|
30s), the circuit breaker allows a single probe request. If it
|
|
succeeds, the breaker returns to CLOSED. If it fails, it
|
|
returns to OPEN with a doubled cooldown (exponential backoff,
|
|
max 300s).
|
|
|
|
Agents MUST expose circuit breaker state at:
|
|
|
|
GET /aerr/circuits
|
|
|
|
Response:
|
|
{
|
|
"circuits": [
|
|
{
|
|
"downstream_agent": "urn:uuid:...",
|
|
"state": "open",
|
|
"error_rate": 0.75,
|
|
"last_failure": "2026-03-01T12:05:00Z",
|
|
"cooldown_remaining_s": 22
|
|
}
|
|
]
|
|
}
|
|
|
|
This enables monitoring systems and upstream agents to
|
|
understand the health topology of the agent network.
|
|
|
|
7. Rollback Protocol
|
|
|
|
A rollback is initiated by sending an HTTP POST to the target
|
|
agent's rollback endpoint:
|
|
|
|
POST /aerr/rollback HTTP/1.1
|
|
Content-Type: application/json
|
|
|
|
{
|
|
"rollback_id": "urn:uuid:...",
|
|
"checkpoint_id": "urn:uuid:...",
|
|
"reason": "Upstream action caused cascading failure",
|
|
"initiator": "urn:uuid:...",
|
|
"cascade": true
|
|
}
|
|
|
|
When "cascade" is true, the receiving agent MUST also initiate
|
|
rollback of any downstream checkpoints that were created as a
|
|
consequence of the checkpointed action. This enables a single
|
|
rollback request to unwind an entire chain of agent actions.
|
|
|
|
The agent MUST respond with a rollback result:
|
|
|
|
{
|
|
"rollback_id": "urn:uuid:...",
|
|
"status": "completed",
|
|
"checkpoint_id": "urn:uuid:...",
|
|
"state_hash_before": "sha256:...",
|
|
"state_hash_after": "sha256:...",
|
|
"cascaded_rollbacks": [
|
|
{"agent_id": "urn:uuid:...", "status": "completed"},
|
|
{"agent_id": "urn:uuid:...", "status": "escalated"}
|
|
]
|
|
}
|
|
|
|
Rollback status values: "completed", "partial", "escalated",
|
|
"failed".
|
|
|
|
"escalated" means the action was irreversible and a human
|
|
operator has been notified. "partial" means some but not all
|
|
downstream rollbacks succeeded.
|
|
|
|
Agents MUST implement idempotent rollback: receiving the same
|
|
rollback_id twice MUST return the same result without re-
|
|
executing the rollback.
|
|
|
|
8. Security Considerations
|
|
|
|
Rollback requests are sensitive operations. Agents MUST
|
|
authenticate rollback requests using mutual TLS or signed JWTs.
|
|
Only agents in the same action chain (identified by checkpoint
|
|
lineage) SHOULD be authorized to request rollback.
|
|
|
|
Checkpoint data may contain sensitive system state. Agents
|
|
MUST encrypt stored checkpoints at rest and MUST NOT include
|
|
checkpoint contents in error signals.
|
|
|
|
Circuit breaker state is observable information about system
|
|
health. The /aerr/circuits endpoint SHOULD be access-
|
|
controlled to prevent adversaries from mapping system topology.
|
|
|
|
Malicious agents could send false error signals to trigger
|
|
unnecessary rollbacks. Agents SHOULD verify that error signals
|
|
reference valid checkpoint IDs from their own action chains
|
|
before initiating rollback.
|
|
|
|
9. IANA Considerations
|
|
|
|
This document requests IANA establish the following:
|
|
|
|
1. An "AERR Error Type" registry under Specification Required
|
|
policy. Initial entries: "action_failed", "timeout",
|
|
"constraint_violation", "resource_exhausted",
|
|
"upstream_cascade", "unknown".
|
|
|
|
2. An "AERR Severity Level" registry under Specification
|
|
Required policy. Initial entries: "info", "warning",
|
|
"error", "critical".
|
|
|
|
3. Well-known URI registrations for "aerr/error",
|
|
"aerr/rollback", and "aerr/circuits" per RFC 8615.
|
|
|
|
Author's Address
|
|
|
|
Generated by IETF Draft Analyzer
|
|
2026-03-01
|