feat: add draft data, gap analysis report, and workspace config
Some checks failed
CI / test (3.11) (push) Failing after 1m37s
CI / test (3.12) (push) Failing after 57s

This commit is contained in:
2026-04-06 18:47:15 +02:00
parent 4f310407b0
commit 2506b6325a
189 changed files with 62649 additions and 0 deletions

View File

@@ -0,0 +1,309 @@
Internet-Draft AI/Agent WG
Intended status: Standards Track March 2026
Expires: September 15, 2026
Agent Error Recovery and Rollback (AERR)
draft-aerr-agent-error-recovery-rollback-00
Abstract
This document defines the Agent Error Recovery and Rollback
(AERR) protocol, a lightweight standard for handling errors,
cascading failures, and rollback in multi-agent systems.
Autonomous AI agents increasingly make unsupervised decisions,
yet no standard exists for how agents checkpoint state, signal
errors to peers, contain cascading failures, or roll back
autonomous decisions gone wrong. AERR defines three mechanisms:
state checkpoints that agents create before consequential
actions, a circuit breaker pattern to contain cascading failures
across agent networks, and a rollback protocol for reverting
agent actions to a known-good state. The protocol is transport-
agnostic and builds on JSON and standard HTTP semantics.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
This document is intended to have Standards Track status.
Distribution of this memo is unlimited.
Table of Contents
1. Introduction
2. Terminology
3. Problem Statement
4. Checkpoint Mechanism
5. Error Signaling
6. Circuit Breaker Pattern
7. Rollback Protocol
8. Security Considerations
9. IANA Considerations
1. Introduction
The IETF AI/agent landscape includes 60 drafts on autonomous
network operations but none that standardize error recovery.
When an autonomous agent misconfigures a router, allocates
resources incorrectly, or triggers an unintended cascade of
actions across a multi-agent system, there is currently no
standard mechanism for detecting the failure, containing its
blast radius, or reverting to a safe state.
AERR borrows proven patterns from distributed systems:
checkpoints from database transactions, circuit breakers from
microservice architectures, and rollback from version control.
It adapts these patterns to the specific needs of AI agents,
where actions may be partially reversible and where the agent
that caused the error may not be the best one to fix it.
Design principles:
1. Agents that take consequential actions MUST be able to undo
them, or MUST declare them irreversible upfront.
2. Failure containment takes priority over failure diagnosis.
3. The protocol adds minimal overhead to the happy path.
2. Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described
in RFC 2119 [RFC2119].
Checkpoint: A snapshot of an agent's state and the external
effects of its actions at a point in time, sufficient to
restore the system to that state.
Circuit Breaker: A mechanism that stops an agent from
propagating requests to a failing downstream agent, preventing
cascading failures.
Rollback: The process of reverting an agent's actions and state
to a previously recorded checkpoint.
Blast Radius: The set of agents and systems affected by a
single agent's failure.
3. Problem Statement
Consider a network operations scenario: Agent A instructs
Agent B to update firewall rules, which causes Agent C's
traffic monitoring to fail, which causes Agent D to
misclassify traffic patterns. Today each agent handles errors
independently with no coordination. There is no standard way
for Agent D to signal that the root cause is upstream, for the
cascade to be halted, or for the chain of actions to be rolled
back.
The only existing draft that partially addresses this space
(draft-yue-anima-agent-recovery-networks) focuses on mobile
network fault recovery and does not provide general-purpose
error recovery primitives usable across agent types.
4. Checkpoint Mechanism
An AERR-compliant agent MUST create a checkpoint before any
action it classifies as "consequential." An action is
consequential if it modifies external state (e.g., network
config, database records, API calls with side effects).
A checkpoint is a JSON object:
{
"checkpoint_id": "urn:uuid:...",
"agent_id": "urn:uuid:...",
"timestamp": "2026-03-01T12:00:00Z",
"action": {
"type": "config_update",
"target": "router-07.example.com",
"description": "Update BGP peer config"
},
"reversible": true,
"rollback_procedure": {
"method": "POST",
"uri": "https://agent-b.example.com/aerr/rollback",
"payload_ref": "urn:uuid:...prior-config-snapshot"
},
"state_hash": "sha256:abcdef...",
"ttl": 86400
}
The "reversible" field MUST be present. If false, the agent
declares that this action cannot be automatically undone and
rollback requests for this checkpoint MUST be escalated to a
human operator.
The "state_hash" provides integrity verification: the agent
hashes its relevant state at checkpoint time so that rollback
can verify it is restoring to an authentic prior state.
Checkpoints MUST be stored for at least the duration specified
by "ttl" (seconds). Agents SHOULD store checkpoints in durable
storage that survives agent restarts.
Agents MAY create hierarchical checkpoints where a parent
checkpoint groups multiple child checkpoints from a multi-step
operation. Rolling back the parent rolls back all children.
5. Error Signaling
When an agent detects an error, it MUST emit an AERR error
signal to all agents in the current action chain. The error
signal is an HTTP POST to each peer's AERR endpoint:
POST /aerr/error HTTP/1.1
Content-Type: application/json
{
"error_id": "urn:uuid:...",
"source_agent": "urn:uuid:...",
"severity": "critical",
"checkpoint_id": "urn:uuid:...",
"error_type": "action_failed",
"description": "BGP session did not establish after config update",
"timestamp": "2026-03-01T12:05:00Z",
"upstream_errors": []
}
Severity levels: "info", "warning", "error", "critical".
Error types: "action_failed", "timeout", "constraint_violation",
"resource_exhausted", "upstream_cascade", "unknown".
When an agent receives an error signal caused by an action it
initiated, it MUST either:
(a) Attempt automatic rollback of its checkpoint, or
(b) Escalate to its operator if the action was irreversible.
The "upstream_errors" array allows agents to chain error
context, building a causal trace from the symptom back to the
root cause.
6. Circuit Breaker Pattern
Each agent MUST implement a circuit breaker for every downstream
agent it communicates with. The circuit breaker has three
states:
CLOSED (normal operation): Requests flow through. The agent
tracks the error rate over a sliding window (default: 60s).
OPEN (failure detected): When the error rate exceeds a
threshold (default: 50% over the window), the circuit breaker
opens. All requests to the downstream agent are immediately
rejected with error_type "circuit_open". The agent MUST emit
an error signal to upstream peers.
HALF-OPEN (recovery probe): After a cooldown period (default:
30s), the circuit breaker allows a single probe request. If it
succeeds, the breaker returns to CLOSED. If it fails, it
returns to OPEN with a doubled cooldown (exponential backoff,
max 300s).
Agents MUST expose circuit breaker state at:
GET /aerr/circuits
Response:
{
"circuits": [
{
"downstream_agent": "urn:uuid:...",
"state": "open",
"error_rate": 0.75,
"last_failure": "2026-03-01T12:05:00Z",
"cooldown_remaining_s": 22
}
]
}
This enables monitoring systems and upstream agents to
understand the health topology of the agent network.
7. Rollback Protocol
A rollback is initiated by sending an HTTP POST to the target
agent's rollback endpoint:
POST /aerr/rollback HTTP/1.1
Content-Type: application/json
{
"rollback_id": "urn:uuid:...",
"checkpoint_id": "urn:uuid:...",
"reason": "Upstream action caused cascading failure",
"initiator": "urn:uuid:...",
"cascade": true
}
When "cascade" is true, the receiving agent MUST also initiate
rollback of any downstream checkpoints that were created as a
consequence of the checkpointed action. This enables a single
rollback request to unwind an entire chain of agent actions.
The agent MUST respond with a rollback result:
{
"rollback_id": "urn:uuid:...",
"status": "completed",
"checkpoint_id": "urn:uuid:...",
"state_hash_before": "sha256:...",
"state_hash_after": "sha256:...",
"cascaded_rollbacks": [
{"agent_id": "urn:uuid:...", "status": "completed"},
{"agent_id": "urn:uuid:...", "status": "escalated"}
]
}
Rollback status values: "completed", "partial", "escalated",
"failed".
"escalated" means the action was irreversible and a human
operator has been notified. "partial" means some but not all
downstream rollbacks succeeded.
Agents MUST implement idempotent rollback: receiving the same
rollback_id twice MUST return the same result without re-
executing the rollback.
8. Security Considerations
Rollback requests are sensitive operations. Agents MUST
authenticate rollback requests using mutual TLS or signed JWTs.
Only agents in the same action chain (identified by checkpoint
lineage) SHOULD be authorized to request rollback.
Checkpoint data may contain sensitive system state. Agents
MUST encrypt stored checkpoints at rest and MUST NOT include
checkpoint contents in error signals.
Circuit breaker state is observable information about system
health. The /aerr/circuits endpoint SHOULD be access-
controlled to prevent adversaries from mapping system topology.
Malicious agents could send false error signals to trigger
unnecessary rollbacks. Agents SHOULD verify that error signals
reference valid checkpoint IDs from their own action chains
before initiating rollback.
9. IANA Considerations
This document requests IANA establish the following:
1. An "AERR Error Type" registry under Specification Required
policy. Initial entries: "action_failed", "timeout",
"constraint_violation", "resource_exhausted",
"upstream_cascade", "unknown".
2. An "AERR Severity Level" registry under Specification
Required policy. Initial entries: "info", "warning",
"error", "critical".
3. Well-known URI registrations for "aerr/error",
"aerr/rollback", and "aerr/circuits" per RFC 8615.
Author's Address
Generated by IETF Draft Analyzer
2026-03-01