feat: add draft data, gap analysis report, and workspace config
This commit is contained in:
@@ -0,0 +1,309 @@
|
||||
Internet-Draft AI/Agent WG
|
||||
Intended status: Standards Track March 2026
|
||||
Expires: September 15, 2026
|
||||
|
||||
|
||||
Agent Error Recovery and Rollback (AERR)
|
||||
draft-aerr-agent-error-recovery-rollback-00
|
||||
|
||||
Abstract
|
||||
|
||||
This document defines the Agent Error Recovery and Rollback
|
||||
(AERR) protocol, a lightweight standard for handling errors,
|
||||
cascading failures, and rollback in multi-agent systems.
|
||||
Autonomous AI agents increasingly make unsupervised decisions,
|
||||
yet no standard exists for how agents checkpoint state, signal
|
||||
errors to peers, contain cascading failures, or roll back
|
||||
autonomous decisions gone wrong. AERR defines three mechanisms:
|
||||
state checkpoints that agents create before consequential
|
||||
actions, a circuit breaker pattern to contain cascading failures
|
||||
across agent networks, and a rollback protocol for reverting
|
||||
agent actions to a known-good state. The protocol is transport-
|
||||
agnostic and builds on JSON and standard HTTP semantics.
|
||||
|
||||
Status of This Memo
|
||||
|
||||
This Internet-Draft is submitted in full conformance with the
|
||||
provisions of BCP 78 and BCP 79.
|
||||
|
||||
This document is intended to have Standards Track status.
|
||||
Distribution of this memo is unlimited.
|
||||
|
||||
Table of Contents
|
||||
|
||||
1. Introduction
|
||||
2. Terminology
|
||||
3. Problem Statement
|
||||
4. Checkpoint Mechanism
|
||||
5. Error Signaling
|
||||
6. Circuit Breaker Pattern
|
||||
7. Rollback Protocol
|
||||
8. Security Considerations
|
||||
9. IANA Considerations
|
||||
|
||||
1. Introduction
|
||||
|
||||
The IETF AI/agent landscape includes 60 drafts on autonomous
|
||||
network operations but none that standardize error recovery.
|
||||
When an autonomous agent misconfigures a router, allocates
|
||||
resources incorrectly, or triggers an unintended cascade of
|
||||
actions across a multi-agent system, there is currently no
|
||||
standard mechanism for detecting the failure, containing its
|
||||
blast radius, or reverting to a safe state.
|
||||
|
||||
AERR borrows proven patterns from distributed systems:
|
||||
checkpoints from database transactions, circuit breakers from
|
||||
microservice architectures, and rollback from version control.
|
||||
It adapts these patterns to the specific needs of AI agents,
|
||||
where actions may be partially reversible and where the agent
|
||||
that caused the error may not be the best one to fix it.
|
||||
|
||||
Design principles:
|
||||
1. Agents that take consequential actions MUST be able to undo
|
||||
them, or MUST declare them irreversible upfront.
|
||||
2. Failure containment takes priority over failure diagnosis.
|
||||
3. The protocol adds minimal overhead to the happy path.
|
||||
|
||||
2. Terminology
|
||||
|
||||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
|
||||
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
|
||||
"OPTIONAL" in this document are to be interpreted as described
|
||||
in RFC 2119 [RFC2119].
|
||||
|
||||
Checkpoint: A snapshot of an agent's state and the external
|
||||
effects of its actions at a point in time, sufficient to
|
||||
restore the system to that state.
|
||||
|
||||
Circuit Breaker: A mechanism that stops an agent from
|
||||
propagating requests to a failing downstream agent, preventing
|
||||
cascading failures.
|
||||
|
||||
Rollback: The process of reverting an agent's actions and state
|
||||
to a previously recorded checkpoint.
|
||||
|
||||
Blast Radius: The set of agents and systems affected by a
|
||||
single agent's failure.
|
||||
|
||||
3. Problem Statement
|
||||
|
||||
Consider a network operations scenario: Agent A instructs
|
||||
Agent B to update firewall rules, which causes Agent C's
|
||||
traffic monitoring to fail, which causes Agent D to
|
||||
misclassify traffic patterns. Today each agent handles errors
|
||||
independently with no coordination. There is no standard way
|
||||
for Agent D to signal that the root cause is upstream, for the
|
||||
cascade to be halted, or for the chain of actions to be rolled
|
||||
back.
|
||||
|
||||
The only existing draft that partially addresses this space
|
||||
(draft-yue-anima-agent-recovery-networks) focuses on mobile
|
||||
network fault recovery and does not provide general-purpose
|
||||
error recovery primitives usable across agent types.
|
||||
|
||||
4. Checkpoint Mechanism
|
||||
|
||||
An AERR-compliant agent MUST create a checkpoint before any
|
||||
action it classifies as "consequential." An action is
|
||||
consequential if it modifies external state (e.g., network
|
||||
config, database records, API calls with side effects).
|
||||
|
||||
A checkpoint is a JSON object:
|
||||
|
||||
{
|
||||
"checkpoint_id": "urn:uuid:...",
|
||||
"agent_id": "urn:uuid:...",
|
||||
"timestamp": "2026-03-01T12:00:00Z",
|
||||
"action": {
|
||||
"type": "config_update",
|
||||
"target": "router-07.example.com",
|
||||
"description": "Update BGP peer config"
|
||||
},
|
||||
"reversible": true,
|
||||
"rollback_procedure": {
|
||||
"method": "POST",
|
||||
"uri": "https://agent-b.example.com/aerr/rollback",
|
||||
"payload_ref": "urn:uuid:...prior-config-snapshot"
|
||||
},
|
||||
"state_hash": "sha256:abcdef...",
|
||||
"ttl": 86400
|
||||
}
|
||||
|
||||
The "reversible" field MUST be present. If false, the agent
|
||||
declares that this action cannot be automatically undone and
|
||||
rollback requests for this checkpoint MUST be escalated to a
|
||||
human operator.
|
||||
|
||||
The "state_hash" provides integrity verification: the agent
|
||||
hashes its relevant state at checkpoint time so that rollback
|
||||
can verify it is restoring to an authentic prior state.
|
||||
|
||||
Checkpoints MUST be stored for at least the duration specified
|
||||
by "ttl" (seconds). Agents SHOULD store checkpoints in durable
|
||||
storage that survives agent restarts.
|
||||
|
||||
Agents MAY create hierarchical checkpoints where a parent
|
||||
checkpoint groups multiple child checkpoints from a multi-step
|
||||
operation. Rolling back the parent rolls back all children.
|
||||
|
||||
5. Error Signaling
|
||||
|
||||
When an agent detects an error, it MUST emit an AERR error
|
||||
signal to all agents in the current action chain. The error
|
||||
signal is an HTTP POST to each peer's AERR endpoint:
|
||||
|
||||
POST /aerr/error HTTP/1.1
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"error_id": "urn:uuid:...",
|
||||
"source_agent": "urn:uuid:...",
|
||||
"severity": "critical",
|
||||
"checkpoint_id": "urn:uuid:...",
|
||||
"error_type": "action_failed",
|
||||
"description": "BGP session did not establish after config update",
|
||||
"timestamp": "2026-03-01T12:05:00Z",
|
||||
"upstream_errors": []
|
||||
}
|
||||
|
||||
Severity levels: "info", "warning", "error", "critical".
|
||||
|
||||
Error types: "action_failed", "timeout", "constraint_violation",
|
||||
"resource_exhausted", "upstream_cascade", "unknown".
|
||||
|
||||
When an agent receives an error signal caused by an action it
|
||||
initiated, it MUST either:
|
||||
(a) Attempt automatic rollback of its checkpoint, or
|
||||
(b) Escalate to its operator if the action was irreversible.
|
||||
|
||||
The "upstream_errors" array allows agents to chain error
|
||||
context, building a causal trace from the symptom back to the
|
||||
root cause.
|
||||
|
||||
6. Circuit Breaker Pattern
|
||||
|
||||
Each agent MUST implement a circuit breaker for every downstream
|
||||
agent it communicates with. The circuit breaker has three
|
||||
states:
|
||||
|
||||
CLOSED (normal operation): Requests flow through. The agent
|
||||
tracks the error rate over a sliding window (default: 60s).
|
||||
|
||||
OPEN (failure detected): When the error rate exceeds a
|
||||
threshold (default: 50% over the window), the circuit breaker
|
||||
opens. All requests to the downstream agent are immediately
|
||||
rejected with error_type "circuit_open". The agent MUST emit
|
||||
an error signal to upstream peers.
|
||||
|
||||
HALF-OPEN (recovery probe): After a cooldown period (default:
|
||||
30s), the circuit breaker allows a single probe request. If it
|
||||
succeeds, the breaker returns to CLOSED. If it fails, it
|
||||
returns to OPEN with a doubled cooldown (exponential backoff,
|
||||
max 300s).
|
||||
|
||||
Agents MUST expose circuit breaker state at:
|
||||
|
||||
GET /aerr/circuits
|
||||
|
||||
Response:
|
||||
{
|
||||
"circuits": [
|
||||
{
|
||||
"downstream_agent": "urn:uuid:...",
|
||||
"state": "open",
|
||||
"error_rate": 0.75,
|
||||
"last_failure": "2026-03-01T12:05:00Z",
|
||||
"cooldown_remaining_s": 22
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
This enables monitoring systems and upstream agents to
|
||||
understand the health topology of the agent network.
|
||||
|
||||
7. Rollback Protocol
|
||||
|
||||
A rollback is initiated by sending an HTTP POST to the target
|
||||
agent's rollback endpoint:
|
||||
|
||||
POST /aerr/rollback HTTP/1.1
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"rollback_id": "urn:uuid:...",
|
||||
"checkpoint_id": "urn:uuid:...",
|
||||
"reason": "Upstream action caused cascading failure",
|
||||
"initiator": "urn:uuid:...",
|
||||
"cascade": true
|
||||
}
|
||||
|
||||
When "cascade" is true, the receiving agent MUST also initiate
|
||||
rollback of any downstream checkpoints that were created as a
|
||||
consequence of the checkpointed action. This enables a single
|
||||
rollback request to unwind an entire chain of agent actions.
|
||||
|
||||
The agent MUST respond with a rollback result:
|
||||
|
||||
{
|
||||
"rollback_id": "urn:uuid:...",
|
||||
"status": "completed",
|
||||
"checkpoint_id": "urn:uuid:...",
|
||||
"state_hash_before": "sha256:...",
|
||||
"state_hash_after": "sha256:...",
|
||||
"cascaded_rollbacks": [
|
||||
{"agent_id": "urn:uuid:...", "status": "completed"},
|
||||
{"agent_id": "urn:uuid:...", "status": "escalated"}
|
||||
]
|
||||
}
|
||||
|
||||
Rollback status values: "completed", "partial", "escalated",
|
||||
"failed".
|
||||
|
||||
"escalated" means the action was irreversible and a human
|
||||
operator has been notified. "partial" means some but not all
|
||||
downstream rollbacks succeeded.
|
||||
|
||||
Agents MUST implement idempotent rollback: receiving the same
|
||||
rollback_id twice MUST return the same result without re-
|
||||
executing the rollback.
|
||||
|
||||
8. Security Considerations
|
||||
|
||||
Rollback requests are sensitive operations. Agents MUST
|
||||
authenticate rollback requests using mutual TLS or signed JWTs.
|
||||
Only agents in the same action chain (identified by checkpoint
|
||||
lineage) SHOULD be authorized to request rollback.
|
||||
|
||||
Checkpoint data may contain sensitive system state. Agents
|
||||
MUST encrypt stored checkpoints at rest and MUST NOT include
|
||||
checkpoint contents in error signals.
|
||||
|
||||
Circuit breaker state is observable information about system
|
||||
health. The /aerr/circuits endpoint SHOULD be access-
|
||||
controlled to prevent adversaries from mapping system topology.
|
||||
|
||||
Malicious agents could send false error signals to trigger
|
||||
unnecessary rollbacks. Agents SHOULD verify that error signals
|
||||
reference valid checkpoint IDs from their own action chains
|
||||
before initiating rollback.
|
||||
|
||||
9. IANA Considerations
|
||||
|
||||
This document requests IANA establish the following:
|
||||
|
||||
1. An "AERR Error Type" registry under Specification Required
|
||||
policy. Initial entries: "action_failed", "timeout",
|
||||
"constraint_violation", "resource_exhausted",
|
||||
"upstream_cascade", "unknown".
|
||||
|
||||
2. An "AERR Severity Level" registry under Specification
|
||||
Required policy. Initial entries: "info", "warning",
|
||||
"error", "critical".
|
||||
|
||||
3. Well-known URI registrations for "aerr/error",
|
||||
"aerr/rollback", and "aerr/circuits" per RFC 8615.
|
||||
|
||||
Author's Address
|
||||
|
||||
Generated by IETF Draft Analyzer
|
||||
2026-03-01
|
||||
Reference in New Issue
Block a user