12 KiB
fullname: Generated by IETF Draft Analyzer
organization: Independent
email: placeholder@example.com
normative: RFC7519: RFC7515: RFC9110: I-D.nennemann-wimse-ect: title: "Execution Context Tokens for Distributed Agentic Workflows" target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
informative: I-D.nennemann-agent-dag-hitl-safety: title: "Agent Context Policy Token: DAG Delegation with Human Override" target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
--- abstract
This document defines the Agent Error Recovery and Rollback (AERR) protocol, a standard for handling errors, cascading failures, and rollback in multi-agent systems. AERR defines three mechanisms: state checkpoints recorded as Execution Context Token (ECT) DAG nodes, a circuit breaker pattern to contain cascading failures, and a rollback protocol that walks the ECT DAG backwards to revert agent actions to a known-good state. By building on ECT, AERR inherits cryptographic audit trails, assurance levels, and DAG validation without inventing parallel infrastructure.
--- middle
Introduction
The IETF AI/agent landscape includes 60 drafts on autonomous network operations but none that standardize error recovery. When an autonomous agent misconfigures a router, allocates resources incorrectly, or triggers a cascade of failures across a multi-agent system, there is no standard mechanism for detecting the failure, containing its blast radius, or reverting to a safe state.
AERR borrows proven patterns from distributed systems -- checkpoints from database transactions, circuit breakers from microservice architectures, rollback from version control -- and adapts them for AI agent workflows. Rather than inventing its own audit and tracing layer, AERR records all checkpoints, errors, and rollbacks as ECT DAG nodes {{I-D.nennemann-wimse-ect}}, giving every recovery action a cryptographic proof chain.
Design principles:
- Agents that take consequential actions MUST be able to undo them, or MUST declare them irreversible upfront.
- Failure containment takes priority over failure diagnosis.
- The protocol adds minimal overhead to the happy path.
Conventions and Definitions
{::boilerplate bcp14-tagged}
- Checkpoint:
- An ECT recording an agent's state hash before a consequential action, providing a restore point for rollback.
- Circuit Breaker:
- A mechanism that stops an agent from propagating requests to a failing downstream agent, preventing cascading failures.
- Rollback:
- The process of reverting an agent's actions and state to a previously recorded checkpoint, walking the ECT DAG backwards.
- Blast Radius:
- The set of agents and systems affected by a single agent's failure, determinable by traversing the ECT DAG forward from the failing node.
Problem Statement
Consider a network operations scenario: Agent A instructs Agent B to update firewall rules, which causes Agent C's traffic monitoring to fail, which causes Agent D to misclassify traffic. Today each agent handles errors independently. There is no standard way for Agent D to signal that the root cause is upstream, for the cascade to be halted, or for the chain of actions to be rolled back.
The ECT DAG {{I-D.nennemann-wimse-ect}} already records causal
ordering of agent actions via par references. AERR adds
checkpoint semantics, error propagation, and rollback operations
on top of this existing structure.
Checkpoint Mechanism
An AERR-compliant agent MUST create a checkpoint ECT before any action it classifies as consequential. An action is consequential if it modifies external state (e.g., network config, database records, API calls with side effects).
Checkpoint as ECT
A checkpoint is an ECT with:
exec_act:"aerr:checkpoint"par: thejtiof the preceding task ECT in the workflowout_hash: SHA-256 hash of the agent's state snapshot at checkpoint time (for rollback integrity verification)
The ext claim carries AERR-specific metadata:
{
"ext": {
"aerr.action_type": "config_update",
"aerr.target": "router-07.example.com",
"aerr.reversible": true,
"aerr.rollback_uri": "https://agent-b.example.com/aerr/rollback",
"aerr.ttl": 86400
}
}
{: #fig-checkpoint title="Checkpoint ECT Extension Claims"}
The aerr.reversible field MUST be present. If false, the
agent declares that this action cannot be automatically undone
and rollback requests MUST be escalated to a human operator via
the HITL mechanism {{I-D.nennemann-agent-dag-hitl-safety}}.
Agents MAY create hierarchical checkpoints using the ECT DAG: a
parent checkpoint ECT with par references to multiple child
checkpoint ECTs. Rolling back the parent rolls back all children.
Checkpoint Storage
Checkpoint ECTs MUST be stored for at least the duration specified
by aerr.ttl. At L3 {{I-D.nennemann-wimse-ect}}, checkpoints
are automatically preserved in the audit ledger. At L1 and L2,
agents MUST store checkpoints in durable local storage that
survives agent restarts.
Error Signaling
When an agent detects an error, it MUST produce an error ECT and propagate it to affected agents in the DAG.
Error ECT
An error signal is an ECT with:
exec_act:"aerr:error"par: thejtiof the checkpoint ECT associated with the failing action
The ext claim carries error details:
{
"ext": {
"aerr.severity": "critical",
"aerr.error_type": "action_failed",
"aerr.description": "BGP session did not establish",
"aerr.checkpoint_id": "550e8400-e29b-41d4-a716-446655440001",
"aerr.upstream_errors": []
}
}
{: #fig-error title="Error ECT Extension Claims"}
Severity levels: info, warning, error, critical.
Error types: action_failed, timeout, constraint_violation,
resource_exhausted, upstream_cascade, unknown.
Error Propagation via DAG
When an agent receives an error ECT caused by an action it initiated, it MUST either:
(a) Attempt automatic rollback of its checkpoint ({{rollback}}), or
(b) Escalate to its operator if the action was irreversible.
The aerr.upstream_errors array allows agents to chain error
context by referencing jti values of predecessor error ECTs,
building a causal trace from symptom to root cause through the
DAG.
HITL Escalation
When an error requires human intervention, the error ECT SHOULD trigger a HITL rule per {{I-D.nennemann-agent-dag-hitl-safety}}. Example policy:
{
"hitl": {
"rules": [{
"id": "r-critical-error",
"trigger": {
"kind": "keyword_match",
"op": "eq",
"value": "critical",
"input_ref": "ext.aerr.severity"
},
"required_role": "operator:oncall",
"action": "escalate",
"allow_override": true,
"override_action": "continue"
}]
}
}
{: #fig-hitl-error title="HITL Policy for Critical Errors"}
Circuit Breaker Pattern
Each agent MUST implement a circuit breaker for every downstream agent it communicates with.
States
- CLOSED (normal):
- Requests flow through. The agent tracks the error rate over a sliding window (default: 60 seconds).
- OPEN (failure detected):
- When the error rate exceeds a threshold (default: 50% over the
window), the breaker opens. All requests to the downstream
agent are immediately rejected with
aerr.error_type:circuit_open. The agent MUST produce an error ECT and emit it to upstream peers. - HALF-OPEN (recovery probe):
- After a cooldown period (default: 30 seconds), the breaker allows a single probe request. If it succeeds, the breaker returns to CLOSED. If it fails, it returns to OPEN with doubled cooldown (exponential backoff, max 300 seconds).
State Change ECTs
Each circuit breaker state change MUST produce an ECT:
exec_act:"aerr:circuit_open","aerr:circuit_half_open", or"aerr:circuit_closed"par: thejtiof the error ECT that triggered the transition
This records the health topology of the agent network in the ECT DAG, queryable from the audit ledger at L3.
Observability
Agents MUST expose circuit breaker state at:
GET /aerr/circuits
Response:
{
"circuits": [{
"downstream_agent": "spiffe://example.com/agent/router-mgr",
"state": "open",
"error_rate": 0.75,
"last_failure_ect": "550e8400-e29b-41d4-a716-446655440099",
"cooldown_remaining_s": 22
}]
}
{: #fig-circuits title="Circuit Breaker Status"}
Rollback Protocol
Rollback Request
A rollback is initiated by sending an HTTP POST to the target agent's rollback endpoint:
POST /aerr/rollback HTTP/1.1
Content-Type: application/json
Execution-Context: <rollback-request-ECT>
{
"rollback_id": "urn:uuid:...",
"checkpoint_id": "550e8400-e29b-41d4-a716-446655440001",
"reason": "Upstream action caused cascading failure",
"cascade": true
}
{: #fig-rollback-req title="Rollback Request"}
The request MUST include an ECT in the Execution-Context header
with exec_act: "aerr:rollback_request" and par referencing
the error ECT that motivated the rollback.
When cascade is true, the receiving agent MUST also initiate
rollback of any downstream checkpoints created as a consequence
of the checkpointed action. The ECT DAG's par chain identifies
these downstream actions.
Rollback Response
The agent produces a rollback result ECT with:
exec_act:"aerr:rollback_complete"(or"aerr:rollback_escalated")par: thejtiof the rollback request ECTout_hash: SHA-256 hash of the agent's state after rollback
{
"ext": {
"aerr.rollback_id": "urn:uuid:...",
"aerr.status": "completed",
"aerr.state_hash_before": "sha256:...",
"aerr.state_hash_after": "sha256:...",
"aerr.cascaded": [
{"agent": "spiffe://example.com/agent/monitor", "status": "completed"},
{"agent": "spiffe://example.com/agent/classify", "status": "escalated"}
]
}
}
{: #fig-rollback-resp title="Rollback Result ECT"}
Status values: completed, partial, escalated, failed.
escalated means the action was irreversible and a human operator
has been notified via HITL. partial means some but not all
downstream rollbacks succeeded.
Idempotency
Agents MUST implement idempotent rollback: receiving the same
rollback_id twice MUST return the same result without
re-executing the rollback.
Security Considerations
Rollback requests are sensitive operations. Agents MUST
authenticate rollback requests via the ECT signature chain -- only
agents whose ECTs appear in the same workflow DAG (identified by
wid) SHOULD be authorized to request rollback.
Checkpoint ECTs contain out_hash of agent state but not the
state itself. Agents MUST encrypt stored state snapshots at rest.
Circuit breaker status exposes system health topology. The
/aerr/circuits endpoint SHOULD be access-controlled.
Malicious agents could emit false error ECTs to trigger rollbacks.
Agents SHOULD verify that error ECTs reference valid checkpoint
jti values from their own workflow DAG before initiating
rollback. At L2 and L3, ECT signatures prevent forgery.
IANA Considerations
This document requests the following IANA registrations:
-
An "AERR Error Type" registry under Specification Required policy. Initial entries:
action_failed,timeout,constraint_violation,resource_exhausted,upstream_cascade,circuit_open,unknown. -
Registration of
exec_actvaluesaerr:checkpoint,aerr:error,aerr:rollback_request,aerr:rollback_complete,aerr:circuit_open,aerr:circuit_half_open,aerr:circuit_closedin a future ECT action type registry.
--- back
Acknowledgments
{:numbered="false"}
This document builds on the Execution Context Token specification {{I-D.nennemann-wimse-ect}} for DAG-based audit trails and the Agent Context Policy Token {{I-D.nennemann-agent-dag-hitl-safety}} for HITL escalation of irreversible actions.