398 lines
12 KiB
Markdown
398 lines
12 KiB
Markdown
---
|
|
title: "Agent Error Recovery and Rollback (AERR)"
|
|
abbrev: "AERR"
|
|
category: std
|
|
docname: draft-aerr-agent-error-recovery-rollback-00
|
|
submissiontype: IETF
|
|
number:
|
|
date:
|
|
v: 3
|
|
area: "OPS"
|
|
workgroup: "NMOP"
|
|
keyword:
|
|
- error recovery
|
|
- rollback
|
|
- circuit breaker
|
|
- agentic workflows
|
|
- execution context
|
|
|
|
author:
|
|
-
|
|
fullname: Generated by IETF Draft Analyzer
|
|
organization: Independent
|
|
email: placeholder@example.com
|
|
|
|
normative:
|
|
RFC7519:
|
|
RFC7515:
|
|
RFC9110:
|
|
I-D.nennemann-wimse-ect:
|
|
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
|
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
|
|
|
informative:
|
|
I-D.nennemann-agent-dag-hitl-safety:
|
|
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
|
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
|
|
|
--- abstract
|
|
|
|
This document defines the Agent Error Recovery and Rollback (AERR)
|
|
protocol, a standard for handling errors, cascading failures, and
|
|
rollback in multi-agent systems. AERR defines three mechanisms:
|
|
state checkpoints recorded as Execution Context Token (ECT) DAG
|
|
nodes, a circuit breaker pattern to contain cascading failures,
|
|
and a rollback protocol that walks the ECT DAG backwards to revert
|
|
agent actions to a known-good state. By building on ECT, AERR
|
|
inherits cryptographic audit trails, assurance levels, and DAG
|
|
validation without inventing parallel infrastructure.
|
|
|
|
--- middle
|
|
|
|
# Introduction
|
|
|
|
The IETF AI/agent landscape includes 60 drafts on autonomous
|
|
network operations but none that standardize error recovery. When
|
|
an autonomous agent misconfigures a router, allocates resources
|
|
incorrectly, or triggers a cascade of failures across a multi-agent
|
|
system, there is no standard mechanism for detecting the failure,
|
|
containing its blast radius, or reverting to a safe state.
|
|
|
|
AERR borrows proven patterns from distributed systems -- checkpoints
|
|
from database transactions, circuit breakers from microservice
|
|
architectures, rollback from version control -- and adapts them for
|
|
AI agent workflows. Rather than inventing its own audit and
|
|
tracing layer, AERR records all checkpoints, errors, and rollbacks
|
|
as ECT DAG nodes {{I-D.nennemann-wimse-ect}}, giving every
|
|
recovery action a cryptographic proof chain.
|
|
|
|
Design principles:
|
|
|
|
1. Agents that take consequential actions MUST be able to undo
|
|
them, or MUST declare them irreversible upfront.
|
|
2. Failure containment takes priority over failure diagnosis.
|
|
3. The protocol adds minimal overhead to the happy path.
|
|
|
|
# Conventions and Definitions
|
|
|
|
{::boilerplate bcp14-tagged}
|
|
|
|
Checkpoint:
|
|
: An ECT recording an agent's state hash before a consequential
|
|
action, providing a restore point for rollback.
|
|
|
|
Circuit Breaker:
|
|
: A mechanism that stops an agent from propagating requests to a
|
|
failing downstream agent, preventing cascading failures.
|
|
|
|
Rollback:
|
|
: The process of reverting an agent's actions and state to a
|
|
previously recorded checkpoint, walking the ECT DAG backwards.
|
|
|
|
Blast Radius:
|
|
: The set of agents and systems affected by a single agent's
|
|
failure, determinable by traversing the ECT DAG forward from the
|
|
failing node.
|
|
|
|
# Problem Statement
|
|
|
|
Consider a network operations scenario: Agent A instructs Agent B
|
|
to update firewall rules, which causes Agent C's traffic monitoring
|
|
to fail, which causes Agent D to misclassify traffic. Today each
|
|
agent handles errors independently. There is no standard way for
|
|
Agent D to signal that the root cause is upstream, for the cascade
|
|
to be halted, or for the chain of actions to be rolled back.
|
|
|
|
The ECT DAG {{I-D.nennemann-wimse-ect}} already records causal
|
|
ordering of agent actions via `par` references. AERR adds
|
|
checkpoint semantics, error propagation, and rollback operations
|
|
on top of this existing structure.
|
|
|
|
# Checkpoint Mechanism {#checkpoints}
|
|
|
|
An AERR-compliant agent MUST create a checkpoint ECT before any
|
|
action it classifies as consequential. An action is consequential
|
|
if it modifies external state (e.g., network config, database
|
|
records, API calls with side effects).
|
|
|
|
## Checkpoint as ECT
|
|
|
|
A checkpoint is an ECT with:
|
|
|
|
- `exec_act`: `"aerr:checkpoint"`
|
|
- `par`: the `jti` of the preceding task ECT in the workflow
|
|
- `out_hash`: SHA-256 hash of the agent's state snapshot at
|
|
checkpoint time (for rollback integrity verification)
|
|
|
|
The `ext` claim carries AERR-specific metadata:
|
|
|
|
~~~json
|
|
{
|
|
"ext": {
|
|
"aerr.action_type": "config_update",
|
|
"aerr.target": "router-07.example.com",
|
|
"aerr.reversible": true,
|
|
"aerr.rollback_uri": "https://agent-b.example.com/aerr/rollback",
|
|
"aerr.ttl": 86400
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-checkpoint title="Checkpoint ECT Extension Claims"}
|
|
|
|
The `aerr.reversible` field MUST be present. If `false`, the
|
|
agent declares that this action cannot be automatically undone
|
|
and rollback requests MUST be escalated to a human operator via
|
|
the HITL mechanism {{I-D.nennemann-agent-dag-hitl-safety}}.
|
|
|
|
Agents MAY create hierarchical checkpoints using the ECT DAG: a
|
|
parent checkpoint ECT with `par` references to multiple child
|
|
checkpoint ECTs. Rolling back the parent rolls back all children.
|
|
|
|
## Checkpoint Storage
|
|
|
|
Checkpoint ECTs MUST be stored for at least the duration specified
|
|
by `aerr.ttl`. At L3 {{I-D.nennemann-wimse-ect}}, checkpoints
|
|
are automatically preserved in the audit ledger. At L1 and L2,
|
|
agents MUST store checkpoints in durable local storage that
|
|
survives agent restarts.
|
|
|
|
# Error Signaling {#error-signals}
|
|
|
|
When an agent detects an error, it MUST produce an error ECT and
|
|
propagate it to affected agents in the DAG.
|
|
|
|
## Error ECT
|
|
|
|
An error signal is an ECT with:
|
|
|
|
- `exec_act`: `"aerr:error"`
|
|
- `par`: the `jti` of the checkpoint ECT associated with the
|
|
failing action
|
|
|
|
The `ext` claim carries error details:
|
|
|
|
~~~json
|
|
{
|
|
"ext": {
|
|
"aerr.severity": "critical",
|
|
"aerr.error_type": "action_failed",
|
|
"aerr.description": "BGP session did not establish",
|
|
"aerr.checkpoint_id": "550e8400-e29b-41d4-a716-446655440001",
|
|
"aerr.upstream_errors": []
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-error title="Error ECT Extension Claims"}
|
|
|
|
Severity levels: `info`, `warning`, `error`, `critical`.
|
|
|
|
Error types: `action_failed`, `timeout`, `constraint_violation`,
|
|
`resource_exhausted`, `upstream_cascade`, `unknown`.
|
|
|
|
## Error Propagation via DAG
|
|
|
|
When an agent receives an error ECT caused by an action it
|
|
initiated, it MUST either:
|
|
|
|
(a) Attempt automatic rollback of its checkpoint ({{rollback}}), or
|
|
|
|
(b) Escalate to its operator if the action was irreversible.
|
|
|
|
The `aerr.upstream_errors` array allows agents to chain error
|
|
context by referencing `jti` values of predecessor error ECTs,
|
|
building a causal trace from symptom to root cause through the
|
|
DAG.
|
|
|
|
## HITL Escalation
|
|
|
|
When an error requires human intervention, the error ECT SHOULD
|
|
trigger a HITL rule per {{I-D.nennemann-agent-dag-hitl-safety}}.
|
|
Example policy:
|
|
|
|
~~~json
|
|
{
|
|
"hitl": {
|
|
"rules": [{
|
|
"id": "r-critical-error",
|
|
"trigger": {
|
|
"kind": "keyword_match",
|
|
"op": "eq",
|
|
"value": "critical",
|
|
"input_ref": "ext.aerr.severity"
|
|
},
|
|
"required_role": "operator:oncall",
|
|
"action": "escalate",
|
|
"allow_override": true,
|
|
"override_action": "continue"
|
|
}]
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-hitl-error title="HITL Policy for Critical Errors"}
|
|
|
|
# Circuit Breaker Pattern {#circuit-breaker}
|
|
|
|
Each agent MUST implement a circuit breaker for every downstream
|
|
agent it communicates with.
|
|
|
|
## States
|
|
|
|
CLOSED (normal):
|
|
: Requests flow through. The agent tracks the error rate over a
|
|
sliding window (default: 60 seconds).
|
|
|
|
OPEN (failure detected):
|
|
: When the error rate exceeds a threshold (default: 50% over the
|
|
window), the breaker opens. All requests to the downstream
|
|
agent are immediately rejected with `aerr.error_type`:
|
|
`circuit_open`. The agent MUST produce an error ECT and emit
|
|
it to upstream peers.
|
|
|
|
HALF-OPEN (recovery probe):
|
|
: After a cooldown period (default: 30 seconds), the breaker
|
|
allows a single probe request. If it succeeds, the breaker
|
|
returns to CLOSED. If it fails, it returns to OPEN with doubled
|
|
cooldown (exponential backoff, max 300 seconds).
|
|
|
|
## State Change ECTs
|
|
|
|
Each circuit breaker state change MUST produce an ECT:
|
|
|
|
- `exec_act`: `"aerr:circuit_open"`, `"aerr:circuit_half_open"`,
|
|
or `"aerr:circuit_closed"`
|
|
- `par`: the `jti` of the error ECT that triggered the transition
|
|
|
|
This records the health topology of the agent network in the ECT
|
|
DAG, queryable from the audit ledger at L3.
|
|
|
|
## Observability
|
|
|
|
Agents MUST expose circuit breaker state at:
|
|
|
|
~~~
|
|
GET /aerr/circuits
|
|
~~~
|
|
|
|
Response:
|
|
|
|
~~~json
|
|
{
|
|
"circuits": [{
|
|
"downstream_agent": "spiffe://example.com/agent/router-mgr",
|
|
"state": "open",
|
|
"error_rate": 0.75,
|
|
"last_failure_ect": "550e8400-e29b-41d4-a716-446655440099",
|
|
"cooldown_remaining_s": 22
|
|
}]
|
|
}
|
|
~~~
|
|
{: #fig-circuits title="Circuit Breaker Status"}
|
|
|
|
# Rollback Protocol {#rollback}
|
|
|
|
## Rollback Request
|
|
|
|
A rollback is initiated by sending an HTTP POST to the target
|
|
agent's rollback endpoint:
|
|
|
|
~~~
|
|
POST /aerr/rollback HTTP/1.1
|
|
Content-Type: application/json
|
|
Execution-Context: <rollback-request-ECT>
|
|
|
|
{
|
|
"rollback_id": "urn:uuid:...",
|
|
"checkpoint_id": "550e8400-e29b-41d4-a716-446655440001",
|
|
"reason": "Upstream action caused cascading failure",
|
|
"cascade": true
|
|
}
|
|
~~~
|
|
{: #fig-rollback-req title="Rollback Request"}
|
|
|
|
The request MUST include an ECT in the Execution-Context header
|
|
with `exec_act`: `"aerr:rollback_request"` and `par` referencing
|
|
the error ECT that motivated the rollback.
|
|
|
|
When `cascade` is `true`, the receiving agent MUST also initiate
|
|
rollback of any downstream checkpoints created as a consequence
|
|
of the checkpointed action. The ECT DAG's `par` chain identifies
|
|
these downstream actions.
|
|
|
|
## Rollback Response
|
|
|
|
The agent produces a rollback result ECT with:
|
|
|
|
- `exec_act`: `"aerr:rollback_complete"` (or `"aerr:rollback_escalated"`)
|
|
- `par`: the `jti` of the rollback request ECT
|
|
- `out_hash`: SHA-256 hash of the agent's state after rollback
|
|
|
|
~~~json
|
|
{
|
|
"ext": {
|
|
"aerr.rollback_id": "urn:uuid:...",
|
|
"aerr.status": "completed",
|
|
"aerr.state_hash_before": "sha256:...",
|
|
"aerr.state_hash_after": "sha256:...",
|
|
"aerr.cascaded": [
|
|
{"agent": "spiffe://example.com/agent/monitor", "status": "completed"},
|
|
{"agent": "spiffe://example.com/agent/classify", "status": "escalated"}
|
|
]
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-rollback-resp title="Rollback Result ECT"}
|
|
|
|
Status values: `completed`, `partial`, `escalated`, `failed`.
|
|
|
|
`escalated` means the action was irreversible and a human operator
|
|
has been notified via HITL. `partial` means some but not all
|
|
downstream rollbacks succeeded.
|
|
|
|
## Idempotency
|
|
|
|
Agents MUST implement idempotent rollback: receiving the same
|
|
`rollback_id` twice MUST return the same result without
|
|
re-executing the rollback.
|
|
|
|
# Security Considerations
|
|
|
|
Rollback requests are sensitive operations. Agents MUST
|
|
authenticate rollback requests via the ECT signature chain -- only
|
|
agents whose ECTs appear in the same workflow DAG (identified by
|
|
`wid`) SHOULD be authorized to request rollback.
|
|
|
|
Checkpoint ECTs contain `out_hash` of agent state but not the
|
|
state itself. Agents MUST encrypt stored state snapshots at rest.
|
|
|
|
Circuit breaker status exposes system health topology. The
|
|
`/aerr/circuits` endpoint SHOULD be access-controlled.
|
|
|
|
Malicious agents could emit false error ECTs to trigger rollbacks.
|
|
Agents SHOULD verify that error ECTs reference valid checkpoint
|
|
`jti` values from their own workflow DAG before initiating
|
|
rollback. At L2 and L3, ECT signatures prevent forgery.
|
|
|
|
# IANA Considerations
|
|
|
|
This document requests the following IANA registrations:
|
|
|
|
1. An "AERR Error Type" registry under Specification Required
|
|
policy. Initial entries: `action_failed`, `timeout`,
|
|
`constraint_violation`, `resource_exhausted`,
|
|
`upstream_cascade`, `circuit_open`, `unknown`.
|
|
|
|
2. Registration of `exec_act` values `aerr:checkpoint`,
|
|
`aerr:error`, `aerr:rollback_request`, `aerr:rollback_complete`,
|
|
`aerr:circuit_open`, `aerr:circuit_half_open`,
|
|
`aerr:circuit_closed` in a future ECT action type registry.
|
|
|
|
--- back
|
|
|
|
# Acknowledgments
|
|
{:numbered="false"}
|
|
|
|
This document builds on the Execution Context Token specification
|
|
{{I-D.nennemann-wimse-ect}} for DAG-based audit trails and the
|
|
Agent Context Policy Token {{I-D.nennemann-agent-dag-hitl-safety}}
|
|
for HITL escalation of irreversible actions.
|