feat: add draft data, gap analysis report, and workspace config
This commit is contained in:
@@ -0,0 +1,397 @@
|
||||
---
|
||||
title: "Agent Error Recovery and Rollback (AERR)"
|
||||
abbrev: "AERR"
|
||||
category: std
|
||||
docname: draft-aerr-agent-error-recovery-rollback-00
|
||||
submissiontype: IETF
|
||||
number:
|
||||
date:
|
||||
v: 3
|
||||
area: "OPS"
|
||||
workgroup: "NMOP"
|
||||
keyword:
|
||||
- error recovery
|
||||
- rollback
|
||||
- circuit breaker
|
||||
- agentic workflows
|
||||
- execution context
|
||||
|
||||
author:
|
||||
-
|
||||
fullname: Generated by IETF Draft Analyzer
|
||||
organization: Independent
|
||||
email: placeholder@example.com
|
||||
|
||||
normative:
|
||||
RFC7519:
|
||||
RFC7515:
|
||||
RFC9110:
|
||||
I-D.nennemann-wimse-ect:
|
||||
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
||||
|
||||
informative:
|
||||
I-D.nennemann-agent-dag-hitl-safety:
|
||||
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
||||
|
||||
--- abstract
|
||||
|
||||
This document defines the Agent Error Recovery and Rollback (AERR)
|
||||
protocol, a standard for handling errors, cascading failures, and
|
||||
rollback in multi-agent systems. AERR defines three mechanisms:
|
||||
state checkpoints recorded as Execution Context Token (ECT) DAG
|
||||
nodes, a circuit breaker pattern to contain cascading failures,
|
||||
and a rollback protocol that walks the ECT DAG backwards to revert
|
||||
agent actions to a known-good state. By building on ECT, AERR
|
||||
inherits cryptographic audit trails, assurance levels, and DAG
|
||||
validation without inventing parallel infrastructure.
|
||||
|
||||
--- middle
|
||||
|
||||
# Introduction
|
||||
|
||||
The IETF AI/agent landscape includes 60 drafts on autonomous
|
||||
network operations but none that standardize error recovery. When
|
||||
an autonomous agent misconfigures a router, allocates resources
|
||||
incorrectly, or triggers a cascade of failures across a multi-agent
|
||||
system, there is no standard mechanism for detecting the failure,
|
||||
containing its blast radius, or reverting to a safe state.
|
||||
|
||||
AERR borrows proven patterns from distributed systems -- checkpoints
|
||||
from database transactions, circuit breakers from microservice
|
||||
architectures, rollback from version control -- and adapts them for
|
||||
AI agent workflows. Rather than inventing its own audit and
|
||||
tracing layer, AERR records all checkpoints, errors, and rollbacks
|
||||
as ECT DAG nodes {{I-D.nennemann-wimse-ect}}, giving every
|
||||
recovery action a cryptographic proof chain.
|
||||
|
||||
Design principles:
|
||||
|
||||
1. Agents that take consequential actions MUST be able to undo
|
||||
them, or MUST declare them irreversible upfront.
|
||||
2. Failure containment takes priority over failure diagnosis.
|
||||
3. The protocol adds minimal overhead to the happy path.
|
||||
|
||||
# Conventions and Definitions
|
||||
|
||||
{::boilerplate bcp14-tagged}
|
||||
|
||||
Checkpoint:
|
||||
: An ECT recording an agent's state hash before a consequential
|
||||
action, providing a restore point for rollback.
|
||||
|
||||
Circuit Breaker:
|
||||
: A mechanism that stops an agent from propagating requests to a
|
||||
failing downstream agent, preventing cascading failures.
|
||||
|
||||
Rollback:
|
||||
: The process of reverting an agent's actions and state to a
|
||||
previously recorded checkpoint, walking the ECT DAG backwards.
|
||||
|
||||
Blast Radius:
|
||||
: The set of agents and systems affected by a single agent's
|
||||
failure, determinable by traversing the ECT DAG forward from the
|
||||
failing node.
|
||||
|
||||
# Problem Statement
|
||||
|
||||
Consider a network operations scenario: Agent A instructs Agent B
|
||||
to update firewall rules, which causes Agent C's traffic monitoring
|
||||
to fail, which causes Agent D to misclassify traffic. Today each
|
||||
agent handles errors independently. There is no standard way for
|
||||
Agent D to signal that the root cause is upstream, for the cascade
|
||||
to be halted, or for the chain of actions to be rolled back.
|
||||
|
||||
The ECT DAG {{I-D.nennemann-wimse-ect}} already records causal
|
||||
ordering of agent actions via `par` references. AERR adds
|
||||
checkpoint semantics, error propagation, and rollback operations
|
||||
on top of this existing structure.
|
||||
|
||||
# Checkpoint Mechanism {#checkpoints}
|
||||
|
||||
An AERR-compliant agent MUST create a checkpoint ECT before any
|
||||
action it classifies as consequential. An action is consequential
|
||||
if it modifies external state (e.g., network config, database
|
||||
records, API calls with side effects).
|
||||
|
||||
## Checkpoint as ECT
|
||||
|
||||
A checkpoint is an ECT with:
|
||||
|
||||
- `exec_act`: `"aerr:checkpoint"`
|
||||
- `par`: the `jti` of the preceding task ECT in the workflow
|
||||
- `out_hash`: SHA-256 hash of the agent's state snapshot at
|
||||
checkpoint time (for rollback integrity verification)
|
||||
|
||||
The `ext` claim carries AERR-specific metadata:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"ext": {
|
||||
"aerr.action_type": "config_update",
|
||||
"aerr.target": "router-07.example.com",
|
||||
"aerr.reversible": true,
|
||||
"aerr.rollback_uri": "https://agent-b.example.com/aerr/rollback",
|
||||
"aerr.ttl": 86400
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-checkpoint title="Checkpoint ECT Extension Claims"}
|
||||
|
||||
The `aerr.reversible` field MUST be present. If `false`, the
|
||||
agent declares that this action cannot be automatically undone
|
||||
and rollback requests MUST be escalated to a human operator via
|
||||
the HITL mechanism {{I-D.nennemann-agent-dag-hitl-safety}}.
|
||||
|
||||
Agents MAY create hierarchical checkpoints using the ECT DAG: a
|
||||
parent checkpoint ECT with `par` references to multiple child
|
||||
checkpoint ECTs. Rolling back the parent rolls back all children.
|
||||
|
||||
## Checkpoint Storage
|
||||
|
||||
Checkpoint ECTs MUST be stored for at least the duration specified
|
||||
by `aerr.ttl`. At L3 {{I-D.nennemann-wimse-ect}}, checkpoints
|
||||
are automatically preserved in the audit ledger. At L1 and L2,
|
||||
agents MUST store checkpoints in durable local storage that
|
||||
survives agent restarts.
|
||||
|
||||
# Error Signaling {#error-signals}
|
||||
|
||||
When an agent detects an error, it MUST produce an error ECT and
|
||||
propagate it to affected agents in the DAG.
|
||||
|
||||
## Error ECT
|
||||
|
||||
An error signal is an ECT with:
|
||||
|
||||
- `exec_act`: `"aerr:error"`
|
||||
- `par`: the `jti` of the checkpoint ECT associated with the
|
||||
failing action
|
||||
|
||||
The `ext` claim carries error details:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"ext": {
|
||||
"aerr.severity": "critical",
|
||||
"aerr.error_type": "action_failed",
|
||||
"aerr.description": "BGP session did not establish",
|
||||
"aerr.checkpoint_id": "550e8400-e29b-41d4-a716-446655440001",
|
||||
"aerr.upstream_errors": []
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-error title="Error ECT Extension Claims"}
|
||||
|
||||
Severity levels: `info`, `warning`, `error`, `critical`.
|
||||
|
||||
Error types: `action_failed`, `timeout`, `constraint_violation`,
|
||||
`resource_exhausted`, `upstream_cascade`, `unknown`.
|
||||
|
||||
## Error Propagation via DAG
|
||||
|
||||
When an agent receives an error ECT caused by an action it
|
||||
initiated, it MUST either:
|
||||
|
||||
(a) Attempt automatic rollback of its checkpoint ({{rollback}}), or
|
||||
|
||||
(b) Escalate to its operator if the action was irreversible.
|
||||
|
||||
The `aerr.upstream_errors` array allows agents to chain error
|
||||
context by referencing `jti` values of predecessor error ECTs,
|
||||
building a causal trace from symptom to root cause through the
|
||||
DAG.
|
||||
|
||||
## HITL Escalation
|
||||
|
||||
When an error requires human intervention, the error ECT SHOULD
|
||||
trigger a HITL rule per {{I-D.nennemann-agent-dag-hitl-safety}}.
|
||||
Example policy:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"hitl": {
|
||||
"rules": [{
|
||||
"id": "r-critical-error",
|
||||
"trigger": {
|
||||
"kind": "keyword_match",
|
||||
"op": "eq",
|
||||
"value": "critical",
|
||||
"input_ref": "ext.aerr.severity"
|
||||
},
|
||||
"required_role": "operator:oncall",
|
||||
"action": "escalate",
|
||||
"allow_override": true,
|
||||
"override_action": "continue"
|
||||
}]
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-hitl-error title="HITL Policy for Critical Errors"}
|
||||
|
||||
# Circuit Breaker Pattern {#circuit-breaker}
|
||||
|
||||
Each agent MUST implement a circuit breaker for every downstream
|
||||
agent it communicates with.
|
||||
|
||||
## States
|
||||
|
||||
CLOSED (normal):
|
||||
: Requests flow through. The agent tracks the error rate over a
|
||||
sliding window (default: 60 seconds).
|
||||
|
||||
OPEN (failure detected):
|
||||
: When the error rate exceeds a threshold (default: 50% over the
|
||||
window), the breaker opens. All requests to the downstream
|
||||
agent are immediately rejected with `aerr.error_type`:
|
||||
`circuit_open`. The agent MUST produce an error ECT and emit
|
||||
it to upstream peers.
|
||||
|
||||
HALF-OPEN (recovery probe):
|
||||
: After a cooldown period (default: 30 seconds), the breaker
|
||||
allows a single probe request. If it succeeds, the breaker
|
||||
returns to CLOSED. If it fails, it returns to OPEN with doubled
|
||||
cooldown (exponential backoff, max 300 seconds).
|
||||
|
||||
## State Change ECTs
|
||||
|
||||
Each circuit breaker state change MUST produce an ECT:
|
||||
|
||||
- `exec_act`: `"aerr:circuit_open"`, `"aerr:circuit_half_open"`,
|
||||
or `"aerr:circuit_closed"`
|
||||
- `par`: the `jti` of the error ECT that triggered the transition
|
||||
|
||||
This records the health topology of the agent network in the ECT
|
||||
DAG, queryable from the audit ledger at L3.
|
||||
|
||||
## Observability
|
||||
|
||||
Agents MUST expose circuit breaker state at:
|
||||
|
||||
~~~
|
||||
GET /aerr/circuits
|
||||
~~~
|
||||
|
||||
Response:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"circuits": [{
|
||||
"downstream_agent": "spiffe://example.com/agent/router-mgr",
|
||||
"state": "open",
|
||||
"error_rate": 0.75,
|
||||
"last_failure_ect": "550e8400-e29b-41d4-a716-446655440099",
|
||||
"cooldown_remaining_s": 22
|
||||
}]
|
||||
}
|
||||
~~~
|
||||
{: #fig-circuits title="Circuit Breaker Status"}
|
||||
|
||||
# Rollback Protocol {#rollback}
|
||||
|
||||
## Rollback Request
|
||||
|
||||
A rollback is initiated by sending an HTTP POST to the target
|
||||
agent's rollback endpoint:
|
||||
|
||||
~~~
|
||||
POST /aerr/rollback HTTP/1.1
|
||||
Content-Type: application/json
|
||||
Execution-Context: <rollback-request-ECT>
|
||||
|
||||
{
|
||||
"rollback_id": "urn:uuid:...",
|
||||
"checkpoint_id": "550e8400-e29b-41d4-a716-446655440001",
|
||||
"reason": "Upstream action caused cascading failure",
|
||||
"cascade": true
|
||||
}
|
||||
~~~
|
||||
{: #fig-rollback-req title="Rollback Request"}
|
||||
|
||||
The request MUST include an ECT in the Execution-Context header
|
||||
with `exec_act`: `"aerr:rollback_request"` and `par` referencing
|
||||
the error ECT that motivated the rollback.
|
||||
|
||||
When `cascade` is `true`, the receiving agent MUST also initiate
|
||||
rollback of any downstream checkpoints created as a consequence
|
||||
of the checkpointed action. The ECT DAG's `par` chain identifies
|
||||
these downstream actions.
|
||||
|
||||
## Rollback Response
|
||||
|
||||
The agent produces a rollback result ECT with:
|
||||
|
||||
- `exec_act`: `"aerr:rollback_complete"` (or `"aerr:rollback_escalated"`)
|
||||
- `par`: the `jti` of the rollback request ECT
|
||||
- `out_hash`: SHA-256 hash of the agent's state after rollback
|
||||
|
||||
~~~json
|
||||
{
|
||||
"ext": {
|
||||
"aerr.rollback_id": "urn:uuid:...",
|
||||
"aerr.status": "completed",
|
||||
"aerr.state_hash_before": "sha256:...",
|
||||
"aerr.state_hash_after": "sha256:...",
|
||||
"aerr.cascaded": [
|
||||
{"agent": "spiffe://example.com/agent/monitor", "status": "completed"},
|
||||
{"agent": "spiffe://example.com/agent/classify", "status": "escalated"}
|
||||
]
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-rollback-resp title="Rollback Result ECT"}
|
||||
|
||||
Status values: `completed`, `partial`, `escalated`, `failed`.
|
||||
|
||||
`escalated` means the action was irreversible and a human operator
|
||||
has been notified via HITL. `partial` means some but not all
|
||||
downstream rollbacks succeeded.
|
||||
|
||||
## Idempotency
|
||||
|
||||
Agents MUST implement idempotent rollback: receiving the same
|
||||
`rollback_id` twice MUST return the same result without
|
||||
re-executing the rollback.
|
||||
|
||||
# Security Considerations
|
||||
|
||||
Rollback requests are sensitive operations. Agents MUST
|
||||
authenticate rollback requests via the ECT signature chain -- only
|
||||
agents whose ECTs appear in the same workflow DAG (identified by
|
||||
`wid`) SHOULD be authorized to request rollback.
|
||||
|
||||
Checkpoint ECTs contain `out_hash` of agent state but not the
|
||||
state itself. Agents MUST encrypt stored state snapshots at rest.
|
||||
|
||||
Circuit breaker status exposes system health topology. The
|
||||
`/aerr/circuits` endpoint SHOULD be access-controlled.
|
||||
|
||||
Malicious agents could emit false error ECTs to trigger rollbacks.
|
||||
Agents SHOULD verify that error ECTs reference valid checkpoint
|
||||
`jti` values from their own workflow DAG before initiating
|
||||
rollback. At L2 and L3, ECT signatures prevent forgery.
|
||||
|
||||
# IANA Considerations
|
||||
|
||||
This document requests the following IANA registrations:
|
||||
|
||||
1. An "AERR Error Type" registry under Specification Required
|
||||
policy. Initial entries: `action_failed`, `timeout`,
|
||||
`constraint_violation`, `resource_exhausted`,
|
||||
`upstream_cascade`, `circuit_open`, `unknown`.
|
||||
|
||||
2. Registration of `exec_act` values `aerr:checkpoint`,
|
||||
`aerr:error`, `aerr:rollback_request`, `aerr:rollback_complete`,
|
||||
`aerr:circuit_open`, `aerr:circuit_half_open`,
|
||||
`aerr:circuit_closed` in a future ECT action type registry.
|
||||
|
||||
--- back
|
||||
|
||||
# Acknowledgments
|
||||
{:numbered="false"}
|
||||
|
||||
This document builds on the Execution Context Token specification
|
||||
{{I-D.nennemann-wimse-ect}} for DAG-based audit trails and the
|
||||
Agent Context Policy Token {{I-D.nennemann-agent-dag-hitl-safety}}
|
||||
for HITL escalation of irreversible actions.
|
||||
Reference in New Issue
Block a user