ietf-draft-analyzer/workspace/drafts/new-drafts/draft-aerr-agent-error-recovery-rollback-00.md

---
title: "Agent Error Recovery and Rollback (AERR)"
abbrev: "AERR"
category: std
docname: draft-aerr-agent-error-recovery-rollback-00
submissiontype: IETF
number:
date:
v: 3
area: "OPS"
workgroup: "NMOP"
keyword:
  - error recovery
  - rollback
  - circuit breaker
  - agentic workflows
  - execution context

author:
  -
    fullname: Generated by IETF Draft Analyzer
    organization: Independent
    email: placeholder@example.com

normative:
  RFC7519:
  RFC7515:
  RFC9110:
  I-D.nennemann-wimse-ect:
    title: "Execution Context Tokens for Distributed Agentic Workflows"
    target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/

informative:
  I-D.nennemann-agent-dag-hitl-safety:
    title: "Agent Context Policy Token: DAG Delegation with Human Override"
    target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/

--- abstract

This document defines the Agent Error Recovery and Rollback (AERR)
protocol, a standard for handling errors, cascading failures, and
rollback in multi-agent systems.  AERR defines three mechanisms:
state checkpoints recorded as Execution Context Token (ECT) DAG
nodes, a circuit breaker pattern to contain cascading failures,
and a rollback protocol that walks the ECT DAG backwards to revert
agent actions to a known-good state.  By building on ECT, AERR
inherits cryptographic audit trails, assurance levels, and DAG
validation without inventing parallel infrastructure.

--- middle

# Introduction

The IETF AI/agent landscape includes 60 drafts on autonomous
network operations but none that standardize error recovery.  When
an autonomous agent misconfigures a router, allocates resources
incorrectly, or triggers a cascade of failures across a multi-agent
system, there is no standard mechanism for detecting the failure,
containing its blast radius, or reverting to a safe state.

AERR borrows proven patterns from distributed systems -- checkpoints
from database transactions, circuit breakers from microservice
architectures, rollback from version control -- and adapts them for
AI agent workflows.  Rather than inventing its own audit and
tracing layer, AERR records all checkpoints, errors, and rollbacks
as ECT DAG nodes {{I-D.nennemann-wimse-ect}}, giving every
recovery action a cryptographic proof chain.

Design principles:

1. Agents that take consequential actions MUST be able to undo
   them, or MUST declare them irreversible upfront.
2. Failure containment takes priority over failure diagnosis.
3. The protocol adds minimal overhead to the happy path.

# Conventions and Definitions

{::boilerplate bcp14-tagged}

Checkpoint:
: An ECT recording an agent's state hash before a consequential
  action, providing a restore point for rollback.

Circuit Breaker:
: A mechanism that stops an agent from propagating requests to a
  failing downstream agent, preventing cascading failures.

Rollback:
: The process of reverting an agent's actions and state to a
  previously recorded checkpoint, walking the ECT DAG backwards.

Blast Radius:
: The set of agents and systems affected by a single agent's
  failure, determinable by traversing the ECT DAG forward from the
  failing node.

# Problem Statement

Consider a network operations scenario: Agent A instructs Agent B
to update firewall rules, which causes Agent C's traffic monitoring
to fail, which causes Agent D to misclassify traffic.  Today each
agent handles errors independently.  There is no standard way for
Agent D to signal that the root cause is upstream, for the cascade
to be halted, or for the chain of actions to be rolled back.

The ECT DAG {{I-D.nennemann-wimse-ect}} already records causal
ordering of agent actions via `par` references.  AERR adds
checkpoint semantics, error propagation, and rollback operations
on top of this existing structure.

# Checkpoint Mechanism {#checkpoints}

An AERR-compliant agent MUST create a checkpoint ECT before any
action it classifies as consequential.  An action is consequential
if it modifies external state (e.g., network config, database
records, API calls with side effects).

## Checkpoint as ECT

A checkpoint is an ECT with:

- `exec_act`: `"aerr:checkpoint"`
- `par`: the `jti` of the preceding task ECT in the workflow
- `out_hash`: SHA-256 hash of the agent's state snapshot at
  checkpoint time (for rollback integrity verification)

The `ext` claim carries AERR-specific metadata:

~~~json
{
  "ext": {
    "aerr.action_type": "config_update",
    "aerr.target": "router-07.example.com",
    "aerr.reversible": true,
    "aerr.rollback_uri": "https://agent-b.example.com/aerr/rollback",
    "aerr.ttl": 86400
  }
}
~~~
{: #fig-checkpoint title="Checkpoint ECT Extension Claims"}

The `aerr.reversible` field MUST be present.  If `false`, the
agent declares that this action cannot be automatically undone
and rollback requests MUST be escalated to a human operator via
the HITL mechanism {{I-D.nennemann-agent-dag-hitl-safety}}.

Agents MAY create hierarchical checkpoints using the ECT DAG: a
parent checkpoint ECT with `par` references to multiple child
checkpoint ECTs.  Rolling back the parent rolls back all children.

## Checkpoint Storage

Checkpoint ECTs MUST be stored for at least the duration specified
by `aerr.ttl`.  At L3 {{I-D.nennemann-wimse-ect}}, checkpoints
are automatically preserved in the audit ledger.  At L1 and L2,
agents MUST store checkpoints in durable local storage that
survives agent restarts.

# Error Signaling {#error-signals}

When an agent detects an error, it MUST produce an error ECT and
propagate it to affected agents in the DAG.

## Error ECT

An error signal is an ECT with:

- `exec_act`: `"aerr:error"`
- `par`: the `jti` of the checkpoint ECT associated with the
  failing action

The `ext` claim carries error details:

~~~json
{
  "ext": {
    "aerr.severity": "critical",
    "aerr.error_type": "action_failed",
    "aerr.description": "BGP session did not establish",
    "aerr.checkpoint_id": "550e8400-e29b-41d4-a716-446655440001",
    "aerr.upstream_errors": []
  }
}
~~~
{: #fig-error title="Error ECT Extension Claims"}

Severity levels: `info`, `warning`, `error`, `critical`.

Error types: `action_failed`, `timeout`, `constraint_violation`,
`resource_exhausted`, `upstream_cascade`, `unknown`.

## Error Propagation via DAG

When an agent receives an error ECT caused by an action it
initiated, it MUST either:

(a) Attempt automatic rollback of its checkpoint ({{rollback}}), or

(b) Escalate to its operator if the action was irreversible.

The `aerr.upstream_errors` array allows agents to chain error
context by referencing `jti` values of predecessor error ECTs,
building a causal trace from symptom to root cause through the
DAG.

## HITL Escalation

When an error requires human intervention, the error ECT SHOULD
trigger a HITL rule per {{I-D.nennemann-agent-dag-hitl-safety}}.
Example policy:

~~~json
{
  "hitl": {
    "rules": [{
      "id": "r-critical-error",
      "trigger": {
        "kind": "keyword_match",
        "op": "eq",
        "value": "critical",
        "input_ref": "ext.aerr.severity"
      },
      "required_role": "operator:oncall",
      "action": "escalate",
      "allow_override": true,
      "override_action": "continue"
    }]
  }
}
~~~
{: #fig-hitl-error title="HITL Policy for Critical Errors"}

# Circuit Breaker Pattern {#circuit-breaker}

Each agent MUST implement a circuit breaker for every downstream
agent it communicates with.

## States

CLOSED (normal):
: Requests flow through.  The agent tracks the error rate over a
  sliding window (default: 60 seconds).

OPEN (failure detected):
: When the error rate exceeds a threshold (default: 50% over the
  window), the breaker opens.  All requests to the downstream
  agent are immediately rejected with `aerr.error_type`:
  `circuit_open`.  The agent MUST produce an error ECT and emit
  it to upstream peers.

HALF-OPEN (recovery probe):
: After a cooldown period (default: 30 seconds), the breaker
  allows a single probe request.  If it succeeds, the breaker
  returns to CLOSED.  If it fails, it returns to OPEN with doubled
  cooldown (exponential backoff, max 300 seconds).

## State Change ECTs

Each circuit breaker state change MUST produce an ECT:

- `exec_act`: `"aerr:circuit_open"`, `"aerr:circuit_half_open"`,
  or `"aerr:circuit_closed"`
- `par`: the `jti` of the error ECT that triggered the transition

This records the health topology of the agent network in the ECT
DAG, queryable from the audit ledger at L3.

## Observability

Agents MUST expose circuit breaker state at:

~~~
GET /aerr/circuits
~~~

Response:

~~~json
{
  "circuits": [{
    "downstream_agent": "spiffe://example.com/agent/router-mgr",
    "state": "open",
    "error_rate": 0.75,
    "last_failure_ect": "550e8400-e29b-41d4-a716-446655440099",
    "cooldown_remaining_s": 22
  }]
}
~~~
{: #fig-circuits title="Circuit Breaker Status"}

# Rollback Protocol {#rollback}

## Rollback Request

A rollback is initiated by sending an HTTP POST to the target
agent's rollback endpoint:

~~~
POST /aerr/rollback HTTP/1.1
Content-Type: application/json
Execution-Context: <rollback-request-ECT>

{
  "rollback_id": "urn:uuid:...",
  "checkpoint_id": "550e8400-e29b-41d4-a716-446655440001",
  "reason": "Upstream action caused cascading failure",
  "cascade": true
}
~~~
{: #fig-rollback-req title="Rollback Request"}

The request MUST include an ECT in the Execution-Context header
with `exec_act`: `"aerr:rollback_request"` and `par` referencing
the error ECT that motivated the rollback.

When `cascade` is `true`, the receiving agent MUST also initiate
rollback of any downstream checkpoints created as a consequence
of the checkpointed action.  The ECT DAG's `par` chain identifies
these downstream actions.

## Rollback Response

The agent produces a rollback result ECT with:

- `exec_act`: `"aerr:rollback_complete"` (or `"aerr:rollback_escalated"`)
- `par`: the `jti` of the rollback request ECT
- `out_hash`: SHA-256 hash of the agent's state after rollback

~~~json
{
  "ext": {
    "aerr.rollback_id": "urn:uuid:...",
    "aerr.status": "completed",
    "aerr.state_hash_before": "sha256:...",
    "aerr.state_hash_after": "sha256:...",
    "aerr.cascaded": [
      {"agent": "spiffe://example.com/agent/monitor", "status": "completed"},
      {"agent": "spiffe://example.com/agent/classify", "status": "escalated"}
    ]
  }
}
~~~
{: #fig-rollback-resp title="Rollback Result ECT"}

Status values: `completed`, `partial`, `escalated`, `failed`.

`escalated` means the action was irreversible and a human operator
has been notified via HITL.  `partial` means some but not all
downstream rollbacks succeeded.

## Idempotency

Agents MUST implement idempotent rollback: receiving the same
`rollback_id` twice MUST return the same result without
re-executing the rollback.

# Security Considerations

Rollback requests are sensitive operations.  Agents MUST
authenticate rollback requests via the ECT signature chain -- only
agents whose ECTs appear in the same workflow DAG (identified by
`wid`) SHOULD be authorized to request rollback.

Checkpoint ECTs contain `out_hash` of agent state but not the
state itself.  Agents MUST encrypt stored state snapshots at rest.

Circuit breaker status exposes system health topology.  The
`/aerr/circuits` endpoint SHOULD be access-controlled.

Malicious agents could emit false error ECTs to trigger rollbacks.
Agents SHOULD verify that error ECTs reference valid checkpoint
`jti` values from their own workflow DAG before initiating
rollback.  At L2 and L3, ECT signatures prevent forgery.

# IANA Considerations

This document requests the following IANA registrations:

1. An "AERR Error Type" registry under Specification Required
   policy.  Initial entries: `action_failed`, `timeout`,
   `constraint_violation`, `resource_exhausted`,
   `upstream_cascade`, `circuit_open`, `unknown`.

2. Registration of `exec_act` values `aerr:checkpoint`,
   `aerr:error`, `aerr:rollback_request`, `aerr:rollback_complete`,
   `aerr:circuit_open`, `aerr:circuit_half_open`,
   `aerr:circuit_closed` in a future ECT action type registry.

--- back

# Acknowledgments
{:numbered="false"}

This document builds on the Execution Context Token specification
{{I-D.nennemann-wimse-ect}} for DAG-based audit trails and the
Agent Context Policy Token {{I-D.nennemann-agent-dag-hitl-safety}}
for HITL escalation of irreversible actions.