ietf-draft-analyzer/workspace/drafts/gap-analysis/draft-nennemann-agent-cascade-prevention-00.md

---
title: "Agent Failure Cascade Prevention and Rollback"
abbrev: "Agent Cascade Prevention"
category: std
docname: draft-nennemann-agent-cascade-prevention-00
submissiontype: IETF
number:
date:
v: 3
area: "OPS"
workgroup: "NMOP"
keyword:
  - cascade prevention
  - circuit breaker
  - rollback
  - failure domain
  - agent recovery

author:
  -
    fullname: Christian Nennemann
    organization: Independent Researcher
    email: ietf@nennemann.de

normative:
  RFC2119:
  RFC8174:
  RFC7519:
  RFC7515:
  RFC9110:
  I-D.nennemann-wimse-ect:
    title: "Execution Context Tokens for Distributed Agentic Workflows"
    target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
  I-D.nennemann-agent-dag-hitl-safety:
    title: "Agent Context Policy Token: DAG Delegation with Human Override"
    target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/

informative:
  I-D.nennemann-agent-gap-analysis:
    title: "Gap Analysis of IETF Standards for Autonomous AI Agent Networking"
    target: https://datatracker.ietf.org/doc/draft-nennemann-agent-gap-analysis/

--- abstract

This document defines protocols for preventing agent failures from
cascading across interconnected autonomous systems and standardized
mechanisms for real-time rollback of incorrect agent decisions.  It
specifies a circuit breaker protocol with well-defined state
transitions, failure domain isolation through bulkhead patterns, cascade
detection via error rate and latency analysis, and a distributed
rollback coordination protocol that walks the Execution Context Token
(ECT) DAG backwards to revert agent actions to a known-good state.
This document absorbs and supersedes the concepts introduced in earlier
AERR and ATD proposals.

--- middle

# Introduction

Autonomous AI agents increasingly operate in interconnected
multi-agent systems where a single agent's failure can propagate
through the network, causing widespread service disruption.  The IETF
gap analysis {{I-D.nennemann-agent-gap-analysis}} identified two
critical gaps in existing standards:

- **Gap 2 (Cascade Prevention)**: No standard mechanism exists for
  containing failures within agent ecosystems.  When one agent fails,
  dependent agents continue sending requests to the failing agent,
  amplifying the failure across the system.

- **Gap 4 (Rollback)**: No standard protocol exists for reverting
  incorrect agent decisions.  When an autonomous agent misconfigures
  a network device or makes an erroneous API call, there is no
  interoperable way to undo the action or coordinate rollback across
  multiple affected agents.

This document addresses both gaps by defining:

1. A circuit breaker protocol that stops failure propagation between
   agents.
2. Failure domain isolation mechanisms that contain blast radius.
3. Cascade detection signals that identify propagating failures early.
4. A distributed rollback protocol that coordinates state reversion
   across multiple agents using the ECT DAG
   {{I-D.nennemann-wimse-ect}}.

This specification absorbs and supersedes the concepts from the earlier
Agent Error Recovery and Rollback (AERR) and Agent Task DAG (ATD)
proposals, consolidating cascade prevention and rollback into a single
coherent protocol built on ECT infrastructure.

Design principles:

1. Agents that take consequential actions MUST be able to undo them,
   or MUST declare them irreversible upfront.
2. Failure containment takes priority over failure diagnosis.
3. The protocol adds minimal overhead to the happy path.
4. All cascade prevention and rollback actions are recorded as ECT
   nodes, providing a cryptographic audit trail.

# Terminology

{::boilerplate bcp14-tagged}

Circuit Breaker:
: A mechanism that stops an agent from propagating requests to a
  failing downstream agent, preventing cascading failures.  Modeled
  after the electrical circuit breaker pattern used in microservice
  architectures.

Failure Domain:
: A bounded set of agents and resources within which a failure is
  contained.  Failures within a domain MUST NOT propagate beyond the
  domain boundary without explicit escalation.

Blast Radius:
: The set of agents and systems affected by a single agent's failure,
  determinable by traversing the ECT DAG forward from the failing
  node.

Cascade Detection:
: The process of identifying that a failure is propagating across
  agent boundaries, using signals such as error rate spikes, latency
  increases, and resource exhaustion patterns.

Rollback Coordinator:
: An agent or orchestrator responsible for coordinating distributed
  rollback across multiple agents in a workflow, ensuring consistency
  and resolving conflicts.

Checkpoint:
: An ECT node recording an agent's state hash before a consequential
  action, providing a restore point for rollback.

Compensating Action:
: An action that semantically reverses the effect of a prior action
  when direct state restoration is not possible (e.g., deleting a
  resource that was created, rather than restoring a pre-creation
  snapshot).

Recovery Point:
: The most recent checkpoint in the ECT DAG to which an agent or
  workflow can be safely rolled back without violating consistency
  constraints.

# Failure Cascade Prevention

## Cascade Model

When an agent fails in a multi-agent system, the failure can
propagate through multiple vectors.  The following diagram
illustrates a typical cascade scenario:

~~~
  Agent A          Agent B          Agent C          Agent D
     |                |                |                |
     | request        |                |                |
     |--------------->|                |                |
     |                | request        |                |
     |                |--------------->|                |
     |                |                | request        |
     |                |                |--------------->|
     |                |                |                |
     |                |                |    FAILURE     |
     |                |                |<--- X ---------|
     |                |                |                |
     |                |  error/timeout |                |
     |                |<---------------|                |
     |                |                |                |
     |  error/timeout |                |                |
     |<---------------|                |                |
     |                |                |                |
     | [CASCADE: all agents impacted by D's failure]    |
     |                |                |                |
~~~
{: #fig-cascade title="Failure Cascade Propagation"}

### Failure Domain Taxonomy

Failures in agent ecosystems fall into the following categories:

Agent-Local Failure:
: A failure confined to a single agent instance (e.g., out-of-memory,
  logic error).  The blast radius is limited to the agent itself and
  its immediate callers.

Service Failure:
: A failure affecting all instances of a particular agent service
  (e.g., model endpoint unavailable).  The blast radius includes all
  agents that depend on the failing service.

Infrastructure Failure:
: A failure in shared infrastructure (e.g., network partition,
  certificate authority unavailable).  The blast radius may span
  multiple failure domains.

Semantic Failure:
: An agent produces incorrect output without raising an error (e.g.,
  misconfiguration, wrong decision).  This is the hardest category
  to detect and may propagate silently through the DAG.

### Propagation Vectors in Agent Ecosystems

Failures propagate through the following vectors:

1. **Synchronous request chains**: An agent blocks waiting for a
   failing downstream agent, causing its own callers to time out.

2. **Shared state corruption**: An agent writes incorrect data to a
   shared store, causing other agents reading that data to fail or
   make incorrect decisions.

3. **Resource exhaustion**: A failing agent consumes excessive
   resources (connections, memory, compute), starving healthy agents.

4. **Retry amplification**: Multiple agents retry requests to a
   failing agent simultaneously, overwhelming it further.

## Circuit Breaker Protocol

Each agent MUST implement a circuit breaker for every downstream
agent it communicates with.

### States

The circuit breaker has four states:

CLOSED (normal):
: Requests flow through normally.  The agent tracks the error rate
  over a sliding window (default: 60 seconds).

OPEN (failure detected):
: When the error rate exceeds the configured threshold (default: 50%
  over the window), the breaker opens.  All requests to the
  downstream agent are immediately rejected locally.  The agent
  MUST emit an ECT with `exec_act` value `"circuit_breaker_open"`.

HALF_OPEN (recovery probe):
: After a cooldown period (default: 30 seconds), the breaker
  transitions to HALF_OPEN and allows a single probe request.  If
  the probe succeeds, the breaker returns to CLOSED.  If the probe
  fails, the breaker returns to OPEN with doubled cooldown
  (exponential backoff, maximum 300 seconds).

CLOSED (recovered):
: When a probe succeeds in the HALF_OPEN state, the breaker returns
  to CLOSED and the agent MUST emit an ECT with `exec_act` value
  `"circuit_breaker_close"`.

### State Transition Rules

~~~
                error_rate > threshold
  CLOSED ────────────────────────────────► OPEN
    ▲                                        │
    │  probe succeeds                        │ cooldown expires
    │                                        ▼
    └──────────────────────────────── HALF_OPEN
                                       │
                            probe fails │
                                        ▼
                                      OPEN
                                (cooldown *= 2,
                                 max 300s)
~~~
{: #fig-circuit-fsm title="Circuit Breaker State Machine"}

The following rules govern state transitions:

1. CLOSED to OPEN: The error rate over the sliding window exceeds
   the configured threshold.  The agent MUST emit a
   `"circuit_breaker_open"` ECT and reject all subsequent requests
   to the downstream agent.

2. OPEN to HALF_OPEN: The cooldown timer expires.  The agent MUST
   allow exactly one probe request through.

3. HALF_OPEN to CLOSED: The probe request succeeds.  The agent MUST
   emit a `"circuit_breaker_close"` ECT and resume normal operation.
   The error rate counters MUST be reset.

4. HALF_OPEN to OPEN: The probe request fails.  The cooldown period
   MUST be doubled (up to a maximum of 300 seconds).

### Circuit Breaker Registration and Discovery

Agents MUST expose circuit breaker state at a well-known endpoint:

~~~
GET /.well-known/cascade/circuits HTTP/1.1
~~~

Response:

~~~json
{
  "circuits": [
    {
      "downstream_agent": "spiffe://example.com/agent/router-mgr",
      "state": "open",
      "error_rate": 0.75,
      "window_s": 60,
      "last_failure_ect": "550e8400-e29b-41d4-a716-446655440099",
      "cooldown_remaining_s": 22
    }
  ]
}
~~~
{: #fig-circuits title="Circuit Breaker Status Endpoint"}

### ECT Integration

Each circuit breaker state change MUST produce an ECT node:

~~~json
{
  "jti": "cb-open-uuid",
  "exec_act": "circuit_breaker_open",
  "par": ["error-ect-uuid"],
  "ext": {
    "cascade.downstream_agent":
      "spiffe://example.com/agent/router-mgr",
    "cascade.error_rate": 0.75,
    "cascade.window_s": 60,
    "cascade.cooldown_s": 30
  }
}
~~~
{: #fig-cb-ect title="Circuit Breaker Open ECT"}

~~~json
{
  "jti": "cb-close-uuid",
  "exec_act": "circuit_breaker_close",
  "par": ["cb-open-uuid"],
  "ext": {
    "cascade.downstream_agent":
      "spiffe://example.com/agent/router-mgr",
    "cascade.total_cooldown_s": 30
  }
}
~~~
{: #fig-cb-close-ect title="Circuit Breaker Close ECT"}

## Failure Domain Isolation

### Blast Radius Containment Strategies

Agents MUST implement the following containment strategies:

1. **Request rejection at the boundary**: When a circuit breaker
   opens, the agent MUST return a structured error to its callers
   indicating that the downstream dependency is unavailable, rather
   than propagating the failure.

2. **Timeout enforcement**: Agents MUST enforce timeouts on all
   downstream requests.  The timeout MUST be shorter than the
   caller's timeout to prevent timeout cascades.

3. **Graceful degradation**: When a non-critical downstream agent
   is unavailable, agents SHOULD continue operating with reduced
   functionality rather than failing entirely.

### Domain Boundary Enforcement

Failure domains are defined by the workflow topology in the ECT DAG.
Each workflow (identified by the `wid` claim) constitutes a failure
domain.  Cross-workflow failures MUST be escalated through the HITL
mechanism {{I-D.nennemann-agent-dag-hitl-safety}} rather than
propagating automatically.

Agents at domain boundaries MUST:

1. Validate all incoming requests against the circuit breaker state
   of their downstream dependencies before accepting work.
2. Emit a `"circuit_breaker_open"` ECT when rejecting work due to
   downstream unavailability.
3. Report domain health status via the circuits endpoint.

### Bulkhead Patterns for Agent Pools

When multiple workflows share a common agent pool, the pool MUST
implement bulkhead isolation:

1. **Connection limits**: Each workflow MUST have a maximum number
   of concurrent connections to the shared agent pool.

2. **Queue isolation**: Each workflow's requests MUST be queued
   independently, preventing one workflow's backlog from blocking
   others.

3. **Resource quotas**: Shared agent pools SHOULD enforce per-workflow
   resource quotas (CPU, memory, request rate).

## Cascade Detection

### Detection Signals

Agents MUST monitor the following signals for cascade detection:

Error Rate:
: The ratio of failed requests to total requests over a sliding
  window.  An error rate exceeding the circuit breaker threshold
  indicates a potential cascade.

Latency Spike:
: A sudden increase in response latency (e.g., p99 latency exceeding
  3x the baseline) indicates downstream congestion or failure.
  Agents SHOULD track latency baselines using exponentially weighted
  moving averages.

Resource Exhaustion:
: Thread pool saturation, connection pool exhaustion, or memory
  pressure above configured thresholds indicates that a cascade is
  consuming resources.

### Propagation Tracking via ECT DAG Analysis

Orchestrators SHOULD analyze the ECT DAG to detect cascading
patterns:

1. **Error clustering**: Multiple `"circuit_breaker_open"` ECTs
   referencing the same downstream agent within a short window
   indicate a shared dependency failure.

2. **Depth-first propagation**: Errors propagating along `par`
   chains in the DAG indicate a synchronous cascade.

3. **Breadth-first propagation**: Multiple sibling nodes in the
   DAG failing concurrently indicate a shared infrastructure
   failure.

### Alert Format and Escalation

When cascade detection identifies a propagating failure, the
detecting agent MUST emit a cascade alert ECT:

~~~json
{
  "exec_act": "cascade_detected",
  "ext": {
    "cascade.pattern": "depth_first",
    "cascade.affected_agents": 4,
    "cascade.root_cause_ect": "error-ect-uuid",
    "cascade.blast_radius": [
      "spiffe://example.com/agent/a",
      "spiffe://example.com/agent/b",
      "spiffe://example.com/agent/c"
    ]
  }
}
~~~
{: #fig-cascade-alert title="Cascade Alert ECT"}

Cascade alerts with more than 3 affected agents SHOULD trigger
HITL escalation per {{I-D.nennemann-agent-dag-hitl-safety}}.

# Real-Time Rollback

## Rollback Model

Rollback reverses the effects of agent actions by walking the ECT
DAG backwards from the point of failure to the nearest valid
recovery point.

### Walking the ECT DAG Backwards

The rollback process follows `par` references in reverse:

1. Identify the failing ECT node.
2. Find the checkpoint ECT associated with the failing action
   (referenced via `par`).
3. Follow `par` references backwards to identify all downstream
   actions that were caused by the checkpointed action.
4. Issue rollback requests to each affected agent in reverse
   topological order.

~~~
  Checkpoint A ──► Action A1 ──► Checkpoint B ──► Action B1
                                      │
                                      └──► Action B2

  Rollback order: B2, B1, B, A1, A (reverse topological)
~~~
{: #fig-rollback-order title="Rollback Order via DAG Traversal"}

### Compensating Actions vs State Restoration

Rollback can be performed through two mechanisms:

State Restoration:
: The agent restores its state from the checkpoint snapshot.  This
  is the preferred mechanism when the checkpoint contains a complete
  state snapshot (verified via `out_hash`).

Compensating Action:
: When state restoration is not possible (e.g., the action involved
  an external API call), the agent executes a compensating action
  that semantically reverses the original action.  Compensating
  actions MUST be recorded as ECT nodes with `exec_act` value
  `"compensate"`.

### Rollback Scope

Rollback can be scoped to three levels:

Single Agent:
: Only the specified agent's checkpoint is rolled back.  No
  downstream propagation occurs.

Sub-DAG:
: The checkpoint and all downstream checkpoints in the sub-DAG
  are rolled back.  This is the default when `cascade` is `true`.

Full Workflow:
: All checkpoints in the workflow are rolled back and the workflow
  is terminated.  This requires Rollback Coordinator authorization.

## Checkpoint Protocol

### Checkpoint Creation

An agent MUST create a checkpoint ECT before any consequential
action.  An action is consequential if it modifies external state
(network configuration, database records, API calls with side
effects).

A checkpoint is an ECT with:

- `exec_act`: `"checkpoint"`
- `par`: the ECT of the action being checkpointed
- `out_hash`: SHA-256 hash of the agent's state snapshot

~~~json
{
  "jti": "ckpt-uuid",
  "exec_act": "checkpoint",
  "par": ["action-ect-uuid"],
  "out_hash": "sha256:...",
  "ext": {
    "cascade.reversible": true,
    "cascade.rollback_uri":
      "https://agent-b.example.com/.well-known/cascade/rollback",
    "cascade.target": "router-07.example.com",
    "cascade.description": "Update BGP peer configuration",
    "cascade.ttl": 86400
  }
}
~~~
{: #fig-checkpoint title="Checkpoint ECT"}

The `cascade.reversible` field MUST be present.  If `false`, the
agent declares that this action cannot be automatically undone and
rollback requests MUST be escalated to a human operator via the
HITL mechanism {{I-D.nennemann-agent-dag-hitl-safety}}.

### Checkpoint Storage and Retrieval

Checkpoint ECTs MUST be stored for at least the duration specified
by `cascade.ttl`.  Agents MUST store checkpoints in durable storage
that survives agent restarts.

Agents MUST expose a checkpoint retrieval endpoint:

~~~
GET /.well-known/cascade/checkpoints/{jti} HTTP/1.1
~~~

The response MUST include the checkpoint ECT and its verification
status (whether `out_hash` matches the current stored state snapshot).

### Checkpoint Verification

Before executing a rollback, the agent MUST verify the checkpoint
integrity:

1. Retrieve the checkpoint ECT.
2. Verify the ECT signature chain (L2/L3).
3. Verify that the stored state snapshot matches `out_hash`.
4. Verify that the checkpoint has not expired (`cascade.ttl`).

If verification fails, the agent MUST reject the rollback request
and emit an error ECT.

## Distributed Rollback Coordination

### Rollback Coordinator Role

For rollbacks spanning multiple agents (sub-DAG or full workflow
scope), a Rollback Coordinator MUST be designated.  The coordinator
is typically the orchestrator or the agent that initiated the
workflow.

The coordinator is responsible for:

1. Computing the blast radius by traversing the ECT DAG.
2. Determining rollback order (reverse topological sort).
3. Issuing rollback requests to each affected agent.
4. Tracking rollback progress and handling failures.
5. Emitting the final rollback completion ECT.

### Two-Phase Rollback Protocol

Distributed rollback follows a two-phase protocol:

**Phase 1: Prepare**

The coordinator sends a prepare request to each affected agent:

~~~
POST /.well-known/cascade/rollback/prepare HTTP/1.1
Content-Type: application/json
Execution-Context: <prepare-ect>

{
  "rollback_id": "urn:uuid:...",
  "checkpoint_id": "ckpt-uuid",
  "scope": "sub_dag"
}
~~~
{: #fig-prepare title="Rollback Prepare Request"}

Each agent MUST respond with either:

- `"prepared"`: The agent has verified its checkpoint and is ready
  to roll back.
- `"cannot_prepare"`: The agent cannot roll back (e.g., checkpoint
  expired, irreversible action).

**Phase 2: Execute**

If all agents respond `"prepared"`, the coordinator sends execute
requests in reverse topological order:

~~~
POST /.well-known/cascade/rollback HTTP/1.1
Content-Type: application/json
Execution-Context: <rollback-ect>

{
  "rollback_id": "urn:uuid:...",
  "checkpoint_id": "ckpt-uuid",
  "phase": "execute"
}
~~~
{: #fig-execute title="Rollback Execute Request"}

If any agent responds `"cannot_prepare"` in Phase 1, the
coordinator MUST either:

- Proceed with partial rollback (if the unprepared agent is not
  on the critical path), or
- Abort the rollback and escalate to HITL.

### Partial Rollback Handling

When a distributed rollback cannot be completed fully, the
coordinator MUST:

1. Roll back all agents that responded `"prepared"`.
2. Record the partial rollback result in the ECT DAG.
3. Emit an ECT with `exec_act` value `"rollback_complete"` and
   `cascade.status` set to `"partial"`.
4. Include the list of agents that could not be rolled back in
   the `cascade.failed_agents` extension claim.

### Conflict Resolution During Concurrent Rollbacks

When multiple rollback requests target overlapping portions of the
ECT DAG:

1. The rollback with the broader scope takes precedence (full
   workflow > sub-DAG > single agent).
2. If scopes are equal, the earlier rollback request (by timestamp)
   takes precedence.
3. The losing rollback request MUST be rejected with an error
   indicating the conflicting rollback ID.

Agents MUST implement idempotent rollback: receiving the same
`rollback_id` twice MUST return the same result without
re-executing the rollback.

## Rollback Evidence

### ECT Nodes for Rollback Actions

Each rollback action MUST produce ECT nodes for audit:

Rollback Start:
: `exec_act`: `"rollback_start"`, `par` references the error ECT
  that triggered the rollback.

~~~json
{
  "jti": "rb-start-uuid",
  "exec_act": "rollback_start",
  "par": ["error-ect-uuid"],
  "ext": {
    "cascade.rollback_id": "urn:uuid:...",
    "cascade.checkpoint_id": "ckpt-uuid",
    "cascade.scope": "sub_dag",
    "cascade.reason": "Upstream cascading failure"
  }
}
~~~
{: #fig-rb-start title="Rollback Start ECT"}

Rollback Complete:
: `exec_act`: `"rollback_complete"`, `par` references the rollback
  start ECT.

~~~json
{
  "jti": "rb-complete-uuid",
  "exec_act": "rollback_complete",
  "par": ["rb-start-uuid"],
  "out_hash": "sha256:...",
  "ext": {
    "cascade.rollback_id": "urn:uuid:...",
    "cascade.status": "completed",
    "cascade.state_hash_before": "sha256:...",
    "cascade.state_hash_after": "sha256:...",
    "cascade.cascaded": [
      {
        "agent": "spiffe://example.com/agent/monitor",
        "status": "completed"
      },
      {
        "agent": "spiffe://example.com/agent/classify",
        "status": "escalated"
      }
    ]
  }
}
~~~
{: #fig-rb-complete title="Rollback Complete ECT"}

### Rollback Audit Trail

The complete rollback audit trail is captured in the ECT DAG:

~~~
  error ECT
     │
     ▼
  rollback_start ECT
     │
     ├──► agent-A rollback_complete ECT
     │
     ├──► agent-B rollback_complete ECT
     │
     └──► agent-C compensate ECT
~~~
{: #fig-rb-audit title="Rollback Audit Trail in ECT DAG"}

Status values for individual agent rollbacks: `completed`,
`partial`, `escalated`, `failed`.

# ECT Integration

This document defines the following new `exec_act` values for use
in ECT nodes {{I-D.nennemann-wimse-ect}}:

| `exec_act` Value | Description |
|-----------------|-------------|
| `circuit_breaker_open` | Circuit breaker transitioned to OPEN state |
| `circuit_breaker_close` | Circuit breaker transitioned to CLOSED state |
| `checkpoint` | State snapshot before consequential action |
| `rollback_start` | Rollback initiated for a checkpoint |
| `rollback_complete` | Rollback finished (with status) |
| `compensate` | Compensating action executed in lieu of state restoration |
| `cascade_detected` | Cascading failure pattern detected |
{: #fig-exec-act-values title="New exec_act Values"}

This document defines the following new `ext` claims for failure
context:

| Claim | Type | Description |
|-------|------|-------------|
| `cascade.downstream_agent` | string | SPIFFE ID of the downstream agent |
| `cascade.error_rate` | number | Error rate that triggered the circuit breaker |
| `cascade.window_s` | number | Sliding window duration in seconds |
| `cascade.cooldown_s` | number | Cooldown duration in seconds |
| `cascade.reversible` | boolean | Whether the checkpointed action can be undone |
| `cascade.rollback_uri` | string | URI for rollback requests |
| `cascade.target` | string | Target system of the checkpointed action |
| `cascade.ttl` | number | Checkpoint time-to-live in seconds |
| `cascade.rollback_id` | string | Unique identifier for a rollback operation |
| `cascade.checkpoint_id` | string | JTI of the checkpoint being rolled back |
| `cascade.scope` | string | Rollback scope: single, sub_dag, full_workflow |
| `cascade.status` | string | Rollback result status |
| `cascade.reason` | string | Human-readable reason for the action |
| `cascade.pattern` | string | Detected cascade pattern type |
| `cascade.affected_agents` | number | Count of agents affected by cascade |
| `cascade.blast_radius` | array | SPIFFE IDs of affected agents |
| `cascade.cascaded` | array | Per-agent rollback results |
| `cascade.failed_agents` | array | Agents that could not be rolled back |
| `cascade.state_hash_before` | string | State hash before rollback |
| `cascade.state_hash_after` | string | State hash after rollback |
| `cascade.description` | string | Human-readable description |
{: #fig-ext-claims title="New ext Claims for Cascade Prevention"}

# Security Considerations

## Rollback Weaponization

Malicious agents could attempt to force unnecessary rollbacks to
disrupt workflows.  Mitigations:

1. Rollback requests MUST be authenticated via the ECT signature
   chain.  Only agents whose ECTs appear in the same workflow DAG
   (identified by `wid`) are authorized to request rollback.

2. Rollback requests from outside the originating workflow MUST be
   rejected with HTTP 403.

3. Agents SHOULD implement rate limiting on rollback requests to
   prevent denial-of-service through rollback flooding.

4. The two-phase rollback protocol provides a prepare phase where
   agents can validate the rollback request before committing.

## Circuit Breaker Manipulation

An adversary could attempt to manipulate circuit breaker state to
either prevent legitimate circuit breaking or force unnecessary
circuit breaks:

1. **False error injection**: A malicious agent could emit false
   error ECTs to trigger circuit breakers.  At L2/L3
   {{I-D.nennemann-wimse-ect}}, ECT signatures prevent forgery.
   Agents SHOULD verify that error ECTs reference valid `par`
   values within their own workflow DAG.

2. **Circuit breaker suppression**: An adversary could attempt to
   reset circuit breakers by sending successful probe responses.
   Agents MUST only accept probe responses from the actual
   downstream agent (verified via ECT identity binding).

3. **Status endpoint abuse**: The `/.well-known/cascade/circuits`
   endpoint reveals system health topology.  This endpoint MUST
   require authentication and SHOULD be restricted to agents within
   the same administrative domain.

## Checkpoint Integrity

Checkpoint state snapshots contain sensitive system state.  Agents
MUST:

1. Encrypt stored checkpoint state at rest.
2. Reference checkpoint state via `out_hash` only in ECTs; MUST NOT
   include checkpoint contents in ECT claims.
3. Verify `out_hash` integrity before executing rollback to prevent
   rollback to a tampered state.
4. Enforce checkpoint storage quotas to prevent checkpoint flooding
   attacks.
5. Purge expired checkpoints (past `cascade.ttl`).

# IANA Considerations

## Registration of exec_act Values

This document requests registration of the following `exec_act`
values in the ECT exec_act registry:

| Value | Description | Reference |
|-------|-------------|-----------|
| `circuit_breaker_open` | Circuit breaker transitioned to OPEN | This document |
| `circuit_breaker_close` | Circuit breaker transitioned to CLOSED | This document |
| `checkpoint` | State snapshot before consequential action | This document |
| `rollback_start` | Rollback operation initiated | This document |
| `rollback_complete` | Rollback operation finished | This document |
| `compensate` | Compensating action executed | This document |
| `cascade_detected` | Cascading failure pattern detected | This document |
{: #fig-iana-exec-act title="exec_act Value Registrations"}

## Registration of ext Claims

This document requests registration of the `ext` claims listed in
{{fig-ext-claims}} in the ECT extension claims registry.  All claims
use the `cascade.` namespace prefix.

## Well-Known URI Registration

This document requests registration of the following well-known URI
suffixes per {{RFC9110}}:

| URI Suffix | Description | Reference |
|------------|-------------|-----------|
| `cascade/circuits` | Circuit breaker status | This document |
| `cascade/rollback` | Rollback request endpoint | This document |
| `cascade/rollback/prepare` | Rollback prepare endpoint | This document |
| `cascade/checkpoints` | Checkpoint retrieval | This document |
{: #fig-iana-uris title="Well-Known URI Registrations"}

--- back

# Acknowledgments
{:numbered="false"}

This document absorbs and supersedes concepts from the earlier Agent
Error Recovery and Rollback (AERR) and Agent Task DAG (ATD) proposals.
It builds on the Execution Context Token specification
{{I-D.nennemann-wimse-ect}} for DAG-based audit trails and the Agent
Context Policy Token {{I-D.nennemann-agent-dag-hitl-safety}} for HITL
escalation of irreversible actions.  The circuit breaker pattern is
adapted from microservice architecture best practices.