feat: add draft data, gap analysis report, and workspace config

2026-04-06 18:47:15 +02:00
parent 4f310407b0
commit 2506b6325a
189 changed files with 62649 additions and 0 deletions
--- a/workspace/drafts/gap-analysis/draft-nennemann-agent-cascade-prevention-00.md
+++ b/workspace/drafts/gap-analysis/draft-nennemann-agent-cascade-prevention-00.md
@@ -0,0 +1,907 @@
+---
+title: "Agent Failure Cascade Prevention and Rollback"
+abbrev: "Agent Cascade Prevention"
+category: std
+docname: draft-nennemann-agent-cascade-prevention-00
+submissiontype: IETF
+number:
+date:
+v: 3
+area: "OPS"
+workgroup: "NMOP"
+keyword:
+  - cascade prevention
+  - circuit breaker
+  - rollback
+  - failure domain
+  - agent recovery
+
+author:
+  -
+    fullname: Christian Nennemann
+    organization: Independent Researcher
+    email: ietf@nennemann.de
+
+normative:
+  RFC2119:
+  RFC8174:
+  RFC7519:
+  RFC7515:
+  RFC9110:
+  I-D.nennemann-wimse-ect:
+    title: "Execution Context Tokens for Distributed Agentic Workflows"
+    target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
+  I-D.nennemann-agent-dag-hitl-safety:
+    title: "Agent Context Policy Token: DAG Delegation with Human Override"
+    target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
+
+informative:
+  I-D.nennemann-agent-gap-analysis:
+    title: "Gap Analysis of IETF Standards for Autonomous AI Agent Networking"
+    target: https://datatracker.ietf.org/doc/draft-nennemann-agent-gap-analysis/
+
+--- abstract
+
+This document defines protocols for preventing agent failures from
+cascading across interconnected autonomous systems and standardized
+mechanisms for real-time rollback of incorrect agent decisions.  It
+specifies a circuit breaker protocol with well-defined state
+transitions, failure domain isolation through bulkhead patterns, cascade
+detection via error rate and latency analysis, and a distributed
+rollback coordination protocol that walks the Execution Context Token
+(ECT) DAG backwards to revert agent actions to a known-good state.
+This document absorbs and supersedes the concepts introduced in earlier
+AERR and ATD proposals.
+
+--- middle
+
+# Introduction
+
+Autonomous AI agents increasingly operate in interconnected
+multi-agent systems where a single agent's failure can propagate
+through the network, causing widespread service disruption.  The IETF
+gap analysis {{I-D.nennemann-agent-gap-analysis}} identified two
+critical gaps in existing standards:
+
+- **Gap 2 (Cascade Prevention)**: No standard mechanism exists for
+  containing failures within agent ecosystems.  When one agent fails,
+  dependent agents continue sending requests to the failing agent,
+  amplifying the failure across the system.
+
+- **Gap 4 (Rollback)**: No standard protocol exists for reverting
+  incorrect agent decisions.  When an autonomous agent misconfigures
+  a network device or makes an erroneous API call, there is no
+  interoperable way to undo the action or coordinate rollback across
+  multiple affected agents.
+
+This document addresses both gaps by defining:
+
+1. A circuit breaker protocol that stops failure propagation between
+   agents.
+2. Failure domain isolation mechanisms that contain blast radius.
+3. Cascade detection signals that identify propagating failures early.
+4. A distributed rollback protocol that coordinates state reversion
+   across multiple agents using the ECT DAG
+   {{I-D.nennemann-wimse-ect}}.
+
+This specification absorbs and supersedes the concepts from the earlier
+Agent Error Recovery and Rollback (AERR) and Agent Task DAG (ATD)
+proposals, consolidating cascade prevention and rollback into a single
+coherent protocol built on ECT infrastructure.
+
+Design principles:
+
+1. Agents that take consequential actions MUST be able to undo them,
+   or MUST declare them irreversible upfront.
+2. Failure containment takes priority over failure diagnosis.
+3. The protocol adds minimal overhead to the happy path.
+4. All cascade prevention and rollback actions are recorded as ECT
+   nodes, providing a cryptographic audit trail.
+
+# Terminology
+
+{::boilerplate bcp14-tagged}
+
+Circuit Breaker:
+: A mechanism that stops an agent from propagating requests to a
+  failing downstream agent, preventing cascading failures.  Modeled
+  after the electrical circuit breaker pattern used in microservice
+  architectures.
+
+Failure Domain:
+: A bounded set of agents and resources within which a failure is
+  contained.  Failures within a domain MUST NOT propagate beyond the
+  domain boundary without explicit escalation.
+
+Blast Radius:
+: The set of agents and systems affected by a single agent's failure,
+  determinable by traversing the ECT DAG forward from the failing
+  node.
+
+Cascade Detection:
+: The process of identifying that a failure is propagating across
+  agent boundaries, using signals such as error rate spikes, latency
+  increases, and resource exhaustion patterns.
+
+Rollback Coordinator:
+: An agent or orchestrator responsible for coordinating distributed
+  rollback across multiple agents in a workflow, ensuring consistency
+  and resolving conflicts.
+
+Checkpoint:
+: An ECT node recording an agent's state hash before a consequential
+  action, providing a restore point for rollback.
+
+Compensating Action:
+: An action that semantically reverses the effect of a prior action
+  when direct state restoration is not possible (e.g., deleting a
+  resource that was created, rather than restoring a pre-creation
+  snapshot).
+
+Recovery Point:
+: The most recent checkpoint in the ECT DAG to which an agent or
+  workflow can be safely rolled back without violating consistency
+  constraints.
+
+# Failure Cascade Prevention
+
+## Cascade Model
+
+When an agent fails in a multi-agent system, the failure can
+propagate through multiple vectors.  The following diagram
+illustrates a typical cascade scenario:
+
+~~~
+  Agent A          Agent B          Agent C          Agent D
+     |                |                |                |
+     | request        |                |                |
+     |--------------->|                |                |
+     |                | request        |                |
+     |                |--------------->|                |
+     |                |                | request        |
+     |                |                |--------------->|
+     |                |                |                |
+     |                |                |    FAILURE     |
+     |                |                |<--- X ---------|
+     |                |                |                |
+     |                |  error/timeout |                |
+     |                |<---------------|                |
+     |                |                |                |
+     |  error/timeout |                |                |
+     |<---------------|                |                |
+     |                |                |                |
+     | [CASCADE: all agents impacted by D's failure]    |
+     |                |                |                |
+~~~
+{: #fig-cascade title="Failure Cascade Propagation"}
+
+### Failure Domain Taxonomy
+
+Failures in agent ecosystems fall into the following categories:
+
+Agent-Local Failure:
+: A failure confined to a single agent instance (e.g., out-of-memory,
+  logic error).  The blast radius is limited to the agent itself and
+  its immediate callers.
+
+Service Failure:
+: A failure affecting all instances of a particular agent service
+  (e.g., model endpoint unavailable).  The blast radius includes all
+  agents that depend on the failing service.
+
+Infrastructure Failure:
+: A failure in shared infrastructure (e.g., network partition,
+  certificate authority unavailable).  The blast radius may span
+  multiple failure domains.
+
+Semantic Failure:
+: An agent produces incorrect output without raising an error (e.g.,
+  misconfiguration, wrong decision).  This is the hardest category
+  to detect and may propagate silently through the DAG.
+
+### Propagation Vectors in Agent Ecosystems
+
+Failures propagate through the following vectors:
+
+1. **Synchronous request chains**: An agent blocks waiting for a
+   failing downstream agent, causing its own callers to time out.
+
+2. **Shared state corruption**: An agent writes incorrect data to a
+   shared store, causing other agents reading that data to fail or
+   make incorrect decisions.
+
+3. **Resource exhaustion**: A failing agent consumes excessive
+   resources (connections, memory, compute), starving healthy agents.
+
+4. **Retry amplification**: Multiple agents retry requests to a
+   failing agent simultaneously, overwhelming it further.
+
+## Circuit Breaker Protocol
+
+Each agent MUST implement a circuit breaker for every downstream
+agent it communicates with.
+
+### States
+
+The circuit breaker has four states:
+
+CLOSED (normal):
+: Requests flow through normally.  The agent tracks the error rate
+  over a sliding window (default: 60 seconds).
+
+OPEN (failure detected):
+: When the error rate exceeds the configured threshold (default: 50%
+  over the window), the breaker opens.  All requests to the
+  downstream agent are immediately rejected locally.  The agent
+  MUST emit an ECT with `exec_act` value `"circuit_breaker_open"`.
+
+HALF_OPEN (recovery probe):
+: After a cooldown period (default: 30 seconds), the breaker
+  transitions to HALF_OPEN and allows a single probe request.  If
+  the probe succeeds, the breaker returns to CLOSED.  If the probe
+  fails, the breaker returns to OPEN with doubled cooldown
+  (exponential backoff, maximum 300 seconds).
+
+CLOSED (recovered):
+: When a probe succeeds in the HALF_OPEN state, the breaker returns
+  to CLOSED and the agent MUST emit an ECT with `exec_act` value
+  `"circuit_breaker_close"`.
+
+### State Transition Rules
+
+~~~
+                error_rate > threshold
+  CLOSED ────────────────────────────────► OPEN
+    ▲                                        │
+    │  probe succeeds                        │ cooldown expires
+    │                                        ▼
+    └──────────────────────────────── HALF_OPEN
+                                       │
+                            probe fails │
+                                        ▼
+                                      OPEN
+                                (cooldown *= 2,
+                                 max 300s)
+~~~
+{: #fig-circuit-fsm title="Circuit Breaker State Machine"}
+
+The following rules govern state transitions:
+
+1. CLOSED to OPEN: The error rate over the sliding window exceeds
+   the configured threshold.  The agent MUST emit a
+   `"circuit_breaker_open"` ECT and reject all subsequent requests
+   to the downstream agent.
+
+2. OPEN to HALF_OPEN: The cooldown timer expires.  The agent MUST
+   allow exactly one probe request through.
+
+3. HALF_OPEN to CLOSED: The probe request succeeds.  The agent MUST
+   emit a `"circuit_breaker_close"` ECT and resume normal operation.
+   The error rate counters MUST be reset.
+
+4. HALF_OPEN to OPEN: The probe request fails.  The cooldown period
+   MUST be doubled (up to a maximum of 300 seconds).
+
+### Circuit Breaker Registration and Discovery
+
+Agents MUST expose circuit breaker state at a well-known endpoint:
+
+~~~
+GET /.well-known/cascade/circuits HTTP/1.1
+~~~
+
+Response:
+
+~~~json
+{
+  "circuits": [
+    {
+      "downstream_agent": "spiffe://example.com/agent/router-mgr",
+      "state": "open",
+      "error_rate": 0.75,
+      "window_s": 60,
+      "last_failure_ect": "550e8400-e29b-41d4-a716-446655440099",
+      "cooldown_remaining_s": 22
+    }
+  ]
+}
+~~~
+{: #fig-circuits title="Circuit Breaker Status Endpoint"}
+
+### ECT Integration
+
+Each circuit breaker state change MUST produce an ECT node:
+
+~~~json
+{
+  "jti": "cb-open-uuid",
+  "exec_act": "circuit_breaker_open",
+  "par": ["error-ect-uuid"],
+  "ext": {
+    "cascade.downstream_agent":
+      "spiffe://example.com/agent/router-mgr",
+    "cascade.error_rate": 0.75,
+    "cascade.window_s": 60,
+    "cascade.cooldown_s": 30
+  }
+}
+~~~
+{: #fig-cb-ect title="Circuit Breaker Open ECT"}
+
+~~~json
+{
+  "jti": "cb-close-uuid",
+  "exec_act": "circuit_breaker_close",
+  "par": ["cb-open-uuid"],
+  "ext": {
+    "cascade.downstream_agent":
+      "spiffe://example.com/agent/router-mgr",
+    "cascade.total_cooldown_s": 30
+  }
+}
+~~~
+{: #fig-cb-close-ect title="Circuit Breaker Close ECT"}
+
+## Failure Domain Isolation
+
+### Blast Radius Containment Strategies
+
+Agents MUST implement the following containment strategies:
+
+1. **Request rejection at the boundary**: When a circuit breaker
+   opens, the agent MUST return a structured error to its callers
+   indicating that the downstream dependency is unavailable, rather
+   than propagating the failure.
+
+2. **Timeout enforcement**: Agents MUST enforce timeouts on all
+   downstream requests.  The timeout MUST be shorter than the
+   caller's timeout to prevent timeout cascades.
+
+3. **Graceful degradation**: When a non-critical downstream agent
+   is unavailable, agents SHOULD continue operating with reduced
+   functionality rather than failing entirely.
+
+### Domain Boundary Enforcement
+
+Failure domains are defined by the workflow topology in the ECT DAG.
+Each workflow (identified by the `wid` claim) constitutes a failure
+domain.  Cross-workflow failures MUST be escalated through the HITL
+mechanism {{I-D.nennemann-agent-dag-hitl-safety}} rather than
+propagating automatically.
+
+Agents at domain boundaries MUST:
+
+1. Validate all incoming requests against the circuit breaker state
+   of their downstream dependencies before accepting work.
+2. Emit a `"circuit_breaker_open"` ECT when rejecting work due to
+   downstream unavailability.
+3. Report domain health status via the circuits endpoint.
+
+### Bulkhead Patterns for Agent Pools
+
+When multiple workflows share a common agent pool, the pool MUST
+implement bulkhead isolation:
+
+1. **Connection limits**: Each workflow MUST have a maximum number
+   of concurrent connections to the shared agent pool.
+
+2. **Queue isolation**: Each workflow's requests MUST be queued
+   independently, preventing one workflow's backlog from blocking
+   others.
+
+3. **Resource quotas**: Shared agent pools SHOULD enforce per-workflow
+   resource quotas (CPU, memory, request rate).
+
+## Cascade Detection
+
+### Detection Signals
+
+Agents MUST monitor the following signals for cascade detection:
+
+Error Rate:
+: The ratio of failed requests to total requests over a sliding
+  window.  An error rate exceeding the circuit breaker threshold
+  indicates a potential cascade.
+
+Latency Spike:
+: A sudden increase in response latency (e.g., p99 latency exceeding
+  3x the baseline) indicates downstream congestion or failure.
+  Agents SHOULD track latency baselines using exponentially weighted
+  moving averages.
+
+Resource Exhaustion:
+: Thread pool saturation, connection pool exhaustion, or memory
+  pressure above configured thresholds indicates that a cascade is
+  consuming resources.
+
+### Propagation Tracking via ECT DAG Analysis
+
+Orchestrators SHOULD analyze the ECT DAG to detect cascading
+patterns:
+
+1. **Error clustering**: Multiple `"circuit_breaker_open"` ECTs
+   referencing the same downstream agent within a short window
+   indicate a shared dependency failure.
+
+2. **Depth-first propagation**: Errors propagating along `par`
+   chains in the DAG indicate a synchronous cascade.
+
+3. **Breadth-first propagation**: Multiple sibling nodes in the
+   DAG failing concurrently indicate a shared infrastructure
+   failure.
+
+### Alert Format and Escalation
+
+When cascade detection identifies a propagating failure, the
+detecting agent MUST emit a cascade alert ECT:
+
+~~~json
+{
+  "exec_act": "cascade_detected",
+  "ext": {
+    "cascade.pattern": "depth_first",
+    "cascade.affected_agents": 4,
+    "cascade.root_cause_ect": "error-ect-uuid",
+    "cascade.blast_radius": [
+      "spiffe://example.com/agent/a",
+      "spiffe://example.com/agent/b",
+      "spiffe://example.com/agent/c"
+    ]
+  }
+}
+~~~
+{: #fig-cascade-alert title="Cascade Alert ECT"}
+
+Cascade alerts with more than 3 affected agents SHOULD trigger
+HITL escalation per {{I-D.nennemann-agent-dag-hitl-safety}}.
+
+# Real-Time Rollback
+
+## Rollback Model
+
+Rollback reverses the effects of agent actions by walking the ECT
+DAG backwards from the point of failure to the nearest valid
+recovery point.
+
+### Walking the ECT DAG Backwards
+
+The rollback process follows `par` references in reverse:
+
+1. Identify the failing ECT node.
+2. Find the checkpoint ECT associated with the failing action
+   (referenced via `par`).
+3. Follow `par` references backwards to identify all downstream
+   actions that were caused by the checkpointed action.
+4. Issue rollback requests to each affected agent in reverse
+   topological order.
+
+~~~
+  Checkpoint A ──► Action A1 ──► Checkpoint B ──► Action B1
+                                      │
+                                      └──► Action B2
+
+  Rollback order: B2, B1, B, A1, A (reverse topological)
+~~~
+{: #fig-rollback-order title="Rollback Order via DAG Traversal"}
+
+### Compensating Actions vs State Restoration
+
+Rollback can be performed through two mechanisms:
+
+State Restoration:
+: The agent restores its state from the checkpoint snapshot.  This
+  is the preferred mechanism when the checkpoint contains a complete
+  state snapshot (verified via `out_hash`).
+
+Compensating Action:
+: When state restoration is not possible (e.g., the action involved
+  an external API call), the agent executes a compensating action
+  that semantically reverses the original action.  Compensating
+  actions MUST be recorded as ECT nodes with `exec_act` value
+  `"compensate"`.
+
+### Rollback Scope
+
+Rollback can be scoped to three levels:
+
+Single Agent:
+: Only the specified agent's checkpoint is rolled back.  No
+  downstream propagation occurs.
+
+Sub-DAG:
+: The checkpoint and all downstream checkpoints in the sub-DAG
+  are rolled back.  This is the default when `cascade` is `true`.
+
+Full Workflow:
+: All checkpoints in the workflow are rolled back and the workflow
+  is terminated.  This requires Rollback Coordinator authorization.
+
+## Checkpoint Protocol
+
+### Checkpoint Creation
+
+An agent MUST create a checkpoint ECT before any consequential
+action.  An action is consequential if it modifies external state
+(network configuration, database records, API calls with side
+effects).
+
+A checkpoint is an ECT with:
+
+- `exec_act`: `"checkpoint"`
+- `par`: the ECT of the action being checkpointed
+- `out_hash`: SHA-256 hash of the agent's state snapshot
+
+~~~json
+{
+  "jti": "ckpt-uuid",
+  "exec_act": "checkpoint",
+  "par": ["action-ect-uuid"],
+  "out_hash": "sha256:...",
+  "ext": {
+    "cascade.reversible": true,
+    "cascade.rollback_uri":
+      "https://agent-b.example.com/.well-known/cascade/rollback",
+    "cascade.target": "router-07.example.com",
+    "cascade.description": "Update BGP peer configuration",
+    "cascade.ttl": 86400
+  }
+}
+~~~
+{: #fig-checkpoint title="Checkpoint ECT"}
+
+The `cascade.reversible` field MUST be present.  If `false`, the
+agent declares that this action cannot be automatically undone and
+rollback requests MUST be escalated to a human operator via the
+HITL mechanism {{I-D.nennemann-agent-dag-hitl-safety}}.
+
+### Checkpoint Storage and Retrieval
+
+Checkpoint ECTs MUST be stored for at least the duration specified
+by `cascade.ttl`.  Agents MUST store checkpoints in durable storage
+that survives agent restarts.
+
+Agents MUST expose a checkpoint retrieval endpoint:
+
+~~~
+GET /.well-known/cascade/checkpoints/{jti} HTTP/1.1
+~~~
+
+The response MUST include the checkpoint ECT and its verification
+status (whether `out_hash` matches the current stored state snapshot).
+
+### Checkpoint Verification
+
+Before executing a rollback, the agent MUST verify the checkpoint
+integrity:
+
+1. Retrieve the checkpoint ECT.
+2. Verify the ECT signature chain (L2/L3).
+3. Verify that the stored state snapshot matches `out_hash`.
+4. Verify that the checkpoint has not expired (`cascade.ttl`).
+
+If verification fails, the agent MUST reject the rollback request
+and emit an error ECT.
+
+## Distributed Rollback Coordination
+
+### Rollback Coordinator Role
+
+For rollbacks spanning multiple agents (sub-DAG or full workflow
+scope), a Rollback Coordinator MUST be designated.  The coordinator
+is typically the orchestrator or the agent that initiated the
+workflow.
+
+The coordinator is responsible for:
+
+1. Computing the blast radius by traversing the ECT DAG.
+2. Determining rollback order (reverse topological sort).
+3. Issuing rollback requests to each affected agent.
+4. Tracking rollback progress and handling failures.
+5. Emitting the final rollback completion ECT.
+
+### Two-Phase Rollback Protocol
+
+Distributed rollback follows a two-phase protocol:
+
+**Phase 1: Prepare**
+
+The coordinator sends a prepare request to each affected agent:
+
+~~~
+POST /.well-known/cascade/rollback/prepare HTTP/1.1
+Content-Type: application/json
+Execution-Context: <prepare-ect>
+
+{
+  "rollback_id": "urn:uuid:...",
+  "checkpoint_id": "ckpt-uuid",
+  "scope": "sub_dag"
+}
+~~~
+{: #fig-prepare title="Rollback Prepare Request"}
+
+Each agent MUST respond with either:
+
+- `"prepared"`: The agent has verified its checkpoint and is ready
+  to roll back.
+- `"cannot_prepare"`: The agent cannot roll back (e.g., checkpoint
+  expired, irreversible action).
+
+**Phase 2: Execute**
+
+If all agents respond `"prepared"`, the coordinator sends execute
+requests in reverse topological order:
+
+~~~
+POST /.well-known/cascade/rollback HTTP/1.1
+Content-Type: application/json
+Execution-Context: <rollback-ect>
+
+{
+  "rollback_id": "urn:uuid:...",
+  "checkpoint_id": "ckpt-uuid",
+  "phase": "execute"
+}
+~~~
+{: #fig-execute title="Rollback Execute Request"}
+
+If any agent responds `"cannot_prepare"` in Phase 1, the
+coordinator MUST either:
+
+- Proceed with partial rollback (if the unprepared agent is not
+  on the critical path), or
+- Abort the rollback and escalate to HITL.
+
+### Partial Rollback Handling
+
+When a distributed rollback cannot be completed fully, the
+coordinator MUST:
+
+1. Roll back all agents that responded `"prepared"`.
+2. Record the partial rollback result in the ECT DAG.
+3. Emit an ECT with `exec_act` value `"rollback_complete"` and
+   `cascade.status` set to `"partial"`.
+4. Include the list of agents that could not be rolled back in
+   the `cascade.failed_agents` extension claim.
+
+### Conflict Resolution During Concurrent Rollbacks
+
+When multiple rollback requests target overlapping portions of the
+ECT DAG:
+
+1. The rollback with the broader scope takes precedence (full
+   workflow > sub-DAG > single agent).
+2. If scopes are equal, the earlier rollback request (by timestamp)
+   takes precedence.
+3. The losing rollback request MUST be rejected with an error
+   indicating the conflicting rollback ID.
+
+Agents MUST implement idempotent rollback: receiving the same
+`rollback_id` twice MUST return the same result without
+re-executing the rollback.
+
+## Rollback Evidence
+
+### ECT Nodes for Rollback Actions
+
+Each rollback action MUST produce ECT nodes for audit:
+
+Rollback Start:
+: `exec_act`: `"rollback_start"`, `par` references the error ECT
+  that triggered the rollback.
+
+~~~json
+{
+  "jti": "rb-start-uuid",
+  "exec_act": "rollback_start",
+  "par": ["error-ect-uuid"],
+  "ext": {
+    "cascade.rollback_id": "urn:uuid:...",
+    "cascade.checkpoint_id": "ckpt-uuid",
+    "cascade.scope": "sub_dag",
+    "cascade.reason": "Upstream cascading failure"
+  }
+}
+~~~
+{: #fig-rb-start title="Rollback Start ECT"}
+
+Rollback Complete:
+: `exec_act`: `"rollback_complete"`, `par` references the rollback
+  start ECT.
+
+~~~json
+{
+  "jti": "rb-complete-uuid",
+  "exec_act": "rollback_complete",
+  "par": ["rb-start-uuid"],
+  "out_hash": "sha256:...",
+  "ext": {
+    "cascade.rollback_id": "urn:uuid:...",
+    "cascade.status": "completed",
+    "cascade.state_hash_before": "sha256:...",
+    "cascade.state_hash_after": "sha256:...",
+    "cascade.cascaded": [
+      {
+        "agent": "spiffe://example.com/agent/monitor",
+        "status": "completed"
+      },
+      {
+        "agent": "spiffe://example.com/agent/classify",
+        "status": "escalated"
+      }
+    ]
+  }
+}
+~~~
+{: #fig-rb-complete title="Rollback Complete ECT"}
+
+### Rollback Audit Trail
+
+The complete rollback audit trail is captured in the ECT DAG:
+
+~~~
+  error ECT
+     │
+     ▼
+  rollback_start ECT
+     │
+     ├──► agent-A rollback_complete ECT
+     │
+     ├──► agent-B rollback_complete ECT
+     │
+     └──► agent-C compensate ECT
+~~~
+{: #fig-rb-audit title="Rollback Audit Trail in ECT DAG"}
+
+Status values for individual agent rollbacks: `completed`,
+`partial`, `escalated`, `failed`.
+
+# ECT Integration
+
+This document defines the following new `exec_act` values for use
+in ECT nodes {{I-D.nennemann-wimse-ect}}:
+
+| `exec_act` Value | Description |
+|-----------------|-------------|
+| `circuit_breaker_open` | Circuit breaker transitioned to OPEN state |
+| `circuit_breaker_close` | Circuit breaker transitioned to CLOSED state |
+| `checkpoint` | State snapshot before consequential action |
+| `rollback_start` | Rollback initiated for a checkpoint |
+| `rollback_complete` | Rollback finished (with status) |
+| `compensate` | Compensating action executed in lieu of state restoration |
+| `cascade_detected` | Cascading failure pattern detected |
+{: #fig-exec-act-values title="New exec_act Values"}
+
+This document defines the following new `ext` claims for failure
+context:
+
+| Claim | Type | Description |
+|-------|------|-------------|
+| `cascade.downstream_agent` | string | SPIFFE ID of the downstream agent |
+| `cascade.error_rate` | number | Error rate that triggered the circuit breaker |
+| `cascade.window_s` | number | Sliding window duration in seconds |
+| `cascade.cooldown_s` | number | Cooldown duration in seconds |
+| `cascade.reversible` | boolean | Whether the checkpointed action can be undone |
+| `cascade.rollback_uri` | string | URI for rollback requests |
+| `cascade.target` | string | Target system of the checkpointed action |
+| `cascade.ttl` | number | Checkpoint time-to-live in seconds |
+| `cascade.rollback_id` | string | Unique identifier for a rollback operation |
+| `cascade.checkpoint_id` | string | JTI of the checkpoint being rolled back |
+| `cascade.scope` | string | Rollback scope: single, sub_dag, full_workflow |
+| `cascade.status` | string | Rollback result status |
+| `cascade.reason` | string | Human-readable reason for the action |
+| `cascade.pattern` | string | Detected cascade pattern type |
+| `cascade.affected_agents` | number | Count of agents affected by cascade |
+| `cascade.blast_radius` | array | SPIFFE IDs of affected agents |
+| `cascade.cascaded` | array | Per-agent rollback results |
+| `cascade.failed_agents` | array | Agents that could not be rolled back |
+| `cascade.state_hash_before` | string | State hash before rollback |
+| `cascade.state_hash_after` | string | State hash after rollback |
+| `cascade.description` | string | Human-readable description |
+{: #fig-ext-claims title="New ext Claims for Cascade Prevention"}
+
+# Security Considerations
+
+## Rollback Weaponization
+
+Malicious agents could attempt to force unnecessary rollbacks to
+disrupt workflows.  Mitigations:
+
+1. Rollback requests MUST be authenticated via the ECT signature
+   chain.  Only agents whose ECTs appear in the same workflow DAG
+   (identified by `wid`) are authorized to request rollback.
+
+2. Rollback requests from outside the originating workflow MUST be
+   rejected with HTTP 403.
+
+3. Agents SHOULD implement rate limiting on rollback requests to
+   prevent denial-of-service through rollback flooding.
+
+4. The two-phase rollback protocol provides a prepare phase where
+   agents can validate the rollback request before committing.
+
+## Circuit Breaker Manipulation
+
+An adversary could attempt to manipulate circuit breaker state to
+either prevent legitimate circuit breaking or force unnecessary
+circuit breaks:
+
+1. **False error injection**: A malicious agent could emit false
+   error ECTs to trigger circuit breakers.  At L2/L3
+   {{I-D.nennemann-wimse-ect}}, ECT signatures prevent forgery.
+   Agents SHOULD verify that error ECTs reference valid `par`
+   values within their own workflow DAG.
+
+2. **Circuit breaker suppression**: An adversary could attempt to
+   reset circuit breakers by sending successful probe responses.
+   Agents MUST only accept probe responses from the actual
+   downstream agent (verified via ECT identity binding).
+
+3. **Status endpoint abuse**: The `/.well-known/cascade/circuits`
+   endpoint reveals system health topology.  This endpoint MUST
+   require authentication and SHOULD be restricted to agents within
+   the same administrative domain.
+
+## Checkpoint Integrity
+
+Checkpoint state snapshots contain sensitive system state.  Agents
+MUST:
+
+1. Encrypt stored checkpoint state at rest.
+2. Reference checkpoint state via `out_hash` only in ECTs; MUST NOT
+   include checkpoint contents in ECT claims.
+3. Verify `out_hash` integrity before executing rollback to prevent
+   rollback to a tampered state.
+4. Enforce checkpoint storage quotas to prevent checkpoint flooding
+   attacks.
+5. Purge expired checkpoints (past `cascade.ttl`).
+
+# IANA Considerations
+
+## Registration of exec_act Values
+
+This document requests registration of the following `exec_act`
+values in the ECT exec_act registry:
+
+| Value | Description | Reference |
+|-------|-------------|-----------|
+| `circuit_breaker_open` | Circuit breaker transitioned to OPEN | This document |
+| `circuit_breaker_close` | Circuit breaker transitioned to CLOSED | This document |
+| `checkpoint` | State snapshot before consequential action | This document |
+| `rollback_start` | Rollback operation initiated | This document |
+| `rollback_complete` | Rollback operation finished | This document |
+| `compensate` | Compensating action executed | This document |
+| `cascade_detected` | Cascading failure pattern detected | This document |
+{: #fig-iana-exec-act title="exec_act Value Registrations"}
+
+## Registration of ext Claims
+
+This document requests registration of the `ext` claims listed in
+{{fig-ext-claims}} in the ECT extension claims registry.  All claims
+use the `cascade.` namespace prefix.
+
+## Well-Known URI Registration
+
+This document requests registration of the following well-known URI
+suffixes per {{RFC9110}}:
+
+| URI Suffix | Description | Reference |
+|------------|-------------|-----------|
+| `cascade/circuits` | Circuit breaker status | This document |
+| `cascade/rollback` | Rollback request endpoint | This document |
+| `cascade/rollback/prepare` | Rollback prepare endpoint | This document |
+| `cascade/checkpoints` | Checkpoint retrieval | This document |
+{: #fig-iana-uris title="Well-Known URI Registrations"}
+
+--- back
+
+# Acknowledgments
+{:numbered="false"}
+
+This document absorbs and supersedes concepts from the earlier Agent
+Error Recovery and Rollback (AERR) and Agent Task DAG (ATD) proposals.
+It builds on the Execution Context Token specification
+{{I-D.nennemann-wimse-ect}} for DAG-based audit trails and the Agent
+Context Policy Token {{I-D.nennemann-agent-dag-hitl-safety}} for HITL
+escalation of irreversible actions.  The circuit breaker pattern is
+adapted from microservice architecture best practices.