Internet-Draft Agent Cascade Prevention March 2026
Nennemann Expires 7 September 2026 [Page]
Workgroup:
NMOP
Internet-Draft:
draft-nennemann-agent-cascade-prevention-00
Published:
Intended Status:
Standards Track
Expires:
Author:
C. Nennemann
Independent Researcher

Agent Failure Cascade Prevention and Rollback

Abstract

This document defines protocols for preventing agent failures from cascading across interconnected autonomous systems and standardized mechanisms for real-time rollback of incorrect agent decisions. It specifies a circuit breaker protocol with well-defined state transitions, failure domain isolation through bulkhead patterns, cascade detection via error rate and latency analysis, and a distributed rollback coordination protocol that walks the Execution Context Token (ECT) DAG backwards to revert agent actions to a known-good state. This document absorbs and supersedes the concepts introduced in earlier AERR and ATD proposals.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 7 September 2026.

Table of Contents

1. Introduction

Autonomous AI agents increasingly operate in interconnected multi-agent systems where a single agent's failure can propagate through the network, causing widespread service disruption. The IETF gap analysis [I-D.nennemann-agent-gap-analysis] identified two critical gaps in existing standards:

This document addresses both gaps by defining:

  1. A circuit breaker protocol that stops failure propagation between agents.

  2. Failure domain isolation mechanisms that contain blast radius.

  3. Cascade detection signals that identify propagating failures early.

  4. A distributed rollback protocol that coordinates state reversion across multiple agents using the ECT DAG [I-D.nennemann-wimse-ect].

This specification absorbs and supersedes the concepts from the earlier Agent Error Recovery and Rollback (AERR) and Agent Task DAG (ATD) proposals, consolidating cascade prevention and rollback into a single coherent protocol built on ECT infrastructure.

Design principles:

  1. Agents that take consequential actions MUST be able to undo them, or MUST declare them irreversible upfront.

  2. Failure containment takes priority over failure diagnosis.

  3. The protocol adds minimal overhead to the happy path.

  4. All cascade prevention and rollback actions are recorded as ECT nodes, providing a cryptographic audit trail.

2. Terminology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

Circuit Breaker:

A mechanism that stops an agent from propagating requests to a failing downstream agent, preventing cascading failures. Modeled after the electrical circuit breaker pattern used in microservice architectures.

Failure Domain:

A bounded set of agents and resources within which a failure is contained. Failures within a domain MUST NOT propagate beyond the domain boundary without explicit escalation.

Blast Radius:

The set of agents and systems affected by a single agent's failure, determinable by traversing the ECT DAG forward from the failing node.

Cascade Detection:

The process of identifying that a failure is propagating across agent boundaries, using signals such as error rate spikes, latency increases, and resource exhaustion patterns.

Rollback Coordinator:

An agent or orchestrator responsible for coordinating distributed rollback across multiple agents in a workflow, ensuring consistency and resolving conflicts.

Checkpoint:

An ECT node recording an agent's state hash before a consequential action, providing a restore point for rollback.

Compensating Action:

An action that semantically reverses the effect of a prior action when direct state restoration is not possible (e.g., deleting a resource that was created, rather than restoring a pre-creation snapshot).

Recovery Point:

The most recent checkpoint in the ECT DAG to which an agent or workflow can be safely rolled back without violating consistency constraints.

3. Failure Cascade Prevention

3.1. Cascade Model

When an agent fails in a multi-agent system, the failure can propagate through multiple vectors. The following diagram illustrates a typical cascade scenario:

  Agent A          Agent B          Agent C          Agent D
     |                |                |                |
     | request        |                |                |
     |--------------->|                |                |
     |                | request        |                |
     |                |--------------->|                |
     |                |                | request        |
     |                |                |--------------->|
     |                |                |                |
     |                |                |    FAILURE     |
     |                |                |<--- X ---------|
     |                |                |                |
     |                |  error/timeout |                |
     |                |<---------------|                |
     |                |                |                |
     |  error/timeout |                |                |
     |<---------------|                |                |
     |                |                |                |
     | [CASCADE: all agents impacted by D's failure]    |
     |                |                |                |
Figure 1: Failure Cascade Propagation

3.1.1. Failure Domain Taxonomy

Failures in agent ecosystems fall into the following categories:

Agent-Local Failure:

A failure confined to a single agent instance (e.g., out-of-memory, logic error). The blast radius is limited to the agent itself and its immediate callers.

Service Failure:

A failure affecting all instances of a particular agent service (e.g., model endpoint unavailable). The blast radius includes all agents that depend on the failing service.

Infrastructure Failure:

A failure in shared infrastructure (e.g., network partition, certificate authority unavailable). The blast radius may span multiple failure domains.

Semantic Failure:

An agent produces incorrect output without raising an error (e.g., misconfiguration, wrong decision). This is the hardest category to detect and may propagate silently through the DAG.

3.1.2. Propagation Vectors in Agent Ecosystems

Failures propagate through the following vectors:

  1. Synchronous request chains: An agent blocks waiting for a failing downstream agent, causing its own callers to time out.

  2. Shared state corruption: An agent writes incorrect data to a shared store, causing other agents reading that data to fail or make incorrect decisions.

  3. Resource exhaustion: A failing agent consumes excessive resources (connections, memory, compute), starving healthy agents.

  4. Retry amplification: Multiple agents retry requests to a failing agent simultaneously, overwhelming it further.

3.2. Circuit Breaker Protocol

Each agent MUST implement a circuit breaker for every downstream agent it communicates with.

3.2.1. States

The circuit breaker has four states:

CLOSED (normal):

Requests flow through normally. The agent tracks the error rate over a sliding window (default: 60 seconds).

OPEN (failure detected):

When the error rate exceeds the configured threshold (default: 50% over the window), the breaker opens. All requests to the downstream agent are immediately rejected locally. The agent MUST emit an ECT with exec_act value "circuit_breaker_open".

HALF_OPEN (recovery probe):

After a cooldown period (default: 30 seconds), the breaker transitions to HALF_OPEN and allows a single probe request. If the probe succeeds, the breaker returns to CLOSED. If the probe fails, the breaker returns to OPEN with doubled cooldown (exponential backoff, maximum 300 seconds).

CLOSED (recovered):

When a probe succeeds in the HALF_OPEN state, the breaker returns to CLOSED and the agent MUST emit an ECT with exec_act value "circuit_breaker_close".

3.2.2. State Transition Rules

                error_rate > threshold
  CLOSED ────────────────────────────────► OPEN
    ▲                                        │
    │  probe succeeds                        │ cooldown expires
    │                                        ▼
    └──────────────────────────────── HALF_OPEN
                                       │
                            probe fails │
                                        ▼
                                      OPEN
                                (cooldown *= 2,
                                 max 300s)
Figure 2: Circuit Breaker State Machine

The following rules govern state transitions:

  1. CLOSED to OPEN: The error rate over the sliding window exceeds the configured threshold. The agent MUST emit a "circuit_breaker_open" ECT and reject all subsequent requests to the downstream agent.

  2. OPEN to HALF_OPEN: The cooldown timer expires. The agent MUST allow exactly one probe request through.

  3. HALF_OPEN to CLOSED: The probe request succeeds. The agent MUST emit a "circuit_breaker_close" ECT and resume normal operation. The error rate counters MUST be reset.

  4. HALF_OPEN to OPEN: The probe request fails. The cooldown period MUST be doubled (up to a maximum of 300 seconds).

3.2.3. Circuit Breaker Registration and Discovery

Agents MUST expose circuit breaker state at a well-known endpoint:

GET /.well-known/cascade/circuits HTTP/1.1

Response:

{
  "circuits": [
    {
      "downstream_agent": "spiffe://example.com/agent/router-mgr",
      "state": "open",
      "error_rate": 0.75,
      "window_s": 60,
      "last_failure_ect": "550e8400-e29b-41d4-a716-446655440099",
      "cooldown_remaining_s": 22
    }
  ]
}
Figure 3: Circuit Breaker Status Endpoint

3.2.4. ECT Integration

Each circuit breaker state change MUST produce an ECT node:

{
  "jti": "cb-open-uuid",
  "exec_act": "circuit_breaker_open",
  "par": ["error-ect-uuid"],
  "ext": {
    "cascade.downstream_agent":
      "spiffe://example.com/agent/router-mgr",
    "cascade.error_rate": 0.75,
    "cascade.window_s": 60,
    "cascade.cooldown_s": 30
  }
}
Figure 4: Circuit Breaker Open ECT
{
  "jti": "cb-close-uuid",
  "exec_act": "circuit_breaker_close",
  "par": ["cb-open-uuid"],
  "ext": {
    "cascade.downstream_agent":
      "spiffe://example.com/agent/router-mgr",
    "cascade.total_cooldown_s": 30
  }
}
Figure 5: Circuit Breaker Close ECT

3.3. Failure Domain Isolation

3.3.1. Blast Radius Containment Strategies

Agents MUST implement the following containment strategies:

  1. Request rejection at the boundary: When a circuit breaker opens, the agent MUST return a structured error to its callers indicating that the downstream dependency is unavailable, rather than propagating the failure.

  2. Timeout enforcement: Agents MUST enforce timeouts on all downstream requests. The timeout MUST be shorter than the caller's timeout to prevent timeout cascades.

  3. Graceful degradation: When a non-critical downstream agent is unavailable, agents SHOULD continue operating with reduced functionality rather than failing entirely.

3.3.2. Domain Boundary Enforcement

Failure domains are defined by the workflow topology in the ECT DAG. Each workflow (identified by the wid claim) constitutes a failure domain. Cross-workflow failures MUST be escalated through the HITL mechanism [I-D.nennemann-agent-dag-hitl-safety] rather than propagating automatically.

Agents at domain boundaries MUST:

  1. Validate all incoming requests against the circuit breaker state of their downstream dependencies before accepting work.

  2. Emit a "circuit_breaker_open" ECT when rejecting work due to downstream unavailability.

  3. Report domain health status via the circuits endpoint.

3.3.3. Bulkhead Patterns for Agent Pools

When multiple workflows share a common agent pool, the pool MUST implement bulkhead isolation:

  1. Connection limits: Each workflow MUST have a maximum number of concurrent connections to the shared agent pool.

  2. Queue isolation: Each workflow's requests MUST be queued independently, preventing one workflow's backlog from blocking others.

  3. Resource quotas: Shared agent pools SHOULD enforce per-workflow resource quotas (CPU, memory, request rate).

3.4. Cascade Detection

3.4.1. Detection Signals

Agents MUST monitor the following signals for cascade detection:

Error Rate:

The ratio of failed requests to total requests over a sliding window. An error rate exceeding the circuit breaker threshold indicates a potential cascade.

Latency Spike:

A sudden increase in response latency (e.g., p99 latency exceeding 3x the baseline) indicates downstream congestion or failure. Agents SHOULD track latency baselines using exponentially weighted moving averages.

Resource Exhaustion:

Thread pool saturation, connection pool exhaustion, or memory pressure above configured thresholds indicates that a cascade is consuming resources.

3.4.2. Propagation Tracking via ECT DAG Analysis

Orchestrators SHOULD analyze the ECT DAG to detect cascading patterns:

  1. Error clustering: Multiple "circuit_breaker_open" ECTs referencing the same downstream agent within a short window indicate a shared dependency failure.

  2. Depth-first propagation: Errors propagating along par chains in the DAG indicate a synchronous cascade.

  3. Breadth-first propagation: Multiple sibling nodes in the DAG failing concurrently indicate a shared infrastructure failure.

3.4.3. Alert Format and Escalation

When cascade detection identifies a propagating failure, the detecting agent MUST emit a cascade alert ECT:

{
  "exec_act": "cascade_detected",
  "ext": {
    "cascade.pattern": "depth_first",
    "cascade.affected_agents": 4,
    "cascade.root_cause_ect": "error-ect-uuid",
    "cascade.blast_radius": [
      "spiffe://example.com/agent/a",
      "spiffe://example.com/agent/b",
      "spiffe://example.com/agent/c"
    ]
  }
}
Figure 6: Cascade Alert ECT

Cascade alerts with more than 3 affected agents SHOULD trigger HITL escalation per [I-D.nennemann-agent-dag-hitl-safety].

4. Real-Time Rollback

4.1. Rollback Model

Rollback reverses the effects of agent actions by walking the ECT DAG backwards from the point of failure to the nearest valid recovery point.

4.1.1. Walking the ECT DAG Backwards

The rollback process follows par references in reverse:

  1. Identify the failing ECT node.

  2. Find the checkpoint ECT associated with the failing action (referenced via par).

  3. Follow par references backwards to identify all downstream actions that were caused by the checkpointed action.

  4. Issue rollback requests to each affected agent in reverse topological order.

  Checkpoint A ──► Action A1 ──► Checkpoint B ──► Action B1
                                      │
                                      └──► Action B2

  Rollback order: B2, B1, B, A1, A (reverse topological)
Figure 7: Rollback Order via DAG Traversal

4.1.2. Compensating Actions vs State Restoration

Rollback can be performed through two mechanisms:

State Restoration:

The agent restores its state from the checkpoint snapshot. This is the preferred mechanism when the checkpoint contains a complete state snapshot (verified via out_hash).

Compensating Action:

When state restoration is not possible (e.g., the action involved an external API call), the agent executes a compensating action that semantically reverses the original action. Compensating actions MUST be recorded as ECT nodes with exec_act value "compensate".

4.1.3. Rollback Scope

Rollback can be scoped to three levels:

Single Agent:

Only the specified agent's checkpoint is rolled back. No downstream propagation occurs.

Sub-DAG:

The checkpoint and all downstream checkpoints in the sub-DAG are rolled back. This is the default when cascade is true.

Full Workflow:

All checkpoints in the workflow are rolled back and the workflow is terminated. This requires Rollback Coordinator authorization.

4.2. Checkpoint Protocol

4.2.1. Checkpoint Creation

An agent MUST create a checkpoint ECT before any consequential action. An action is consequential if it modifies external state (network configuration, database records, API calls with side effects).

A checkpoint is an ECT with:

  • exec_act: "checkpoint"

  • par: the ECT of the action being checkpointed

  • out_hash: SHA-256 hash of the agent's state snapshot

{
  "jti": "ckpt-uuid",
  "exec_act": "checkpoint",
  "par": ["action-ect-uuid"],
  "out_hash": "sha256:...",
  "ext": {
    "cascade.reversible": true,
    "cascade.rollback_uri":
      "https://agent-b.example.com/.well-known/cascade/rollback",
    "cascade.target": "router-07.example.com",
    "cascade.description": "Update BGP peer configuration",
    "cascade.ttl": 86400
  }
}
Figure 8: Checkpoint ECT

The cascade.reversible field MUST be present. If false, the agent declares that this action cannot be automatically undone and rollback requests MUST be escalated to a human operator via the HITL mechanism [I-D.nennemann-agent-dag-hitl-safety].

4.2.2. Checkpoint Storage and Retrieval

Checkpoint ECTs MUST be stored for at least the duration specified by cascade.ttl. Agents MUST store checkpoints in durable storage that survives agent restarts.

Agents MUST expose a checkpoint retrieval endpoint:

GET /.well-known/cascade/checkpoints/{jti} HTTP/1.1

The response MUST include the checkpoint ECT and its verification status (whether out_hash matches the current stored state snapshot).

4.2.3. Checkpoint Verification

Before executing a rollback, the agent MUST verify the checkpoint integrity:

  1. Retrieve the checkpoint ECT.

  2. Verify the ECT signature chain (L2/L3).

  3. Verify that the stored state snapshot matches out_hash.

  4. Verify that the checkpoint has not expired (cascade.ttl).

If verification fails, the agent MUST reject the rollback request and emit an error ECT.

4.3. Distributed Rollback Coordination

4.3.1. Rollback Coordinator Role

For rollbacks spanning multiple agents (sub-DAG or full workflow scope), a Rollback Coordinator MUST be designated. The coordinator is typically the orchestrator or the agent that initiated the workflow.

The coordinator is responsible for:

  1. Computing the blast radius by traversing the ECT DAG.

  2. Determining rollback order (reverse topological sort).

  3. Issuing rollback requests to each affected agent.

  4. Tracking rollback progress and handling failures.

  5. Emitting the final rollback completion ECT.

4.3.2. Two-Phase Rollback Protocol

Distributed rollback follows a two-phase protocol:

Phase 1: Prepare

The coordinator sends a prepare request to each affected agent:

POST /.well-known/cascade/rollback/prepare HTTP/1.1
Content-Type: application/json
Execution-Context: <prepare-ect>

{
  "rollback_id": "urn:uuid:...",
  "checkpoint_id": "ckpt-uuid",
  "scope": "sub_dag"
}
Figure 9: Rollback Prepare Request

Each agent MUST respond with either:

  • "prepared": The agent has verified its checkpoint and is ready to roll back.

  • "cannot_prepare": The agent cannot roll back (e.g., checkpoint expired, irreversible action).

Phase 2: Execute

If all agents respond "prepared", the coordinator sends execute requests in reverse topological order:

POST /.well-known/cascade/rollback HTTP/1.1
Content-Type: application/json
Execution-Context: <rollback-ect>

{
  "rollback_id": "urn:uuid:...",
  "checkpoint_id": "ckpt-uuid",
  "phase": "execute"
}
Figure 10: Rollback Execute Request

If any agent responds "cannot_prepare" in Phase 1, the coordinator MUST either:

  • Proceed with partial rollback (if the unprepared agent is not on the critical path), or

  • Abort the rollback and escalate to HITL.

4.3.3. Partial Rollback Handling

When a distributed rollback cannot be completed fully, the coordinator MUST:

  1. Roll back all agents that responded "prepared".

  2. Record the partial rollback result in the ECT DAG.

  3. Emit an ECT with exec_act value "rollback_complete" and cascade.status set to "partial".

  4. Include the list of agents that could not be rolled back in the cascade.failed_agents extension claim.

4.3.4. Conflict Resolution During Concurrent Rollbacks

When multiple rollback requests target overlapping portions of the ECT DAG:

  1. The rollback with the broader scope takes precedence (full workflow > sub-DAG > single agent).

  2. If scopes are equal, the earlier rollback request (by timestamp) takes precedence.

  3. The losing rollback request MUST be rejected with an error indicating the conflicting rollback ID.

Agents MUST implement idempotent rollback: receiving the same rollback_id twice MUST return the same result without re-executing the rollback.

4.4. Rollback Evidence

4.4.1. ECT Nodes for Rollback Actions

Each rollback action MUST produce ECT nodes for audit:

Rollback Start:

exec_act: "rollback_start", par references the error ECT that triggered the rollback.

{
  "jti": "rb-start-uuid",
  "exec_act": "rollback_start",
  "par": ["error-ect-uuid"],
  "ext": {
    "cascade.rollback_id": "urn:uuid:...",
    "cascade.checkpoint_id": "ckpt-uuid",
    "cascade.scope": "sub_dag",
    "cascade.reason": "Upstream cascading failure"
  }
}
Figure 11: Rollback Start ECT
Rollback Complete:

exec_act: "rollback_complete", par references the rollback start ECT.

{
  "jti": "rb-complete-uuid",
  "exec_act": "rollback_complete",
  "par": ["rb-start-uuid"],
  "out_hash": "sha256:...",
  "ext": {
    "cascade.rollback_id": "urn:uuid:...",
    "cascade.status": "completed",
    "cascade.state_hash_before": "sha256:...",
    "cascade.state_hash_after": "sha256:...",
    "cascade.cascaded": [
      {
        "agent": "spiffe://example.com/agent/monitor",
        "status": "completed"
      },
      {
        "agent": "spiffe://example.com/agent/classify",
        "status": "escalated"
      }
    ]
  }
}
Figure 12: Rollback Complete ECT

4.4.2. Rollback Audit Trail

The complete rollback audit trail is captured in the ECT DAG:

  error ECT
     │
     ▼
  rollback_start ECT
     │
     ├──► agent-A rollback_complete ECT
     │
     ├──► agent-B rollback_complete ECT
     │
     └──► agent-C compensate ECT
Figure 13: Rollback Audit Trail in ECT DAG

Status values for individual agent rollbacks: completed, partial, escalated, failed.

5. ECT Integration

This document defines the following new exec_act values for use in ECT nodes [I-D.nennemann-wimse-ect]:

Table 1: New exec_act Values
exec_act Value Description
circuit_breaker_open Circuit breaker transitioned to OPEN state
circuit_breaker_close Circuit breaker transitioned to CLOSED state
checkpoint State snapshot before consequential action
rollback_start Rollback initiated for a checkpoint
rollback_complete Rollback finished (with status)
compensate Compensating action executed in lieu of state restoration
cascade_detected Cascading failure pattern detected

This document defines the following new ext claims for failure context:

Table 2: New ext Claims for Cascade Prevention
Claim Type Description
cascade.downstream_agent string SPIFFE ID of the downstream agent
cascade.error_rate number Error rate that triggered the circuit breaker
cascade.window_s number Sliding window duration in seconds
cascade.cooldown_s number Cooldown duration in seconds
cascade.reversible boolean Whether the checkpointed action can be undone
cascade.rollback_uri string URI for rollback requests
cascade.target string Target system of the checkpointed action
cascade.ttl number Checkpoint time-to-live in seconds
cascade.rollback_id string Unique identifier for a rollback operation
cascade.checkpoint_id string JTI of the checkpoint being rolled back
cascade.scope string Rollback scope: single, sub_dag, full_workflow
cascade.status string Rollback result status
cascade.reason string Human-readable reason for the action
cascade.pattern string Detected cascade pattern type
cascade.affected_agents number Count of agents affected by cascade
cascade.blast_radius array SPIFFE IDs of affected agents
cascade.cascaded array Per-agent rollback results
cascade.failed_agents array Agents that could not be rolled back
cascade.state_hash_before string State hash before rollback
cascade.state_hash_after string State hash after rollback
cascade.description string Human-readable description

6. Security Considerations

6.1. Rollback Weaponization

Malicious agents could attempt to force unnecessary rollbacks to disrupt workflows. Mitigations:

  1. Rollback requests MUST be authenticated via the ECT signature chain. Only agents whose ECTs appear in the same workflow DAG (identified by wid) are authorized to request rollback.

  2. Rollback requests from outside the originating workflow MUST be rejected with HTTP 403.

  3. Agents SHOULD implement rate limiting on rollback requests to prevent denial-of-service through rollback flooding.

  4. The two-phase rollback protocol provides a prepare phase where agents can validate the rollback request before committing.

6.2. Circuit Breaker Manipulation

An adversary could attempt to manipulate circuit breaker state to either prevent legitimate circuit breaking or force unnecessary circuit breaks:

  1. False error injection: A malicious agent could emit false error ECTs to trigger circuit breakers. At L2/L3 [I-D.nennemann-wimse-ect], ECT signatures prevent forgery. Agents SHOULD verify that error ECTs reference valid par values within their own workflow DAG.

  2. Circuit breaker suppression: An adversary could attempt to reset circuit breakers by sending successful probe responses. Agents MUST only accept probe responses from the actual downstream agent (verified via ECT identity binding).

  3. Status endpoint abuse: The /.well-known/cascade/circuits endpoint reveals system health topology. This endpoint MUST require authentication and SHOULD be restricted to agents within the same administrative domain.

6.3. Checkpoint Integrity

Checkpoint state snapshots contain sensitive system state. Agents MUST:

  1. Encrypt stored checkpoint state at rest.

  2. Reference checkpoint state via out_hash only in ECTs; MUST NOT include checkpoint contents in ECT claims.

  3. Verify out_hash integrity before executing rollback to prevent rollback to a tampered state.

  4. Enforce checkpoint storage quotas to prevent checkpoint flooding attacks.

  5. Purge expired checkpoints (past cascade.ttl).

7. IANA Considerations

7.1. Registration of exec_act Values

This document requests registration of the following exec_act values in the ECT exec_act registry:

Table 3: exec_act Value Registrations
Value Description Reference
circuit_breaker_open Circuit breaker transitioned to OPEN This document
circuit_breaker_close Circuit breaker transitioned to CLOSED This document
checkpoint State snapshot before consequential action This document
rollback_start Rollback operation initiated This document
rollback_complete Rollback operation finished This document
compensate Compensating action executed This document
cascade_detected Cascading failure pattern detected This document

7.2. Registration of ext Claims

This document requests registration of the ext claims listed in Table 2 in the ECT extension claims registry. All claims use the cascade. namespace prefix.

7.3. Well-Known URI Registration

This document requests registration of the following well-known URI suffixes per [RFC9110]:

Table 4: Well-Known URI Registrations
URI Suffix Description Reference
cascade/circuits Circuit breaker status This document
cascade/rollback Rollback request endpoint This document
cascade/rollback/prepare Rollback prepare endpoint This document
cascade/checkpoints Checkpoint retrieval This document

8. References

8.1. Normative References

[I-D.nennemann-agent-dag-hitl-safety]
"Agent Context Policy Token: DAG Delegation with Human Override", n.d., <https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/>.
[I-D.nennemann-wimse-ect]
"Execution Context Tokens for Distributed Agentic Workflows", n.d., <https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/>.
[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/rfc/rfc2119>.
[RFC7515]
Jones, M., Bradley, J., and N. Sakimura, "JSON Web Signature (JWS)", RFC 7515, DOI 10.17487/RFC7515, , <https://www.rfc-editor.org/rfc/rfc7515>.
[RFC7519]
Jones, M., Bradley, J., and N. Sakimura, "JSON Web Token (JWT)", RFC 7519, DOI 10.17487/RFC7519, , <https://www.rfc-editor.org/rfc/rfc7519>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/rfc/rfc8174>.
[RFC9110]
Fielding, R., Ed., Nottingham, M., Ed., and J. Reschke, Ed., "HTTP Semantics", STD 97, RFC 9110, DOI 10.17487/RFC9110, , <https://www.rfc-editor.org/rfc/rfc9110>.

8.2. Informative References

[I-D.nennemann-agent-gap-analysis]
"Gap Analysis of IETF Standards for Autonomous AI Agent Networking", n.d., <https://datatracker.ietf.org/doc/draft-nennemann-agent-gap-analysis/>.

Acknowledgments

This document absorbs and supersedes concepts from the earlier Agent Error Recovery and Rollback (AERR) and Agent Task DAG (ATD) proposals. It builds on the Execution Context Token specification [I-D.nennemann-wimse-ect] for DAG-based audit trails and the Agent Context Policy Token [I-D.nennemann-agent-dag-hitl-safety] for HITL escalation of irreversible actions. The circuit breaker pattern is adapted from microservice architecture best practices.

Author's Address

Christian Nennemann
Independent Researcher