908 lines
30 KiB
Markdown
908 lines
30 KiB
Markdown
---
|
|
title: "Agent Failure Cascade Prevention and Rollback"
|
|
abbrev: "Agent Cascade Prevention"
|
|
category: std
|
|
docname: draft-nennemann-agent-cascade-prevention-00
|
|
submissiontype: IETF
|
|
number:
|
|
date:
|
|
v: 3
|
|
area: "OPS"
|
|
workgroup: "NMOP"
|
|
keyword:
|
|
- cascade prevention
|
|
- circuit breaker
|
|
- rollback
|
|
- failure domain
|
|
- agent recovery
|
|
|
|
author:
|
|
-
|
|
fullname: Christian Nennemann
|
|
organization: Independent Researcher
|
|
email: ietf@nennemann.de
|
|
|
|
normative:
|
|
RFC2119:
|
|
RFC8174:
|
|
RFC7519:
|
|
RFC7515:
|
|
RFC9110:
|
|
I-D.nennemann-wimse-ect:
|
|
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
|
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
|
I-D.nennemann-agent-dag-hitl-safety:
|
|
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
|
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
|
|
|
informative:
|
|
I-D.nennemann-agent-gap-analysis:
|
|
title: "Gap Analysis of IETF Standards for Autonomous AI Agent Networking"
|
|
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-gap-analysis/
|
|
|
|
--- abstract
|
|
|
|
This document defines protocols for preventing agent failures from
|
|
cascading across interconnected autonomous systems and standardized
|
|
mechanisms for real-time rollback of incorrect agent decisions. It
|
|
specifies a circuit breaker protocol with well-defined state
|
|
transitions, failure domain isolation through bulkhead patterns, cascade
|
|
detection via error rate and latency analysis, and a distributed
|
|
rollback coordination protocol that walks the Execution Context Token
|
|
(ECT) DAG backwards to revert agent actions to a known-good state.
|
|
This document absorbs and supersedes the concepts introduced in earlier
|
|
AERR and ATD proposals.
|
|
|
|
--- middle
|
|
|
|
# Introduction
|
|
|
|
Autonomous AI agents increasingly operate in interconnected
|
|
multi-agent systems where a single agent's failure can propagate
|
|
through the network, causing widespread service disruption. The IETF
|
|
gap analysis {{I-D.nennemann-agent-gap-analysis}} identified two
|
|
critical gaps in existing standards:
|
|
|
|
- **Gap 2 (Cascade Prevention)**: No standard mechanism exists for
|
|
containing failures within agent ecosystems. When one agent fails,
|
|
dependent agents continue sending requests to the failing agent,
|
|
amplifying the failure across the system.
|
|
|
|
- **Gap 4 (Rollback)**: No standard protocol exists for reverting
|
|
incorrect agent decisions. When an autonomous agent misconfigures
|
|
a network device or makes an erroneous API call, there is no
|
|
interoperable way to undo the action or coordinate rollback across
|
|
multiple affected agents.
|
|
|
|
This document addresses both gaps by defining:
|
|
|
|
1. A circuit breaker protocol that stops failure propagation between
|
|
agents.
|
|
2. Failure domain isolation mechanisms that contain blast radius.
|
|
3. Cascade detection signals that identify propagating failures early.
|
|
4. A distributed rollback protocol that coordinates state reversion
|
|
across multiple agents using the ECT DAG
|
|
{{I-D.nennemann-wimse-ect}}.
|
|
|
|
This specification absorbs and supersedes the concepts from the earlier
|
|
Agent Error Recovery and Rollback (AERR) and Agent Task DAG (ATD)
|
|
proposals, consolidating cascade prevention and rollback into a single
|
|
coherent protocol built on ECT infrastructure.
|
|
|
|
Design principles:
|
|
|
|
1. Agents that take consequential actions MUST be able to undo them,
|
|
or MUST declare them irreversible upfront.
|
|
2. Failure containment takes priority over failure diagnosis.
|
|
3. The protocol adds minimal overhead to the happy path.
|
|
4. All cascade prevention and rollback actions are recorded as ECT
|
|
nodes, providing a cryptographic audit trail.
|
|
|
|
# Terminology
|
|
|
|
{::boilerplate bcp14-tagged}
|
|
|
|
Circuit Breaker:
|
|
: A mechanism that stops an agent from propagating requests to a
|
|
failing downstream agent, preventing cascading failures. Modeled
|
|
after the electrical circuit breaker pattern used in microservice
|
|
architectures.
|
|
|
|
Failure Domain:
|
|
: A bounded set of agents and resources within which a failure is
|
|
contained. Failures within a domain MUST NOT propagate beyond the
|
|
domain boundary without explicit escalation.
|
|
|
|
Blast Radius:
|
|
: The set of agents and systems affected by a single agent's failure,
|
|
determinable by traversing the ECT DAG forward from the failing
|
|
node.
|
|
|
|
Cascade Detection:
|
|
: The process of identifying that a failure is propagating across
|
|
agent boundaries, using signals such as error rate spikes, latency
|
|
increases, and resource exhaustion patterns.
|
|
|
|
Rollback Coordinator:
|
|
: An agent or orchestrator responsible for coordinating distributed
|
|
rollback across multiple agents in a workflow, ensuring consistency
|
|
and resolving conflicts.
|
|
|
|
Checkpoint:
|
|
: An ECT node recording an agent's state hash before a consequential
|
|
action, providing a restore point for rollback.
|
|
|
|
Compensating Action:
|
|
: An action that semantically reverses the effect of a prior action
|
|
when direct state restoration is not possible (e.g., deleting a
|
|
resource that was created, rather than restoring a pre-creation
|
|
snapshot).
|
|
|
|
Recovery Point:
|
|
: The most recent checkpoint in the ECT DAG to which an agent or
|
|
workflow can be safely rolled back without violating consistency
|
|
constraints.
|
|
|
|
# Failure Cascade Prevention
|
|
|
|
## Cascade Model
|
|
|
|
When an agent fails in a multi-agent system, the failure can
|
|
propagate through multiple vectors. The following diagram
|
|
illustrates a typical cascade scenario:
|
|
|
|
~~~
|
|
Agent A Agent B Agent C Agent D
|
|
| | | |
|
|
| request | | |
|
|
|--------------->| | |
|
|
| | request | |
|
|
| |--------------->| |
|
|
| | | request |
|
|
| | |--------------->|
|
|
| | | |
|
|
| | | FAILURE |
|
|
| | |<--- X ---------|
|
|
| | | |
|
|
| | error/timeout | |
|
|
| |<---------------| |
|
|
| | | |
|
|
| error/timeout | | |
|
|
|<---------------| | |
|
|
| | | |
|
|
| [CASCADE: all agents impacted by D's failure] |
|
|
| | | |
|
|
~~~
|
|
{: #fig-cascade title="Failure Cascade Propagation"}
|
|
|
|
### Failure Domain Taxonomy
|
|
|
|
Failures in agent ecosystems fall into the following categories:
|
|
|
|
Agent-Local Failure:
|
|
: A failure confined to a single agent instance (e.g., out-of-memory,
|
|
logic error). The blast radius is limited to the agent itself and
|
|
its immediate callers.
|
|
|
|
Service Failure:
|
|
: A failure affecting all instances of a particular agent service
|
|
(e.g., model endpoint unavailable). The blast radius includes all
|
|
agents that depend on the failing service.
|
|
|
|
Infrastructure Failure:
|
|
: A failure in shared infrastructure (e.g., network partition,
|
|
certificate authority unavailable). The blast radius may span
|
|
multiple failure domains.
|
|
|
|
Semantic Failure:
|
|
: An agent produces incorrect output without raising an error (e.g.,
|
|
misconfiguration, wrong decision). This is the hardest category
|
|
to detect and may propagate silently through the DAG.
|
|
|
|
### Propagation Vectors in Agent Ecosystems
|
|
|
|
Failures propagate through the following vectors:
|
|
|
|
1. **Synchronous request chains**: An agent blocks waiting for a
|
|
failing downstream agent, causing its own callers to time out.
|
|
|
|
2. **Shared state corruption**: An agent writes incorrect data to a
|
|
shared store, causing other agents reading that data to fail or
|
|
make incorrect decisions.
|
|
|
|
3. **Resource exhaustion**: A failing agent consumes excessive
|
|
resources (connections, memory, compute), starving healthy agents.
|
|
|
|
4. **Retry amplification**: Multiple agents retry requests to a
|
|
failing agent simultaneously, overwhelming it further.
|
|
|
|
## Circuit Breaker Protocol
|
|
|
|
Each agent MUST implement a circuit breaker for every downstream
|
|
agent it communicates with.
|
|
|
|
### States
|
|
|
|
The circuit breaker has four states:
|
|
|
|
CLOSED (normal):
|
|
: Requests flow through normally. The agent tracks the error rate
|
|
over a sliding window (default: 60 seconds).
|
|
|
|
OPEN (failure detected):
|
|
: When the error rate exceeds the configured threshold (default: 50%
|
|
over the window), the breaker opens. All requests to the
|
|
downstream agent are immediately rejected locally. The agent
|
|
MUST emit an ECT with `exec_act` value `"circuit_breaker_open"`.
|
|
|
|
HALF_OPEN (recovery probe):
|
|
: After a cooldown period (default: 30 seconds), the breaker
|
|
transitions to HALF_OPEN and allows a single probe request. If
|
|
the probe succeeds, the breaker returns to CLOSED. If the probe
|
|
fails, the breaker returns to OPEN with doubled cooldown
|
|
(exponential backoff, maximum 300 seconds).
|
|
|
|
CLOSED (recovered):
|
|
: When a probe succeeds in the HALF_OPEN state, the breaker returns
|
|
to CLOSED and the agent MUST emit an ECT with `exec_act` value
|
|
`"circuit_breaker_close"`.
|
|
|
|
### State Transition Rules
|
|
|
|
~~~
|
|
error_rate > threshold
|
|
CLOSED ────────────────────────────────► OPEN
|
|
▲ │
|
|
│ probe succeeds │ cooldown expires
|
|
│ ▼
|
|
└──────────────────────────────── HALF_OPEN
|
|
│
|
|
probe fails │
|
|
▼
|
|
OPEN
|
|
(cooldown *= 2,
|
|
max 300s)
|
|
~~~
|
|
{: #fig-circuit-fsm title="Circuit Breaker State Machine"}
|
|
|
|
The following rules govern state transitions:
|
|
|
|
1. CLOSED to OPEN: The error rate over the sliding window exceeds
|
|
the configured threshold. The agent MUST emit a
|
|
`"circuit_breaker_open"` ECT and reject all subsequent requests
|
|
to the downstream agent.
|
|
|
|
2. OPEN to HALF_OPEN: The cooldown timer expires. The agent MUST
|
|
allow exactly one probe request through.
|
|
|
|
3. HALF_OPEN to CLOSED: The probe request succeeds. The agent MUST
|
|
emit a `"circuit_breaker_close"` ECT and resume normal operation.
|
|
The error rate counters MUST be reset.
|
|
|
|
4. HALF_OPEN to OPEN: The probe request fails. The cooldown period
|
|
MUST be doubled (up to a maximum of 300 seconds).
|
|
|
|
### Circuit Breaker Registration and Discovery
|
|
|
|
Agents MUST expose circuit breaker state at a well-known endpoint:
|
|
|
|
~~~
|
|
GET /.well-known/cascade/circuits HTTP/1.1
|
|
~~~
|
|
|
|
Response:
|
|
|
|
~~~json
|
|
{
|
|
"circuits": [
|
|
{
|
|
"downstream_agent": "spiffe://example.com/agent/router-mgr",
|
|
"state": "open",
|
|
"error_rate": 0.75,
|
|
"window_s": 60,
|
|
"last_failure_ect": "550e8400-e29b-41d4-a716-446655440099",
|
|
"cooldown_remaining_s": 22
|
|
}
|
|
]
|
|
}
|
|
~~~
|
|
{: #fig-circuits title="Circuit Breaker Status Endpoint"}
|
|
|
|
### ECT Integration
|
|
|
|
Each circuit breaker state change MUST produce an ECT node:
|
|
|
|
~~~json
|
|
{
|
|
"jti": "cb-open-uuid",
|
|
"exec_act": "circuit_breaker_open",
|
|
"par": ["error-ect-uuid"],
|
|
"ext": {
|
|
"cascade.downstream_agent":
|
|
"spiffe://example.com/agent/router-mgr",
|
|
"cascade.error_rate": 0.75,
|
|
"cascade.window_s": 60,
|
|
"cascade.cooldown_s": 30
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-cb-ect title="Circuit Breaker Open ECT"}
|
|
|
|
~~~json
|
|
{
|
|
"jti": "cb-close-uuid",
|
|
"exec_act": "circuit_breaker_close",
|
|
"par": ["cb-open-uuid"],
|
|
"ext": {
|
|
"cascade.downstream_agent":
|
|
"spiffe://example.com/agent/router-mgr",
|
|
"cascade.total_cooldown_s": 30
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-cb-close-ect title="Circuit Breaker Close ECT"}
|
|
|
|
## Failure Domain Isolation
|
|
|
|
### Blast Radius Containment Strategies
|
|
|
|
Agents MUST implement the following containment strategies:
|
|
|
|
1. **Request rejection at the boundary**: When a circuit breaker
|
|
opens, the agent MUST return a structured error to its callers
|
|
indicating that the downstream dependency is unavailable, rather
|
|
than propagating the failure.
|
|
|
|
2. **Timeout enforcement**: Agents MUST enforce timeouts on all
|
|
downstream requests. The timeout MUST be shorter than the
|
|
caller's timeout to prevent timeout cascades.
|
|
|
|
3. **Graceful degradation**: When a non-critical downstream agent
|
|
is unavailable, agents SHOULD continue operating with reduced
|
|
functionality rather than failing entirely.
|
|
|
|
### Domain Boundary Enforcement
|
|
|
|
Failure domains are defined by the workflow topology in the ECT DAG.
|
|
Each workflow (identified by the `wid` claim) constitutes a failure
|
|
domain. Cross-workflow failures MUST be escalated through the HITL
|
|
mechanism {{I-D.nennemann-agent-dag-hitl-safety}} rather than
|
|
propagating automatically.
|
|
|
|
Agents at domain boundaries MUST:
|
|
|
|
1. Validate all incoming requests against the circuit breaker state
|
|
of their downstream dependencies before accepting work.
|
|
2. Emit a `"circuit_breaker_open"` ECT when rejecting work due to
|
|
downstream unavailability.
|
|
3. Report domain health status via the circuits endpoint.
|
|
|
|
### Bulkhead Patterns for Agent Pools
|
|
|
|
When multiple workflows share a common agent pool, the pool MUST
|
|
implement bulkhead isolation:
|
|
|
|
1. **Connection limits**: Each workflow MUST have a maximum number
|
|
of concurrent connections to the shared agent pool.
|
|
|
|
2. **Queue isolation**: Each workflow's requests MUST be queued
|
|
independently, preventing one workflow's backlog from blocking
|
|
others.
|
|
|
|
3. **Resource quotas**: Shared agent pools SHOULD enforce per-workflow
|
|
resource quotas (CPU, memory, request rate).
|
|
|
|
## Cascade Detection
|
|
|
|
### Detection Signals
|
|
|
|
Agents MUST monitor the following signals for cascade detection:
|
|
|
|
Error Rate:
|
|
: The ratio of failed requests to total requests over a sliding
|
|
window. An error rate exceeding the circuit breaker threshold
|
|
indicates a potential cascade.
|
|
|
|
Latency Spike:
|
|
: A sudden increase in response latency (e.g., p99 latency exceeding
|
|
3x the baseline) indicates downstream congestion or failure.
|
|
Agents SHOULD track latency baselines using exponentially weighted
|
|
moving averages.
|
|
|
|
Resource Exhaustion:
|
|
: Thread pool saturation, connection pool exhaustion, or memory
|
|
pressure above configured thresholds indicates that a cascade is
|
|
consuming resources.
|
|
|
|
### Propagation Tracking via ECT DAG Analysis
|
|
|
|
Orchestrators SHOULD analyze the ECT DAG to detect cascading
|
|
patterns:
|
|
|
|
1. **Error clustering**: Multiple `"circuit_breaker_open"` ECTs
|
|
referencing the same downstream agent within a short window
|
|
indicate a shared dependency failure.
|
|
|
|
2. **Depth-first propagation**: Errors propagating along `par`
|
|
chains in the DAG indicate a synchronous cascade.
|
|
|
|
3. **Breadth-first propagation**: Multiple sibling nodes in the
|
|
DAG failing concurrently indicate a shared infrastructure
|
|
failure.
|
|
|
|
### Alert Format and Escalation
|
|
|
|
When cascade detection identifies a propagating failure, the
|
|
detecting agent MUST emit a cascade alert ECT:
|
|
|
|
~~~json
|
|
{
|
|
"exec_act": "cascade_detected",
|
|
"ext": {
|
|
"cascade.pattern": "depth_first",
|
|
"cascade.affected_agents": 4,
|
|
"cascade.root_cause_ect": "error-ect-uuid",
|
|
"cascade.blast_radius": [
|
|
"spiffe://example.com/agent/a",
|
|
"spiffe://example.com/agent/b",
|
|
"spiffe://example.com/agent/c"
|
|
]
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-cascade-alert title="Cascade Alert ECT"}
|
|
|
|
Cascade alerts with more than 3 affected agents SHOULD trigger
|
|
HITL escalation per {{I-D.nennemann-agent-dag-hitl-safety}}.
|
|
|
|
# Real-Time Rollback
|
|
|
|
## Rollback Model
|
|
|
|
Rollback reverses the effects of agent actions by walking the ECT
|
|
DAG backwards from the point of failure to the nearest valid
|
|
recovery point.
|
|
|
|
### Walking the ECT DAG Backwards
|
|
|
|
The rollback process follows `par` references in reverse:
|
|
|
|
1. Identify the failing ECT node.
|
|
2. Find the checkpoint ECT associated with the failing action
|
|
(referenced via `par`).
|
|
3. Follow `par` references backwards to identify all downstream
|
|
actions that were caused by the checkpointed action.
|
|
4. Issue rollback requests to each affected agent in reverse
|
|
topological order.
|
|
|
|
~~~
|
|
Checkpoint A ──► Action A1 ──► Checkpoint B ──► Action B1
|
|
│
|
|
└──► Action B2
|
|
|
|
Rollback order: B2, B1, B, A1, A (reverse topological)
|
|
~~~
|
|
{: #fig-rollback-order title="Rollback Order via DAG Traversal"}
|
|
|
|
### Compensating Actions vs State Restoration
|
|
|
|
Rollback can be performed through two mechanisms:
|
|
|
|
State Restoration:
|
|
: The agent restores its state from the checkpoint snapshot. This
|
|
is the preferred mechanism when the checkpoint contains a complete
|
|
state snapshot (verified via `out_hash`).
|
|
|
|
Compensating Action:
|
|
: When state restoration is not possible (e.g., the action involved
|
|
an external API call), the agent executes a compensating action
|
|
that semantically reverses the original action. Compensating
|
|
actions MUST be recorded as ECT nodes with `exec_act` value
|
|
`"compensate"`.
|
|
|
|
### Rollback Scope
|
|
|
|
Rollback can be scoped to three levels:
|
|
|
|
Single Agent:
|
|
: Only the specified agent's checkpoint is rolled back. No
|
|
downstream propagation occurs.
|
|
|
|
Sub-DAG:
|
|
: The checkpoint and all downstream checkpoints in the sub-DAG
|
|
are rolled back. This is the default when `cascade` is `true`.
|
|
|
|
Full Workflow:
|
|
: All checkpoints in the workflow are rolled back and the workflow
|
|
is terminated. This requires Rollback Coordinator authorization.
|
|
|
|
## Checkpoint Protocol
|
|
|
|
### Checkpoint Creation
|
|
|
|
An agent MUST create a checkpoint ECT before any consequential
|
|
action. An action is consequential if it modifies external state
|
|
(network configuration, database records, API calls with side
|
|
effects).
|
|
|
|
A checkpoint is an ECT with:
|
|
|
|
- `exec_act`: `"checkpoint"`
|
|
- `par`: the ECT of the action being checkpointed
|
|
- `out_hash`: SHA-256 hash of the agent's state snapshot
|
|
|
|
~~~json
|
|
{
|
|
"jti": "ckpt-uuid",
|
|
"exec_act": "checkpoint",
|
|
"par": ["action-ect-uuid"],
|
|
"out_hash": "sha256:...",
|
|
"ext": {
|
|
"cascade.reversible": true,
|
|
"cascade.rollback_uri":
|
|
"https://agent-b.example.com/.well-known/cascade/rollback",
|
|
"cascade.target": "router-07.example.com",
|
|
"cascade.description": "Update BGP peer configuration",
|
|
"cascade.ttl": 86400
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-checkpoint title="Checkpoint ECT"}
|
|
|
|
The `cascade.reversible` field MUST be present. If `false`, the
|
|
agent declares that this action cannot be automatically undone and
|
|
rollback requests MUST be escalated to a human operator via the
|
|
HITL mechanism {{I-D.nennemann-agent-dag-hitl-safety}}.
|
|
|
|
### Checkpoint Storage and Retrieval
|
|
|
|
Checkpoint ECTs MUST be stored for at least the duration specified
|
|
by `cascade.ttl`. Agents MUST store checkpoints in durable storage
|
|
that survives agent restarts.
|
|
|
|
Agents MUST expose a checkpoint retrieval endpoint:
|
|
|
|
~~~
|
|
GET /.well-known/cascade/checkpoints/{jti} HTTP/1.1
|
|
~~~
|
|
|
|
The response MUST include the checkpoint ECT and its verification
|
|
status (whether `out_hash` matches the current stored state snapshot).
|
|
|
|
### Checkpoint Verification
|
|
|
|
Before executing a rollback, the agent MUST verify the checkpoint
|
|
integrity:
|
|
|
|
1. Retrieve the checkpoint ECT.
|
|
2. Verify the ECT signature chain (L2/L3).
|
|
3. Verify that the stored state snapshot matches `out_hash`.
|
|
4. Verify that the checkpoint has not expired (`cascade.ttl`).
|
|
|
|
If verification fails, the agent MUST reject the rollback request
|
|
and emit an error ECT.
|
|
|
|
## Distributed Rollback Coordination
|
|
|
|
### Rollback Coordinator Role
|
|
|
|
For rollbacks spanning multiple agents (sub-DAG or full workflow
|
|
scope), a Rollback Coordinator MUST be designated. The coordinator
|
|
is typically the orchestrator or the agent that initiated the
|
|
workflow.
|
|
|
|
The coordinator is responsible for:
|
|
|
|
1. Computing the blast radius by traversing the ECT DAG.
|
|
2. Determining rollback order (reverse topological sort).
|
|
3. Issuing rollback requests to each affected agent.
|
|
4. Tracking rollback progress and handling failures.
|
|
5. Emitting the final rollback completion ECT.
|
|
|
|
### Two-Phase Rollback Protocol
|
|
|
|
Distributed rollback follows a two-phase protocol:
|
|
|
|
**Phase 1: Prepare**
|
|
|
|
The coordinator sends a prepare request to each affected agent:
|
|
|
|
~~~
|
|
POST /.well-known/cascade/rollback/prepare HTTP/1.1
|
|
Content-Type: application/json
|
|
Execution-Context: <prepare-ect>
|
|
|
|
{
|
|
"rollback_id": "urn:uuid:...",
|
|
"checkpoint_id": "ckpt-uuid",
|
|
"scope": "sub_dag"
|
|
}
|
|
~~~
|
|
{: #fig-prepare title="Rollback Prepare Request"}
|
|
|
|
Each agent MUST respond with either:
|
|
|
|
- `"prepared"`: The agent has verified its checkpoint and is ready
|
|
to roll back.
|
|
- `"cannot_prepare"`: The agent cannot roll back (e.g., checkpoint
|
|
expired, irreversible action).
|
|
|
|
**Phase 2: Execute**
|
|
|
|
If all agents respond `"prepared"`, the coordinator sends execute
|
|
requests in reverse topological order:
|
|
|
|
~~~
|
|
POST /.well-known/cascade/rollback HTTP/1.1
|
|
Content-Type: application/json
|
|
Execution-Context: <rollback-ect>
|
|
|
|
{
|
|
"rollback_id": "urn:uuid:...",
|
|
"checkpoint_id": "ckpt-uuid",
|
|
"phase": "execute"
|
|
}
|
|
~~~
|
|
{: #fig-execute title="Rollback Execute Request"}
|
|
|
|
If any agent responds `"cannot_prepare"` in Phase 1, the
|
|
coordinator MUST either:
|
|
|
|
- Proceed with partial rollback (if the unprepared agent is not
|
|
on the critical path), or
|
|
- Abort the rollback and escalate to HITL.
|
|
|
|
### Partial Rollback Handling
|
|
|
|
When a distributed rollback cannot be completed fully, the
|
|
coordinator MUST:
|
|
|
|
1. Roll back all agents that responded `"prepared"`.
|
|
2. Record the partial rollback result in the ECT DAG.
|
|
3. Emit an ECT with `exec_act` value `"rollback_complete"` and
|
|
`cascade.status` set to `"partial"`.
|
|
4. Include the list of agents that could not be rolled back in
|
|
the `cascade.failed_agents` extension claim.
|
|
|
|
### Conflict Resolution During Concurrent Rollbacks
|
|
|
|
When multiple rollback requests target overlapping portions of the
|
|
ECT DAG:
|
|
|
|
1. The rollback with the broader scope takes precedence (full
|
|
workflow > sub-DAG > single agent).
|
|
2. If scopes are equal, the earlier rollback request (by timestamp)
|
|
takes precedence.
|
|
3. The losing rollback request MUST be rejected with an error
|
|
indicating the conflicting rollback ID.
|
|
|
|
Agents MUST implement idempotent rollback: receiving the same
|
|
`rollback_id` twice MUST return the same result without
|
|
re-executing the rollback.
|
|
|
|
## Rollback Evidence
|
|
|
|
### ECT Nodes for Rollback Actions
|
|
|
|
Each rollback action MUST produce ECT nodes for audit:
|
|
|
|
Rollback Start:
|
|
: `exec_act`: `"rollback_start"`, `par` references the error ECT
|
|
that triggered the rollback.
|
|
|
|
~~~json
|
|
{
|
|
"jti": "rb-start-uuid",
|
|
"exec_act": "rollback_start",
|
|
"par": ["error-ect-uuid"],
|
|
"ext": {
|
|
"cascade.rollback_id": "urn:uuid:...",
|
|
"cascade.checkpoint_id": "ckpt-uuid",
|
|
"cascade.scope": "sub_dag",
|
|
"cascade.reason": "Upstream cascading failure"
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-rb-start title="Rollback Start ECT"}
|
|
|
|
Rollback Complete:
|
|
: `exec_act`: `"rollback_complete"`, `par` references the rollback
|
|
start ECT.
|
|
|
|
~~~json
|
|
{
|
|
"jti": "rb-complete-uuid",
|
|
"exec_act": "rollback_complete",
|
|
"par": ["rb-start-uuid"],
|
|
"out_hash": "sha256:...",
|
|
"ext": {
|
|
"cascade.rollback_id": "urn:uuid:...",
|
|
"cascade.status": "completed",
|
|
"cascade.state_hash_before": "sha256:...",
|
|
"cascade.state_hash_after": "sha256:...",
|
|
"cascade.cascaded": [
|
|
{
|
|
"agent": "spiffe://example.com/agent/monitor",
|
|
"status": "completed"
|
|
},
|
|
{
|
|
"agent": "spiffe://example.com/agent/classify",
|
|
"status": "escalated"
|
|
}
|
|
]
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-rb-complete title="Rollback Complete ECT"}
|
|
|
|
### Rollback Audit Trail
|
|
|
|
The complete rollback audit trail is captured in the ECT DAG:
|
|
|
|
~~~
|
|
error ECT
|
|
│
|
|
▼
|
|
rollback_start ECT
|
|
│
|
|
├──► agent-A rollback_complete ECT
|
|
│
|
|
├──► agent-B rollback_complete ECT
|
|
│
|
|
└──► agent-C compensate ECT
|
|
~~~
|
|
{: #fig-rb-audit title="Rollback Audit Trail in ECT DAG"}
|
|
|
|
Status values for individual agent rollbacks: `completed`,
|
|
`partial`, `escalated`, `failed`.
|
|
|
|
# ECT Integration
|
|
|
|
This document defines the following new `exec_act` values for use
|
|
in ECT nodes {{I-D.nennemann-wimse-ect}}:
|
|
|
|
| `exec_act` Value | Description |
|
|
|-----------------|-------------|
|
|
| `circuit_breaker_open` | Circuit breaker transitioned to OPEN state |
|
|
| `circuit_breaker_close` | Circuit breaker transitioned to CLOSED state |
|
|
| `checkpoint` | State snapshot before consequential action |
|
|
| `rollback_start` | Rollback initiated for a checkpoint |
|
|
| `rollback_complete` | Rollback finished (with status) |
|
|
| `compensate` | Compensating action executed in lieu of state restoration |
|
|
| `cascade_detected` | Cascading failure pattern detected |
|
|
{: #fig-exec-act-values title="New exec_act Values"}
|
|
|
|
This document defines the following new `ext` claims for failure
|
|
context:
|
|
|
|
| Claim | Type | Description |
|
|
|-------|------|-------------|
|
|
| `cascade.downstream_agent` | string | SPIFFE ID of the downstream agent |
|
|
| `cascade.error_rate` | number | Error rate that triggered the circuit breaker |
|
|
| `cascade.window_s` | number | Sliding window duration in seconds |
|
|
| `cascade.cooldown_s` | number | Cooldown duration in seconds |
|
|
| `cascade.reversible` | boolean | Whether the checkpointed action can be undone |
|
|
| `cascade.rollback_uri` | string | URI for rollback requests |
|
|
| `cascade.target` | string | Target system of the checkpointed action |
|
|
| `cascade.ttl` | number | Checkpoint time-to-live in seconds |
|
|
| `cascade.rollback_id` | string | Unique identifier for a rollback operation |
|
|
| `cascade.checkpoint_id` | string | JTI of the checkpoint being rolled back |
|
|
| `cascade.scope` | string | Rollback scope: single, sub_dag, full_workflow |
|
|
| `cascade.status` | string | Rollback result status |
|
|
| `cascade.reason` | string | Human-readable reason for the action |
|
|
| `cascade.pattern` | string | Detected cascade pattern type |
|
|
| `cascade.affected_agents` | number | Count of agents affected by cascade |
|
|
| `cascade.blast_radius` | array | SPIFFE IDs of affected agents |
|
|
| `cascade.cascaded` | array | Per-agent rollback results |
|
|
| `cascade.failed_agents` | array | Agents that could not be rolled back |
|
|
| `cascade.state_hash_before` | string | State hash before rollback |
|
|
| `cascade.state_hash_after` | string | State hash after rollback |
|
|
| `cascade.description` | string | Human-readable description |
|
|
{: #fig-ext-claims title="New ext Claims for Cascade Prevention"}
|
|
|
|
# Security Considerations
|
|
|
|
## Rollback Weaponization
|
|
|
|
Malicious agents could attempt to force unnecessary rollbacks to
|
|
disrupt workflows. Mitigations:
|
|
|
|
1. Rollback requests MUST be authenticated via the ECT signature
|
|
chain. Only agents whose ECTs appear in the same workflow DAG
|
|
(identified by `wid`) are authorized to request rollback.
|
|
|
|
2. Rollback requests from outside the originating workflow MUST be
|
|
rejected with HTTP 403.
|
|
|
|
3. Agents SHOULD implement rate limiting on rollback requests to
|
|
prevent denial-of-service through rollback flooding.
|
|
|
|
4. The two-phase rollback protocol provides a prepare phase where
|
|
agents can validate the rollback request before committing.
|
|
|
|
## Circuit Breaker Manipulation
|
|
|
|
An adversary could attempt to manipulate circuit breaker state to
|
|
either prevent legitimate circuit breaking or force unnecessary
|
|
circuit breaks:
|
|
|
|
1. **False error injection**: A malicious agent could emit false
|
|
error ECTs to trigger circuit breakers. At L2/L3
|
|
{{I-D.nennemann-wimse-ect}}, ECT signatures prevent forgery.
|
|
Agents SHOULD verify that error ECTs reference valid `par`
|
|
values within their own workflow DAG.
|
|
|
|
2. **Circuit breaker suppression**: An adversary could attempt to
|
|
reset circuit breakers by sending successful probe responses.
|
|
Agents MUST only accept probe responses from the actual
|
|
downstream agent (verified via ECT identity binding).
|
|
|
|
3. **Status endpoint abuse**: The `/.well-known/cascade/circuits`
|
|
endpoint reveals system health topology. This endpoint MUST
|
|
require authentication and SHOULD be restricted to agents within
|
|
the same administrative domain.
|
|
|
|
## Checkpoint Integrity
|
|
|
|
Checkpoint state snapshots contain sensitive system state. Agents
|
|
MUST:
|
|
|
|
1. Encrypt stored checkpoint state at rest.
|
|
2. Reference checkpoint state via `out_hash` only in ECTs; MUST NOT
|
|
include checkpoint contents in ECT claims.
|
|
3. Verify `out_hash` integrity before executing rollback to prevent
|
|
rollback to a tampered state.
|
|
4. Enforce checkpoint storage quotas to prevent checkpoint flooding
|
|
attacks.
|
|
5. Purge expired checkpoints (past `cascade.ttl`).
|
|
|
|
# IANA Considerations
|
|
|
|
## Registration of exec_act Values
|
|
|
|
This document requests registration of the following `exec_act`
|
|
values in the ECT exec_act registry:
|
|
|
|
| Value | Description | Reference |
|
|
|-------|-------------|-----------|
|
|
| `circuit_breaker_open` | Circuit breaker transitioned to OPEN | This document |
|
|
| `circuit_breaker_close` | Circuit breaker transitioned to CLOSED | This document |
|
|
| `checkpoint` | State snapshot before consequential action | This document |
|
|
| `rollback_start` | Rollback operation initiated | This document |
|
|
| `rollback_complete` | Rollback operation finished | This document |
|
|
| `compensate` | Compensating action executed | This document |
|
|
| `cascade_detected` | Cascading failure pattern detected | This document |
|
|
{: #fig-iana-exec-act title="exec_act Value Registrations"}
|
|
|
|
## Registration of ext Claims
|
|
|
|
This document requests registration of the `ext` claims listed in
|
|
{{fig-ext-claims}} in the ECT extension claims registry. All claims
|
|
use the `cascade.` namespace prefix.
|
|
|
|
## Well-Known URI Registration
|
|
|
|
This document requests registration of the following well-known URI
|
|
suffixes per {{RFC9110}}:
|
|
|
|
| URI Suffix | Description | Reference |
|
|
|------------|-------------|-----------|
|
|
| `cascade/circuits` | Circuit breaker status | This document |
|
|
| `cascade/rollback` | Rollback request endpoint | This document |
|
|
| `cascade/rollback/prepare` | Rollback prepare endpoint | This document |
|
|
| `cascade/checkpoints` | Checkpoint retrieval | This document |
|
|
{: #fig-iana-uris title="Well-Known URI Registrations"}
|
|
|
|
--- back
|
|
|
|
# Acknowledgments
|
|
{:numbered="false"}
|
|
|
|
This document absorbs and supersedes concepts from the earlier Agent
|
|
Error Recovery and Rollback (AERR) and Agent Task DAG (ATD) proposals.
|
|
It builds on the Execution Context Token specification
|
|
{{I-D.nennemann-wimse-ect}} for DAG-based audit trails and the Agent
|
|
Context Policy Token {{I-D.nennemann-agent-dag-hitl-safety}} for HITL
|
|
escalation of irreversible actions. The circuit breaker pattern is
|
|
adapted from microservice architecture best practices.
|