feat: add draft data, gap analysis report, and workspace config
This commit is contained in:
@@ -0,0 +1,907 @@
|
||||
---
|
||||
title: "Agent Failure Cascade Prevention and Rollback"
|
||||
abbrev: "Agent Cascade Prevention"
|
||||
category: std
|
||||
docname: draft-nennemann-agent-cascade-prevention-00
|
||||
submissiontype: IETF
|
||||
number:
|
||||
date:
|
||||
v: 3
|
||||
area: "OPS"
|
||||
workgroup: "NMOP"
|
||||
keyword:
|
||||
- cascade prevention
|
||||
- circuit breaker
|
||||
- rollback
|
||||
- failure domain
|
||||
- agent recovery
|
||||
|
||||
author:
|
||||
-
|
||||
fullname: Christian Nennemann
|
||||
organization: Independent Researcher
|
||||
email: ietf@nennemann.de
|
||||
|
||||
normative:
|
||||
RFC2119:
|
||||
RFC8174:
|
||||
RFC7519:
|
||||
RFC7515:
|
||||
RFC9110:
|
||||
I-D.nennemann-wimse-ect:
|
||||
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
||||
I-D.nennemann-agent-dag-hitl-safety:
|
||||
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
||||
|
||||
informative:
|
||||
I-D.nennemann-agent-gap-analysis:
|
||||
title: "Gap Analysis of IETF Standards for Autonomous AI Agent Networking"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-gap-analysis/
|
||||
|
||||
--- abstract
|
||||
|
||||
This document defines protocols for preventing agent failures from
|
||||
cascading across interconnected autonomous systems and standardized
|
||||
mechanisms for real-time rollback of incorrect agent decisions. It
|
||||
specifies a circuit breaker protocol with well-defined state
|
||||
transitions, failure domain isolation through bulkhead patterns, cascade
|
||||
detection via error rate and latency analysis, and a distributed
|
||||
rollback coordination protocol that walks the Execution Context Token
|
||||
(ECT) DAG backwards to revert agent actions to a known-good state.
|
||||
This document absorbs and supersedes the concepts introduced in earlier
|
||||
AERR and ATD proposals.
|
||||
|
||||
--- middle
|
||||
|
||||
# Introduction
|
||||
|
||||
Autonomous AI agents increasingly operate in interconnected
|
||||
multi-agent systems where a single agent's failure can propagate
|
||||
through the network, causing widespread service disruption. The IETF
|
||||
gap analysis {{I-D.nennemann-agent-gap-analysis}} identified two
|
||||
critical gaps in existing standards:
|
||||
|
||||
- **Gap 2 (Cascade Prevention)**: No standard mechanism exists for
|
||||
containing failures within agent ecosystems. When one agent fails,
|
||||
dependent agents continue sending requests to the failing agent,
|
||||
amplifying the failure across the system.
|
||||
|
||||
- **Gap 4 (Rollback)**: No standard protocol exists for reverting
|
||||
incorrect agent decisions. When an autonomous agent misconfigures
|
||||
a network device or makes an erroneous API call, there is no
|
||||
interoperable way to undo the action or coordinate rollback across
|
||||
multiple affected agents.
|
||||
|
||||
This document addresses both gaps by defining:
|
||||
|
||||
1. A circuit breaker protocol that stops failure propagation between
|
||||
agents.
|
||||
2. Failure domain isolation mechanisms that contain blast radius.
|
||||
3. Cascade detection signals that identify propagating failures early.
|
||||
4. A distributed rollback protocol that coordinates state reversion
|
||||
across multiple agents using the ECT DAG
|
||||
{{I-D.nennemann-wimse-ect}}.
|
||||
|
||||
This specification absorbs and supersedes the concepts from the earlier
|
||||
Agent Error Recovery and Rollback (AERR) and Agent Task DAG (ATD)
|
||||
proposals, consolidating cascade prevention and rollback into a single
|
||||
coherent protocol built on ECT infrastructure.
|
||||
|
||||
Design principles:
|
||||
|
||||
1. Agents that take consequential actions MUST be able to undo them,
|
||||
or MUST declare them irreversible upfront.
|
||||
2. Failure containment takes priority over failure diagnosis.
|
||||
3. The protocol adds minimal overhead to the happy path.
|
||||
4. All cascade prevention and rollback actions are recorded as ECT
|
||||
nodes, providing a cryptographic audit trail.
|
||||
|
||||
# Terminology
|
||||
|
||||
{::boilerplate bcp14-tagged}
|
||||
|
||||
Circuit Breaker:
|
||||
: A mechanism that stops an agent from propagating requests to a
|
||||
failing downstream agent, preventing cascading failures. Modeled
|
||||
after the electrical circuit breaker pattern used in microservice
|
||||
architectures.
|
||||
|
||||
Failure Domain:
|
||||
: A bounded set of agents and resources within which a failure is
|
||||
contained. Failures within a domain MUST NOT propagate beyond the
|
||||
domain boundary without explicit escalation.
|
||||
|
||||
Blast Radius:
|
||||
: The set of agents and systems affected by a single agent's failure,
|
||||
determinable by traversing the ECT DAG forward from the failing
|
||||
node.
|
||||
|
||||
Cascade Detection:
|
||||
: The process of identifying that a failure is propagating across
|
||||
agent boundaries, using signals such as error rate spikes, latency
|
||||
increases, and resource exhaustion patterns.
|
||||
|
||||
Rollback Coordinator:
|
||||
: An agent or orchestrator responsible for coordinating distributed
|
||||
rollback across multiple agents in a workflow, ensuring consistency
|
||||
and resolving conflicts.
|
||||
|
||||
Checkpoint:
|
||||
: An ECT node recording an agent's state hash before a consequential
|
||||
action, providing a restore point for rollback.
|
||||
|
||||
Compensating Action:
|
||||
: An action that semantically reverses the effect of a prior action
|
||||
when direct state restoration is not possible (e.g., deleting a
|
||||
resource that was created, rather than restoring a pre-creation
|
||||
snapshot).
|
||||
|
||||
Recovery Point:
|
||||
: The most recent checkpoint in the ECT DAG to which an agent or
|
||||
workflow can be safely rolled back without violating consistency
|
||||
constraints.
|
||||
|
||||
# Failure Cascade Prevention
|
||||
|
||||
## Cascade Model
|
||||
|
||||
When an agent fails in a multi-agent system, the failure can
|
||||
propagate through multiple vectors. The following diagram
|
||||
illustrates a typical cascade scenario:
|
||||
|
||||
~~~
|
||||
Agent A Agent B Agent C Agent D
|
||||
| | | |
|
||||
| request | | |
|
||||
|--------------->| | |
|
||||
| | request | |
|
||||
| |--------------->| |
|
||||
| | | request |
|
||||
| | |--------------->|
|
||||
| | | |
|
||||
| | | FAILURE |
|
||||
| | |<--- X ---------|
|
||||
| | | |
|
||||
| | error/timeout | |
|
||||
| |<---------------| |
|
||||
| | | |
|
||||
| error/timeout | | |
|
||||
|<---------------| | |
|
||||
| | | |
|
||||
| [CASCADE: all agents impacted by D's failure] |
|
||||
| | | |
|
||||
~~~
|
||||
{: #fig-cascade title="Failure Cascade Propagation"}
|
||||
|
||||
### Failure Domain Taxonomy
|
||||
|
||||
Failures in agent ecosystems fall into the following categories:
|
||||
|
||||
Agent-Local Failure:
|
||||
: A failure confined to a single agent instance (e.g., out-of-memory,
|
||||
logic error). The blast radius is limited to the agent itself and
|
||||
its immediate callers.
|
||||
|
||||
Service Failure:
|
||||
: A failure affecting all instances of a particular agent service
|
||||
(e.g., model endpoint unavailable). The blast radius includes all
|
||||
agents that depend on the failing service.
|
||||
|
||||
Infrastructure Failure:
|
||||
: A failure in shared infrastructure (e.g., network partition,
|
||||
certificate authority unavailable). The blast radius may span
|
||||
multiple failure domains.
|
||||
|
||||
Semantic Failure:
|
||||
: An agent produces incorrect output without raising an error (e.g.,
|
||||
misconfiguration, wrong decision). This is the hardest category
|
||||
to detect and may propagate silently through the DAG.
|
||||
|
||||
### Propagation Vectors in Agent Ecosystems
|
||||
|
||||
Failures propagate through the following vectors:
|
||||
|
||||
1. **Synchronous request chains**: An agent blocks waiting for a
|
||||
failing downstream agent, causing its own callers to time out.
|
||||
|
||||
2. **Shared state corruption**: An agent writes incorrect data to a
|
||||
shared store, causing other agents reading that data to fail or
|
||||
make incorrect decisions.
|
||||
|
||||
3. **Resource exhaustion**: A failing agent consumes excessive
|
||||
resources (connections, memory, compute), starving healthy agents.
|
||||
|
||||
4. **Retry amplification**: Multiple agents retry requests to a
|
||||
failing agent simultaneously, overwhelming it further.
|
||||
|
||||
## Circuit Breaker Protocol
|
||||
|
||||
Each agent MUST implement a circuit breaker for every downstream
|
||||
agent it communicates with.
|
||||
|
||||
### States
|
||||
|
||||
The circuit breaker has four states:
|
||||
|
||||
CLOSED (normal):
|
||||
: Requests flow through normally. The agent tracks the error rate
|
||||
over a sliding window (default: 60 seconds).
|
||||
|
||||
OPEN (failure detected):
|
||||
: When the error rate exceeds the configured threshold (default: 50%
|
||||
over the window), the breaker opens. All requests to the
|
||||
downstream agent are immediately rejected locally. The agent
|
||||
MUST emit an ECT with `exec_act` value `"circuit_breaker_open"`.
|
||||
|
||||
HALF_OPEN (recovery probe):
|
||||
: After a cooldown period (default: 30 seconds), the breaker
|
||||
transitions to HALF_OPEN and allows a single probe request. If
|
||||
the probe succeeds, the breaker returns to CLOSED. If the probe
|
||||
fails, the breaker returns to OPEN with doubled cooldown
|
||||
(exponential backoff, maximum 300 seconds).
|
||||
|
||||
CLOSED (recovered):
|
||||
: When a probe succeeds in the HALF_OPEN state, the breaker returns
|
||||
to CLOSED and the agent MUST emit an ECT with `exec_act` value
|
||||
`"circuit_breaker_close"`.
|
||||
|
||||
### State Transition Rules
|
||||
|
||||
~~~
|
||||
error_rate > threshold
|
||||
CLOSED ────────────────────────────────► OPEN
|
||||
▲ │
|
||||
│ probe succeeds │ cooldown expires
|
||||
│ ▼
|
||||
└──────────────────────────────── HALF_OPEN
|
||||
│
|
||||
probe fails │
|
||||
▼
|
||||
OPEN
|
||||
(cooldown *= 2,
|
||||
max 300s)
|
||||
~~~
|
||||
{: #fig-circuit-fsm title="Circuit Breaker State Machine"}
|
||||
|
||||
The following rules govern state transitions:
|
||||
|
||||
1. CLOSED to OPEN: The error rate over the sliding window exceeds
|
||||
the configured threshold. The agent MUST emit a
|
||||
`"circuit_breaker_open"` ECT and reject all subsequent requests
|
||||
to the downstream agent.
|
||||
|
||||
2. OPEN to HALF_OPEN: The cooldown timer expires. The agent MUST
|
||||
allow exactly one probe request through.
|
||||
|
||||
3. HALF_OPEN to CLOSED: The probe request succeeds. The agent MUST
|
||||
emit a `"circuit_breaker_close"` ECT and resume normal operation.
|
||||
The error rate counters MUST be reset.
|
||||
|
||||
4. HALF_OPEN to OPEN: The probe request fails. The cooldown period
|
||||
MUST be doubled (up to a maximum of 300 seconds).
|
||||
|
||||
### Circuit Breaker Registration and Discovery
|
||||
|
||||
Agents MUST expose circuit breaker state at a well-known endpoint:
|
||||
|
||||
~~~
|
||||
GET /.well-known/cascade/circuits HTTP/1.1
|
||||
~~~
|
||||
|
||||
Response:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"circuits": [
|
||||
{
|
||||
"downstream_agent": "spiffe://example.com/agent/router-mgr",
|
||||
"state": "open",
|
||||
"error_rate": 0.75,
|
||||
"window_s": 60,
|
||||
"last_failure_ect": "550e8400-e29b-41d4-a716-446655440099",
|
||||
"cooldown_remaining_s": 22
|
||||
}
|
||||
]
|
||||
}
|
||||
~~~
|
||||
{: #fig-circuits title="Circuit Breaker Status Endpoint"}
|
||||
|
||||
### ECT Integration
|
||||
|
||||
Each circuit breaker state change MUST produce an ECT node:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"jti": "cb-open-uuid",
|
||||
"exec_act": "circuit_breaker_open",
|
||||
"par": ["error-ect-uuid"],
|
||||
"ext": {
|
||||
"cascade.downstream_agent":
|
||||
"spiffe://example.com/agent/router-mgr",
|
||||
"cascade.error_rate": 0.75,
|
||||
"cascade.window_s": 60,
|
||||
"cascade.cooldown_s": 30
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-cb-ect title="Circuit Breaker Open ECT"}
|
||||
|
||||
~~~json
|
||||
{
|
||||
"jti": "cb-close-uuid",
|
||||
"exec_act": "circuit_breaker_close",
|
||||
"par": ["cb-open-uuid"],
|
||||
"ext": {
|
||||
"cascade.downstream_agent":
|
||||
"spiffe://example.com/agent/router-mgr",
|
||||
"cascade.total_cooldown_s": 30
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-cb-close-ect title="Circuit Breaker Close ECT"}
|
||||
|
||||
## Failure Domain Isolation
|
||||
|
||||
### Blast Radius Containment Strategies
|
||||
|
||||
Agents MUST implement the following containment strategies:
|
||||
|
||||
1. **Request rejection at the boundary**: When a circuit breaker
|
||||
opens, the agent MUST return a structured error to its callers
|
||||
indicating that the downstream dependency is unavailable, rather
|
||||
than propagating the failure.
|
||||
|
||||
2. **Timeout enforcement**: Agents MUST enforce timeouts on all
|
||||
downstream requests. The timeout MUST be shorter than the
|
||||
caller's timeout to prevent timeout cascades.
|
||||
|
||||
3. **Graceful degradation**: When a non-critical downstream agent
|
||||
is unavailable, agents SHOULD continue operating with reduced
|
||||
functionality rather than failing entirely.
|
||||
|
||||
### Domain Boundary Enforcement
|
||||
|
||||
Failure domains are defined by the workflow topology in the ECT DAG.
|
||||
Each workflow (identified by the `wid` claim) constitutes a failure
|
||||
domain. Cross-workflow failures MUST be escalated through the HITL
|
||||
mechanism {{I-D.nennemann-agent-dag-hitl-safety}} rather than
|
||||
propagating automatically.
|
||||
|
||||
Agents at domain boundaries MUST:
|
||||
|
||||
1. Validate all incoming requests against the circuit breaker state
|
||||
of their downstream dependencies before accepting work.
|
||||
2. Emit a `"circuit_breaker_open"` ECT when rejecting work due to
|
||||
downstream unavailability.
|
||||
3. Report domain health status via the circuits endpoint.
|
||||
|
||||
### Bulkhead Patterns for Agent Pools
|
||||
|
||||
When multiple workflows share a common agent pool, the pool MUST
|
||||
implement bulkhead isolation:
|
||||
|
||||
1. **Connection limits**: Each workflow MUST have a maximum number
|
||||
of concurrent connections to the shared agent pool.
|
||||
|
||||
2. **Queue isolation**: Each workflow's requests MUST be queued
|
||||
independently, preventing one workflow's backlog from blocking
|
||||
others.
|
||||
|
||||
3. **Resource quotas**: Shared agent pools SHOULD enforce per-workflow
|
||||
resource quotas (CPU, memory, request rate).
|
||||
|
||||
## Cascade Detection
|
||||
|
||||
### Detection Signals
|
||||
|
||||
Agents MUST monitor the following signals for cascade detection:
|
||||
|
||||
Error Rate:
|
||||
: The ratio of failed requests to total requests over a sliding
|
||||
window. An error rate exceeding the circuit breaker threshold
|
||||
indicates a potential cascade.
|
||||
|
||||
Latency Spike:
|
||||
: A sudden increase in response latency (e.g., p99 latency exceeding
|
||||
3x the baseline) indicates downstream congestion or failure.
|
||||
Agents SHOULD track latency baselines using exponentially weighted
|
||||
moving averages.
|
||||
|
||||
Resource Exhaustion:
|
||||
: Thread pool saturation, connection pool exhaustion, or memory
|
||||
pressure above configured thresholds indicates that a cascade is
|
||||
consuming resources.
|
||||
|
||||
### Propagation Tracking via ECT DAG Analysis
|
||||
|
||||
Orchestrators SHOULD analyze the ECT DAG to detect cascading
|
||||
patterns:
|
||||
|
||||
1. **Error clustering**: Multiple `"circuit_breaker_open"` ECTs
|
||||
referencing the same downstream agent within a short window
|
||||
indicate a shared dependency failure.
|
||||
|
||||
2. **Depth-first propagation**: Errors propagating along `par`
|
||||
chains in the DAG indicate a synchronous cascade.
|
||||
|
||||
3. **Breadth-first propagation**: Multiple sibling nodes in the
|
||||
DAG failing concurrently indicate a shared infrastructure
|
||||
failure.
|
||||
|
||||
### Alert Format and Escalation
|
||||
|
||||
When cascade detection identifies a propagating failure, the
|
||||
detecting agent MUST emit a cascade alert ECT:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "cascade_detected",
|
||||
"ext": {
|
||||
"cascade.pattern": "depth_first",
|
||||
"cascade.affected_agents": 4,
|
||||
"cascade.root_cause_ect": "error-ect-uuid",
|
||||
"cascade.blast_radius": [
|
||||
"spiffe://example.com/agent/a",
|
||||
"spiffe://example.com/agent/b",
|
||||
"spiffe://example.com/agent/c"
|
||||
]
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-cascade-alert title="Cascade Alert ECT"}
|
||||
|
||||
Cascade alerts with more than 3 affected agents SHOULD trigger
|
||||
HITL escalation per {{I-D.nennemann-agent-dag-hitl-safety}}.
|
||||
|
||||
# Real-Time Rollback
|
||||
|
||||
## Rollback Model
|
||||
|
||||
Rollback reverses the effects of agent actions by walking the ECT
|
||||
DAG backwards from the point of failure to the nearest valid
|
||||
recovery point.
|
||||
|
||||
### Walking the ECT DAG Backwards
|
||||
|
||||
The rollback process follows `par` references in reverse:
|
||||
|
||||
1. Identify the failing ECT node.
|
||||
2. Find the checkpoint ECT associated with the failing action
|
||||
(referenced via `par`).
|
||||
3. Follow `par` references backwards to identify all downstream
|
||||
actions that were caused by the checkpointed action.
|
||||
4. Issue rollback requests to each affected agent in reverse
|
||||
topological order.
|
||||
|
||||
~~~
|
||||
Checkpoint A ──► Action A1 ──► Checkpoint B ──► Action B1
|
||||
│
|
||||
└──► Action B2
|
||||
|
||||
Rollback order: B2, B1, B, A1, A (reverse topological)
|
||||
~~~
|
||||
{: #fig-rollback-order title="Rollback Order via DAG Traversal"}
|
||||
|
||||
### Compensating Actions vs State Restoration
|
||||
|
||||
Rollback can be performed through two mechanisms:
|
||||
|
||||
State Restoration:
|
||||
: The agent restores its state from the checkpoint snapshot. This
|
||||
is the preferred mechanism when the checkpoint contains a complete
|
||||
state snapshot (verified via `out_hash`).
|
||||
|
||||
Compensating Action:
|
||||
: When state restoration is not possible (e.g., the action involved
|
||||
an external API call), the agent executes a compensating action
|
||||
that semantically reverses the original action. Compensating
|
||||
actions MUST be recorded as ECT nodes with `exec_act` value
|
||||
`"compensate"`.
|
||||
|
||||
### Rollback Scope
|
||||
|
||||
Rollback can be scoped to three levels:
|
||||
|
||||
Single Agent:
|
||||
: Only the specified agent's checkpoint is rolled back. No
|
||||
downstream propagation occurs.
|
||||
|
||||
Sub-DAG:
|
||||
: The checkpoint and all downstream checkpoints in the sub-DAG
|
||||
are rolled back. This is the default when `cascade` is `true`.
|
||||
|
||||
Full Workflow:
|
||||
: All checkpoints in the workflow are rolled back and the workflow
|
||||
is terminated. This requires Rollback Coordinator authorization.
|
||||
|
||||
## Checkpoint Protocol
|
||||
|
||||
### Checkpoint Creation
|
||||
|
||||
An agent MUST create a checkpoint ECT before any consequential
|
||||
action. An action is consequential if it modifies external state
|
||||
(network configuration, database records, API calls with side
|
||||
effects).
|
||||
|
||||
A checkpoint is an ECT with:
|
||||
|
||||
- `exec_act`: `"checkpoint"`
|
||||
- `par`: the ECT of the action being checkpointed
|
||||
- `out_hash`: SHA-256 hash of the agent's state snapshot
|
||||
|
||||
~~~json
|
||||
{
|
||||
"jti": "ckpt-uuid",
|
||||
"exec_act": "checkpoint",
|
||||
"par": ["action-ect-uuid"],
|
||||
"out_hash": "sha256:...",
|
||||
"ext": {
|
||||
"cascade.reversible": true,
|
||||
"cascade.rollback_uri":
|
||||
"https://agent-b.example.com/.well-known/cascade/rollback",
|
||||
"cascade.target": "router-07.example.com",
|
||||
"cascade.description": "Update BGP peer configuration",
|
||||
"cascade.ttl": 86400
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-checkpoint title="Checkpoint ECT"}
|
||||
|
||||
The `cascade.reversible` field MUST be present. If `false`, the
|
||||
agent declares that this action cannot be automatically undone and
|
||||
rollback requests MUST be escalated to a human operator via the
|
||||
HITL mechanism {{I-D.nennemann-agent-dag-hitl-safety}}.
|
||||
|
||||
### Checkpoint Storage and Retrieval
|
||||
|
||||
Checkpoint ECTs MUST be stored for at least the duration specified
|
||||
by `cascade.ttl`. Agents MUST store checkpoints in durable storage
|
||||
that survives agent restarts.
|
||||
|
||||
Agents MUST expose a checkpoint retrieval endpoint:
|
||||
|
||||
~~~
|
||||
GET /.well-known/cascade/checkpoints/{jti} HTTP/1.1
|
||||
~~~
|
||||
|
||||
The response MUST include the checkpoint ECT and its verification
|
||||
status (whether `out_hash` matches the current stored state snapshot).
|
||||
|
||||
### Checkpoint Verification
|
||||
|
||||
Before executing a rollback, the agent MUST verify the checkpoint
|
||||
integrity:
|
||||
|
||||
1. Retrieve the checkpoint ECT.
|
||||
2. Verify the ECT signature chain (L2/L3).
|
||||
3. Verify that the stored state snapshot matches `out_hash`.
|
||||
4. Verify that the checkpoint has not expired (`cascade.ttl`).
|
||||
|
||||
If verification fails, the agent MUST reject the rollback request
|
||||
and emit an error ECT.
|
||||
|
||||
## Distributed Rollback Coordination
|
||||
|
||||
### Rollback Coordinator Role
|
||||
|
||||
For rollbacks spanning multiple agents (sub-DAG or full workflow
|
||||
scope), a Rollback Coordinator MUST be designated. The coordinator
|
||||
is typically the orchestrator or the agent that initiated the
|
||||
workflow.
|
||||
|
||||
The coordinator is responsible for:
|
||||
|
||||
1. Computing the blast radius by traversing the ECT DAG.
|
||||
2. Determining rollback order (reverse topological sort).
|
||||
3. Issuing rollback requests to each affected agent.
|
||||
4. Tracking rollback progress and handling failures.
|
||||
5. Emitting the final rollback completion ECT.
|
||||
|
||||
### Two-Phase Rollback Protocol
|
||||
|
||||
Distributed rollback follows a two-phase protocol:
|
||||
|
||||
**Phase 1: Prepare**
|
||||
|
||||
The coordinator sends a prepare request to each affected agent:
|
||||
|
||||
~~~
|
||||
POST /.well-known/cascade/rollback/prepare HTTP/1.1
|
||||
Content-Type: application/json
|
||||
Execution-Context: <prepare-ect>
|
||||
|
||||
{
|
||||
"rollback_id": "urn:uuid:...",
|
||||
"checkpoint_id": "ckpt-uuid",
|
||||
"scope": "sub_dag"
|
||||
}
|
||||
~~~
|
||||
{: #fig-prepare title="Rollback Prepare Request"}
|
||||
|
||||
Each agent MUST respond with either:
|
||||
|
||||
- `"prepared"`: The agent has verified its checkpoint and is ready
|
||||
to roll back.
|
||||
- `"cannot_prepare"`: The agent cannot roll back (e.g., checkpoint
|
||||
expired, irreversible action).
|
||||
|
||||
**Phase 2: Execute**
|
||||
|
||||
If all agents respond `"prepared"`, the coordinator sends execute
|
||||
requests in reverse topological order:
|
||||
|
||||
~~~
|
||||
POST /.well-known/cascade/rollback HTTP/1.1
|
||||
Content-Type: application/json
|
||||
Execution-Context: <rollback-ect>
|
||||
|
||||
{
|
||||
"rollback_id": "urn:uuid:...",
|
||||
"checkpoint_id": "ckpt-uuid",
|
||||
"phase": "execute"
|
||||
}
|
||||
~~~
|
||||
{: #fig-execute title="Rollback Execute Request"}
|
||||
|
||||
If any agent responds `"cannot_prepare"` in Phase 1, the
|
||||
coordinator MUST either:
|
||||
|
||||
- Proceed with partial rollback (if the unprepared agent is not
|
||||
on the critical path), or
|
||||
- Abort the rollback and escalate to HITL.
|
||||
|
||||
### Partial Rollback Handling
|
||||
|
||||
When a distributed rollback cannot be completed fully, the
|
||||
coordinator MUST:
|
||||
|
||||
1. Roll back all agents that responded `"prepared"`.
|
||||
2. Record the partial rollback result in the ECT DAG.
|
||||
3. Emit an ECT with `exec_act` value `"rollback_complete"` and
|
||||
`cascade.status` set to `"partial"`.
|
||||
4. Include the list of agents that could not be rolled back in
|
||||
the `cascade.failed_agents` extension claim.
|
||||
|
||||
### Conflict Resolution During Concurrent Rollbacks
|
||||
|
||||
When multiple rollback requests target overlapping portions of the
|
||||
ECT DAG:
|
||||
|
||||
1. The rollback with the broader scope takes precedence (full
|
||||
workflow > sub-DAG > single agent).
|
||||
2. If scopes are equal, the earlier rollback request (by timestamp)
|
||||
takes precedence.
|
||||
3. The losing rollback request MUST be rejected with an error
|
||||
indicating the conflicting rollback ID.
|
||||
|
||||
Agents MUST implement idempotent rollback: receiving the same
|
||||
`rollback_id` twice MUST return the same result without
|
||||
re-executing the rollback.
|
||||
|
||||
## Rollback Evidence
|
||||
|
||||
### ECT Nodes for Rollback Actions
|
||||
|
||||
Each rollback action MUST produce ECT nodes for audit:
|
||||
|
||||
Rollback Start:
|
||||
: `exec_act`: `"rollback_start"`, `par` references the error ECT
|
||||
that triggered the rollback.
|
||||
|
||||
~~~json
|
||||
{
|
||||
"jti": "rb-start-uuid",
|
||||
"exec_act": "rollback_start",
|
||||
"par": ["error-ect-uuid"],
|
||||
"ext": {
|
||||
"cascade.rollback_id": "urn:uuid:...",
|
||||
"cascade.checkpoint_id": "ckpt-uuid",
|
||||
"cascade.scope": "sub_dag",
|
||||
"cascade.reason": "Upstream cascading failure"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-rb-start title="Rollback Start ECT"}
|
||||
|
||||
Rollback Complete:
|
||||
: `exec_act`: `"rollback_complete"`, `par` references the rollback
|
||||
start ECT.
|
||||
|
||||
~~~json
|
||||
{
|
||||
"jti": "rb-complete-uuid",
|
||||
"exec_act": "rollback_complete",
|
||||
"par": ["rb-start-uuid"],
|
||||
"out_hash": "sha256:...",
|
||||
"ext": {
|
||||
"cascade.rollback_id": "urn:uuid:...",
|
||||
"cascade.status": "completed",
|
||||
"cascade.state_hash_before": "sha256:...",
|
||||
"cascade.state_hash_after": "sha256:...",
|
||||
"cascade.cascaded": [
|
||||
{
|
||||
"agent": "spiffe://example.com/agent/monitor",
|
||||
"status": "completed"
|
||||
},
|
||||
{
|
||||
"agent": "spiffe://example.com/agent/classify",
|
||||
"status": "escalated"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-rb-complete title="Rollback Complete ECT"}
|
||||
|
||||
### Rollback Audit Trail
|
||||
|
||||
The complete rollback audit trail is captured in the ECT DAG:
|
||||
|
||||
~~~
|
||||
error ECT
|
||||
│
|
||||
▼
|
||||
rollback_start ECT
|
||||
│
|
||||
├──► agent-A rollback_complete ECT
|
||||
│
|
||||
├──► agent-B rollback_complete ECT
|
||||
│
|
||||
└──► agent-C compensate ECT
|
||||
~~~
|
||||
{: #fig-rb-audit title="Rollback Audit Trail in ECT DAG"}
|
||||
|
||||
Status values for individual agent rollbacks: `completed`,
|
||||
`partial`, `escalated`, `failed`.
|
||||
|
||||
# ECT Integration
|
||||
|
||||
This document defines the following new `exec_act` values for use
|
||||
in ECT nodes {{I-D.nennemann-wimse-ect}}:
|
||||
|
||||
| `exec_act` Value | Description |
|
||||
|-----------------|-------------|
|
||||
| `circuit_breaker_open` | Circuit breaker transitioned to OPEN state |
|
||||
| `circuit_breaker_close` | Circuit breaker transitioned to CLOSED state |
|
||||
| `checkpoint` | State snapshot before consequential action |
|
||||
| `rollback_start` | Rollback initiated for a checkpoint |
|
||||
| `rollback_complete` | Rollback finished (with status) |
|
||||
| `compensate` | Compensating action executed in lieu of state restoration |
|
||||
| `cascade_detected` | Cascading failure pattern detected |
|
||||
{: #fig-exec-act-values title="New exec_act Values"}
|
||||
|
||||
This document defines the following new `ext` claims for failure
|
||||
context:
|
||||
|
||||
| Claim | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `cascade.downstream_agent` | string | SPIFFE ID of the downstream agent |
|
||||
| `cascade.error_rate` | number | Error rate that triggered the circuit breaker |
|
||||
| `cascade.window_s` | number | Sliding window duration in seconds |
|
||||
| `cascade.cooldown_s` | number | Cooldown duration in seconds |
|
||||
| `cascade.reversible` | boolean | Whether the checkpointed action can be undone |
|
||||
| `cascade.rollback_uri` | string | URI for rollback requests |
|
||||
| `cascade.target` | string | Target system of the checkpointed action |
|
||||
| `cascade.ttl` | number | Checkpoint time-to-live in seconds |
|
||||
| `cascade.rollback_id` | string | Unique identifier for a rollback operation |
|
||||
| `cascade.checkpoint_id` | string | JTI of the checkpoint being rolled back |
|
||||
| `cascade.scope` | string | Rollback scope: single, sub_dag, full_workflow |
|
||||
| `cascade.status` | string | Rollback result status |
|
||||
| `cascade.reason` | string | Human-readable reason for the action |
|
||||
| `cascade.pattern` | string | Detected cascade pattern type |
|
||||
| `cascade.affected_agents` | number | Count of agents affected by cascade |
|
||||
| `cascade.blast_radius` | array | SPIFFE IDs of affected agents |
|
||||
| `cascade.cascaded` | array | Per-agent rollback results |
|
||||
| `cascade.failed_agents` | array | Agents that could not be rolled back |
|
||||
| `cascade.state_hash_before` | string | State hash before rollback |
|
||||
| `cascade.state_hash_after` | string | State hash after rollback |
|
||||
| `cascade.description` | string | Human-readable description |
|
||||
{: #fig-ext-claims title="New ext Claims for Cascade Prevention"}
|
||||
|
||||
# Security Considerations
|
||||
|
||||
## Rollback Weaponization
|
||||
|
||||
Malicious agents could attempt to force unnecessary rollbacks to
|
||||
disrupt workflows. Mitigations:
|
||||
|
||||
1. Rollback requests MUST be authenticated via the ECT signature
|
||||
chain. Only agents whose ECTs appear in the same workflow DAG
|
||||
(identified by `wid`) are authorized to request rollback.
|
||||
|
||||
2. Rollback requests from outside the originating workflow MUST be
|
||||
rejected with HTTP 403.
|
||||
|
||||
3. Agents SHOULD implement rate limiting on rollback requests to
|
||||
prevent denial-of-service through rollback flooding.
|
||||
|
||||
4. The two-phase rollback protocol provides a prepare phase where
|
||||
agents can validate the rollback request before committing.
|
||||
|
||||
## Circuit Breaker Manipulation
|
||||
|
||||
An adversary could attempt to manipulate circuit breaker state to
|
||||
either prevent legitimate circuit breaking or force unnecessary
|
||||
circuit breaks:
|
||||
|
||||
1. **False error injection**: A malicious agent could emit false
|
||||
error ECTs to trigger circuit breakers. At L2/L3
|
||||
{{I-D.nennemann-wimse-ect}}, ECT signatures prevent forgery.
|
||||
Agents SHOULD verify that error ECTs reference valid `par`
|
||||
values within their own workflow DAG.
|
||||
|
||||
2. **Circuit breaker suppression**: An adversary could attempt to
|
||||
reset circuit breakers by sending successful probe responses.
|
||||
Agents MUST only accept probe responses from the actual
|
||||
downstream agent (verified via ECT identity binding).
|
||||
|
||||
3. **Status endpoint abuse**: The `/.well-known/cascade/circuits`
|
||||
endpoint reveals system health topology. This endpoint MUST
|
||||
require authentication and SHOULD be restricted to agents within
|
||||
the same administrative domain.
|
||||
|
||||
## Checkpoint Integrity
|
||||
|
||||
Checkpoint state snapshots contain sensitive system state. Agents
|
||||
MUST:
|
||||
|
||||
1. Encrypt stored checkpoint state at rest.
|
||||
2. Reference checkpoint state via `out_hash` only in ECTs; MUST NOT
|
||||
include checkpoint contents in ECT claims.
|
||||
3. Verify `out_hash` integrity before executing rollback to prevent
|
||||
rollback to a tampered state.
|
||||
4. Enforce checkpoint storage quotas to prevent checkpoint flooding
|
||||
attacks.
|
||||
5. Purge expired checkpoints (past `cascade.ttl`).
|
||||
|
||||
# IANA Considerations
|
||||
|
||||
## Registration of exec_act Values
|
||||
|
||||
This document requests registration of the following `exec_act`
|
||||
values in the ECT exec_act registry:
|
||||
|
||||
| Value | Description | Reference |
|
||||
|-------|-------------|-----------|
|
||||
| `circuit_breaker_open` | Circuit breaker transitioned to OPEN | This document |
|
||||
| `circuit_breaker_close` | Circuit breaker transitioned to CLOSED | This document |
|
||||
| `checkpoint` | State snapshot before consequential action | This document |
|
||||
| `rollback_start` | Rollback operation initiated | This document |
|
||||
| `rollback_complete` | Rollback operation finished | This document |
|
||||
| `compensate` | Compensating action executed | This document |
|
||||
| `cascade_detected` | Cascading failure pattern detected | This document |
|
||||
{: #fig-iana-exec-act title="exec_act Value Registrations"}
|
||||
|
||||
## Registration of ext Claims
|
||||
|
||||
This document requests registration of the `ext` claims listed in
|
||||
{{fig-ext-claims}} in the ECT extension claims registry. All claims
|
||||
use the `cascade.` namespace prefix.
|
||||
|
||||
## Well-Known URI Registration
|
||||
|
||||
This document requests registration of the following well-known URI
|
||||
suffixes per {{RFC9110}}:
|
||||
|
||||
| URI Suffix | Description | Reference |
|
||||
|------------|-------------|-----------|
|
||||
| `cascade/circuits` | Circuit breaker status | This document |
|
||||
| `cascade/rollback` | Rollback request endpoint | This document |
|
||||
| `cascade/rollback/prepare` | Rollback prepare endpoint | This document |
|
||||
| `cascade/checkpoints` | Checkpoint retrieval | This document |
|
||||
{: #fig-iana-uris title="Well-Known URI Registrations"}
|
||||
|
||||
--- back
|
||||
|
||||
# Acknowledgments
|
||||
{:numbered="false"}
|
||||
|
||||
This document absorbs and supersedes concepts from the earlier Agent
|
||||
Error Recovery and Rollback (AERR) and Agent Task DAG (ATD) proposals.
|
||||
It builds on the Execution Context Token specification
|
||||
{{I-D.nennemann-wimse-ect}} for DAG-based audit trails and the Agent
|
||||
Context Policy Token {{I-D.nennemann-agent-dag-hitl-safety}} for HITL
|
||||
escalation of irreversible actions. The circuit breaker pattern is
|
||||
adapted from microservice architecture best practices.
|
||||
Reference in New Issue
Block a user