11 KiB
fullname: TBD
organization: Independent
email: placeholder@example.com
normative: RFC2119: RFC8174: RFC8446: I-D.nennemann-wimse-ect: title: "Execution Context Tokens for Distributed Agentic Workflows" target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/ I-D.nennemann-agent-dag-hitl-safety: title: "Agent Context Policy Token: DAG Delegation with Human Override" target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
informative:
--- abstract
This document defines the Agent Task DAG (ATD) specification:
execution semantics, checkpoints, error signaling, circuit
breakers, and rollback for agent workflows. ATD does not define a
new DAG or token format. It defines when agents MUST emit ECT
nodes, what those nodes mean, and how to recover when things go
wrong. Checkpoints, errors, and rollback results are ECT nodes
with specific exec_act values and ext claims. Rollback walks
the ECT DAG backwards. Circuit breakers contain cascading
failures. Resource hints enable scheduling. The protocol is
transport-agnostic and builds on ECT for evidence and ACP-DAG-HITL
for policy.
--- middle
Introduction
Autonomous agents increasingly make unsupervised decisions, yet no standard exists for how agents checkpoint state, signal errors to peers, contain cascading failures, or roll back decisions gone wrong.
ATD borrows proven patterns from distributed systems: checkpoints from database transactions, circuit breakers from microservice architectures, and rollback from version control. It adapts these to agent workflows where actions may be partially reversible and where the agent that caused the error may not be the best one to fix it.
ATD does not define a new DAG format. The ECT DAG {{I-D.nennemann-wimse-ect}} IS the execution graph. ATD defines the semantics of specific node types within that graph.
Design principles:
- Agents that take consequential actions MUST be able to undo them, or MUST declare them irreversible upfront.
- Failure containment takes priority over failure diagnosis.
- The protocol adds minimal overhead to the happy path.
Conventions and Definitions
{::boilerplate bcp14-tagged}
- Checkpoint:
- An ECT node recording agent state before a consequential action, sufficient to restore the system to that state.
- Circuit Breaker:
- A mechanism that stops an agent from propagating requests to a failing downstream agent, preventing cascading failures.
- Rollback:
- The process of reverting an agent's actions and state to a previously recorded checkpoint.
- Blast Radius:
- The set of agents and systems affected by a single failure.
Node States
Each task node in the ECT DAG has an implicit state derived from subsequent ECT nodes:
- pending: A delegation node exists in ACP-DAG-HITL but no corresponding ECT has been emitted.
- running: An ECT with
exec_actmatching the task type has been emitted but no completion or error ECT follows. - done: A completion ECT (or the next
par-linked ECT) exists. - failed: An
atd:errorECT references this node. - rolled_back: An
atd:rollback_resultECT references this node's checkpoint.
Checkpoint Mechanism
An ATD-compliant agent MUST create a checkpoint before any action it classifies as consequential. An action is consequential if it modifies external state (network config, database records, API calls with side effects).
A checkpoint is an ECT with:
exec_act:"atd:checkpoint"par: the ECT of the action being checkpointed
{
"jti": "ckpt-uuid",
"exec_act": "atd:checkpoint",
"par": ["action-ect-uuid"],
"out_hash": "sha256-of-agent-state-snapshot",
"ext": {
"atd.reversible": true,
"atd.rollback_uri": "https://agent-b.example.com/atd/rollback",
"atd.target": "router-07.example.com",
"atd.description": "Update BGP peer config",
"atd.ttl": 86400
}
}
{: #fig-checkpoint title="Checkpoint ECT"}
The atd.reversible field MUST be present. If false, the agent
declares that this action cannot be automatically undone and
rollback requests MUST be escalated per the ACP-DAG-HITL
unreachable_human policy.
The out_hash provides integrity verification: the agent hashes
its state at checkpoint time so that rollback can verify it is
restoring to an authentic prior state.
Checkpoints MUST be stored for at least atd.ttl seconds. Agents
SHOULD store checkpoints in durable storage that survives restarts.
Hierarchical Checkpoints
Agents MAY create hierarchical checkpoints where a parent groups
multiple child checkpoints from a multi-step operation. Rolling
back the parent rolls back all children. The parent checkpoint's
par array references all child checkpoint jti values.
Error Signaling
When an agent detects an error, it MUST emit an error ECT:
exec_act:"atd:error"par: the ECT of the failed action
{
"jti": "error-uuid",
"exec_act": "atd:error",
"par": ["failed-action-ect-uuid"],
"ext": {
"atd.severity": "critical",
"atd.error_type": "action_failed",
"atd.description": "BGP session did not establish",
"atd.checkpoint_id": "ckpt-uuid",
"atd.upstream_errors": []
}
}
{: #fig-error title="Error ECT"}
Severity levels: info, warning, error, critical.
Error types: action_failed, timeout, constraint_violation,
resource_exhausted, upstream_cascade, unknown.
When an agent receives an error signal caused by an action it initiated, it MUST either:
(a) Attempt automatic rollback of its checkpoint, or (b) Escalate per ACP-DAG-HITL HITL rules if the action was irreversible.
The atd.upstream_errors array allows agents to chain error
context, building a causal trace from symptom to root cause.
HITL Escalation on Error
Error ECTs MAY trigger ACP-DAG-HITL rules. A deployment can define HITL rules such as:
{
"id": "r-critical-error",
"trigger": {
"kind": "keyword_match",
"op": "eq",
"value": "critical",
"input_ref": "atd.severity"
},
"required_role": "operator:oncall",
"action": "escalate",
"allow_override": true,
"override_action": "continue"
}
{: #fig-error-hitl title="HITL Rule for Critical Errors"}
Circuit Breaker Pattern
Each agent MUST implement a circuit breaker for every downstream agent it communicates with. The circuit breaker has three states:
- CLOSED (normal):
- Requests flow through. The agent tracks the error rate over a sliding window (default: 60 seconds).
- OPEN (failure detected):
- When the error rate exceeds a threshold (default: 50%), the breaker opens. All requests are immediately rejected. The agent MUST emit a circuit breaker ECT:
{
"exec_act": "atd:circuit_open",
"ext": {
"atd.downstream_agent": "spiffe://example.com/agent/b",
"atd.error_rate": 0.75,
"atd.window_s": 60
}
}
{: #fig-circuit title="Circuit Breaker ECT"}
- HALF-OPEN (recovery probe):
- After a cooldown period (default: 30s), the breaker allows one probe request. If it succeeds, the breaker returns to CLOSED. If it fails, it returns to OPEN with doubled cooldown (exponential backoff, max 300s).
Circuit breaker thresholds can be configured as ACP-DAG-HITL node constraints:
{
"constraints": {
"atd.circuit_threshold": 0.5,
"atd.circuit_window_s": 60
}
}
{: #fig-circuit-policy title="Circuit Breaker Policy"}
Rollback Protocol
A rollback is initiated by emitting a rollback request ECT and sending an HTTP POST to the target agent's rollback endpoint:
POST /atd/rollback HTTP/1.1
Content-Type: application/json
Execution-Context: <rollback-request-ect>
exec_act:"atd:rollback_request"par: the checkpoint ECT to roll back to
{
"exec_act": "atd:rollback_request",
"par": ["ckpt-uuid"],
"ext": {
"atd.reason": "Upstream action caused cascading failure",
"atd.cascade": true
}
}
{: #fig-rollback-req title="Rollback Request ECT"}
When atd.cascade is true, the receiving agent MUST also
initiate rollback of any downstream checkpoints created as a
consequence of the checkpointed action.
The agent MUST respond with a rollback result ECT:
exec_act:"atd:rollback_result"par: the rollback request ECT
{
"exec_act": "atd:rollback_result",
"par": ["rollback-request-uuid"],
"out_hash": "sha256-of-restored-state",
"ext": {
"atd.status": "completed",
"atd.checkpoint_id": "ckpt-uuid",
"atd.cascaded": [
{"agent": "spiffe://example.com/agent/c", "status": "completed"},
{"agent": "spiffe://example.com/agent/d", "status": "escalated"}
]
}
}
{: #fig-rollback-result title="Rollback Result ECT"}
Status values: completed, partial, escalated, failed.
escalated means the action was irreversible and a human operator
has been notified per ACP-DAG-HITL unreachable_human policy.
Agents MUST implement idempotent rollback: receiving the same
rollback request ECT jti twice MUST return the same result.
Resource Hints
Agents MAY declare resource requirements as ECT extension claims or ACP-DAG-HITL node constraints:
{
"constraints": {
"atd.resource_cpu": "2",
"atd.resource_memory_mb": 4096,
"atd.resource_timeout_s": 300,
"atd.resource_priority": "high"
}
}
{: #fig-resources title="Resource Hints as Node Constraints"}
Orchestrators (e.g., Kubernetes schedulers, agent gateways) MAY use these hints for scheduling and quota enforcement. Resource hints are advisory; agents MUST NOT depend on them for correctness.
Security Considerations
Rollback requests are sensitive operations. Agents MUST
authenticate rollback requests using the ECT identity binding
(L2/L3). Only agents in the same workflow (wid) with
checkpoint lineage in the DAG SHOULD be authorized to request
rollback.
Checkpoint data may contain sensitive system state. Agents MUST encrypt stored checkpoints at rest and MUST NOT include checkpoint contents in error ECTs.
Circuit breaker state reveals system health topology. The
atd:circuit_open ECT is part of the audit trail; access to the
audit ledger SHOULD be controlled.
Malicious agents could send false error ECTs to trigger
unnecessary rollbacks. Agents SHOULD verify that error ECTs
reference valid par values within their own workflow DAG.
IANA Considerations
This document requests registration of the following exec_act
values in a future ECT action type registry:
atd:checkpointatd:erroratd:circuit_openatd:rollback_requestatd:rollback_result
--- back
Acknowledgments
{:numbered="false"}
ATD builds on ECT {{I-D.nennemann-wimse-ect}} for execution evidence and ACP-DAG-HITL {{I-D.nennemann-agent-dag-hitl-safety}} for delegation policy. The circuit breaker pattern is adapted from microservice architecture best practices.