22 KiB
fullname: TBD
organization: Independent
email: placeholder@example.com
normative: RFC2119: RFC8174: RFC8446: RFC9110: RFC8615: I-D.nennemann-wimse-ect: title: "Execution Context Tokens for Distributed Agentic Workflows" target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/ I-D.nennemann-agent-dag-hitl-safety: title: "Agent Context Policy Token: DAG Delegation with Human Override" target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
informative:
--- abstract
This document defines the Agent Task DAG (ATD) specification:
execution semantics, checkpoints, error signaling, circuit
breakers, and rollback for agent workflows. ATD does not define a
new DAG or token format. It defines when agents MUST emit ECT
nodes, what those nodes mean, and how to recover when things go
wrong. Checkpoints, errors, and rollback results are ECT nodes
with specific exec_act values and ext claims. Rollback walks
the ECT DAG backwards. Circuit breakers contain cascading
failures. Resource hints enable scheduling. The protocol is
transport-agnostic and builds on ECT for evidence and ACP-DAG-HITL
for policy.
--- middle
Introduction
Autonomous agents increasingly make unsupervised decisions, yet no standard exists for how agents checkpoint state, signal errors to peers, contain cascading failures, or roll back decisions gone wrong.
ATD borrows proven patterns from distributed systems: checkpoints from database transactions, circuit breakers from microservice architectures, and rollback from version control. It adapts these to agent workflows where actions may be partially reversible and where the agent that caused the error may not be the best one to fix it.
ATD does not define a new DAG format. The ECT DAG {{I-D.nennemann-wimse-ect}} IS the execution graph. ATD defines the semantics of specific node types within that graph.
Design principles:
- Agents that take consequential actions MUST be able to undo them, or MUST declare them irreversible upfront.
- Failure containment takes priority over failure diagnosis.
- The protocol adds minimal overhead to the happy path.
Conventions and Definitions
{::boilerplate bcp14-tagged}
- Checkpoint:
- An ECT node recording agent state before a consequential action, sufficient to restore the system to that state.
- Circuit Breaker:
- A mechanism that stops an agent from propagating requests to a failing downstream agent, preventing cascading failures.
- Rollback:
- The process of reverting an agent's actions and state to a previously recorded checkpoint.
- Blast Radius:
- The set of agents and systems affected by a single failure.
- Consequential Action:
- An action that modifies external state (network configuration, database records, API calls with side effects) such that reversal requires explicit effort.
Execution Semantics
Topological Order
Tasks in the ECT DAG MUST execute in topological order: a task
MUST NOT begin execution until all tasks referenced by its ECT
par claims are in state done.
Two tasks with no common ancestor in the DAG (no shared par
lineage) MAY execute concurrently. Orchestrators SHOULD
exploit this parallelism for performance.
Circular dependencies are prohibited. Agents MUST reject ACP-DAG-HITL delegation DAGs containing cycles.
Workflow Boundary ECTs
When a workflow begins, the initiating agent MUST emit:
{
"exec_act": "atd:workflow_start",
"ext": {
"atd.wf_id": "wf-uuid",
"atd.description": "BGP failover workflow",
"atd.node_count": 5
}
}
{: #fig-wf-start title="Workflow Start ECT"}
When the workflow reaches a terminal state (all leaf nodes complete or any node failed with no rollback path), the orchestrator MUST emit:
{
"exec_act": "atd:workflow_complete",
"par": ["wf-start-ect-uuid"],
"ext": {
"atd.wf_id": "wf-uuid",
"atd.terminal_status": "success",
"atd.elapsed_s": 42
}
}
{: #fig-wf-complete title="Workflow Complete ECT"}
Terminal status values: success, partial, failed,
rolled_back, escalated.
Node States
Each task node in the ECT DAG has an implicit state derived from subsequent ECT nodes:
- pending: A delegation node exists in ACP-DAG-HITL but no corresponding ECT has been emitted.
- running: An ECT matching the task type has been emitted but no completion or error ECT follows.
- done: A completion ECT (or the next
par-linked ECT) exists. - failed: An
atd:errorECT references this node. - rolled_back: An
atd:rollback_resultECT references this node's checkpoint. - escalated: The task failed and a human has been notified per HITL escalation rules.
Checkpoint Mechanism
Checkpoint Placement Policy
An ATD-compliant agent MUST create a checkpoint before any action it classifies as consequential. The following actions are always consequential and MUST be checkpointed:
- Any modification to network device configuration.
- Any write to a shared database or external data store.
- Any API call with side effects (non-idempotent HTTP methods).
- Any delegation to another agent that will itself take consequential actions.
The following SHOULD be checkpointed:
- Long-running computations (>
atd.resource_timeout_s). - Actions that cannot be verified without external state.
The following are exempt from checkpoint requirements:
- Read-only queries.
- Sending notifications with no side effects.
- Internal state computations with no external observable effect.
Checkpoint ECT Format
A checkpoint is an ECT with:
exec_act:"atd:checkpoint"par: the ECT of the action being checkpointed
{
"jti": "ckpt-uuid",
"exec_act": "atd:checkpoint",
"par": ["action-ect-uuid"],
"out_hash": "sha256-of-agent-state-snapshot",
"ext": {
"atd.reversible": true,
"atd.rollback_uri": "https://agent-b.example.com/.well-known/atd/rollback",
"atd.target": "router-07.example.com",
"atd.description": "Update BGP peer config",
"atd.ttl": 86400
}
}
{: #fig-checkpoint title="Checkpoint ECT"}
The atd.reversible field MUST be present. If false, the agent
declares that this action cannot be automatically undone and
rollback requests MUST be escalated per the ACP-DAG-HITL
unreachable_human policy.
The out_hash provides integrity verification: the agent hashes
its state at checkpoint time so that rollback can verify it is
restoring to an authentic prior state.
Checkpoints MUST be stored for at least atd.ttl seconds. Agents
SHOULD store checkpoints in durable storage that survives restarts.
The rollback URI MUST be a well-known URI per {{RFC8615}} at the
path /.well-known/atd/rollback.
Hierarchical Checkpoints
Agents MAY create hierarchical checkpoints where a parent groups
multiple child checkpoints from a multi-step operation. Rolling
back the parent rolls back all children. The parent checkpoint's
par array references all child checkpoint jti values.
Checkpoint exec_act Table
exec_act value |
When emitted | Required ext fields |
|---|---|---|
atd:checkpoint |
Before consequential action | atd.reversible, atd.rollback_uri, atd.ttl |
atd:error |
On failure detection | atd.severity, atd.error_type, atd.checkpoint_id |
atd:circuit_open |
When error rate exceeds threshold | atd.downstream_agent, atd.error_rate, atd.window_s |
atd:circuit_close |
When probe succeeds in HALF-OPEN | atd.downstream_agent, atd.cooldown_s |
atd:rollback_request |
To initiate rollback | atd.reason, atd.cascade |
atd:rollback_result |
Rollback complete or failed | atd.status, atd.checkpoint_id, atd.cascaded |
atd:workflow_start |
Workflow begins | atd.wf_id, atd.description |
atd:workflow_complete |
Workflow terminal | atd.wf_id, atd.terminal_status |
| {: #fig-actions title="ATD exec_act Values"} |
Error Signaling
When an agent detects an error, it MUST emit an error ECT:
exec_act:"atd:error"par: the ECT of the failed action
{
"jti": "error-uuid",
"exec_act": "atd:error",
"par": ["failed-action-ect-uuid"],
"ext": {
"atd.severity": "critical",
"atd.error_type": "action_failed",
"atd.description": "BGP session did not establish",
"atd.checkpoint_id": "ckpt-uuid",
"atd.upstream_errors": []
}
}
{: #fig-error title="Error ECT"}
Severity levels (in increasing order): info, warning,
error, critical.
Error types: action_failed, timeout, constraint_violation,
resource_exhausted, upstream_cascade, unknown.
When an agent receives an error signal caused by an action it initiated, it MUST either:
(a) Attempt automatic rollback of its checkpoint, or (b) Escalate per ACP-DAG-HITL HITL rules if the action was irreversible.
The atd.upstream_errors array allows agents to chain error
context, building a causal trace from symptom to root cause.
HITL Escalation on Error
Error ECTs with severity critical SHOULD trigger HITL
escalation. Deployments SHOULD define ACP-DAG-HITL rules such
as:
{
"id": "r-critical-error",
"trigger": {
"kind": "keyword_match",
"op": "eq",
"value": "critical",
"input_ref": "atd.severity"
},
"required_role": "operator:oncall",
"action": "escalate",
"allow_override": true,
"override_action": "continue"
}
{: #fig-error-hitl title="HITL Rule for Critical Errors"}
Circuit Breaker Pattern
Each agent MUST implement a circuit breaker for every downstream agent it communicates with. The circuit breaker has three states:
- CLOSED (normal):
- Requests flow through. The agent tracks the error rate over a sliding window (default: 60 seconds).
- OPEN (failure detected):
- When the error rate exceeds a threshold (default: 50%), the breaker opens. All requests are immediately rejected. The agent MUST emit a circuit breaker open ECT:
{
"exec_act": "atd:circuit_open",
"ext": {
"atd.downstream_agent": "spiffe://example.com/agent/b",
"atd.error_rate": 0.75,
"atd.window_s": 60
}
}
{: #fig-circuit-open title="Circuit Breaker Open ECT"}
- HALF-OPEN (recovery probe):
- After a cooldown period (default: 30s), the breaker allows one probe request. If it succeeds, the breaker returns to CLOSED and MUST emit:
{
"exec_act": "atd:circuit_close",
"ext": {
"atd.downstream_agent": "spiffe://example.com/agent/b",
"atd.cooldown_s": 30
}
}
{: #fig-circuit-close title="Circuit Breaker Close ECT"}
If the probe fails, the breaker returns to OPEN with doubled cooldown (exponential backoff, max 300s).
Circuit Breaker State Machine
error_rate > threshold
CLOSED ─────────────────────────► OPEN
▲ │
│ probe success │ cooldown expires
│ ▼
└────────────────────────── HALF-OPEN
probe failure ──► OPEN (cooldown * 2)
{: #fig-fsm title="Circuit Breaker State Machine"}
Coordinated Circuit Breaking
When multiple agents share a downstream dependency, each maintains its own circuit breaker independently. However, agents SHOULD publish circuit breaker state via their ECT stream so peers can observe the signal.
If an orchestrator observes N circuit breakers opening for the same downstream agent within a short window, it SHOULD initiate a HITL escalation rather than allowing N parallel recovery probes.
Circuit Breaker Policy Configuration
Circuit breaker thresholds can be configured as ACP-DAG-HITL node constraints:
{
"constraints": {
"atd.circuit_threshold": 0.5,
"atd.circuit_window_s": 60
}
}
{: #fig-circuit-policy title="Circuit Breaker Policy"}
Rollback Protocol
Basic Rollback
A rollback is initiated by emitting a rollback request ECT and sending an HTTP POST to the target agent's rollback endpoint:
POST /.well-known/atd/rollback HTTP/1.1
Content-Type: application/json
Execution-Context: <rollback-request-ect>
exec_act:"atd:rollback_request"par: the checkpoint ECT to roll back to
{
"exec_act": "atd:rollback_request",
"par": ["ckpt-uuid"],
"ext": {
"atd.reason": "Upstream action caused cascading failure",
"atd.cascade": true
}
}
{: #fig-rollback-req title="Rollback Request ECT"}
When atd.cascade is true, the receiving agent MUST also
initiate rollback of any downstream checkpoints created as a
consequence of the checkpointed action.
The agent MUST respond with a rollback result ECT:
{
"exec_act": "atd:rollback_result",
"par": ["rollback-request-uuid"],
"out_hash": "sha256-of-restored-state",
"ext": {
"atd.status": "completed",
"atd.checkpoint_id": "ckpt-uuid",
"atd.cascaded": [
{"agent": "spiffe://example.com/agent/c", "status": "completed"},
{"agent": "spiffe://example.com/agent/d", "status": "escalated"}
]
}
}
{: #fig-rollback-result title="Rollback Result ECT"}
Status values: completed, partial, escalated, failed.
escalated means the action was irreversible and a human operator
has been notified per ACP-DAG-HITL unreachable_human policy.
Partial Rollback and Blast Radius Containment
When a failure occurs in the middle of a DAG, it is often undesirable to roll back the entire workflow. ATD defines partial rollback as rolling back the failed subgraph while preserving completed sibling branches.
Partial rollback MUST only proceed if:
- The checkpoints to be rolled back are in the same workflow
(
atd.wf_id). - No completed sibling task depends on the output of the failed task (verified by walking the DAG forward from the checkpoint).
The blast radius is the set of agents holding checkpoints that are descendants of the failed node. Orchestrators SHOULD compute blast radius before initiating cascade rollback to avoid unnecessary disruption.
Rollback Timeout and Escalation
Rollback requests MUST include a timeout implicitly derived from
the original checkpoint's atd.ttl. If rollback is not
completed within atd.ttl / 2 seconds, the agent MUST:
- Emit an
atd:errorwitherror_type: "timeout"andatd.descriptionnoting rollback timeout. - Escalate to HITL per {{hitl-escalation}}.
Agents MUST implement idempotent rollback: receiving the same
rollback request ECT jti twice MUST return the same result.
Rollback Authorization
Only agents within the same workflow (wid) with checkpoint
lineage in the DAG SHOULD be authorized to request rollback.
Rollback requests from outside the originating workflow MUST be
rejected with HTTP 403.
Interaction with HITL
ATD escalates to HITL in the following scenarios:
-
Irreversible action failure: An error ECT with
atd.reversible: falseon the checkpoint MUST trigger HITL Level 2 (approval required) per the companion HITL specification. -
Rollback failure: A rollback result with
atd.status: "failed"MUST trigger HITL Level 3 (STOP) on the workflow. -
Cascaded rollback of critical nodes: When
atd.cascade: truerollback propagates to a node withatd.severity: critical, HITL SHOULD be triggered at Level 1 (PAUSE) to allow human review before proceeding. -
Circuit breaker permanent open: If a circuit breaker re-opens after 3 successive HALF-OPEN probes, HITL Level 2 escalation SHOULD be triggered.
ATD-to-HITL escalation is recorded as an ECT linked to both the triggering error ECT and the HITL override ECT, preserving the causal chain in the audit DAG.
Resource Hints
Resource Claim Format
Agents MAY declare resource requirements as ACP-DAG-HITL node constraints:
{
"constraints": {
"atd.resource_cpu": "2",
"atd.resource_memory_mb": 4096,
"atd.resource_timeout_s": 300,
"atd.resource_priority": "high",
"atd.resource_gpu": "0",
"atd.resource_network_mbps": 100
}
}
{: #fig-resources title="Resource Hints as Node Constraints"}
Priority Levels
The atd.resource_priority field MUST be one of: critical,
high, normal, low. Orchestrators SHOULD map these to
scheduling priority classes (e.g., Kubernetes QoS classes:
critical → Guaranteed, high/normal → Burstable, low
→ BestEffort).
Fair-Share Scheduling
When multiple agents compete for a shared resource pool, orchestrators SHOULD implement fair-share scheduling:
- Each active workflow receives an equal base allocation.
- Unused allocation from
lowpriority agents is redistributed tohigh/criticalagents within the same scheduling cycle. - Starvation prevention:
lowpriority agents MUST eventually be scheduled within a configurable maximum wait (default: 300s).
Unsatisfiable Resource Hints
Resource hints are advisory; agents MUST NOT depend on them for correctness. When resource hints cannot be satisfied:
- If
atd.resource_priorityiscritical: orchestrator SHOULD pre-empt lower-priority tasks. - If
criticaltasks still cannot be scheduled within 60s: emitatd:errorwitherror_type: "resource_exhausted"and escalate to HITL. - All other priorities: proceed with degraded resources; log
a warning via
atd:errorwith severitywarning.
Optional Declarative Workflow Format
To support pre-run planning and tooling, ATD defines an optional declarative workflow descriptor. This is a planning artifact only; at runtime it is realized as ECTs per this specification.
{
"wf_id": "bgp-failover-v2",
"description": "BGP peer failover with validation",
"nodes": [
{
"id": "n1",
"label": "validate-config",
"reversible": true,
"hitl_required": false,
"resource_hints": {
"priority": "normal",
"timeout_s": 30
}
},
{
"id": "n2",
"label": "update-bgp-peer",
"reversible": true,
"hitl_required": true,
"resource_hints": {
"priority": "critical",
"timeout_s": 120
}
},
{
"id": "n3",
"label": "verify-session",
"reversible": false,
"hitl_required": false,
"resource_hints": {
"priority": "high",
"timeout_s": 60
}
}
],
"edges": [
{"from": "n1", "to": "n2"},
{"from": "n2", "to": "n3"}
]
}
{: #fig-workflow title="Declarative Workflow Descriptor"}
The workflow descriptor media type is
application/atd-workflow+json. Orchestrators MAY store and
version workflow descriptors independently of their ECT runtime
realization.
The hitl_required field is a hint to the HITL system that this
node MUST have an approval gate as defined in the companion HITL
specification.
Security Considerations
Rollback Authorization
Rollback requests are high-privilege operations. Agents MUST authenticate rollback requests using the ECT identity binding (L2/L3). The rollback endpoint MUST require mutual TLS or a signed JWT from an agent within the same workflow DAG.
Only agents that are ancestors in the ECT DAG of the checkpoint being rolled back SHOULD be authorized to request that rollback.
Checkpoint Confidentiality
Checkpoint data may contain sensitive system state (API keys, session tokens, configuration). Agents MUST:
- Encrypt stored checkpoints at rest.
- Reference checkpoint state via
out_hashonly in ECTs. - MUST NOT include checkpoint contents in error ECTs.
False Error Injection
A malicious agent could send false atd:error ECTs to trigger
unnecessary rollbacks and disrupt workflows. Mitigation:
- Agents SHOULD verify that error ECTs reference valid
parvalues within their own workflow DAG (widclaim). - Rollback MUST require authentication (see {{rollback-authz}}).
- L2/L3 ECT signing prevents unauthenticated error injection.
Checkpoint Flooding
An adversary could exhaust checkpoint storage by triggering many checkpoints. Mitigation:
- Agents SHOULD enforce a maximum checkpoint count per workflow.
- Expired checkpoints (past
atd.ttl) MUST be purged. - Checkpoint creation rate SHOULD be rate-limited per calling workflow.
Circuit Breaker State Leakage
The atd:circuit_open ECT reveals system health topology. The
audit ledger SHOULD enforce access controls: only agents within
the same workflow or authorized operators SHOULD be able to query
circuit breaker history.
IANA Considerations
This document requests registration of the following values in the AEM Ecosystem Extension Registry established by draft-aem-agent-ecosystem-model:
exec_act Values
| Value | Description | Reference |
|---|---|---|
atd:checkpoint |
State snapshot before consequential action | This document |
atd:error |
Error signal with severity and type | This document |
atd:circuit_open |
Circuit breaker opened to downstream agent | This document |
atd:circuit_close |
Circuit breaker returned to CLOSED state | This document |
atd:rollback_request |
Initiate rollback to named checkpoint | This document |
atd:rollback_result |
Result of rollback attempt | This document |
atd:workflow_start |
Workflow began execution | This document |
atd:workflow_complete |
Workflow reached terminal state | This document |
| {: #fig-iana-actions title="ATD exec_act Registrations"} |
Well-Known URI
This document requests registration of atd/rollback as a
well-known URI suffix per {{RFC8615}}.
Media Type
This document requests registration of
application/atd-workflow+json for the declarative workflow
descriptor format defined in {{workflow-format}}.
--- back
Acknowledgments
{:numbered="false"}
ATD builds on ECT {{I-D.nennemann-wimse-ect}} for execution evidence and ACP-DAG-HITL {{I-D.nennemann-agent-dag-hitl-safety}} for delegation policy. The circuit breaker pattern is adapted from microservice architecture best practices. The declarative workflow format is inspired by workflow description languages (BPEL, BPMN) adapted for lightweight agent coordination.