--- title: "Agent Task DAG (ATD): Execution Model, Checkpoints, and Recovery" abbrev: "ATD" category: std docname: draft-atd-agent-task-dag-00 submissiontype: IETF number: date: v: 3 area: "OPS" workgroup: "NMOP" keyword: - agent DAG - checkpoint - rollback - error recovery - circuit breaker author: - fullname: TBD organization: Independent email: placeholder@example.com normative: RFC2119: RFC8174: RFC8446: I-D.nennemann-wimse-ect: title: "Execution Context Tokens for Distributed Agentic Workflows" target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/ I-D.nennemann-agent-dag-hitl-safety: title: "Agent Context Policy Token: DAG Delegation with Human Override" target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/ informative: --- abstract This document defines the Agent Task DAG (ATD) specification: execution semantics, checkpoints, error signaling, circuit breakers, and rollback for agent workflows. ATD does not define a new DAG or token format. It defines when agents MUST emit ECT nodes, what those nodes mean, and how to recover when things go wrong. Checkpoints, errors, and rollback results are ECT nodes with specific `exec_act` values and `ext` claims. Rollback walks the ECT DAG backwards. Circuit breakers contain cascading failures. Resource hints enable scheduling. The protocol is transport-agnostic and builds on ECT for evidence and ACP-DAG-HITL for policy. --- middle # Introduction Autonomous agents increasingly make unsupervised decisions, yet no standard exists for how agents checkpoint state, signal errors to peers, contain cascading failures, or roll back decisions gone wrong. ATD borrows proven patterns from distributed systems: checkpoints from database transactions, circuit breakers from microservice architectures, and rollback from version control. It adapts these to agent workflows where actions may be partially reversible and where the agent that caused the error may not be the best one to fix it. ATD does not define a new DAG format. The ECT DAG {{I-D.nennemann-wimse-ect}} IS the execution graph. ATD defines the semantics of specific node types within that graph. Design principles: 1. Agents that take consequential actions MUST be able to undo them, or MUST declare them irreversible upfront. 2. Failure containment takes priority over failure diagnosis. 3. The protocol adds minimal overhead to the happy path. # Conventions and Definitions {::boilerplate bcp14-tagged} Checkpoint: : An ECT node recording agent state before a consequential action, sufficient to restore the system to that state. Circuit Breaker: : A mechanism that stops an agent from propagating requests to a failing downstream agent, preventing cascading failures. Rollback: : The process of reverting an agent's actions and state to a previously recorded checkpoint. Blast Radius: : The set of agents and systems affected by a single failure. # Node States {#node-states} Each task node in the ECT DAG has an implicit state derived from subsequent ECT nodes: - **pending**: A delegation node exists in ACP-DAG-HITL but no corresponding ECT has been emitted. - **running**: An ECT with `exec_act` matching the task type has been emitted but no completion or error ECT follows. - **done**: A completion ECT (or the next `par`-linked ECT) exists. - **failed**: An `atd:error` ECT references this node. - **rolled_back**: An `atd:rollback_result` ECT references this node's checkpoint. # Checkpoint Mechanism {#checkpoints} An ATD-compliant agent MUST create a checkpoint before any action it classifies as consequential. An action is consequential if it modifies external state (network config, database records, API calls with side effects). A checkpoint is an ECT with: - `exec_act`: `"atd:checkpoint"` - `par`: the ECT of the action being checkpointed ~~~json { "jti": "ckpt-uuid", "exec_act": "atd:checkpoint", "par": ["action-ect-uuid"], "out_hash": "sha256-of-agent-state-snapshot", "ext": { "atd.reversible": true, "atd.rollback_uri": "https://agent-b.example.com/atd/rollback", "atd.target": "router-07.example.com", "atd.description": "Update BGP peer config", "atd.ttl": 86400 } } ~~~ {: #fig-checkpoint title="Checkpoint ECT"} The `atd.reversible` field MUST be present. If `false`, the agent declares that this action cannot be automatically undone and rollback requests MUST be escalated per the ACP-DAG-HITL `unreachable_human` policy. The `out_hash` provides integrity verification: the agent hashes its state at checkpoint time so that rollback can verify it is restoring to an authentic prior state. Checkpoints MUST be stored for at least `atd.ttl` seconds. Agents SHOULD store checkpoints in durable storage that survives restarts. ## Hierarchical Checkpoints Agents MAY create hierarchical checkpoints where a parent groups multiple child checkpoints from a multi-step operation. Rolling back the parent rolls back all children. The parent checkpoint's `par` array references all child checkpoint `jti` values. # Error Signaling {#errors} When an agent detects an error, it MUST emit an error ECT: - `exec_act`: `"atd:error"` - `par`: the ECT of the failed action ~~~json { "jti": "error-uuid", "exec_act": "atd:error", "par": ["failed-action-ect-uuid"], "ext": { "atd.severity": "critical", "atd.error_type": "action_failed", "atd.description": "BGP session did not establish", "atd.checkpoint_id": "ckpt-uuid", "atd.upstream_errors": [] } } ~~~ {: #fig-error title="Error ECT"} Severity levels: `info`, `warning`, `error`, `critical`. Error types: `action_failed`, `timeout`, `constraint_violation`, `resource_exhausted`, `upstream_cascade`, `unknown`. When an agent receives an error signal caused by an action it initiated, it MUST either: (a) Attempt automatic rollback of its checkpoint, or (b) Escalate per ACP-DAG-HITL HITL rules if the action was irreversible. The `atd.upstream_errors` array allows agents to chain error context, building a causal trace from symptom to root cause. ## HITL Escalation on Error Error ECTs MAY trigger ACP-DAG-HITL rules. A deployment can define HITL rules such as: ~~~json { "id": "r-critical-error", "trigger": { "kind": "keyword_match", "op": "eq", "value": "critical", "input_ref": "atd.severity" }, "required_role": "operator:oncall", "action": "escalate", "allow_override": true, "override_action": "continue" } ~~~ {: #fig-error-hitl title="HITL Rule for Critical Errors"} # Circuit Breaker Pattern {#circuit-breaker} Each agent MUST implement a circuit breaker for every downstream agent it communicates with. The circuit breaker has three states: CLOSED (normal): : Requests flow through. The agent tracks the error rate over a sliding window (default: 60 seconds). OPEN (failure detected): : When the error rate exceeds a threshold (default: 50%), the breaker opens. All requests are immediately rejected. The agent MUST emit a circuit breaker ECT: ~~~json { "exec_act": "atd:circuit_open", "ext": { "atd.downstream_agent": "spiffe://example.com/agent/b", "atd.error_rate": 0.75, "atd.window_s": 60 } } ~~~ {: #fig-circuit title="Circuit Breaker ECT"} HALF-OPEN (recovery probe): : After a cooldown period (default: 30s), the breaker allows one probe request. If it succeeds, the breaker returns to CLOSED. If it fails, it returns to OPEN with doubled cooldown (exponential backoff, max 300s). Circuit breaker thresholds can be configured as ACP-DAG-HITL node constraints: ~~~json { "constraints": { "atd.circuit_threshold": 0.5, "atd.circuit_window_s": 60 } } ~~~ {: #fig-circuit-policy title="Circuit Breaker Policy"} # Rollback Protocol {#rollback} A rollback is initiated by emitting a rollback request ECT and sending an HTTP POST to the target agent's rollback endpoint: ~~~ POST /atd/rollback HTTP/1.1 Content-Type: application/json Execution-Context: ~~~ - `exec_act`: `"atd:rollback_request"` - `par`: the checkpoint ECT to roll back to ~~~json { "exec_act": "atd:rollback_request", "par": ["ckpt-uuid"], "ext": { "atd.reason": "Upstream action caused cascading failure", "atd.cascade": true } } ~~~ {: #fig-rollback-req title="Rollback Request ECT"} When `atd.cascade` is `true`, the receiving agent MUST also initiate rollback of any downstream checkpoints created as a consequence of the checkpointed action. The agent MUST respond with a rollback result ECT: - `exec_act`: `"atd:rollback_result"` - `par`: the rollback request ECT ~~~json { "exec_act": "atd:rollback_result", "par": ["rollback-request-uuid"], "out_hash": "sha256-of-restored-state", "ext": { "atd.status": "completed", "atd.checkpoint_id": "ckpt-uuid", "atd.cascaded": [ {"agent": "spiffe://example.com/agent/c", "status": "completed"}, {"agent": "spiffe://example.com/agent/d", "status": "escalated"} ] } } ~~~ {: #fig-rollback-result title="Rollback Result ECT"} Status values: `completed`, `partial`, `escalated`, `failed`. `escalated` means the action was irreversible and a human operator has been notified per ACP-DAG-HITL `unreachable_human` policy. Agents MUST implement idempotent rollback: receiving the same rollback request ECT `jti` twice MUST return the same result. # Resource Hints {#resources} Agents MAY declare resource requirements as ECT extension claims or ACP-DAG-HITL node constraints: ~~~json { "constraints": { "atd.resource_cpu": "2", "atd.resource_memory_mb": 4096, "atd.resource_timeout_s": 300, "atd.resource_priority": "high" } } ~~~ {: #fig-resources title="Resource Hints as Node Constraints"} Orchestrators (e.g., Kubernetes schedulers, agent gateways) MAY use these hints for scheduling and quota enforcement. Resource hints are advisory; agents MUST NOT depend on them for correctness. # Security Considerations Rollback requests are sensitive operations. Agents MUST authenticate rollback requests using the ECT identity binding (L2/L3). Only agents in the same workflow (`wid`) with checkpoint lineage in the DAG SHOULD be authorized to request rollback. Checkpoint data may contain sensitive system state. Agents MUST encrypt stored checkpoints at rest and MUST NOT include checkpoint contents in error ECTs. Circuit breaker state reveals system health topology. The `atd:circuit_open` ECT is part of the audit trail; access to the audit ledger SHOULD be controlled. Malicious agents could send false error ECTs to trigger unnecessary rollbacks. Agents SHOULD verify that error ECTs reference valid `par` values within their own workflow DAG. # IANA Considerations This document requests registration of the following `exec_act` values in a future ECT action type registry: - `atd:checkpoint` - `atd:error` - `atd:circuit_open` - `atd:rollback_request` - `atd:rollback_result` --- back # Acknowledgments {:numbered="false"} ATD builds on ECT {{I-D.nennemann-wimse-ect}} for execution evidence and ACP-DAG-HITL {{I-D.nennemann-agent-dag-hitl-safety}} for delegation policy. The circuit breaker pattern is adapted from microservice architecture best practices.