ietf-draft-analyzer/workspace/drafts/new-drafts/draft-b-atd-agent-task-dag-00.md

---
title: "Agent Task DAG (ATD): Execution Model, Checkpoints, and Recovery"
abbrev: "ATD"
category: std
docname: draft-atd-agent-task-dag-00
submissiontype: IETF
number:
date:
v: 3
area: "OPS"
workgroup: "NMOP"
keyword:
  - agent DAG
  - checkpoint
  - rollback
  - error recovery
  - circuit breaker

author:
  -
    fullname: TBD
    organization: Independent
    email: placeholder@example.com

normative:
  RFC2119:
  RFC8174:
  RFC8446:
  I-D.nennemann-wimse-ect:
    title: "Execution Context Tokens for Distributed Agentic Workflows"
    target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
  I-D.nennemann-agent-dag-hitl-safety:
    title: "Agent Context Policy Token: DAG Delegation with Human Override"
    target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/

informative:

--- abstract

This document defines the Agent Task DAG (ATD) specification:
execution semantics, checkpoints, error signaling, circuit
breakers, and rollback for agent workflows.  ATD does not define a
new DAG or token format.  It defines when agents MUST emit ECT
nodes, what those nodes mean, and how to recover when things go
wrong.  Checkpoints, errors, and rollback results are ECT nodes
with specific `exec_act` values and `ext` claims.  Rollback walks
the ECT DAG backwards.  Circuit breakers contain cascading
failures.  Resource hints enable scheduling.  The protocol is
transport-agnostic and builds on ECT for evidence and ACP-DAG-HITL
for policy.

--- middle

# Introduction

Autonomous agents increasingly make unsupervised decisions, yet no
standard exists for how agents checkpoint state, signal errors to
peers, contain cascading failures, or roll back decisions gone
wrong.

ATD borrows proven patterns from distributed systems: checkpoints
from database transactions, circuit breakers from microservice
architectures, and rollback from version control.  It adapts these
to agent workflows where actions may be partially reversible and
where the agent that caused the error may not be the best one to
fix it.

ATD does not define a new DAG format.  The ECT DAG
{{I-D.nennemann-wimse-ect}} IS the execution graph.  ATD defines
the semantics of specific node types within that graph.

Design principles:

1. Agents that take consequential actions MUST be able to undo
   them, or MUST declare them irreversible upfront.
2. Failure containment takes priority over failure diagnosis.
3. The protocol adds minimal overhead to the happy path.

# Conventions and Definitions

{::boilerplate bcp14-tagged}

Checkpoint:
: An ECT node recording agent state before a consequential action,
  sufficient to restore the system to that state.

Circuit Breaker:
: A mechanism that stops an agent from propagating requests to a
  failing downstream agent, preventing cascading failures.

Rollback:
: The process of reverting an agent's actions and state to a
  previously recorded checkpoint.

Blast Radius:
: The set of agents and systems affected by a single failure.

# Node States {#node-states}

Each task node in the ECT DAG has an implicit state derived from
subsequent ECT nodes:

- **pending**: A delegation node exists in ACP-DAG-HITL but no
  corresponding ECT has been emitted.
- **running**: An ECT with `exec_act` matching the task type has
  been emitted but no completion or error ECT follows.
- **done**: A completion ECT (or the next `par`-linked ECT) exists.
- **failed**: An `atd:error` ECT references this node.
- **rolled_back**: An `atd:rollback_result` ECT references this
  node's checkpoint.

# Checkpoint Mechanism {#checkpoints}

An ATD-compliant agent MUST create a checkpoint before any action
it classifies as consequential.  An action is consequential if it
modifies external state (network config, database records, API
calls with side effects).

A checkpoint is an ECT with:

- `exec_act`: `"atd:checkpoint"`
- `par`: the ECT of the action being checkpointed

~~~json
{
  "jti": "ckpt-uuid",
  "exec_act": "atd:checkpoint",
  "par": ["action-ect-uuid"],
  "out_hash": "sha256-of-agent-state-snapshot",
  "ext": {
    "atd.reversible": true,
    "atd.rollback_uri": "https://agent-b.example.com/atd/rollback",
    "atd.target": "router-07.example.com",
    "atd.description": "Update BGP peer config",
    "atd.ttl": 86400
  }
}
~~~
{: #fig-checkpoint title="Checkpoint ECT"}

The `atd.reversible` field MUST be present.  If `false`, the agent
declares that this action cannot be automatically undone and
rollback requests MUST be escalated per the ACP-DAG-HITL
`unreachable_human` policy.

The `out_hash` provides integrity verification: the agent hashes
its state at checkpoint time so that rollback can verify it is
restoring to an authentic prior state.

Checkpoints MUST be stored for at least `atd.ttl` seconds.  Agents
SHOULD store checkpoints in durable storage that survives restarts.

## Hierarchical Checkpoints

Agents MAY create hierarchical checkpoints where a parent groups
multiple child checkpoints from a multi-step operation.  Rolling
back the parent rolls back all children.  The parent checkpoint's
`par` array references all child checkpoint `jti` values.

# Error Signaling {#errors}

When an agent detects an error, it MUST emit an error ECT:

- `exec_act`: `"atd:error"`
- `par`: the ECT of the failed action

~~~json
{
  "jti": "error-uuid",
  "exec_act": "atd:error",
  "par": ["failed-action-ect-uuid"],
  "ext": {
    "atd.severity": "critical",
    "atd.error_type": "action_failed",
    "atd.description": "BGP session did not establish",
    "atd.checkpoint_id": "ckpt-uuid",
    "atd.upstream_errors": []
  }
}
~~~
{: #fig-error title="Error ECT"}

Severity levels: `info`, `warning`, `error`, `critical`.

Error types: `action_failed`, `timeout`, `constraint_violation`,
`resource_exhausted`, `upstream_cascade`, `unknown`.

When an agent receives an error signal caused by an action it
initiated, it MUST either:

(a) Attempt automatic rollback of its checkpoint, or
(b) Escalate per ACP-DAG-HITL HITL rules if the action was
    irreversible.

The `atd.upstream_errors` array allows agents to chain error
context, building a causal trace from symptom to root cause.

## HITL Escalation on Error

Error ECTs MAY trigger ACP-DAG-HITL rules.  A deployment can
define HITL rules such as:

~~~json
{
  "id": "r-critical-error",
  "trigger": {
    "kind": "keyword_match",
    "op": "eq",
    "value": "critical",
    "input_ref": "atd.severity"
  },
  "required_role": "operator:oncall",
  "action": "escalate",
  "allow_override": true,
  "override_action": "continue"
}
~~~
{: #fig-error-hitl title="HITL Rule for Critical Errors"}

# Circuit Breaker Pattern {#circuit-breaker}

Each agent MUST implement a circuit breaker for every downstream
agent it communicates with.  The circuit breaker has three states:

CLOSED (normal):
: Requests flow through.  The agent tracks the error rate over a
  sliding window (default: 60 seconds).

OPEN (failure detected):
: When the error rate exceeds a threshold (default: 50%), the
  breaker opens.  All requests are immediately rejected.  The
  agent MUST emit a circuit breaker ECT:

~~~json
{
  "exec_act": "atd:circuit_open",
  "ext": {
    "atd.downstream_agent": "spiffe://example.com/agent/b",
    "atd.error_rate": 0.75,
    "atd.window_s": 60
  }
}
~~~
{: #fig-circuit title="Circuit Breaker ECT"}

HALF-OPEN (recovery probe):
: After a cooldown period (default: 30s), the breaker allows one
  probe request.  If it succeeds, the breaker returns to CLOSED.
  If it fails, it returns to OPEN with doubled cooldown
  (exponential backoff, max 300s).

Circuit breaker thresholds can be configured as ACP-DAG-HITL
node constraints:

~~~json
{
  "constraints": {
    "atd.circuit_threshold": 0.5,
    "atd.circuit_window_s": 60
  }
}
~~~
{: #fig-circuit-policy title="Circuit Breaker Policy"}

# Rollback Protocol {#rollback}

A rollback is initiated by emitting a rollback request ECT and
sending an HTTP POST to the target agent's rollback endpoint:

~~~
POST /atd/rollback HTTP/1.1
Content-Type: application/json
Execution-Context: <rollback-request-ect>
~~~

- `exec_act`: `"atd:rollback_request"`
- `par`: the checkpoint ECT to roll back to

~~~json
{
  "exec_act": "atd:rollback_request",
  "par": ["ckpt-uuid"],
  "ext": {
    "atd.reason": "Upstream action caused cascading failure",
    "atd.cascade": true
  }
}
~~~
{: #fig-rollback-req title="Rollback Request ECT"}

When `atd.cascade` is `true`, the receiving agent MUST also
initiate rollback of any downstream checkpoints created as a
consequence of the checkpointed action.

The agent MUST respond with a rollback result ECT:

- `exec_act`: `"atd:rollback_result"`
- `par`: the rollback request ECT

~~~json
{
  "exec_act": "atd:rollback_result",
  "par": ["rollback-request-uuid"],
  "out_hash": "sha256-of-restored-state",
  "ext": {
    "atd.status": "completed",
    "atd.checkpoint_id": "ckpt-uuid",
    "atd.cascaded": [
      {"agent": "spiffe://example.com/agent/c", "status": "completed"},
      {"agent": "spiffe://example.com/agent/d", "status": "escalated"}
    ]
  }
}
~~~
{: #fig-rollback-result title="Rollback Result ECT"}

Status values: `completed`, `partial`, `escalated`, `failed`.

`escalated` means the action was irreversible and a human operator
has been notified per ACP-DAG-HITL `unreachable_human` policy.

Agents MUST implement idempotent rollback: receiving the same
rollback request ECT `jti` twice MUST return the same result.

# Resource Hints {#resources}

Agents MAY declare resource requirements as ECT extension claims
or ACP-DAG-HITL node constraints:

~~~json
{
  "constraints": {
    "atd.resource_cpu": "2",
    "atd.resource_memory_mb": 4096,
    "atd.resource_timeout_s": 300,
    "atd.resource_priority": "high"
  }
}
~~~
{: #fig-resources title="Resource Hints as Node Constraints"}

Orchestrators (e.g., Kubernetes schedulers, agent gateways) MAY
use these hints for scheduling and quota enforcement.  Resource
hints are advisory; agents MUST NOT depend on them for
correctness.

# Security Considerations

Rollback requests are sensitive operations.  Agents MUST
authenticate rollback requests using the ECT identity binding
(L2/L3).  Only agents in the same workflow (`wid`) with
checkpoint lineage in the DAG SHOULD be authorized to request
rollback.

Checkpoint data may contain sensitive system state.  Agents MUST
encrypt stored checkpoints at rest and MUST NOT include checkpoint
contents in error ECTs.

Circuit breaker state reveals system health topology.  The
`atd:circuit_open` ECT is part of the audit trail; access to the
audit ledger SHOULD be controlled.

Malicious agents could send false error ECTs to trigger
unnecessary rollbacks.  Agents SHOULD verify that error ECTs
reference valid `par` values within their own workflow DAG.

# IANA Considerations

This document requests registration of the following `exec_act`
values in a future ECT action type registry:

- `atd:checkpoint`
- `atd:error`
- `atd:circuit_open`
- `atd:rollback_request`
- `atd:rollback_result`

--- back

# Acknowledgments
{:numbered="false"}

ATD builds on ECT {{I-D.nennemann-wimse-ect}} for execution
evidence and ACP-DAG-HITL {{I-D.nennemann-agent-dag-hitl-safety}}
for delegation policy.  The circuit breaker pattern is adapted
from microservice architecture best practices.