ietf-draft-analyzer/workspace/drafts/new-drafts/draft-b-atd-agent-task-dag-01.md

---
title: "Agent Task DAG (ATD): Execution Model, Checkpoints, and Recovery"
abbrev: "ATD"
category: std
docname: draft-atd-agent-task-dag-01
submissiontype: IETF
number:
date:
v: 3
area: "OPS"
workgroup: "NMOP"
keyword:
  - agent DAG
  - checkpoint
  - rollback
  - error recovery
  - circuit breaker

author:
  -
    fullname: TBD
    organization: Independent
    email: placeholder@example.com

normative:
  RFC2119:
  RFC8174:
  RFC8446:
  RFC9110:
  RFC8615:
  I-D.nennemann-wimse-ect:
    title: "Execution Context Tokens for Distributed Agentic Workflows"
    target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
  I-D.nennemann-agent-dag-hitl-safety:
    title: "Agent Context Policy Token: DAG Delegation with Human Override"
    target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/

informative:

--- abstract

This document defines the Agent Task DAG (ATD) specification:
execution semantics, checkpoints, error signaling, circuit
breakers, and rollback for agent workflows.  ATD does not define a
new DAG or token format.  It defines when agents MUST emit ECT
nodes, what those nodes mean, and how to recover when things go
wrong.  Checkpoints, errors, and rollback results are ECT nodes
with specific `exec_act` values and `ext` claims.  Rollback walks
the ECT DAG backwards.  Circuit breakers contain cascading
failures.  Resource hints enable scheduling.  The protocol is
transport-agnostic and builds on ECT for evidence and ACP-DAG-HITL
for policy.

--- middle

# Introduction

Autonomous agents increasingly make unsupervised decisions, yet no
standard exists for how agents checkpoint state, signal errors to
peers, contain cascading failures, or roll back decisions gone
wrong.

ATD borrows proven patterns from distributed systems: checkpoints
from database transactions, circuit breakers from microservice
architectures, and rollback from version control.  It adapts these
to agent workflows where actions may be partially reversible and
where the agent that caused the error may not be the best one to
fix it.

ATD does not define a new DAG format.  The ECT DAG
{{I-D.nennemann-wimse-ect}} IS the execution graph.  ATD defines
the semantics of specific node types within that graph.

Design principles:

1. Agents that take consequential actions MUST be able to undo
   them, or MUST declare them irreversible upfront.
2. Failure containment takes priority over failure diagnosis.
3. The protocol adds minimal overhead to the happy path.

# Conventions and Definitions

{::boilerplate bcp14-tagged}

Checkpoint:
: An ECT node recording agent state before a consequential action,
  sufficient to restore the system to that state.

Circuit Breaker:
: A mechanism that stops an agent from propagating requests to a
  failing downstream agent, preventing cascading failures.

Rollback:
: The process of reverting an agent's actions and state to a
  previously recorded checkpoint.

Blast Radius:
: The set of agents and systems affected by a single failure.

Consequential Action:
: An action that modifies external state (network configuration,
  database records, API calls with side effects) such that
  reversal requires explicit effort.

# Execution Semantics {#execution}

## Topological Order

Tasks in the ECT DAG MUST execute in topological order: a task
MUST NOT begin execution until all tasks referenced by its ECT
`par` claims are in state `done`.

Two tasks with no common ancestor in the DAG (no shared `par`
lineage) MAY execute concurrently.  Orchestrators SHOULD
exploit this parallelism for performance.

Circular dependencies are prohibited.  Agents MUST reject
ACP-DAG-HITL delegation DAGs containing cycles.

## Workflow Boundary ECTs

When a workflow begins, the initiating agent MUST emit:

~~~json
{
  "exec_act": "atd:workflow_start",
  "ext": {
    "atd.wf_id": "wf-uuid",
    "atd.description": "BGP failover workflow",
    "atd.node_count": 5
  }
}
~~~
{: #fig-wf-start title="Workflow Start ECT"}

When the workflow reaches a terminal state (all leaf nodes
complete or any node failed with no rollback path), the
orchestrator MUST emit:

~~~json
{
  "exec_act": "atd:workflow_complete",
  "par": ["wf-start-ect-uuid"],
  "ext": {
    "atd.wf_id": "wf-uuid",
    "atd.terminal_status": "success",
    "atd.elapsed_s": 42
  }
}
~~~
{: #fig-wf-complete title="Workflow Complete ECT"}

Terminal status values: `success`, `partial`, `failed`,
`rolled_back`, `escalated`.

# Node States {#node-states}

Each task node in the ECT DAG has an implicit state derived from
subsequent ECT nodes:

- **pending**: A delegation node exists in ACP-DAG-HITL but no
  corresponding ECT has been emitted.
- **running**: An ECT matching the task type has been emitted
  but no completion or error ECT follows.
- **done**: A completion ECT (or the next `par`-linked ECT) exists.
- **failed**: An `atd:error` ECT references this node.
- **rolled_back**: An `atd:rollback_result` ECT references this
  node's checkpoint.
- **escalated**: The task failed and a human has been notified
  per HITL escalation rules.

# Checkpoint Mechanism {#checkpoints}

## Checkpoint Placement Policy

An ATD-compliant agent MUST create a checkpoint before any action
it classifies as consequential.  The following actions are always
consequential and MUST be checkpointed:

1. Any modification to network device configuration.
2. Any write to a shared database or external data store.
3. Any API call with side effects (non-idempotent HTTP methods).
4. Any delegation to another agent that will itself take
   consequential actions.

The following SHOULD be checkpointed:

1. Long-running computations (> `atd.resource_timeout_s`).
2. Actions that cannot be verified without external state.

The following are exempt from checkpoint requirements:

1. Read-only queries.
2. Sending notifications with no side effects.
3. Internal state computations with no external observable effect.

## Checkpoint ECT Format

A checkpoint is an ECT with:

- `exec_act`: `"atd:checkpoint"`
- `par`: the ECT of the action being checkpointed

~~~json
{
  "jti": "ckpt-uuid",
  "exec_act": "atd:checkpoint",
  "par": ["action-ect-uuid"],
  "out_hash": "sha256-of-agent-state-snapshot",
  "ext": {
    "atd.reversible": true,
    "atd.rollback_uri": "https://agent-b.example.com/.well-known/atd/rollback",
    "atd.target": "router-07.example.com",
    "atd.description": "Update BGP peer config",
    "atd.ttl": 86400
  }
}
~~~
{: #fig-checkpoint title="Checkpoint ECT"}

The `atd.reversible` field MUST be present.  If `false`, the agent
declares that this action cannot be automatically undone and
rollback requests MUST be escalated per the ACP-DAG-HITL
`unreachable_human` policy.

The `out_hash` provides integrity verification: the agent hashes
its state at checkpoint time so that rollback can verify it is
restoring to an authentic prior state.

Checkpoints MUST be stored for at least `atd.ttl` seconds.  Agents
SHOULD store checkpoints in durable storage that survives restarts.

The rollback URI MUST be a well-known URI per {{RFC8615}} at the
path `/.well-known/atd/rollback`.

## Hierarchical Checkpoints

Agents MAY create hierarchical checkpoints where a parent groups
multiple child checkpoints from a multi-step operation.  Rolling
back the parent rolls back all children.  The parent checkpoint's
`par` array references all child checkpoint `jti` values.

## Checkpoint `exec_act` Table

| `exec_act` value | When emitted | Required `ext` fields |
|-----------------|-------------|----------------------|
| `atd:checkpoint` | Before consequential action | `atd.reversible`, `atd.rollback_uri`, `atd.ttl` |
| `atd:error` | On failure detection | `atd.severity`, `atd.error_type`, `atd.checkpoint_id` |
| `atd:circuit_open` | When error rate exceeds threshold | `atd.downstream_agent`, `atd.error_rate`, `atd.window_s` |
| `atd:circuit_close` | When probe succeeds in HALF-OPEN | `atd.downstream_agent`, `atd.cooldown_s` |
| `atd:rollback_request` | To initiate rollback | `atd.reason`, `atd.cascade` |
| `atd:rollback_result` | Rollback complete or failed | `atd.status`, `atd.checkpoint_id`, `atd.cascaded` |
| `atd:workflow_start` | Workflow begins | `atd.wf_id`, `atd.description` |
| `atd:workflow_complete` | Workflow terminal | `atd.wf_id`, `atd.terminal_status` |
{: #fig-actions title="ATD exec_act Values"}

# Error Signaling {#errors}

When an agent detects an error, it MUST emit an error ECT:

- `exec_act`: `"atd:error"`
- `par`: the ECT of the failed action

~~~json
{
  "jti": "error-uuid",
  "exec_act": "atd:error",
  "par": ["failed-action-ect-uuid"],
  "ext": {
    "atd.severity": "critical",
    "atd.error_type": "action_failed",
    "atd.description": "BGP session did not establish",
    "atd.checkpoint_id": "ckpt-uuid",
    "atd.upstream_errors": []
  }
}
~~~
{: #fig-error title="Error ECT"}

Severity levels (in increasing order): `info`, `warning`,
`error`, `critical`.

Error types: `action_failed`, `timeout`, `constraint_violation`,
`resource_exhausted`, `upstream_cascade`, `unknown`.

When an agent receives an error signal caused by an action it
initiated, it MUST either:

(a) Attempt automatic rollback of its checkpoint, or
(b) Escalate per ACP-DAG-HITL HITL rules if the action was
    irreversible.

The `atd.upstream_errors` array allows agents to chain error
context, building a causal trace from symptom to root cause.

## HITL Escalation on Error

Error ECTs with severity `critical` SHOULD trigger HITL
escalation.  Deployments SHOULD define ACP-DAG-HITL rules such
as:

~~~json
{
  "id": "r-critical-error",
  "trigger": {
    "kind": "keyword_match",
    "op": "eq",
    "value": "critical",
    "input_ref": "atd.severity"
  },
  "required_role": "operator:oncall",
  "action": "escalate",
  "allow_override": true,
  "override_action": "continue"
}
~~~
{: #fig-error-hitl title="HITL Rule for Critical Errors"}

# Circuit Breaker Pattern {#circuit-breaker}

Each agent MUST implement a circuit breaker for every downstream
agent it communicates with.  The circuit breaker has three states:

CLOSED (normal):
: Requests flow through.  The agent tracks the error rate over a
  sliding window (default: 60 seconds).

OPEN (failure detected):
: When the error rate exceeds a threshold (default: 50%), the
  breaker opens.  All requests are immediately rejected.  The
  agent MUST emit a circuit breaker open ECT:

~~~json
{
  "exec_act": "atd:circuit_open",
  "ext": {
    "atd.downstream_agent": "spiffe://example.com/agent/b",
    "atd.error_rate": 0.75,
    "atd.window_s": 60
  }
}
~~~
{: #fig-circuit-open title="Circuit Breaker Open ECT"}

HALF-OPEN (recovery probe):
: After a cooldown period (default: 30s), the breaker allows one
  probe request.  If it succeeds, the breaker returns to CLOSED
  and MUST emit:

~~~json
{
  "exec_act": "atd:circuit_close",
  "ext": {
    "atd.downstream_agent": "spiffe://example.com/agent/b",
    "atd.cooldown_s": 30
  }
}
~~~
{: #fig-circuit-close title="Circuit Breaker Close ECT"}

  If the probe fails, the breaker returns to OPEN with doubled
  cooldown (exponential backoff, max 300s).

## Circuit Breaker State Machine

~~~
         error_rate > threshold
CLOSED ─────────────────────────► OPEN
  ▲                                  │
  │ probe success                    │ cooldown expires
  │                                  ▼
  └────────────────────────── HALF-OPEN
         probe failure ──► OPEN (cooldown * 2)
~~~
{: #fig-fsm title="Circuit Breaker State Machine"}

## Coordinated Circuit Breaking

When multiple agents share a downstream dependency, each maintains
its own circuit breaker independently.  However, agents SHOULD
publish circuit breaker state via their ECT stream so peers can
observe the signal.

If an orchestrator observes N circuit breakers opening for the
same downstream agent within a short window, it SHOULD initiate
a HITL escalation rather than allowing N parallel recovery probes.

## Circuit Breaker Policy Configuration

Circuit breaker thresholds can be configured as ACP-DAG-HITL
node constraints:

~~~json
{
  "constraints": {
    "atd.circuit_threshold": 0.5,
    "atd.circuit_window_s": 60
  }
}
~~~
{: #fig-circuit-policy title="Circuit Breaker Policy"}

# Rollback Protocol {#rollback}

## Basic Rollback

A rollback is initiated by emitting a rollback request ECT and
sending an HTTP POST to the target agent's rollback endpoint:

~~~
POST /.well-known/atd/rollback HTTP/1.1
Content-Type: application/json
Execution-Context: <rollback-request-ect>
~~~

- `exec_act`: `"atd:rollback_request"`
- `par`: the checkpoint ECT to roll back to

~~~json
{
  "exec_act": "atd:rollback_request",
  "par": ["ckpt-uuid"],
  "ext": {
    "atd.reason": "Upstream action caused cascading failure",
    "atd.cascade": true
  }
}
~~~
{: #fig-rollback-req title="Rollback Request ECT"}

When `atd.cascade` is `true`, the receiving agent MUST also
initiate rollback of any downstream checkpoints created as a
consequence of the checkpointed action.

The agent MUST respond with a rollback result ECT:

~~~json
{
  "exec_act": "atd:rollback_result",
  "par": ["rollback-request-uuid"],
  "out_hash": "sha256-of-restored-state",
  "ext": {
    "atd.status": "completed",
    "atd.checkpoint_id": "ckpt-uuid",
    "atd.cascaded": [
      {"agent": "spiffe://example.com/agent/c", "status": "completed"},
      {"agent": "spiffe://example.com/agent/d", "status": "escalated"}
    ]
  }
}
~~~
{: #fig-rollback-result title="Rollback Result ECT"}

Status values: `completed`, `partial`, `escalated`, `failed`.

`escalated` means the action was irreversible and a human operator
has been notified per ACP-DAG-HITL `unreachable_human` policy.

## Partial Rollback and Blast Radius Containment

When a failure occurs in the middle of a DAG, it is often
undesirable to roll back the entire workflow.  ATD defines
partial rollback as rolling back the failed subgraph while
preserving completed sibling branches.

Partial rollback MUST only proceed if:

1. The checkpoints to be rolled back are in the same workflow
   (`atd.wf_id`).
2. No completed sibling task depends on the output of the
   failed task (verified by walking the DAG forward from the
   checkpoint).

The blast radius is the set of agents holding checkpoints that
are descendants of the failed node.  Orchestrators SHOULD
compute blast radius before initiating cascade rollback to
avoid unnecessary disruption.

## Rollback Timeout and Escalation

Rollback requests MUST include a timeout implicitly derived from
the original checkpoint's `atd.ttl`.  If rollback is not
completed within `atd.ttl / 2` seconds, the agent MUST:

1. Emit an `atd:error` with `error_type: "timeout"` and
   `atd.description` noting rollback timeout.
2. Escalate to HITL per {{hitl-escalation}}.

Agents MUST implement idempotent rollback: receiving the same
rollback request ECT `jti` twice MUST return the same result.

## Rollback Authorization {#rollback-authz}

Only agents within the same workflow (`wid`) with checkpoint
lineage in the DAG SHOULD be authorized to request rollback.
Rollback requests from outside the originating workflow MUST be
rejected with HTTP 403.

# Interaction with HITL {#hitl-escalation}

ATD escalates to HITL in the following scenarios:

1. **Irreversible action failure**: An error ECT with
   `atd.reversible: false` on the checkpoint MUST trigger
   HITL Level 2 (approval required) per the companion HITL
   specification.

2. **Rollback failure**: A rollback result with `atd.status:
   "failed"` MUST trigger HITL Level 3 (STOP) on the workflow.

3. **Cascaded rollback of critical nodes**: When `atd.cascade:
   true` rollback propagates to a node with `atd.severity:
   critical`, HITL SHOULD be triggered at Level 1 (PAUSE)
   to allow human review before proceeding.

4. **Circuit breaker permanent open**: If a circuit breaker
   re-opens after 3 successive HALF-OPEN probes, HITL Level 2
   escalation SHOULD be triggered.

ATD-to-HITL escalation is recorded as an ECT linked to both
the triggering error ECT and the HITL override ECT, preserving
the causal chain in the audit DAG.

# Resource Hints {#resources}

## Resource Claim Format

Agents MAY declare resource requirements as ACP-DAG-HITL node
constraints:

~~~json
{
  "constraints": {
    "atd.resource_cpu": "2",
    "atd.resource_memory_mb": 4096,
    "atd.resource_timeout_s": 300,
    "atd.resource_priority": "high",
    "atd.resource_gpu": "0",
    "atd.resource_network_mbps": 100
  }
}
~~~
{: #fig-resources title="Resource Hints as Node Constraints"}

## Priority Levels

The `atd.resource_priority` field MUST be one of: `critical`,
`high`, `normal`, `low`.  Orchestrators SHOULD map these to
scheduling priority classes (e.g., Kubernetes QoS classes:
`critical` → Guaranteed, `high`/`normal` → Burstable, `low`
→ BestEffort).

## Fair-Share Scheduling

When multiple agents compete for a shared resource pool,
orchestrators SHOULD implement fair-share scheduling:

1. Each active workflow receives an equal base allocation.
2. Unused allocation from `low` priority agents is redistributed
   to `high`/`critical` agents within the same scheduling cycle.
3. Starvation prevention: `low` priority agents MUST eventually
   be scheduled within a configurable maximum wait (default: 300s).

## Unsatisfiable Resource Hints

Resource hints are advisory; agents MUST NOT depend on them for
correctness.  When resource hints cannot be satisfied:

- If `atd.resource_priority` is `critical`: orchestrator SHOULD
  pre-empt lower-priority tasks.
- If `critical` tasks still cannot be scheduled within 60s:
  emit `atd:error` with `error_type: "resource_exhausted"` and
  escalate to HITL.
- All other priorities: proceed with degraded resources; log
  a warning via `atd:error` with severity `warning`.

# Optional Declarative Workflow Format {#workflow-format}

To support pre-run planning and tooling, ATD defines an optional
declarative workflow descriptor.  This is a planning artifact
only; at runtime it is realized as ECTs per this specification.

~~~json
{
  "wf_id": "bgp-failover-v2",
  "description": "BGP peer failover with validation",
  "nodes": [
    {
      "id": "n1",
      "label": "validate-config",
      "reversible": true,
      "hitl_required": false,
      "resource_hints": {
        "priority": "normal",
        "timeout_s": 30
      }
    },
    {
      "id": "n2",
      "label": "update-bgp-peer",
      "reversible": true,
      "hitl_required": true,
      "resource_hints": {
        "priority": "critical",
        "timeout_s": 120
      }
    },
    {
      "id": "n3",
      "label": "verify-session",
      "reversible": false,
      "hitl_required": false,
      "resource_hints": {
        "priority": "high",
        "timeout_s": 60
      }
    }
  ],
  "edges": [
    {"from": "n1", "to": "n2"},
    {"from": "n2", "to": "n3"}
  ]
}
~~~
{: #fig-workflow title="Declarative Workflow Descriptor"}

The workflow descriptor media type is
`application/atd-workflow+json`.  Orchestrators MAY store and
version workflow descriptors independently of their ECT runtime
realization.

The `hitl_required` field is a hint to the HITL system that this
node MUST have an approval gate as defined in the companion HITL
specification.

# Security Considerations

## Rollback Authorization

Rollback requests are high-privilege operations.  Agents MUST
authenticate rollback requests using the ECT identity binding
(L2/L3).  The rollback endpoint MUST require mutual TLS or a
signed JWT from an agent within the same workflow DAG.

Only agents that are ancestors in the ECT DAG of the checkpoint
being rolled back SHOULD be authorized to request that rollback.

## Checkpoint Confidentiality

Checkpoint data may contain sensitive system state (API keys,
session tokens, configuration).  Agents MUST:

- Encrypt stored checkpoints at rest.
- Reference checkpoint state via `out_hash` only in ECTs.
- MUST NOT include checkpoint contents in error ECTs.

## False Error Injection

A malicious agent could send false `atd:error` ECTs to trigger
unnecessary rollbacks and disrupt workflows.  Mitigation:

- Agents SHOULD verify that error ECTs reference valid `par`
  values within their own workflow DAG (`wid` claim).
- Rollback MUST require authentication (see {{rollback-authz}}).
- L2/L3 ECT signing prevents unauthenticated error injection.

## Checkpoint Flooding

An adversary could exhaust checkpoint storage by triggering
many checkpoints.  Mitigation:

- Agents SHOULD enforce a maximum checkpoint count per workflow.
- Expired checkpoints (past `atd.ttl`) MUST be purged.
- Checkpoint creation rate SHOULD be rate-limited per calling
  workflow.

## Circuit Breaker State Leakage

The `atd:circuit_open` ECT reveals system health topology.  The
audit ledger SHOULD enforce access controls: only agents within
the same workflow or authorized operators SHOULD be able to query
circuit breaker history.

# IANA Considerations

This document requests registration of the following values in
the AEM Ecosystem Extension Registry established by
draft-aem-agent-ecosystem-model:

## `exec_act` Values

| Value | Description | Reference |
|-------|-------------|-----------|
| `atd:checkpoint` | State snapshot before consequential action | This document |
| `atd:error` | Error signal with severity and type | This document |
| `atd:circuit_open` | Circuit breaker opened to downstream agent | This document |
| `atd:circuit_close` | Circuit breaker returned to CLOSED state | This document |
| `atd:rollback_request` | Initiate rollback to named checkpoint | This document |
| `atd:rollback_result` | Result of rollback attempt | This document |
| `atd:workflow_start` | Workflow began execution | This document |
| `atd:workflow_complete` | Workflow reached terminal state | This document |
{: #fig-iana-actions title="ATD exec_act Registrations"}

## Well-Known URI

This document requests registration of `atd/rollback` as a
well-known URI suffix per {{RFC8615}}.

## Media Type

This document requests registration of
`application/atd-workflow+json` for the declarative workflow
descriptor format defined in {{workflow-format}}.

--- back

# Acknowledgments
{:numbered="false"}

ATD builds on ECT {{I-D.nennemann-wimse-ect}} for execution
evidence and ACP-DAG-HITL {{I-D.nennemann-agent-dag-hitl-safety}}
for delegation policy.  The circuit breaker pattern is adapted
from microservice architecture best practices.  The declarative
workflow format is inspired by workflow description languages
(BPEL, BPMN) adapted for lightweight agent coordination.