Files
ietf-draft-analyzer/workspace/drafts/new-drafts/draft-b-atd-agent-task-dag-01.md
Christian Nennemann 2506b6325a
Some checks failed
CI / test (3.11) (push) Failing after 1m37s
CI / test (3.12) (push) Failing after 57s
feat: add draft data, gap analysis report, and workspace config
2026-04-06 18:47:15 +02:00

22 KiB

fullname: TBD
organization: Independent
email: placeholder@example.com

normative: RFC2119: RFC8174: RFC8446: RFC9110: RFC8615: I-D.nennemann-wimse-ect: title: "Execution Context Tokens for Distributed Agentic Workflows" target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/ I-D.nennemann-agent-dag-hitl-safety: title: "Agent Context Policy Token: DAG Delegation with Human Override" target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/

informative:

--- abstract

This document defines the Agent Task DAG (ATD) specification: execution semantics, checkpoints, error signaling, circuit breakers, and rollback for agent workflows. ATD does not define a new DAG or token format. It defines when agents MUST emit ECT nodes, what those nodes mean, and how to recover when things go wrong. Checkpoints, errors, and rollback results are ECT nodes with specific exec_act values and ext claims. Rollback walks the ECT DAG backwards. Circuit breakers contain cascading failures. Resource hints enable scheduling. The protocol is transport-agnostic and builds on ECT for evidence and ACP-DAG-HITL for policy.

--- middle

Introduction

Autonomous agents increasingly make unsupervised decisions, yet no standard exists for how agents checkpoint state, signal errors to peers, contain cascading failures, or roll back decisions gone wrong.

ATD borrows proven patterns from distributed systems: checkpoints from database transactions, circuit breakers from microservice architectures, and rollback from version control. It adapts these to agent workflows where actions may be partially reversible and where the agent that caused the error may not be the best one to fix it.

ATD does not define a new DAG format. The ECT DAG {{I-D.nennemann-wimse-ect}} IS the execution graph. ATD defines the semantics of specific node types within that graph.

Design principles:

  1. Agents that take consequential actions MUST be able to undo them, or MUST declare them irreversible upfront.
  2. Failure containment takes priority over failure diagnosis.
  3. The protocol adds minimal overhead to the happy path.

Conventions and Definitions

{::boilerplate bcp14-tagged}

Checkpoint:
An ECT node recording agent state before a consequential action, sufficient to restore the system to that state.
Circuit Breaker:
A mechanism that stops an agent from propagating requests to a failing downstream agent, preventing cascading failures.
Rollback:
The process of reverting an agent's actions and state to a previously recorded checkpoint.
Blast Radius:
The set of agents and systems affected by a single failure.
Consequential Action:
An action that modifies external state (network configuration, database records, API calls with side effects) such that reversal requires explicit effort.

Execution Semantics

Topological Order

Tasks in the ECT DAG MUST execute in topological order: a task MUST NOT begin execution until all tasks referenced by its ECT par claims are in state done.

Two tasks with no common ancestor in the DAG (no shared par lineage) MAY execute concurrently. Orchestrators SHOULD exploit this parallelism for performance.

Circular dependencies are prohibited. Agents MUST reject ACP-DAG-HITL delegation DAGs containing cycles.

Workflow Boundary ECTs

When a workflow begins, the initiating agent MUST emit:

{
  "exec_act": "atd:workflow_start",
  "ext": {
    "atd.wf_id": "wf-uuid",
    "atd.description": "BGP failover workflow",
    "atd.node_count": 5
  }
}

{: #fig-wf-start title="Workflow Start ECT"}

When the workflow reaches a terminal state (all leaf nodes complete or any node failed with no rollback path), the orchestrator MUST emit:

{
  "exec_act": "atd:workflow_complete",
  "par": ["wf-start-ect-uuid"],
  "ext": {
    "atd.wf_id": "wf-uuid",
    "atd.terminal_status": "success",
    "atd.elapsed_s": 42
  }
}

{: #fig-wf-complete title="Workflow Complete ECT"}

Terminal status values: success, partial, failed, rolled_back, escalated.

Node States

Each task node in the ECT DAG has an implicit state derived from subsequent ECT nodes:

  • pending: A delegation node exists in ACP-DAG-HITL but no corresponding ECT has been emitted.
  • running: An ECT matching the task type has been emitted but no completion or error ECT follows.
  • done: A completion ECT (or the next par-linked ECT) exists.
  • failed: An atd:error ECT references this node.
  • rolled_back: An atd:rollback_result ECT references this node's checkpoint.
  • escalated: The task failed and a human has been notified per HITL escalation rules.

Checkpoint Mechanism

Checkpoint Placement Policy

An ATD-compliant agent MUST create a checkpoint before any action it classifies as consequential. The following actions are always consequential and MUST be checkpointed:

  1. Any modification to network device configuration.
  2. Any write to a shared database or external data store.
  3. Any API call with side effects (non-idempotent HTTP methods).
  4. Any delegation to another agent that will itself take consequential actions.

The following SHOULD be checkpointed:

  1. Long-running computations (> atd.resource_timeout_s).
  2. Actions that cannot be verified without external state.

The following are exempt from checkpoint requirements:

  1. Read-only queries.
  2. Sending notifications with no side effects.
  3. Internal state computations with no external observable effect.

Checkpoint ECT Format

A checkpoint is an ECT with:

  • exec_act: "atd:checkpoint"
  • par: the ECT of the action being checkpointed
{
  "jti": "ckpt-uuid",
  "exec_act": "atd:checkpoint",
  "par": ["action-ect-uuid"],
  "out_hash": "sha256-of-agent-state-snapshot",
  "ext": {
    "atd.reversible": true,
    "atd.rollback_uri": "https://agent-b.example.com/.well-known/atd/rollback",
    "atd.target": "router-07.example.com",
    "atd.description": "Update BGP peer config",
    "atd.ttl": 86400
  }
}

{: #fig-checkpoint title="Checkpoint ECT"}

The atd.reversible field MUST be present. If false, the agent declares that this action cannot be automatically undone and rollback requests MUST be escalated per the ACP-DAG-HITL unreachable_human policy.

The out_hash provides integrity verification: the agent hashes its state at checkpoint time so that rollback can verify it is restoring to an authentic prior state.

Checkpoints MUST be stored for at least atd.ttl seconds. Agents SHOULD store checkpoints in durable storage that survives restarts.

The rollback URI MUST be a well-known URI per {{RFC8615}} at the path /.well-known/atd/rollback.

Hierarchical Checkpoints

Agents MAY create hierarchical checkpoints where a parent groups multiple child checkpoints from a multi-step operation. Rolling back the parent rolls back all children. The parent checkpoint's par array references all child checkpoint jti values.

Checkpoint exec_act Table

exec_act value When emitted Required ext fields
atd:checkpoint Before consequential action atd.reversible, atd.rollback_uri, atd.ttl
atd:error On failure detection atd.severity, atd.error_type, atd.checkpoint_id
atd:circuit_open When error rate exceeds threshold atd.downstream_agent, atd.error_rate, atd.window_s
atd:circuit_close When probe succeeds in HALF-OPEN atd.downstream_agent, atd.cooldown_s
atd:rollback_request To initiate rollback atd.reason, atd.cascade
atd:rollback_result Rollback complete or failed atd.status, atd.checkpoint_id, atd.cascaded
atd:workflow_start Workflow begins atd.wf_id, atd.description
atd:workflow_complete Workflow terminal atd.wf_id, atd.terminal_status
{: #fig-actions title="ATD exec_act Values"}

Error Signaling

When an agent detects an error, it MUST emit an error ECT:

  • exec_act: "atd:error"
  • par: the ECT of the failed action
{
  "jti": "error-uuid",
  "exec_act": "atd:error",
  "par": ["failed-action-ect-uuid"],
  "ext": {
    "atd.severity": "critical",
    "atd.error_type": "action_failed",
    "atd.description": "BGP session did not establish",
    "atd.checkpoint_id": "ckpt-uuid",
    "atd.upstream_errors": []
  }
}

{: #fig-error title="Error ECT"}

Severity levels (in increasing order): info, warning, error, critical.

Error types: action_failed, timeout, constraint_violation, resource_exhausted, upstream_cascade, unknown.

When an agent receives an error signal caused by an action it initiated, it MUST either:

(a) Attempt automatic rollback of its checkpoint, or (b) Escalate per ACP-DAG-HITL HITL rules if the action was irreversible.

The atd.upstream_errors array allows agents to chain error context, building a causal trace from symptom to root cause.

HITL Escalation on Error

Error ECTs with severity critical SHOULD trigger HITL escalation. Deployments SHOULD define ACP-DAG-HITL rules such as:

{
  "id": "r-critical-error",
  "trigger": {
    "kind": "keyword_match",
    "op": "eq",
    "value": "critical",
    "input_ref": "atd.severity"
  },
  "required_role": "operator:oncall",
  "action": "escalate",
  "allow_override": true,
  "override_action": "continue"
}

{: #fig-error-hitl title="HITL Rule for Critical Errors"}

Circuit Breaker Pattern

Each agent MUST implement a circuit breaker for every downstream agent it communicates with. The circuit breaker has three states:

CLOSED (normal):
Requests flow through. The agent tracks the error rate over a sliding window (default: 60 seconds).
OPEN (failure detected):
When the error rate exceeds a threshold (default: 50%), the breaker opens. All requests are immediately rejected. The agent MUST emit a circuit breaker open ECT:
{
  "exec_act": "atd:circuit_open",
  "ext": {
    "atd.downstream_agent": "spiffe://example.com/agent/b",
    "atd.error_rate": 0.75,
    "atd.window_s": 60
  }
}

{: #fig-circuit-open title="Circuit Breaker Open ECT"}

HALF-OPEN (recovery probe):
After a cooldown period (default: 30s), the breaker allows one probe request. If it succeeds, the breaker returns to CLOSED and MUST emit:
{
  "exec_act": "atd:circuit_close",
  "ext": {
    "atd.downstream_agent": "spiffe://example.com/agent/b",
    "atd.cooldown_s": 30
  }
}

{: #fig-circuit-close title="Circuit Breaker Close ECT"}

If the probe fails, the breaker returns to OPEN with doubled cooldown (exponential backoff, max 300s).

Circuit Breaker State Machine

         error_rate > threshold
CLOSED ─────────────────────────► OPEN
  ▲                                  │
  │ probe success                    │ cooldown expires
  │                                  ▼
  └────────────────────────── HALF-OPEN
         probe failure ──► OPEN (cooldown * 2)

{: #fig-fsm title="Circuit Breaker State Machine"}

Coordinated Circuit Breaking

When multiple agents share a downstream dependency, each maintains its own circuit breaker independently. However, agents SHOULD publish circuit breaker state via their ECT stream so peers can observe the signal.

If an orchestrator observes N circuit breakers opening for the same downstream agent within a short window, it SHOULD initiate a HITL escalation rather than allowing N parallel recovery probes.

Circuit Breaker Policy Configuration

Circuit breaker thresholds can be configured as ACP-DAG-HITL node constraints:

{
  "constraints": {
    "atd.circuit_threshold": 0.5,
    "atd.circuit_window_s": 60
  }
}

{: #fig-circuit-policy title="Circuit Breaker Policy"}

Rollback Protocol

Basic Rollback

A rollback is initiated by emitting a rollback request ECT and sending an HTTP POST to the target agent's rollback endpoint:

POST /.well-known/atd/rollback HTTP/1.1
Content-Type: application/json
Execution-Context: <rollback-request-ect>
  • exec_act: "atd:rollback_request"
  • par: the checkpoint ECT to roll back to
{
  "exec_act": "atd:rollback_request",
  "par": ["ckpt-uuid"],
  "ext": {
    "atd.reason": "Upstream action caused cascading failure",
    "atd.cascade": true
  }
}

{: #fig-rollback-req title="Rollback Request ECT"}

When atd.cascade is true, the receiving agent MUST also initiate rollback of any downstream checkpoints created as a consequence of the checkpointed action.

The agent MUST respond with a rollback result ECT:

{
  "exec_act": "atd:rollback_result",
  "par": ["rollback-request-uuid"],
  "out_hash": "sha256-of-restored-state",
  "ext": {
    "atd.status": "completed",
    "atd.checkpoint_id": "ckpt-uuid",
    "atd.cascaded": [
      {"agent": "spiffe://example.com/agent/c", "status": "completed"},
      {"agent": "spiffe://example.com/agent/d", "status": "escalated"}
    ]
  }
}

{: #fig-rollback-result title="Rollback Result ECT"}

Status values: completed, partial, escalated, failed.

escalated means the action was irreversible and a human operator has been notified per ACP-DAG-HITL unreachable_human policy.

Partial Rollback and Blast Radius Containment

When a failure occurs in the middle of a DAG, it is often undesirable to roll back the entire workflow. ATD defines partial rollback as rolling back the failed subgraph while preserving completed sibling branches.

Partial rollback MUST only proceed if:

  1. The checkpoints to be rolled back are in the same workflow (atd.wf_id).
  2. No completed sibling task depends on the output of the failed task (verified by walking the DAG forward from the checkpoint).

The blast radius is the set of agents holding checkpoints that are descendants of the failed node. Orchestrators SHOULD compute blast radius before initiating cascade rollback to avoid unnecessary disruption.

Rollback Timeout and Escalation

Rollback requests MUST include a timeout implicitly derived from the original checkpoint's atd.ttl. If rollback is not completed within atd.ttl / 2 seconds, the agent MUST:

  1. Emit an atd:error with error_type: "timeout" and atd.description noting rollback timeout.
  2. Escalate to HITL per {{hitl-escalation}}.

Agents MUST implement idempotent rollback: receiving the same rollback request ECT jti twice MUST return the same result.

Rollback Authorization

Only agents within the same workflow (wid) with checkpoint lineage in the DAG SHOULD be authorized to request rollback. Rollback requests from outside the originating workflow MUST be rejected with HTTP 403.

Interaction with HITL

ATD escalates to HITL in the following scenarios:

  1. Irreversible action failure: An error ECT with atd.reversible: false on the checkpoint MUST trigger HITL Level 2 (approval required) per the companion HITL specification.

  2. Rollback failure: A rollback result with atd.status: "failed" MUST trigger HITL Level 3 (STOP) on the workflow.

  3. Cascaded rollback of critical nodes: When atd.cascade: true rollback propagates to a node with atd.severity: critical, HITL SHOULD be triggered at Level 1 (PAUSE) to allow human review before proceeding.

  4. Circuit breaker permanent open: If a circuit breaker re-opens after 3 successive HALF-OPEN probes, HITL Level 2 escalation SHOULD be triggered.

ATD-to-HITL escalation is recorded as an ECT linked to both the triggering error ECT and the HITL override ECT, preserving the causal chain in the audit DAG.

Resource Hints

Resource Claim Format

Agents MAY declare resource requirements as ACP-DAG-HITL node constraints:

{
  "constraints": {
    "atd.resource_cpu": "2",
    "atd.resource_memory_mb": 4096,
    "atd.resource_timeout_s": 300,
    "atd.resource_priority": "high",
    "atd.resource_gpu": "0",
    "atd.resource_network_mbps": 100
  }
}

{: #fig-resources title="Resource Hints as Node Constraints"}

Priority Levels

The atd.resource_priority field MUST be one of: critical, high, normal, low. Orchestrators SHOULD map these to scheduling priority classes (e.g., Kubernetes QoS classes: critical → Guaranteed, high/normal → Burstable, low → BestEffort).

Fair-Share Scheduling

When multiple agents compete for a shared resource pool, orchestrators SHOULD implement fair-share scheduling:

  1. Each active workflow receives an equal base allocation.
  2. Unused allocation from low priority agents is redistributed to high/critical agents within the same scheduling cycle.
  3. Starvation prevention: low priority agents MUST eventually be scheduled within a configurable maximum wait (default: 300s).

Unsatisfiable Resource Hints

Resource hints are advisory; agents MUST NOT depend on them for correctness. When resource hints cannot be satisfied:

  • If atd.resource_priority is critical: orchestrator SHOULD pre-empt lower-priority tasks.
  • If critical tasks still cannot be scheduled within 60s: emit atd:error with error_type: "resource_exhausted" and escalate to HITL.
  • All other priorities: proceed with degraded resources; log a warning via atd:error with severity warning.

Optional Declarative Workflow Format

To support pre-run planning and tooling, ATD defines an optional declarative workflow descriptor. This is a planning artifact only; at runtime it is realized as ECTs per this specification.

{
  "wf_id": "bgp-failover-v2",
  "description": "BGP peer failover with validation",
  "nodes": [
    {
      "id": "n1",
      "label": "validate-config",
      "reversible": true,
      "hitl_required": false,
      "resource_hints": {
        "priority": "normal",
        "timeout_s": 30
      }
    },
    {
      "id": "n2",
      "label": "update-bgp-peer",
      "reversible": true,
      "hitl_required": true,
      "resource_hints": {
        "priority": "critical",
        "timeout_s": 120
      }
    },
    {
      "id": "n3",
      "label": "verify-session",
      "reversible": false,
      "hitl_required": false,
      "resource_hints": {
        "priority": "high",
        "timeout_s": 60
      }
    }
  ],
  "edges": [
    {"from": "n1", "to": "n2"},
    {"from": "n2", "to": "n3"}
  ]
}

{: #fig-workflow title="Declarative Workflow Descriptor"}

The workflow descriptor media type is application/atd-workflow+json. Orchestrators MAY store and version workflow descriptors independently of their ECT runtime realization.

The hitl_required field is a hint to the HITL system that this node MUST have an approval gate as defined in the companion HITL specification.

Security Considerations

Rollback Authorization

Rollback requests are high-privilege operations. Agents MUST authenticate rollback requests using the ECT identity binding (L2/L3). The rollback endpoint MUST require mutual TLS or a signed JWT from an agent within the same workflow DAG.

Only agents that are ancestors in the ECT DAG of the checkpoint being rolled back SHOULD be authorized to request that rollback.

Checkpoint Confidentiality

Checkpoint data may contain sensitive system state (API keys, session tokens, configuration). Agents MUST:

  • Encrypt stored checkpoints at rest.
  • Reference checkpoint state via out_hash only in ECTs.
  • MUST NOT include checkpoint contents in error ECTs.

False Error Injection

A malicious agent could send false atd:error ECTs to trigger unnecessary rollbacks and disrupt workflows. Mitigation:

  • Agents SHOULD verify that error ECTs reference valid par values within their own workflow DAG (wid claim).
  • Rollback MUST require authentication (see {{rollback-authz}}).
  • L2/L3 ECT signing prevents unauthenticated error injection.

Checkpoint Flooding

An adversary could exhaust checkpoint storage by triggering many checkpoints. Mitigation:

  • Agents SHOULD enforce a maximum checkpoint count per workflow.
  • Expired checkpoints (past atd.ttl) MUST be purged.
  • Checkpoint creation rate SHOULD be rate-limited per calling workflow.

Circuit Breaker State Leakage

The atd:circuit_open ECT reveals system health topology. The audit ledger SHOULD enforce access controls: only agents within the same workflow or authorized operators SHOULD be able to query circuit breaker history.

IANA Considerations

This document requests registration of the following values in the AEM Ecosystem Extension Registry established by draft-aem-agent-ecosystem-model:

exec_act Values

Value Description Reference
atd:checkpoint State snapshot before consequential action This document
atd:error Error signal with severity and type This document
atd:circuit_open Circuit breaker opened to downstream agent This document
atd:circuit_close Circuit breaker returned to CLOSED state This document
atd:rollback_request Initiate rollback to named checkpoint This document
atd:rollback_result Result of rollback attempt This document
atd:workflow_start Workflow began execution This document
atd:workflow_complete Workflow reached terminal state This document
{: #fig-iana-actions title="ATD exec_act Registrations"}

Well-Known URI

This document requests registration of atd/rollback as a well-known URI suffix per {{RFC8615}}.

Media Type

This document requests registration of application/atd-workflow+json for the declarative workflow descriptor format defined in {{workflow-format}}.

--- back

Acknowledgments

{:numbered="false"}

ATD builds on ECT {{I-D.nennemann-wimse-ect}} for execution evidence and ACP-DAG-HITL {{I-D.nennemann-agent-dag-hitl-safety}} for delegation policy. The circuit breaker pattern is adapted from microservice architecture best practices. The declarative workflow format is inspired by workflow description languages (BPEL, BPMN) adapted for lightweight agent coordination.