Files
ietf-draft-analyzer/workspace/drafts/new-drafts/draft-b-atd-agent-task-dag-00.md
Christian Nennemann 2506b6325a
Some checks failed
CI / test (3.11) (push) Failing after 1m37s
CI / test (3.12) (push) Failing after 57s
feat: add draft data, gap analysis report, and workspace config
2026-04-06 18:47:15 +02:00

11 KiB

fullname: TBD
organization: Independent
email: placeholder@example.com

normative: RFC2119: RFC8174: RFC8446: I-D.nennemann-wimse-ect: title: "Execution Context Tokens for Distributed Agentic Workflows" target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/ I-D.nennemann-agent-dag-hitl-safety: title: "Agent Context Policy Token: DAG Delegation with Human Override" target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/

informative:

--- abstract

This document defines the Agent Task DAG (ATD) specification: execution semantics, checkpoints, error signaling, circuit breakers, and rollback for agent workflows. ATD does not define a new DAG or token format. It defines when agents MUST emit ECT nodes, what those nodes mean, and how to recover when things go wrong. Checkpoints, errors, and rollback results are ECT nodes with specific exec_act values and ext claims. Rollback walks the ECT DAG backwards. Circuit breakers contain cascading failures. Resource hints enable scheduling. The protocol is transport-agnostic and builds on ECT for evidence and ACP-DAG-HITL for policy.

--- middle

Introduction

Autonomous agents increasingly make unsupervised decisions, yet no standard exists for how agents checkpoint state, signal errors to peers, contain cascading failures, or roll back decisions gone wrong.

ATD borrows proven patterns from distributed systems: checkpoints from database transactions, circuit breakers from microservice architectures, and rollback from version control. It adapts these to agent workflows where actions may be partially reversible and where the agent that caused the error may not be the best one to fix it.

ATD does not define a new DAG format. The ECT DAG {{I-D.nennemann-wimse-ect}} IS the execution graph. ATD defines the semantics of specific node types within that graph.

Design principles:

  1. Agents that take consequential actions MUST be able to undo them, or MUST declare them irreversible upfront.
  2. Failure containment takes priority over failure diagnosis.
  3. The protocol adds minimal overhead to the happy path.

Conventions and Definitions

{::boilerplate bcp14-tagged}

Checkpoint:
An ECT node recording agent state before a consequential action, sufficient to restore the system to that state.
Circuit Breaker:
A mechanism that stops an agent from propagating requests to a failing downstream agent, preventing cascading failures.
Rollback:
The process of reverting an agent's actions and state to a previously recorded checkpoint.
Blast Radius:
The set of agents and systems affected by a single failure.

Node States

Each task node in the ECT DAG has an implicit state derived from subsequent ECT nodes:

  • pending: A delegation node exists in ACP-DAG-HITL but no corresponding ECT has been emitted.
  • running: An ECT with exec_act matching the task type has been emitted but no completion or error ECT follows.
  • done: A completion ECT (or the next par-linked ECT) exists.
  • failed: An atd:error ECT references this node.
  • rolled_back: An atd:rollback_result ECT references this node's checkpoint.

Checkpoint Mechanism

An ATD-compliant agent MUST create a checkpoint before any action it classifies as consequential. An action is consequential if it modifies external state (network config, database records, API calls with side effects).

A checkpoint is an ECT with:

  • exec_act: "atd:checkpoint"
  • par: the ECT of the action being checkpointed
{
  "jti": "ckpt-uuid",
  "exec_act": "atd:checkpoint",
  "par": ["action-ect-uuid"],
  "out_hash": "sha256-of-agent-state-snapshot",
  "ext": {
    "atd.reversible": true,
    "atd.rollback_uri": "https://agent-b.example.com/atd/rollback",
    "atd.target": "router-07.example.com",
    "atd.description": "Update BGP peer config",
    "atd.ttl": 86400
  }
}

{: #fig-checkpoint title="Checkpoint ECT"}

The atd.reversible field MUST be present. If false, the agent declares that this action cannot be automatically undone and rollback requests MUST be escalated per the ACP-DAG-HITL unreachable_human policy.

The out_hash provides integrity verification: the agent hashes its state at checkpoint time so that rollback can verify it is restoring to an authentic prior state.

Checkpoints MUST be stored for at least atd.ttl seconds. Agents SHOULD store checkpoints in durable storage that survives restarts.

Hierarchical Checkpoints

Agents MAY create hierarchical checkpoints where a parent groups multiple child checkpoints from a multi-step operation. Rolling back the parent rolls back all children. The parent checkpoint's par array references all child checkpoint jti values.

Error Signaling

When an agent detects an error, it MUST emit an error ECT:

  • exec_act: "atd:error"
  • par: the ECT of the failed action
{
  "jti": "error-uuid",
  "exec_act": "atd:error",
  "par": ["failed-action-ect-uuid"],
  "ext": {
    "atd.severity": "critical",
    "atd.error_type": "action_failed",
    "atd.description": "BGP session did not establish",
    "atd.checkpoint_id": "ckpt-uuid",
    "atd.upstream_errors": []
  }
}

{: #fig-error title="Error ECT"}

Severity levels: info, warning, error, critical.

Error types: action_failed, timeout, constraint_violation, resource_exhausted, upstream_cascade, unknown.

When an agent receives an error signal caused by an action it initiated, it MUST either:

(a) Attempt automatic rollback of its checkpoint, or (b) Escalate per ACP-DAG-HITL HITL rules if the action was irreversible.

The atd.upstream_errors array allows agents to chain error context, building a causal trace from symptom to root cause.

HITL Escalation on Error

Error ECTs MAY trigger ACP-DAG-HITL rules. A deployment can define HITL rules such as:

{
  "id": "r-critical-error",
  "trigger": {
    "kind": "keyword_match",
    "op": "eq",
    "value": "critical",
    "input_ref": "atd.severity"
  },
  "required_role": "operator:oncall",
  "action": "escalate",
  "allow_override": true,
  "override_action": "continue"
}

{: #fig-error-hitl title="HITL Rule for Critical Errors"}

Circuit Breaker Pattern

Each agent MUST implement a circuit breaker for every downstream agent it communicates with. The circuit breaker has three states:

CLOSED (normal):
Requests flow through. The agent tracks the error rate over a sliding window (default: 60 seconds).
OPEN (failure detected):
When the error rate exceeds a threshold (default: 50%), the breaker opens. All requests are immediately rejected. The agent MUST emit a circuit breaker ECT:
{
  "exec_act": "atd:circuit_open",
  "ext": {
    "atd.downstream_agent": "spiffe://example.com/agent/b",
    "atd.error_rate": 0.75,
    "atd.window_s": 60
  }
}

{: #fig-circuit title="Circuit Breaker ECT"}

HALF-OPEN (recovery probe):
After a cooldown period (default: 30s), the breaker allows one probe request. If it succeeds, the breaker returns to CLOSED. If it fails, it returns to OPEN with doubled cooldown (exponential backoff, max 300s).

Circuit breaker thresholds can be configured as ACP-DAG-HITL node constraints:

{
  "constraints": {
    "atd.circuit_threshold": 0.5,
    "atd.circuit_window_s": 60
  }
}

{: #fig-circuit-policy title="Circuit Breaker Policy"}

Rollback Protocol

A rollback is initiated by emitting a rollback request ECT and sending an HTTP POST to the target agent's rollback endpoint:

POST /atd/rollback HTTP/1.1
Content-Type: application/json
Execution-Context: <rollback-request-ect>
  • exec_act: "atd:rollback_request"
  • par: the checkpoint ECT to roll back to
{
  "exec_act": "atd:rollback_request",
  "par": ["ckpt-uuid"],
  "ext": {
    "atd.reason": "Upstream action caused cascading failure",
    "atd.cascade": true
  }
}

{: #fig-rollback-req title="Rollback Request ECT"}

When atd.cascade is true, the receiving agent MUST also initiate rollback of any downstream checkpoints created as a consequence of the checkpointed action.

The agent MUST respond with a rollback result ECT:

  • exec_act: "atd:rollback_result"
  • par: the rollback request ECT
{
  "exec_act": "atd:rollback_result",
  "par": ["rollback-request-uuid"],
  "out_hash": "sha256-of-restored-state",
  "ext": {
    "atd.status": "completed",
    "atd.checkpoint_id": "ckpt-uuid",
    "atd.cascaded": [
      {"agent": "spiffe://example.com/agent/c", "status": "completed"},
      {"agent": "spiffe://example.com/agent/d", "status": "escalated"}
    ]
  }
}

{: #fig-rollback-result title="Rollback Result ECT"}

Status values: completed, partial, escalated, failed.

escalated means the action was irreversible and a human operator has been notified per ACP-DAG-HITL unreachable_human policy.

Agents MUST implement idempotent rollback: receiving the same rollback request ECT jti twice MUST return the same result.

Resource Hints

Agents MAY declare resource requirements as ECT extension claims or ACP-DAG-HITL node constraints:

{
  "constraints": {
    "atd.resource_cpu": "2",
    "atd.resource_memory_mb": 4096,
    "atd.resource_timeout_s": 300,
    "atd.resource_priority": "high"
  }
}

{: #fig-resources title="Resource Hints as Node Constraints"}

Orchestrators (e.g., Kubernetes schedulers, agent gateways) MAY use these hints for scheduling and quota enforcement. Resource hints are advisory; agents MUST NOT depend on them for correctness.

Security Considerations

Rollback requests are sensitive operations. Agents MUST authenticate rollback requests using the ECT identity binding (L2/L3). Only agents in the same workflow (wid) with checkpoint lineage in the DAG SHOULD be authorized to request rollback.

Checkpoint data may contain sensitive system state. Agents MUST encrypt stored checkpoints at rest and MUST NOT include checkpoint contents in error ECTs.

Circuit breaker state reveals system health topology. The atd:circuit_open ECT is part of the audit trail; access to the audit ledger SHOULD be controlled.

Malicious agents could send false error ECTs to trigger unnecessary rollbacks. Agents SHOULD verify that error ECTs reference valid par values within their own workflow DAG.

IANA Considerations

This document requests registration of the following exec_act values in a future ECT action type registry:

  • atd:checkpoint
  • atd:error
  • atd:circuit_open
  • atd:rollback_request
  • atd:rollback_result

--- back

Acknowledgments

{:numbered="false"}

ATD builds on ECT {{I-D.nennemann-wimse-ect}} for execution evidence and ACP-DAG-HITL {{I-D.nennemann-agent-dag-hitl-safety}} for delegation policy. The circuit breaker pattern is adapted from microservice architecture best practices.