387 lines
11 KiB
Markdown
387 lines
11 KiB
Markdown
---
|
|
title: "Agent Task DAG (ATD): Execution Model, Checkpoints, and Recovery"
|
|
abbrev: "ATD"
|
|
category: std
|
|
docname: draft-atd-agent-task-dag-00
|
|
submissiontype: IETF
|
|
number:
|
|
date:
|
|
v: 3
|
|
area: "OPS"
|
|
workgroup: "NMOP"
|
|
keyword:
|
|
- agent DAG
|
|
- checkpoint
|
|
- rollback
|
|
- error recovery
|
|
- circuit breaker
|
|
|
|
author:
|
|
-
|
|
fullname: TBD
|
|
organization: Independent
|
|
email: placeholder@example.com
|
|
|
|
normative:
|
|
RFC2119:
|
|
RFC8174:
|
|
RFC8446:
|
|
I-D.nennemann-wimse-ect:
|
|
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
|
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
|
I-D.nennemann-agent-dag-hitl-safety:
|
|
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
|
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
|
|
|
informative:
|
|
|
|
--- abstract
|
|
|
|
This document defines the Agent Task DAG (ATD) specification:
|
|
execution semantics, checkpoints, error signaling, circuit
|
|
breakers, and rollback for agent workflows. ATD does not define a
|
|
new DAG or token format. It defines when agents MUST emit ECT
|
|
nodes, what those nodes mean, and how to recover when things go
|
|
wrong. Checkpoints, errors, and rollback results are ECT nodes
|
|
with specific `exec_act` values and `ext` claims. Rollback walks
|
|
the ECT DAG backwards. Circuit breakers contain cascading
|
|
failures. Resource hints enable scheduling. The protocol is
|
|
transport-agnostic and builds on ECT for evidence and ACP-DAG-HITL
|
|
for policy.
|
|
|
|
--- middle
|
|
|
|
# Introduction
|
|
|
|
Autonomous agents increasingly make unsupervised decisions, yet no
|
|
standard exists for how agents checkpoint state, signal errors to
|
|
peers, contain cascading failures, or roll back decisions gone
|
|
wrong.
|
|
|
|
ATD borrows proven patterns from distributed systems: checkpoints
|
|
from database transactions, circuit breakers from microservice
|
|
architectures, and rollback from version control. It adapts these
|
|
to agent workflows where actions may be partially reversible and
|
|
where the agent that caused the error may not be the best one to
|
|
fix it.
|
|
|
|
ATD does not define a new DAG format. The ECT DAG
|
|
{{I-D.nennemann-wimse-ect}} IS the execution graph. ATD defines
|
|
the semantics of specific node types within that graph.
|
|
|
|
Design principles:
|
|
|
|
1. Agents that take consequential actions MUST be able to undo
|
|
them, or MUST declare them irreversible upfront.
|
|
2. Failure containment takes priority over failure diagnosis.
|
|
3. The protocol adds minimal overhead to the happy path.
|
|
|
|
# Conventions and Definitions
|
|
|
|
{::boilerplate bcp14-tagged}
|
|
|
|
Checkpoint:
|
|
: An ECT node recording agent state before a consequential action,
|
|
sufficient to restore the system to that state.
|
|
|
|
Circuit Breaker:
|
|
: A mechanism that stops an agent from propagating requests to a
|
|
failing downstream agent, preventing cascading failures.
|
|
|
|
Rollback:
|
|
: The process of reverting an agent's actions and state to a
|
|
previously recorded checkpoint.
|
|
|
|
Blast Radius:
|
|
: The set of agents and systems affected by a single failure.
|
|
|
|
# Node States {#node-states}
|
|
|
|
Each task node in the ECT DAG has an implicit state derived from
|
|
subsequent ECT nodes:
|
|
|
|
- **pending**: A delegation node exists in ACP-DAG-HITL but no
|
|
corresponding ECT has been emitted.
|
|
- **running**: An ECT with `exec_act` matching the task type has
|
|
been emitted but no completion or error ECT follows.
|
|
- **done**: A completion ECT (or the next `par`-linked ECT) exists.
|
|
- **failed**: An `atd:error` ECT references this node.
|
|
- **rolled_back**: An `atd:rollback_result` ECT references this
|
|
node's checkpoint.
|
|
|
|
# Checkpoint Mechanism {#checkpoints}
|
|
|
|
An ATD-compliant agent MUST create a checkpoint before any action
|
|
it classifies as consequential. An action is consequential if it
|
|
modifies external state (network config, database records, API
|
|
calls with side effects).
|
|
|
|
A checkpoint is an ECT with:
|
|
|
|
- `exec_act`: `"atd:checkpoint"`
|
|
- `par`: the ECT of the action being checkpointed
|
|
|
|
~~~json
|
|
{
|
|
"jti": "ckpt-uuid",
|
|
"exec_act": "atd:checkpoint",
|
|
"par": ["action-ect-uuid"],
|
|
"out_hash": "sha256-of-agent-state-snapshot",
|
|
"ext": {
|
|
"atd.reversible": true,
|
|
"atd.rollback_uri": "https://agent-b.example.com/atd/rollback",
|
|
"atd.target": "router-07.example.com",
|
|
"atd.description": "Update BGP peer config",
|
|
"atd.ttl": 86400
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-checkpoint title="Checkpoint ECT"}
|
|
|
|
The `atd.reversible` field MUST be present. If `false`, the agent
|
|
declares that this action cannot be automatically undone and
|
|
rollback requests MUST be escalated per the ACP-DAG-HITL
|
|
`unreachable_human` policy.
|
|
|
|
The `out_hash` provides integrity verification: the agent hashes
|
|
its state at checkpoint time so that rollback can verify it is
|
|
restoring to an authentic prior state.
|
|
|
|
Checkpoints MUST be stored for at least `atd.ttl` seconds. Agents
|
|
SHOULD store checkpoints in durable storage that survives restarts.
|
|
|
|
## Hierarchical Checkpoints
|
|
|
|
Agents MAY create hierarchical checkpoints where a parent groups
|
|
multiple child checkpoints from a multi-step operation. Rolling
|
|
back the parent rolls back all children. The parent checkpoint's
|
|
`par` array references all child checkpoint `jti` values.
|
|
|
|
# Error Signaling {#errors}
|
|
|
|
When an agent detects an error, it MUST emit an error ECT:
|
|
|
|
- `exec_act`: `"atd:error"`
|
|
- `par`: the ECT of the failed action
|
|
|
|
~~~json
|
|
{
|
|
"jti": "error-uuid",
|
|
"exec_act": "atd:error",
|
|
"par": ["failed-action-ect-uuid"],
|
|
"ext": {
|
|
"atd.severity": "critical",
|
|
"atd.error_type": "action_failed",
|
|
"atd.description": "BGP session did not establish",
|
|
"atd.checkpoint_id": "ckpt-uuid",
|
|
"atd.upstream_errors": []
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-error title="Error ECT"}
|
|
|
|
Severity levels: `info`, `warning`, `error`, `critical`.
|
|
|
|
Error types: `action_failed`, `timeout`, `constraint_violation`,
|
|
`resource_exhausted`, `upstream_cascade`, `unknown`.
|
|
|
|
When an agent receives an error signal caused by an action it
|
|
initiated, it MUST either:
|
|
|
|
(a) Attempt automatic rollback of its checkpoint, or
|
|
(b) Escalate per ACP-DAG-HITL HITL rules if the action was
|
|
irreversible.
|
|
|
|
The `atd.upstream_errors` array allows agents to chain error
|
|
context, building a causal trace from symptom to root cause.
|
|
|
|
## HITL Escalation on Error
|
|
|
|
Error ECTs MAY trigger ACP-DAG-HITL rules. A deployment can
|
|
define HITL rules such as:
|
|
|
|
~~~json
|
|
{
|
|
"id": "r-critical-error",
|
|
"trigger": {
|
|
"kind": "keyword_match",
|
|
"op": "eq",
|
|
"value": "critical",
|
|
"input_ref": "atd.severity"
|
|
},
|
|
"required_role": "operator:oncall",
|
|
"action": "escalate",
|
|
"allow_override": true,
|
|
"override_action": "continue"
|
|
}
|
|
~~~
|
|
{: #fig-error-hitl title="HITL Rule for Critical Errors"}
|
|
|
|
# Circuit Breaker Pattern {#circuit-breaker}
|
|
|
|
Each agent MUST implement a circuit breaker for every downstream
|
|
agent it communicates with. The circuit breaker has three states:
|
|
|
|
CLOSED (normal):
|
|
: Requests flow through. The agent tracks the error rate over a
|
|
sliding window (default: 60 seconds).
|
|
|
|
OPEN (failure detected):
|
|
: When the error rate exceeds a threshold (default: 50%), the
|
|
breaker opens. All requests are immediately rejected. The
|
|
agent MUST emit a circuit breaker ECT:
|
|
|
|
~~~json
|
|
{
|
|
"exec_act": "atd:circuit_open",
|
|
"ext": {
|
|
"atd.downstream_agent": "spiffe://example.com/agent/b",
|
|
"atd.error_rate": 0.75,
|
|
"atd.window_s": 60
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-circuit title="Circuit Breaker ECT"}
|
|
|
|
HALF-OPEN (recovery probe):
|
|
: After a cooldown period (default: 30s), the breaker allows one
|
|
probe request. If it succeeds, the breaker returns to CLOSED.
|
|
If it fails, it returns to OPEN with doubled cooldown
|
|
(exponential backoff, max 300s).
|
|
|
|
Circuit breaker thresholds can be configured as ACP-DAG-HITL
|
|
node constraints:
|
|
|
|
~~~json
|
|
{
|
|
"constraints": {
|
|
"atd.circuit_threshold": 0.5,
|
|
"atd.circuit_window_s": 60
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-circuit-policy title="Circuit Breaker Policy"}
|
|
|
|
# Rollback Protocol {#rollback}
|
|
|
|
A rollback is initiated by emitting a rollback request ECT and
|
|
sending an HTTP POST to the target agent's rollback endpoint:
|
|
|
|
~~~
|
|
POST /atd/rollback HTTP/1.1
|
|
Content-Type: application/json
|
|
Execution-Context: <rollback-request-ect>
|
|
~~~
|
|
|
|
- `exec_act`: `"atd:rollback_request"`
|
|
- `par`: the checkpoint ECT to roll back to
|
|
|
|
~~~json
|
|
{
|
|
"exec_act": "atd:rollback_request",
|
|
"par": ["ckpt-uuid"],
|
|
"ext": {
|
|
"atd.reason": "Upstream action caused cascading failure",
|
|
"atd.cascade": true
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-rollback-req title="Rollback Request ECT"}
|
|
|
|
When `atd.cascade` is `true`, the receiving agent MUST also
|
|
initiate rollback of any downstream checkpoints created as a
|
|
consequence of the checkpointed action.
|
|
|
|
The agent MUST respond with a rollback result ECT:
|
|
|
|
- `exec_act`: `"atd:rollback_result"`
|
|
- `par`: the rollback request ECT
|
|
|
|
~~~json
|
|
{
|
|
"exec_act": "atd:rollback_result",
|
|
"par": ["rollback-request-uuid"],
|
|
"out_hash": "sha256-of-restored-state",
|
|
"ext": {
|
|
"atd.status": "completed",
|
|
"atd.checkpoint_id": "ckpt-uuid",
|
|
"atd.cascaded": [
|
|
{"agent": "spiffe://example.com/agent/c", "status": "completed"},
|
|
{"agent": "spiffe://example.com/agent/d", "status": "escalated"}
|
|
]
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-rollback-result title="Rollback Result ECT"}
|
|
|
|
Status values: `completed`, `partial`, `escalated`, `failed`.
|
|
|
|
`escalated` means the action was irreversible and a human operator
|
|
has been notified per ACP-DAG-HITL `unreachable_human` policy.
|
|
|
|
Agents MUST implement idempotent rollback: receiving the same
|
|
rollback request ECT `jti` twice MUST return the same result.
|
|
|
|
# Resource Hints {#resources}
|
|
|
|
Agents MAY declare resource requirements as ECT extension claims
|
|
or ACP-DAG-HITL node constraints:
|
|
|
|
~~~json
|
|
{
|
|
"constraints": {
|
|
"atd.resource_cpu": "2",
|
|
"atd.resource_memory_mb": 4096,
|
|
"atd.resource_timeout_s": 300,
|
|
"atd.resource_priority": "high"
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-resources title="Resource Hints as Node Constraints"}
|
|
|
|
Orchestrators (e.g., Kubernetes schedulers, agent gateways) MAY
|
|
use these hints for scheduling and quota enforcement. Resource
|
|
hints are advisory; agents MUST NOT depend on them for
|
|
correctness.
|
|
|
|
# Security Considerations
|
|
|
|
Rollback requests are sensitive operations. Agents MUST
|
|
authenticate rollback requests using the ECT identity binding
|
|
(L2/L3). Only agents in the same workflow (`wid`) with
|
|
checkpoint lineage in the DAG SHOULD be authorized to request
|
|
rollback.
|
|
|
|
Checkpoint data may contain sensitive system state. Agents MUST
|
|
encrypt stored checkpoints at rest and MUST NOT include checkpoint
|
|
contents in error ECTs.
|
|
|
|
Circuit breaker state reveals system health topology. The
|
|
`atd:circuit_open` ECT is part of the audit trail; access to the
|
|
audit ledger SHOULD be controlled.
|
|
|
|
Malicious agents could send false error ECTs to trigger
|
|
unnecessary rollbacks. Agents SHOULD verify that error ECTs
|
|
reference valid `par` values within their own workflow DAG.
|
|
|
|
# IANA Considerations
|
|
|
|
This document requests registration of the following `exec_act`
|
|
values in a future ECT action type registry:
|
|
|
|
- `atd:checkpoint`
|
|
- `atd:error`
|
|
- `atd:circuit_open`
|
|
- `atd:rollback_request`
|
|
- `atd:rollback_result`
|
|
|
|
--- back
|
|
|
|
# Acknowledgments
|
|
{:numbered="false"}
|
|
|
|
ATD builds on ECT {{I-D.nennemann-wimse-ect}} for execution
|
|
evidence and ACP-DAG-HITL {{I-D.nennemann-agent-dag-hitl-safety}}
|
|
for delegation policy. The circuit breaker pattern is adapted
|
|
from microservice architecture best practices.
|