feat: add draft data, gap analysis report, and workspace config
This commit is contained in:
386
workspace/drafts/new-drafts/draft-b-atd-agent-task-dag-00.md
Normal file
386
workspace/drafts/new-drafts/draft-b-atd-agent-task-dag-00.md
Normal file
@@ -0,0 +1,386 @@
|
||||
---
|
||||
title: "Agent Task DAG (ATD): Execution Model, Checkpoints, and Recovery"
|
||||
abbrev: "ATD"
|
||||
category: std
|
||||
docname: draft-atd-agent-task-dag-00
|
||||
submissiontype: IETF
|
||||
number:
|
||||
date:
|
||||
v: 3
|
||||
area: "OPS"
|
||||
workgroup: "NMOP"
|
||||
keyword:
|
||||
- agent DAG
|
||||
- checkpoint
|
||||
- rollback
|
||||
- error recovery
|
||||
- circuit breaker
|
||||
|
||||
author:
|
||||
-
|
||||
fullname: TBD
|
||||
organization: Independent
|
||||
email: placeholder@example.com
|
||||
|
||||
normative:
|
||||
RFC2119:
|
||||
RFC8174:
|
||||
RFC8446:
|
||||
I-D.nennemann-wimse-ect:
|
||||
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
||||
I-D.nennemann-agent-dag-hitl-safety:
|
||||
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
||||
|
||||
informative:
|
||||
|
||||
--- abstract
|
||||
|
||||
This document defines the Agent Task DAG (ATD) specification:
|
||||
execution semantics, checkpoints, error signaling, circuit
|
||||
breakers, and rollback for agent workflows. ATD does not define a
|
||||
new DAG or token format. It defines when agents MUST emit ECT
|
||||
nodes, what those nodes mean, and how to recover when things go
|
||||
wrong. Checkpoints, errors, and rollback results are ECT nodes
|
||||
with specific `exec_act` values and `ext` claims. Rollback walks
|
||||
the ECT DAG backwards. Circuit breakers contain cascading
|
||||
failures. Resource hints enable scheduling. The protocol is
|
||||
transport-agnostic and builds on ECT for evidence and ACP-DAG-HITL
|
||||
for policy.
|
||||
|
||||
--- middle
|
||||
|
||||
# Introduction
|
||||
|
||||
Autonomous agents increasingly make unsupervised decisions, yet no
|
||||
standard exists for how agents checkpoint state, signal errors to
|
||||
peers, contain cascading failures, or roll back decisions gone
|
||||
wrong.
|
||||
|
||||
ATD borrows proven patterns from distributed systems: checkpoints
|
||||
from database transactions, circuit breakers from microservice
|
||||
architectures, and rollback from version control. It adapts these
|
||||
to agent workflows where actions may be partially reversible and
|
||||
where the agent that caused the error may not be the best one to
|
||||
fix it.
|
||||
|
||||
ATD does not define a new DAG format. The ECT DAG
|
||||
{{I-D.nennemann-wimse-ect}} IS the execution graph. ATD defines
|
||||
the semantics of specific node types within that graph.
|
||||
|
||||
Design principles:
|
||||
|
||||
1. Agents that take consequential actions MUST be able to undo
|
||||
them, or MUST declare them irreversible upfront.
|
||||
2. Failure containment takes priority over failure diagnosis.
|
||||
3. The protocol adds minimal overhead to the happy path.
|
||||
|
||||
# Conventions and Definitions
|
||||
|
||||
{::boilerplate bcp14-tagged}
|
||||
|
||||
Checkpoint:
|
||||
: An ECT node recording agent state before a consequential action,
|
||||
sufficient to restore the system to that state.
|
||||
|
||||
Circuit Breaker:
|
||||
: A mechanism that stops an agent from propagating requests to a
|
||||
failing downstream agent, preventing cascading failures.
|
||||
|
||||
Rollback:
|
||||
: The process of reverting an agent's actions and state to a
|
||||
previously recorded checkpoint.
|
||||
|
||||
Blast Radius:
|
||||
: The set of agents and systems affected by a single failure.
|
||||
|
||||
# Node States {#node-states}
|
||||
|
||||
Each task node in the ECT DAG has an implicit state derived from
|
||||
subsequent ECT nodes:
|
||||
|
||||
- **pending**: A delegation node exists in ACP-DAG-HITL but no
|
||||
corresponding ECT has been emitted.
|
||||
- **running**: An ECT with `exec_act` matching the task type has
|
||||
been emitted but no completion or error ECT follows.
|
||||
- **done**: A completion ECT (or the next `par`-linked ECT) exists.
|
||||
- **failed**: An `atd:error` ECT references this node.
|
||||
- **rolled_back**: An `atd:rollback_result` ECT references this
|
||||
node's checkpoint.
|
||||
|
||||
# Checkpoint Mechanism {#checkpoints}
|
||||
|
||||
An ATD-compliant agent MUST create a checkpoint before any action
|
||||
it classifies as consequential. An action is consequential if it
|
||||
modifies external state (network config, database records, API
|
||||
calls with side effects).
|
||||
|
||||
A checkpoint is an ECT with:
|
||||
|
||||
- `exec_act`: `"atd:checkpoint"`
|
||||
- `par`: the ECT of the action being checkpointed
|
||||
|
||||
~~~json
|
||||
{
|
||||
"jti": "ckpt-uuid",
|
||||
"exec_act": "atd:checkpoint",
|
||||
"par": ["action-ect-uuid"],
|
||||
"out_hash": "sha256-of-agent-state-snapshot",
|
||||
"ext": {
|
||||
"atd.reversible": true,
|
||||
"atd.rollback_uri": "https://agent-b.example.com/atd/rollback",
|
||||
"atd.target": "router-07.example.com",
|
||||
"atd.description": "Update BGP peer config",
|
||||
"atd.ttl": 86400
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-checkpoint title="Checkpoint ECT"}
|
||||
|
||||
The `atd.reversible` field MUST be present. If `false`, the agent
|
||||
declares that this action cannot be automatically undone and
|
||||
rollback requests MUST be escalated per the ACP-DAG-HITL
|
||||
`unreachable_human` policy.
|
||||
|
||||
The `out_hash` provides integrity verification: the agent hashes
|
||||
its state at checkpoint time so that rollback can verify it is
|
||||
restoring to an authentic prior state.
|
||||
|
||||
Checkpoints MUST be stored for at least `atd.ttl` seconds. Agents
|
||||
SHOULD store checkpoints in durable storage that survives restarts.
|
||||
|
||||
## Hierarchical Checkpoints
|
||||
|
||||
Agents MAY create hierarchical checkpoints where a parent groups
|
||||
multiple child checkpoints from a multi-step operation. Rolling
|
||||
back the parent rolls back all children. The parent checkpoint's
|
||||
`par` array references all child checkpoint `jti` values.
|
||||
|
||||
# Error Signaling {#errors}
|
||||
|
||||
When an agent detects an error, it MUST emit an error ECT:
|
||||
|
||||
- `exec_act`: `"atd:error"`
|
||||
- `par`: the ECT of the failed action
|
||||
|
||||
~~~json
|
||||
{
|
||||
"jti": "error-uuid",
|
||||
"exec_act": "atd:error",
|
||||
"par": ["failed-action-ect-uuid"],
|
||||
"ext": {
|
||||
"atd.severity": "critical",
|
||||
"atd.error_type": "action_failed",
|
||||
"atd.description": "BGP session did not establish",
|
||||
"atd.checkpoint_id": "ckpt-uuid",
|
||||
"atd.upstream_errors": []
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-error title="Error ECT"}
|
||||
|
||||
Severity levels: `info`, `warning`, `error`, `critical`.
|
||||
|
||||
Error types: `action_failed`, `timeout`, `constraint_violation`,
|
||||
`resource_exhausted`, `upstream_cascade`, `unknown`.
|
||||
|
||||
When an agent receives an error signal caused by an action it
|
||||
initiated, it MUST either:
|
||||
|
||||
(a) Attempt automatic rollback of its checkpoint, or
|
||||
(b) Escalate per ACP-DAG-HITL HITL rules if the action was
|
||||
irreversible.
|
||||
|
||||
The `atd.upstream_errors` array allows agents to chain error
|
||||
context, building a causal trace from symptom to root cause.
|
||||
|
||||
## HITL Escalation on Error
|
||||
|
||||
Error ECTs MAY trigger ACP-DAG-HITL rules. A deployment can
|
||||
define HITL rules such as:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"id": "r-critical-error",
|
||||
"trigger": {
|
||||
"kind": "keyword_match",
|
||||
"op": "eq",
|
||||
"value": "critical",
|
||||
"input_ref": "atd.severity"
|
||||
},
|
||||
"required_role": "operator:oncall",
|
||||
"action": "escalate",
|
||||
"allow_override": true,
|
||||
"override_action": "continue"
|
||||
}
|
||||
~~~
|
||||
{: #fig-error-hitl title="HITL Rule for Critical Errors"}
|
||||
|
||||
# Circuit Breaker Pattern {#circuit-breaker}
|
||||
|
||||
Each agent MUST implement a circuit breaker for every downstream
|
||||
agent it communicates with. The circuit breaker has three states:
|
||||
|
||||
CLOSED (normal):
|
||||
: Requests flow through. The agent tracks the error rate over a
|
||||
sliding window (default: 60 seconds).
|
||||
|
||||
OPEN (failure detected):
|
||||
: When the error rate exceeds a threshold (default: 50%), the
|
||||
breaker opens. All requests are immediately rejected. The
|
||||
agent MUST emit a circuit breaker ECT:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "atd:circuit_open",
|
||||
"ext": {
|
||||
"atd.downstream_agent": "spiffe://example.com/agent/b",
|
||||
"atd.error_rate": 0.75,
|
||||
"atd.window_s": 60
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-circuit title="Circuit Breaker ECT"}
|
||||
|
||||
HALF-OPEN (recovery probe):
|
||||
: After a cooldown period (default: 30s), the breaker allows one
|
||||
probe request. If it succeeds, the breaker returns to CLOSED.
|
||||
If it fails, it returns to OPEN with doubled cooldown
|
||||
(exponential backoff, max 300s).
|
||||
|
||||
Circuit breaker thresholds can be configured as ACP-DAG-HITL
|
||||
node constraints:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"constraints": {
|
||||
"atd.circuit_threshold": 0.5,
|
||||
"atd.circuit_window_s": 60
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-circuit-policy title="Circuit Breaker Policy"}
|
||||
|
||||
# Rollback Protocol {#rollback}
|
||||
|
||||
A rollback is initiated by emitting a rollback request ECT and
|
||||
sending an HTTP POST to the target agent's rollback endpoint:
|
||||
|
||||
~~~
|
||||
POST /atd/rollback HTTP/1.1
|
||||
Content-Type: application/json
|
||||
Execution-Context: <rollback-request-ect>
|
||||
~~~
|
||||
|
||||
- `exec_act`: `"atd:rollback_request"`
|
||||
- `par`: the checkpoint ECT to roll back to
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "atd:rollback_request",
|
||||
"par": ["ckpt-uuid"],
|
||||
"ext": {
|
||||
"atd.reason": "Upstream action caused cascading failure",
|
||||
"atd.cascade": true
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-rollback-req title="Rollback Request ECT"}
|
||||
|
||||
When `atd.cascade` is `true`, the receiving agent MUST also
|
||||
initiate rollback of any downstream checkpoints created as a
|
||||
consequence of the checkpointed action.
|
||||
|
||||
The agent MUST respond with a rollback result ECT:
|
||||
|
||||
- `exec_act`: `"atd:rollback_result"`
|
||||
- `par`: the rollback request ECT
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "atd:rollback_result",
|
||||
"par": ["rollback-request-uuid"],
|
||||
"out_hash": "sha256-of-restored-state",
|
||||
"ext": {
|
||||
"atd.status": "completed",
|
||||
"atd.checkpoint_id": "ckpt-uuid",
|
||||
"atd.cascaded": [
|
||||
{"agent": "spiffe://example.com/agent/c", "status": "completed"},
|
||||
{"agent": "spiffe://example.com/agent/d", "status": "escalated"}
|
||||
]
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-rollback-result title="Rollback Result ECT"}
|
||||
|
||||
Status values: `completed`, `partial`, `escalated`, `failed`.
|
||||
|
||||
`escalated` means the action was irreversible and a human operator
|
||||
has been notified per ACP-DAG-HITL `unreachable_human` policy.
|
||||
|
||||
Agents MUST implement idempotent rollback: receiving the same
|
||||
rollback request ECT `jti` twice MUST return the same result.
|
||||
|
||||
# Resource Hints {#resources}
|
||||
|
||||
Agents MAY declare resource requirements as ECT extension claims
|
||||
or ACP-DAG-HITL node constraints:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"constraints": {
|
||||
"atd.resource_cpu": "2",
|
||||
"atd.resource_memory_mb": 4096,
|
||||
"atd.resource_timeout_s": 300,
|
||||
"atd.resource_priority": "high"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-resources title="Resource Hints as Node Constraints"}
|
||||
|
||||
Orchestrators (e.g., Kubernetes schedulers, agent gateways) MAY
|
||||
use these hints for scheduling and quota enforcement. Resource
|
||||
hints are advisory; agents MUST NOT depend on them for
|
||||
correctness.
|
||||
|
||||
# Security Considerations
|
||||
|
||||
Rollback requests are sensitive operations. Agents MUST
|
||||
authenticate rollback requests using the ECT identity binding
|
||||
(L2/L3). Only agents in the same workflow (`wid`) with
|
||||
checkpoint lineage in the DAG SHOULD be authorized to request
|
||||
rollback.
|
||||
|
||||
Checkpoint data may contain sensitive system state. Agents MUST
|
||||
encrypt stored checkpoints at rest and MUST NOT include checkpoint
|
||||
contents in error ECTs.
|
||||
|
||||
Circuit breaker state reveals system health topology. The
|
||||
`atd:circuit_open` ECT is part of the audit trail; access to the
|
||||
audit ledger SHOULD be controlled.
|
||||
|
||||
Malicious agents could send false error ECTs to trigger
|
||||
unnecessary rollbacks. Agents SHOULD verify that error ECTs
|
||||
reference valid `par` values within their own workflow DAG.
|
||||
|
||||
# IANA Considerations
|
||||
|
||||
This document requests registration of the following `exec_act`
|
||||
values in a future ECT action type registry:
|
||||
|
||||
- `atd:checkpoint`
|
||||
- `atd:error`
|
||||
- `atd:circuit_open`
|
||||
- `atd:rollback_request`
|
||||
- `atd:rollback_result`
|
||||
|
||||
--- back
|
||||
|
||||
# Acknowledgments
|
||||
{:numbered="false"}
|
||||
|
||||
ATD builds on ECT {{I-D.nennemann-wimse-ect}} for execution
|
||||
evidence and ACP-DAG-HITL {{I-D.nennemann-agent-dag-hitl-safety}}
|
||||
for delegation policy. The circuit breaker pattern is adapted
|
||||
from microservice architecture best practices.
|
||||
Reference in New Issue
Block a user