feat: add draft data, gap analysis report, and workspace config
This commit is contained in:
725
workspace/drafts/new-drafts/draft-b-atd-agent-task-dag-01.md
Normal file
725
workspace/drafts/new-drafts/draft-b-atd-agent-task-dag-01.md
Normal file
@@ -0,0 +1,725 @@
|
||||
---
|
||||
title: "Agent Task DAG (ATD): Execution Model, Checkpoints, and Recovery"
|
||||
abbrev: "ATD"
|
||||
category: std
|
||||
docname: draft-atd-agent-task-dag-01
|
||||
submissiontype: IETF
|
||||
number:
|
||||
date:
|
||||
v: 3
|
||||
area: "OPS"
|
||||
workgroup: "NMOP"
|
||||
keyword:
|
||||
- agent DAG
|
||||
- checkpoint
|
||||
- rollback
|
||||
- error recovery
|
||||
- circuit breaker
|
||||
|
||||
author:
|
||||
-
|
||||
fullname: TBD
|
||||
organization: Independent
|
||||
email: placeholder@example.com
|
||||
|
||||
normative:
|
||||
RFC2119:
|
||||
RFC8174:
|
||||
RFC8446:
|
||||
RFC9110:
|
||||
RFC8615:
|
||||
I-D.nennemann-wimse-ect:
|
||||
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
||||
I-D.nennemann-agent-dag-hitl-safety:
|
||||
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
||||
|
||||
informative:
|
||||
|
||||
--- abstract
|
||||
|
||||
This document defines the Agent Task DAG (ATD) specification:
|
||||
execution semantics, checkpoints, error signaling, circuit
|
||||
breakers, and rollback for agent workflows. ATD does not define a
|
||||
new DAG or token format. It defines when agents MUST emit ECT
|
||||
nodes, what those nodes mean, and how to recover when things go
|
||||
wrong. Checkpoints, errors, and rollback results are ECT nodes
|
||||
with specific `exec_act` values and `ext` claims. Rollback walks
|
||||
the ECT DAG backwards. Circuit breakers contain cascading
|
||||
failures. Resource hints enable scheduling. The protocol is
|
||||
transport-agnostic and builds on ECT for evidence and ACP-DAG-HITL
|
||||
for policy.
|
||||
|
||||
--- middle
|
||||
|
||||
# Introduction
|
||||
|
||||
Autonomous agents increasingly make unsupervised decisions, yet no
|
||||
standard exists for how agents checkpoint state, signal errors to
|
||||
peers, contain cascading failures, or roll back decisions gone
|
||||
wrong.
|
||||
|
||||
ATD borrows proven patterns from distributed systems: checkpoints
|
||||
from database transactions, circuit breakers from microservice
|
||||
architectures, and rollback from version control. It adapts these
|
||||
to agent workflows where actions may be partially reversible and
|
||||
where the agent that caused the error may not be the best one to
|
||||
fix it.
|
||||
|
||||
ATD does not define a new DAG format. The ECT DAG
|
||||
{{I-D.nennemann-wimse-ect}} IS the execution graph. ATD defines
|
||||
the semantics of specific node types within that graph.
|
||||
|
||||
Design principles:
|
||||
|
||||
1. Agents that take consequential actions MUST be able to undo
|
||||
them, or MUST declare them irreversible upfront.
|
||||
2. Failure containment takes priority over failure diagnosis.
|
||||
3. The protocol adds minimal overhead to the happy path.
|
||||
|
||||
# Conventions and Definitions
|
||||
|
||||
{::boilerplate bcp14-tagged}
|
||||
|
||||
Checkpoint:
|
||||
: An ECT node recording agent state before a consequential action,
|
||||
sufficient to restore the system to that state.
|
||||
|
||||
Circuit Breaker:
|
||||
: A mechanism that stops an agent from propagating requests to a
|
||||
failing downstream agent, preventing cascading failures.
|
||||
|
||||
Rollback:
|
||||
: The process of reverting an agent's actions and state to a
|
||||
previously recorded checkpoint.
|
||||
|
||||
Blast Radius:
|
||||
: The set of agents and systems affected by a single failure.
|
||||
|
||||
Consequential Action:
|
||||
: An action that modifies external state (network configuration,
|
||||
database records, API calls with side effects) such that
|
||||
reversal requires explicit effort.
|
||||
|
||||
# Execution Semantics {#execution}
|
||||
|
||||
## Topological Order
|
||||
|
||||
Tasks in the ECT DAG MUST execute in topological order: a task
|
||||
MUST NOT begin execution until all tasks referenced by its ECT
|
||||
`par` claims are in state `done`.
|
||||
|
||||
Two tasks with no common ancestor in the DAG (no shared `par`
|
||||
lineage) MAY execute concurrently. Orchestrators SHOULD
|
||||
exploit this parallelism for performance.
|
||||
|
||||
Circular dependencies are prohibited. Agents MUST reject
|
||||
ACP-DAG-HITL delegation DAGs containing cycles.
|
||||
|
||||
## Workflow Boundary ECTs
|
||||
|
||||
When a workflow begins, the initiating agent MUST emit:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "atd:workflow_start",
|
||||
"ext": {
|
||||
"atd.wf_id": "wf-uuid",
|
||||
"atd.description": "BGP failover workflow",
|
||||
"atd.node_count": 5
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-wf-start title="Workflow Start ECT"}
|
||||
|
||||
When the workflow reaches a terminal state (all leaf nodes
|
||||
complete or any node failed with no rollback path), the
|
||||
orchestrator MUST emit:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "atd:workflow_complete",
|
||||
"par": ["wf-start-ect-uuid"],
|
||||
"ext": {
|
||||
"atd.wf_id": "wf-uuid",
|
||||
"atd.terminal_status": "success",
|
||||
"atd.elapsed_s": 42
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-wf-complete title="Workflow Complete ECT"}
|
||||
|
||||
Terminal status values: `success`, `partial`, `failed`,
|
||||
`rolled_back`, `escalated`.
|
||||
|
||||
# Node States {#node-states}
|
||||
|
||||
Each task node in the ECT DAG has an implicit state derived from
|
||||
subsequent ECT nodes:
|
||||
|
||||
- **pending**: A delegation node exists in ACP-DAG-HITL but no
|
||||
corresponding ECT has been emitted.
|
||||
- **running**: An ECT matching the task type has been emitted
|
||||
but no completion or error ECT follows.
|
||||
- **done**: A completion ECT (or the next `par`-linked ECT) exists.
|
||||
- **failed**: An `atd:error` ECT references this node.
|
||||
- **rolled_back**: An `atd:rollback_result` ECT references this
|
||||
node's checkpoint.
|
||||
- **escalated**: The task failed and a human has been notified
|
||||
per HITL escalation rules.
|
||||
|
||||
# Checkpoint Mechanism {#checkpoints}
|
||||
|
||||
## Checkpoint Placement Policy
|
||||
|
||||
An ATD-compliant agent MUST create a checkpoint before any action
|
||||
it classifies as consequential. The following actions are always
|
||||
consequential and MUST be checkpointed:
|
||||
|
||||
1. Any modification to network device configuration.
|
||||
2. Any write to a shared database or external data store.
|
||||
3. Any API call with side effects (non-idempotent HTTP methods).
|
||||
4. Any delegation to another agent that will itself take
|
||||
consequential actions.
|
||||
|
||||
The following SHOULD be checkpointed:
|
||||
|
||||
1. Long-running computations (> `atd.resource_timeout_s`).
|
||||
2. Actions that cannot be verified without external state.
|
||||
|
||||
The following are exempt from checkpoint requirements:
|
||||
|
||||
1. Read-only queries.
|
||||
2. Sending notifications with no side effects.
|
||||
3. Internal state computations with no external observable effect.
|
||||
|
||||
## Checkpoint ECT Format
|
||||
|
||||
A checkpoint is an ECT with:
|
||||
|
||||
- `exec_act`: `"atd:checkpoint"`
|
||||
- `par`: the ECT of the action being checkpointed
|
||||
|
||||
~~~json
|
||||
{
|
||||
"jti": "ckpt-uuid",
|
||||
"exec_act": "atd:checkpoint",
|
||||
"par": ["action-ect-uuid"],
|
||||
"out_hash": "sha256-of-agent-state-snapshot",
|
||||
"ext": {
|
||||
"atd.reversible": true,
|
||||
"atd.rollback_uri": "https://agent-b.example.com/.well-known/atd/rollback",
|
||||
"atd.target": "router-07.example.com",
|
||||
"atd.description": "Update BGP peer config",
|
||||
"atd.ttl": 86400
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-checkpoint title="Checkpoint ECT"}
|
||||
|
||||
The `atd.reversible` field MUST be present. If `false`, the agent
|
||||
declares that this action cannot be automatically undone and
|
||||
rollback requests MUST be escalated per the ACP-DAG-HITL
|
||||
`unreachable_human` policy.
|
||||
|
||||
The `out_hash` provides integrity verification: the agent hashes
|
||||
its state at checkpoint time so that rollback can verify it is
|
||||
restoring to an authentic prior state.
|
||||
|
||||
Checkpoints MUST be stored for at least `atd.ttl` seconds. Agents
|
||||
SHOULD store checkpoints in durable storage that survives restarts.
|
||||
|
||||
The rollback URI MUST be a well-known URI per {{RFC8615}} at the
|
||||
path `/.well-known/atd/rollback`.
|
||||
|
||||
## Hierarchical Checkpoints
|
||||
|
||||
Agents MAY create hierarchical checkpoints where a parent groups
|
||||
multiple child checkpoints from a multi-step operation. Rolling
|
||||
back the parent rolls back all children. The parent checkpoint's
|
||||
`par` array references all child checkpoint `jti` values.
|
||||
|
||||
## Checkpoint `exec_act` Table
|
||||
|
||||
| `exec_act` value | When emitted | Required `ext` fields |
|
||||
|-----------------|-------------|----------------------|
|
||||
| `atd:checkpoint` | Before consequential action | `atd.reversible`, `atd.rollback_uri`, `atd.ttl` |
|
||||
| `atd:error` | On failure detection | `atd.severity`, `atd.error_type`, `atd.checkpoint_id` |
|
||||
| `atd:circuit_open` | When error rate exceeds threshold | `atd.downstream_agent`, `atd.error_rate`, `atd.window_s` |
|
||||
| `atd:circuit_close` | When probe succeeds in HALF-OPEN | `atd.downstream_agent`, `atd.cooldown_s` |
|
||||
| `atd:rollback_request` | To initiate rollback | `atd.reason`, `atd.cascade` |
|
||||
| `atd:rollback_result` | Rollback complete or failed | `atd.status`, `atd.checkpoint_id`, `atd.cascaded` |
|
||||
| `atd:workflow_start` | Workflow begins | `atd.wf_id`, `atd.description` |
|
||||
| `atd:workflow_complete` | Workflow terminal | `atd.wf_id`, `atd.terminal_status` |
|
||||
{: #fig-actions title="ATD exec_act Values"}
|
||||
|
||||
# Error Signaling {#errors}
|
||||
|
||||
When an agent detects an error, it MUST emit an error ECT:
|
||||
|
||||
- `exec_act`: `"atd:error"`
|
||||
- `par`: the ECT of the failed action
|
||||
|
||||
~~~json
|
||||
{
|
||||
"jti": "error-uuid",
|
||||
"exec_act": "atd:error",
|
||||
"par": ["failed-action-ect-uuid"],
|
||||
"ext": {
|
||||
"atd.severity": "critical",
|
||||
"atd.error_type": "action_failed",
|
||||
"atd.description": "BGP session did not establish",
|
||||
"atd.checkpoint_id": "ckpt-uuid",
|
||||
"atd.upstream_errors": []
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-error title="Error ECT"}
|
||||
|
||||
Severity levels (in increasing order): `info`, `warning`,
|
||||
`error`, `critical`.
|
||||
|
||||
Error types: `action_failed`, `timeout`, `constraint_violation`,
|
||||
`resource_exhausted`, `upstream_cascade`, `unknown`.
|
||||
|
||||
When an agent receives an error signal caused by an action it
|
||||
initiated, it MUST either:
|
||||
|
||||
(a) Attempt automatic rollback of its checkpoint, or
|
||||
(b) Escalate per ACP-DAG-HITL HITL rules if the action was
|
||||
irreversible.
|
||||
|
||||
The `atd.upstream_errors` array allows agents to chain error
|
||||
context, building a causal trace from symptom to root cause.
|
||||
|
||||
## HITL Escalation on Error
|
||||
|
||||
Error ECTs with severity `critical` SHOULD trigger HITL
|
||||
escalation. Deployments SHOULD define ACP-DAG-HITL rules such
|
||||
as:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"id": "r-critical-error",
|
||||
"trigger": {
|
||||
"kind": "keyword_match",
|
||||
"op": "eq",
|
||||
"value": "critical",
|
||||
"input_ref": "atd.severity"
|
||||
},
|
||||
"required_role": "operator:oncall",
|
||||
"action": "escalate",
|
||||
"allow_override": true,
|
||||
"override_action": "continue"
|
||||
}
|
||||
~~~
|
||||
{: #fig-error-hitl title="HITL Rule for Critical Errors"}
|
||||
|
||||
# Circuit Breaker Pattern {#circuit-breaker}
|
||||
|
||||
Each agent MUST implement a circuit breaker for every downstream
|
||||
agent it communicates with. The circuit breaker has three states:
|
||||
|
||||
CLOSED (normal):
|
||||
: Requests flow through. The agent tracks the error rate over a
|
||||
sliding window (default: 60 seconds).
|
||||
|
||||
OPEN (failure detected):
|
||||
: When the error rate exceeds a threshold (default: 50%), the
|
||||
breaker opens. All requests are immediately rejected. The
|
||||
agent MUST emit a circuit breaker open ECT:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "atd:circuit_open",
|
||||
"ext": {
|
||||
"atd.downstream_agent": "spiffe://example.com/agent/b",
|
||||
"atd.error_rate": 0.75,
|
||||
"atd.window_s": 60
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-circuit-open title="Circuit Breaker Open ECT"}
|
||||
|
||||
HALF-OPEN (recovery probe):
|
||||
: After a cooldown period (default: 30s), the breaker allows one
|
||||
probe request. If it succeeds, the breaker returns to CLOSED
|
||||
and MUST emit:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "atd:circuit_close",
|
||||
"ext": {
|
||||
"atd.downstream_agent": "spiffe://example.com/agent/b",
|
||||
"atd.cooldown_s": 30
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-circuit-close title="Circuit Breaker Close ECT"}
|
||||
|
||||
If the probe fails, the breaker returns to OPEN with doubled
|
||||
cooldown (exponential backoff, max 300s).
|
||||
|
||||
## Circuit Breaker State Machine
|
||||
|
||||
~~~
|
||||
error_rate > threshold
|
||||
CLOSED ─────────────────────────► OPEN
|
||||
▲ │
|
||||
│ probe success │ cooldown expires
|
||||
│ ▼
|
||||
└────────────────────────── HALF-OPEN
|
||||
probe failure ──► OPEN (cooldown * 2)
|
||||
~~~
|
||||
{: #fig-fsm title="Circuit Breaker State Machine"}
|
||||
|
||||
## Coordinated Circuit Breaking
|
||||
|
||||
When multiple agents share a downstream dependency, each maintains
|
||||
its own circuit breaker independently. However, agents SHOULD
|
||||
publish circuit breaker state via their ECT stream so peers can
|
||||
observe the signal.
|
||||
|
||||
If an orchestrator observes N circuit breakers opening for the
|
||||
same downstream agent within a short window, it SHOULD initiate
|
||||
a HITL escalation rather than allowing N parallel recovery probes.
|
||||
|
||||
## Circuit Breaker Policy Configuration
|
||||
|
||||
Circuit breaker thresholds can be configured as ACP-DAG-HITL
|
||||
node constraints:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"constraints": {
|
||||
"atd.circuit_threshold": 0.5,
|
||||
"atd.circuit_window_s": 60
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-circuit-policy title="Circuit Breaker Policy"}
|
||||
|
||||
# Rollback Protocol {#rollback}
|
||||
|
||||
## Basic Rollback
|
||||
|
||||
A rollback is initiated by emitting a rollback request ECT and
|
||||
sending an HTTP POST to the target agent's rollback endpoint:
|
||||
|
||||
~~~
|
||||
POST /.well-known/atd/rollback HTTP/1.1
|
||||
Content-Type: application/json
|
||||
Execution-Context: <rollback-request-ect>
|
||||
~~~
|
||||
|
||||
- `exec_act`: `"atd:rollback_request"`
|
||||
- `par`: the checkpoint ECT to roll back to
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "atd:rollback_request",
|
||||
"par": ["ckpt-uuid"],
|
||||
"ext": {
|
||||
"atd.reason": "Upstream action caused cascading failure",
|
||||
"atd.cascade": true
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-rollback-req title="Rollback Request ECT"}
|
||||
|
||||
When `atd.cascade` is `true`, the receiving agent MUST also
|
||||
initiate rollback of any downstream checkpoints created as a
|
||||
consequence of the checkpointed action.
|
||||
|
||||
The agent MUST respond with a rollback result ECT:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "atd:rollback_result",
|
||||
"par": ["rollback-request-uuid"],
|
||||
"out_hash": "sha256-of-restored-state",
|
||||
"ext": {
|
||||
"atd.status": "completed",
|
||||
"atd.checkpoint_id": "ckpt-uuid",
|
||||
"atd.cascaded": [
|
||||
{"agent": "spiffe://example.com/agent/c", "status": "completed"},
|
||||
{"agent": "spiffe://example.com/agent/d", "status": "escalated"}
|
||||
]
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-rollback-result title="Rollback Result ECT"}
|
||||
|
||||
Status values: `completed`, `partial`, `escalated`, `failed`.
|
||||
|
||||
`escalated` means the action was irreversible and a human operator
|
||||
has been notified per ACP-DAG-HITL `unreachable_human` policy.
|
||||
|
||||
## Partial Rollback and Blast Radius Containment
|
||||
|
||||
When a failure occurs in the middle of a DAG, it is often
|
||||
undesirable to roll back the entire workflow. ATD defines
|
||||
partial rollback as rolling back the failed subgraph while
|
||||
preserving completed sibling branches.
|
||||
|
||||
Partial rollback MUST only proceed if:
|
||||
|
||||
1. The checkpoints to be rolled back are in the same workflow
|
||||
(`atd.wf_id`).
|
||||
2. No completed sibling task depends on the output of the
|
||||
failed task (verified by walking the DAG forward from the
|
||||
checkpoint).
|
||||
|
||||
The blast radius is the set of agents holding checkpoints that
|
||||
are descendants of the failed node. Orchestrators SHOULD
|
||||
compute blast radius before initiating cascade rollback to
|
||||
avoid unnecessary disruption.
|
||||
|
||||
## Rollback Timeout and Escalation
|
||||
|
||||
Rollback requests MUST include a timeout implicitly derived from
|
||||
the original checkpoint's `atd.ttl`. If rollback is not
|
||||
completed within `atd.ttl / 2` seconds, the agent MUST:
|
||||
|
||||
1. Emit an `atd:error` with `error_type: "timeout"` and
|
||||
`atd.description` noting rollback timeout.
|
||||
2. Escalate to HITL per {{hitl-escalation}}.
|
||||
|
||||
Agents MUST implement idempotent rollback: receiving the same
|
||||
rollback request ECT `jti` twice MUST return the same result.
|
||||
|
||||
## Rollback Authorization {#rollback-authz}
|
||||
|
||||
Only agents within the same workflow (`wid`) with checkpoint
|
||||
lineage in the DAG SHOULD be authorized to request rollback.
|
||||
Rollback requests from outside the originating workflow MUST be
|
||||
rejected with HTTP 403.
|
||||
|
||||
# Interaction with HITL {#hitl-escalation}
|
||||
|
||||
ATD escalates to HITL in the following scenarios:
|
||||
|
||||
1. **Irreversible action failure**: An error ECT with
|
||||
`atd.reversible: false` on the checkpoint MUST trigger
|
||||
HITL Level 2 (approval required) per the companion HITL
|
||||
specification.
|
||||
|
||||
2. **Rollback failure**: A rollback result with `atd.status:
|
||||
"failed"` MUST trigger HITL Level 3 (STOP) on the workflow.
|
||||
|
||||
3. **Cascaded rollback of critical nodes**: When `atd.cascade:
|
||||
true` rollback propagates to a node with `atd.severity:
|
||||
critical`, HITL SHOULD be triggered at Level 1 (PAUSE)
|
||||
to allow human review before proceeding.
|
||||
|
||||
4. **Circuit breaker permanent open**: If a circuit breaker
|
||||
re-opens after 3 successive HALF-OPEN probes, HITL Level 2
|
||||
escalation SHOULD be triggered.
|
||||
|
||||
ATD-to-HITL escalation is recorded as an ECT linked to both
|
||||
the triggering error ECT and the HITL override ECT, preserving
|
||||
the causal chain in the audit DAG.
|
||||
|
||||
# Resource Hints {#resources}
|
||||
|
||||
## Resource Claim Format
|
||||
|
||||
Agents MAY declare resource requirements as ACP-DAG-HITL node
|
||||
constraints:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"constraints": {
|
||||
"atd.resource_cpu": "2",
|
||||
"atd.resource_memory_mb": 4096,
|
||||
"atd.resource_timeout_s": 300,
|
||||
"atd.resource_priority": "high",
|
||||
"atd.resource_gpu": "0",
|
||||
"atd.resource_network_mbps": 100
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-resources title="Resource Hints as Node Constraints"}
|
||||
|
||||
## Priority Levels
|
||||
|
||||
The `atd.resource_priority` field MUST be one of: `critical`,
|
||||
`high`, `normal`, `low`. Orchestrators SHOULD map these to
|
||||
scheduling priority classes (e.g., Kubernetes QoS classes:
|
||||
`critical` → Guaranteed, `high`/`normal` → Burstable, `low`
|
||||
→ BestEffort).
|
||||
|
||||
## Fair-Share Scheduling
|
||||
|
||||
When multiple agents compete for a shared resource pool,
|
||||
orchestrators SHOULD implement fair-share scheduling:
|
||||
|
||||
1. Each active workflow receives an equal base allocation.
|
||||
2. Unused allocation from `low` priority agents is redistributed
|
||||
to `high`/`critical` agents within the same scheduling cycle.
|
||||
3. Starvation prevention: `low` priority agents MUST eventually
|
||||
be scheduled within a configurable maximum wait (default: 300s).
|
||||
|
||||
## Unsatisfiable Resource Hints
|
||||
|
||||
Resource hints are advisory; agents MUST NOT depend on them for
|
||||
correctness. When resource hints cannot be satisfied:
|
||||
|
||||
- If `atd.resource_priority` is `critical`: orchestrator SHOULD
|
||||
pre-empt lower-priority tasks.
|
||||
- If `critical` tasks still cannot be scheduled within 60s:
|
||||
emit `atd:error` with `error_type: "resource_exhausted"` and
|
||||
escalate to HITL.
|
||||
- All other priorities: proceed with degraded resources; log
|
||||
a warning via `atd:error` with severity `warning`.
|
||||
|
||||
# Optional Declarative Workflow Format {#workflow-format}
|
||||
|
||||
To support pre-run planning and tooling, ATD defines an optional
|
||||
declarative workflow descriptor. This is a planning artifact
|
||||
only; at runtime it is realized as ECTs per this specification.
|
||||
|
||||
~~~json
|
||||
{
|
||||
"wf_id": "bgp-failover-v2",
|
||||
"description": "BGP peer failover with validation",
|
||||
"nodes": [
|
||||
{
|
||||
"id": "n1",
|
||||
"label": "validate-config",
|
||||
"reversible": true,
|
||||
"hitl_required": false,
|
||||
"resource_hints": {
|
||||
"priority": "normal",
|
||||
"timeout_s": 30
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "n2",
|
||||
"label": "update-bgp-peer",
|
||||
"reversible": true,
|
||||
"hitl_required": true,
|
||||
"resource_hints": {
|
||||
"priority": "critical",
|
||||
"timeout_s": 120
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "n3",
|
||||
"label": "verify-session",
|
||||
"reversible": false,
|
||||
"hitl_required": false,
|
||||
"resource_hints": {
|
||||
"priority": "high",
|
||||
"timeout_s": 60
|
||||
}
|
||||
}
|
||||
],
|
||||
"edges": [
|
||||
{"from": "n1", "to": "n2"},
|
||||
{"from": "n2", "to": "n3"}
|
||||
]
|
||||
}
|
||||
~~~
|
||||
{: #fig-workflow title="Declarative Workflow Descriptor"}
|
||||
|
||||
The workflow descriptor media type is
|
||||
`application/atd-workflow+json`. Orchestrators MAY store and
|
||||
version workflow descriptors independently of their ECT runtime
|
||||
realization.
|
||||
|
||||
The `hitl_required` field is a hint to the HITL system that this
|
||||
node MUST have an approval gate as defined in the companion HITL
|
||||
specification.
|
||||
|
||||
# Security Considerations
|
||||
|
||||
## Rollback Authorization
|
||||
|
||||
Rollback requests are high-privilege operations. Agents MUST
|
||||
authenticate rollback requests using the ECT identity binding
|
||||
(L2/L3). The rollback endpoint MUST require mutual TLS or a
|
||||
signed JWT from an agent within the same workflow DAG.
|
||||
|
||||
Only agents that are ancestors in the ECT DAG of the checkpoint
|
||||
being rolled back SHOULD be authorized to request that rollback.
|
||||
|
||||
## Checkpoint Confidentiality
|
||||
|
||||
Checkpoint data may contain sensitive system state (API keys,
|
||||
session tokens, configuration). Agents MUST:
|
||||
|
||||
- Encrypt stored checkpoints at rest.
|
||||
- Reference checkpoint state via `out_hash` only in ECTs.
|
||||
- MUST NOT include checkpoint contents in error ECTs.
|
||||
|
||||
## False Error Injection
|
||||
|
||||
A malicious agent could send false `atd:error` ECTs to trigger
|
||||
unnecessary rollbacks and disrupt workflows. Mitigation:
|
||||
|
||||
- Agents SHOULD verify that error ECTs reference valid `par`
|
||||
values within their own workflow DAG (`wid` claim).
|
||||
- Rollback MUST require authentication (see {{rollback-authz}}).
|
||||
- L2/L3 ECT signing prevents unauthenticated error injection.
|
||||
|
||||
## Checkpoint Flooding
|
||||
|
||||
An adversary could exhaust checkpoint storage by triggering
|
||||
many checkpoints. Mitigation:
|
||||
|
||||
- Agents SHOULD enforce a maximum checkpoint count per workflow.
|
||||
- Expired checkpoints (past `atd.ttl`) MUST be purged.
|
||||
- Checkpoint creation rate SHOULD be rate-limited per calling
|
||||
workflow.
|
||||
|
||||
## Circuit Breaker State Leakage
|
||||
|
||||
The `atd:circuit_open` ECT reveals system health topology. The
|
||||
audit ledger SHOULD enforce access controls: only agents within
|
||||
the same workflow or authorized operators SHOULD be able to query
|
||||
circuit breaker history.
|
||||
|
||||
# IANA Considerations
|
||||
|
||||
This document requests registration of the following values in
|
||||
the AEM Ecosystem Extension Registry established by
|
||||
draft-aem-agent-ecosystem-model:
|
||||
|
||||
## `exec_act` Values
|
||||
|
||||
| Value | Description | Reference |
|
||||
|-------|-------------|-----------|
|
||||
| `atd:checkpoint` | State snapshot before consequential action | This document |
|
||||
| `atd:error` | Error signal with severity and type | This document |
|
||||
| `atd:circuit_open` | Circuit breaker opened to downstream agent | This document |
|
||||
| `atd:circuit_close` | Circuit breaker returned to CLOSED state | This document |
|
||||
| `atd:rollback_request` | Initiate rollback to named checkpoint | This document |
|
||||
| `atd:rollback_result` | Result of rollback attempt | This document |
|
||||
| `atd:workflow_start` | Workflow began execution | This document |
|
||||
| `atd:workflow_complete` | Workflow reached terminal state | This document |
|
||||
{: #fig-iana-actions title="ATD exec_act Registrations"}
|
||||
|
||||
## Well-Known URI
|
||||
|
||||
This document requests registration of `atd/rollback` as a
|
||||
well-known URI suffix per {{RFC8615}}.
|
||||
|
||||
## Media Type
|
||||
|
||||
This document requests registration of
|
||||
`application/atd-workflow+json` for the declarative workflow
|
||||
descriptor format defined in {{workflow-format}}.
|
||||
|
||||
--- back
|
||||
|
||||
# Acknowledgments
|
||||
{:numbered="false"}
|
||||
|
||||
ATD builds on ECT {{I-D.nennemann-wimse-ect}} for execution
|
||||
evidence and ACP-DAG-HITL {{I-D.nennemann-agent-dag-hitl-safety}}
|
||||
for delegation policy. The circuit breaker pattern is adapted
|
||||
from microservice architecture best practices. The declarative
|
||||
workflow format is inspired by workflow description languages
|
||||
(BPEL, BPMN) adapted for lightweight agent coordination.
|
||||
Reference in New Issue
Block a user