726 lines
22 KiB
Markdown
726 lines
22 KiB
Markdown
---
|
|
title: "Agent Task DAG (ATD): Execution Model, Checkpoints, and Recovery"
|
|
abbrev: "ATD"
|
|
category: std
|
|
docname: draft-atd-agent-task-dag-01
|
|
submissiontype: IETF
|
|
number:
|
|
date:
|
|
v: 3
|
|
area: "OPS"
|
|
workgroup: "NMOP"
|
|
keyword:
|
|
- agent DAG
|
|
- checkpoint
|
|
- rollback
|
|
- error recovery
|
|
- circuit breaker
|
|
|
|
author:
|
|
-
|
|
fullname: TBD
|
|
organization: Independent
|
|
email: placeholder@example.com
|
|
|
|
normative:
|
|
RFC2119:
|
|
RFC8174:
|
|
RFC8446:
|
|
RFC9110:
|
|
RFC8615:
|
|
I-D.nennemann-wimse-ect:
|
|
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
|
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
|
I-D.nennemann-agent-dag-hitl-safety:
|
|
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
|
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
|
|
|
informative:
|
|
|
|
--- abstract
|
|
|
|
This document defines the Agent Task DAG (ATD) specification:
|
|
execution semantics, checkpoints, error signaling, circuit
|
|
breakers, and rollback for agent workflows. ATD does not define a
|
|
new DAG or token format. It defines when agents MUST emit ECT
|
|
nodes, what those nodes mean, and how to recover when things go
|
|
wrong. Checkpoints, errors, and rollback results are ECT nodes
|
|
with specific `exec_act` values and `ext` claims. Rollback walks
|
|
the ECT DAG backwards. Circuit breakers contain cascading
|
|
failures. Resource hints enable scheduling. The protocol is
|
|
transport-agnostic and builds on ECT for evidence and ACP-DAG-HITL
|
|
for policy.
|
|
|
|
--- middle
|
|
|
|
# Introduction
|
|
|
|
Autonomous agents increasingly make unsupervised decisions, yet no
|
|
standard exists for how agents checkpoint state, signal errors to
|
|
peers, contain cascading failures, or roll back decisions gone
|
|
wrong.
|
|
|
|
ATD borrows proven patterns from distributed systems: checkpoints
|
|
from database transactions, circuit breakers from microservice
|
|
architectures, and rollback from version control. It adapts these
|
|
to agent workflows where actions may be partially reversible and
|
|
where the agent that caused the error may not be the best one to
|
|
fix it.
|
|
|
|
ATD does not define a new DAG format. The ECT DAG
|
|
{{I-D.nennemann-wimse-ect}} IS the execution graph. ATD defines
|
|
the semantics of specific node types within that graph.
|
|
|
|
Design principles:
|
|
|
|
1. Agents that take consequential actions MUST be able to undo
|
|
them, or MUST declare them irreversible upfront.
|
|
2. Failure containment takes priority over failure diagnosis.
|
|
3. The protocol adds minimal overhead to the happy path.
|
|
|
|
# Conventions and Definitions
|
|
|
|
{::boilerplate bcp14-tagged}
|
|
|
|
Checkpoint:
|
|
: An ECT node recording agent state before a consequential action,
|
|
sufficient to restore the system to that state.
|
|
|
|
Circuit Breaker:
|
|
: A mechanism that stops an agent from propagating requests to a
|
|
failing downstream agent, preventing cascading failures.
|
|
|
|
Rollback:
|
|
: The process of reverting an agent's actions and state to a
|
|
previously recorded checkpoint.
|
|
|
|
Blast Radius:
|
|
: The set of agents and systems affected by a single failure.
|
|
|
|
Consequential Action:
|
|
: An action that modifies external state (network configuration,
|
|
database records, API calls with side effects) such that
|
|
reversal requires explicit effort.
|
|
|
|
# Execution Semantics {#execution}
|
|
|
|
## Topological Order
|
|
|
|
Tasks in the ECT DAG MUST execute in topological order: a task
|
|
MUST NOT begin execution until all tasks referenced by its ECT
|
|
`par` claims are in state `done`.
|
|
|
|
Two tasks with no common ancestor in the DAG (no shared `par`
|
|
lineage) MAY execute concurrently. Orchestrators SHOULD
|
|
exploit this parallelism for performance.
|
|
|
|
Circular dependencies are prohibited. Agents MUST reject
|
|
ACP-DAG-HITL delegation DAGs containing cycles.
|
|
|
|
## Workflow Boundary ECTs
|
|
|
|
When a workflow begins, the initiating agent MUST emit:
|
|
|
|
~~~json
|
|
{
|
|
"exec_act": "atd:workflow_start",
|
|
"ext": {
|
|
"atd.wf_id": "wf-uuid",
|
|
"atd.description": "BGP failover workflow",
|
|
"atd.node_count": 5
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-wf-start title="Workflow Start ECT"}
|
|
|
|
When the workflow reaches a terminal state (all leaf nodes
|
|
complete or any node failed with no rollback path), the
|
|
orchestrator MUST emit:
|
|
|
|
~~~json
|
|
{
|
|
"exec_act": "atd:workflow_complete",
|
|
"par": ["wf-start-ect-uuid"],
|
|
"ext": {
|
|
"atd.wf_id": "wf-uuid",
|
|
"atd.terminal_status": "success",
|
|
"atd.elapsed_s": 42
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-wf-complete title="Workflow Complete ECT"}
|
|
|
|
Terminal status values: `success`, `partial`, `failed`,
|
|
`rolled_back`, `escalated`.
|
|
|
|
# Node States {#node-states}
|
|
|
|
Each task node in the ECT DAG has an implicit state derived from
|
|
subsequent ECT nodes:
|
|
|
|
- **pending**: A delegation node exists in ACP-DAG-HITL but no
|
|
corresponding ECT has been emitted.
|
|
- **running**: An ECT matching the task type has been emitted
|
|
but no completion or error ECT follows.
|
|
- **done**: A completion ECT (or the next `par`-linked ECT) exists.
|
|
- **failed**: An `atd:error` ECT references this node.
|
|
- **rolled_back**: An `atd:rollback_result` ECT references this
|
|
node's checkpoint.
|
|
- **escalated**: The task failed and a human has been notified
|
|
per HITL escalation rules.
|
|
|
|
# Checkpoint Mechanism {#checkpoints}
|
|
|
|
## Checkpoint Placement Policy
|
|
|
|
An ATD-compliant agent MUST create a checkpoint before any action
|
|
it classifies as consequential. The following actions are always
|
|
consequential and MUST be checkpointed:
|
|
|
|
1. Any modification to network device configuration.
|
|
2. Any write to a shared database or external data store.
|
|
3. Any API call with side effects (non-idempotent HTTP methods).
|
|
4. Any delegation to another agent that will itself take
|
|
consequential actions.
|
|
|
|
The following SHOULD be checkpointed:
|
|
|
|
1. Long-running computations (> `atd.resource_timeout_s`).
|
|
2. Actions that cannot be verified without external state.
|
|
|
|
The following are exempt from checkpoint requirements:
|
|
|
|
1. Read-only queries.
|
|
2. Sending notifications with no side effects.
|
|
3. Internal state computations with no external observable effect.
|
|
|
|
## Checkpoint ECT Format
|
|
|
|
A checkpoint is an ECT with:
|
|
|
|
- `exec_act`: `"atd:checkpoint"`
|
|
- `par`: the ECT of the action being checkpointed
|
|
|
|
~~~json
|
|
{
|
|
"jti": "ckpt-uuid",
|
|
"exec_act": "atd:checkpoint",
|
|
"par": ["action-ect-uuid"],
|
|
"out_hash": "sha256-of-agent-state-snapshot",
|
|
"ext": {
|
|
"atd.reversible": true,
|
|
"atd.rollback_uri": "https://agent-b.example.com/.well-known/atd/rollback",
|
|
"atd.target": "router-07.example.com",
|
|
"atd.description": "Update BGP peer config",
|
|
"atd.ttl": 86400
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-checkpoint title="Checkpoint ECT"}
|
|
|
|
The `atd.reversible` field MUST be present. If `false`, the agent
|
|
declares that this action cannot be automatically undone and
|
|
rollback requests MUST be escalated per the ACP-DAG-HITL
|
|
`unreachable_human` policy.
|
|
|
|
The `out_hash` provides integrity verification: the agent hashes
|
|
its state at checkpoint time so that rollback can verify it is
|
|
restoring to an authentic prior state.
|
|
|
|
Checkpoints MUST be stored for at least `atd.ttl` seconds. Agents
|
|
SHOULD store checkpoints in durable storage that survives restarts.
|
|
|
|
The rollback URI MUST be a well-known URI per {{RFC8615}} at the
|
|
path `/.well-known/atd/rollback`.
|
|
|
|
## Hierarchical Checkpoints
|
|
|
|
Agents MAY create hierarchical checkpoints where a parent groups
|
|
multiple child checkpoints from a multi-step operation. Rolling
|
|
back the parent rolls back all children. The parent checkpoint's
|
|
`par` array references all child checkpoint `jti` values.
|
|
|
|
## Checkpoint `exec_act` Table
|
|
|
|
| `exec_act` value | When emitted | Required `ext` fields |
|
|
|-----------------|-------------|----------------------|
|
|
| `atd:checkpoint` | Before consequential action | `atd.reversible`, `atd.rollback_uri`, `atd.ttl` |
|
|
| `atd:error` | On failure detection | `atd.severity`, `atd.error_type`, `atd.checkpoint_id` |
|
|
| `atd:circuit_open` | When error rate exceeds threshold | `atd.downstream_agent`, `atd.error_rate`, `atd.window_s` |
|
|
| `atd:circuit_close` | When probe succeeds in HALF-OPEN | `atd.downstream_agent`, `atd.cooldown_s` |
|
|
| `atd:rollback_request` | To initiate rollback | `atd.reason`, `atd.cascade` |
|
|
| `atd:rollback_result` | Rollback complete or failed | `atd.status`, `atd.checkpoint_id`, `atd.cascaded` |
|
|
| `atd:workflow_start` | Workflow begins | `atd.wf_id`, `atd.description` |
|
|
| `atd:workflow_complete` | Workflow terminal | `atd.wf_id`, `atd.terminal_status` |
|
|
{: #fig-actions title="ATD exec_act Values"}
|
|
|
|
# Error Signaling {#errors}
|
|
|
|
When an agent detects an error, it MUST emit an error ECT:
|
|
|
|
- `exec_act`: `"atd:error"`
|
|
- `par`: the ECT of the failed action
|
|
|
|
~~~json
|
|
{
|
|
"jti": "error-uuid",
|
|
"exec_act": "atd:error",
|
|
"par": ["failed-action-ect-uuid"],
|
|
"ext": {
|
|
"atd.severity": "critical",
|
|
"atd.error_type": "action_failed",
|
|
"atd.description": "BGP session did not establish",
|
|
"atd.checkpoint_id": "ckpt-uuid",
|
|
"atd.upstream_errors": []
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-error title="Error ECT"}
|
|
|
|
Severity levels (in increasing order): `info`, `warning`,
|
|
`error`, `critical`.
|
|
|
|
Error types: `action_failed`, `timeout`, `constraint_violation`,
|
|
`resource_exhausted`, `upstream_cascade`, `unknown`.
|
|
|
|
When an agent receives an error signal caused by an action it
|
|
initiated, it MUST either:
|
|
|
|
(a) Attempt automatic rollback of its checkpoint, or
|
|
(b) Escalate per ACP-DAG-HITL HITL rules if the action was
|
|
irreversible.
|
|
|
|
The `atd.upstream_errors` array allows agents to chain error
|
|
context, building a causal trace from symptom to root cause.
|
|
|
|
## HITL Escalation on Error
|
|
|
|
Error ECTs with severity `critical` SHOULD trigger HITL
|
|
escalation. Deployments SHOULD define ACP-DAG-HITL rules such
|
|
as:
|
|
|
|
~~~json
|
|
{
|
|
"id": "r-critical-error",
|
|
"trigger": {
|
|
"kind": "keyword_match",
|
|
"op": "eq",
|
|
"value": "critical",
|
|
"input_ref": "atd.severity"
|
|
},
|
|
"required_role": "operator:oncall",
|
|
"action": "escalate",
|
|
"allow_override": true,
|
|
"override_action": "continue"
|
|
}
|
|
~~~
|
|
{: #fig-error-hitl title="HITL Rule for Critical Errors"}
|
|
|
|
# Circuit Breaker Pattern {#circuit-breaker}
|
|
|
|
Each agent MUST implement a circuit breaker for every downstream
|
|
agent it communicates with. The circuit breaker has three states:
|
|
|
|
CLOSED (normal):
|
|
: Requests flow through. The agent tracks the error rate over a
|
|
sliding window (default: 60 seconds).
|
|
|
|
OPEN (failure detected):
|
|
: When the error rate exceeds a threshold (default: 50%), the
|
|
breaker opens. All requests are immediately rejected. The
|
|
agent MUST emit a circuit breaker open ECT:
|
|
|
|
~~~json
|
|
{
|
|
"exec_act": "atd:circuit_open",
|
|
"ext": {
|
|
"atd.downstream_agent": "spiffe://example.com/agent/b",
|
|
"atd.error_rate": 0.75,
|
|
"atd.window_s": 60
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-circuit-open title="Circuit Breaker Open ECT"}
|
|
|
|
HALF-OPEN (recovery probe):
|
|
: After a cooldown period (default: 30s), the breaker allows one
|
|
probe request. If it succeeds, the breaker returns to CLOSED
|
|
and MUST emit:
|
|
|
|
~~~json
|
|
{
|
|
"exec_act": "atd:circuit_close",
|
|
"ext": {
|
|
"atd.downstream_agent": "spiffe://example.com/agent/b",
|
|
"atd.cooldown_s": 30
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-circuit-close title="Circuit Breaker Close ECT"}
|
|
|
|
If the probe fails, the breaker returns to OPEN with doubled
|
|
cooldown (exponential backoff, max 300s).
|
|
|
|
## Circuit Breaker State Machine
|
|
|
|
~~~
|
|
error_rate > threshold
|
|
CLOSED ─────────────────────────► OPEN
|
|
▲ │
|
|
│ probe success │ cooldown expires
|
|
│ ▼
|
|
└────────────────────────── HALF-OPEN
|
|
probe failure ──► OPEN (cooldown * 2)
|
|
~~~
|
|
{: #fig-fsm title="Circuit Breaker State Machine"}
|
|
|
|
## Coordinated Circuit Breaking
|
|
|
|
When multiple agents share a downstream dependency, each maintains
|
|
its own circuit breaker independently. However, agents SHOULD
|
|
publish circuit breaker state via their ECT stream so peers can
|
|
observe the signal.
|
|
|
|
If an orchestrator observes N circuit breakers opening for the
|
|
same downstream agent within a short window, it SHOULD initiate
|
|
a HITL escalation rather than allowing N parallel recovery probes.
|
|
|
|
## Circuit Breaker Policy Configuration
|
|
|
|
Circuit breaker thresholds can be configured as ACP-DAG-HITL
|
|
node constraints:
|
|
|
|
~~~json
|
|
{
|
|
"constraints": {
|
|
"atd.circuit_threshold": 0.5,
|
|
"atd.circuit_window_s": 60
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-circuit-policy title="Circuit Breaker Policy"}
|
|
|
|
# Rollback Protocol {#rollback}
|
|
|
|
## Basic Rollback
|
|
|
|
A rollback is initiated by emitting a rollback request ECT and
|
|
sending an HTTP POST to the target agent's rollback endpoint:
|
|
|
|
~~~
|
|
POST /.well-known/atd/rollback HTTP/1.1
|
|
Content-Type: application/json
|
|
Execution-Context: <rollback-request-ect>
|
|
~~~
|
|
|
|
- `exec_act`: `"atd:rollback_request"`
|
|
- `par`: the checkpoint ECT to roll back to
|
|
|
|
~~~json
|
|
{
|
|
"exec_act": "atd:rollback_request",
|
|
"par": ["ckpt-uuid"],
|
|
"ext": {
|
|
"atd.reason": "Upstream action caused cascading failure",
|
|
"atd.cascade": true
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-rollback-req title="Rollback Request ECT"}
|
|
|
|
When `atd.cascade` is `true`, the receiving agent MUST also
|
|
initiate rollback of any downstream checkpoints created as a
|
|
consequence of the checkpointed action.
|
|
|
|
The agent MUST respond with a rollback result ECT:
|
|
|
|
~~~json
|
|
{
|
|
"exec_act": "atd:rollback_result",
|
|
"par": ["rollback-request-uuid"],
|
|
"out_hash": "sha256-of-restored-state",
|
|
"ext": {
|
|
"atd.status": "completed",
|
|
"atd.checkpoint_id": "ckpt-uuid",
|
|
"atd.cascaded": [
|
|
{"agent": "spiffe://example.com/agent/c", "status": "completed"},
|
|
{"agent": "spiffe://example.com/agent/d", "status": "escalated"}
|
|
]
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-rollback-result title="Rollback Result ECT"}
|
|
|
|
Status values: `completed`, `partial`, `escalated`, `failed`.
|
|
|
|
`escalated` means the action was irreversible and a human operator
|
|
has been notified per ACP-DAG-HITL `unreachable_human` policy.
|
|
|
|
## Partial Rollback and Blast Radius Containment
|
|
|
|
When a failure occurs in the middle of a DAG, it is often
|
|
undesirable to roll back the entire workflow. ATD defines
|
|
partial rollback as rolling back the failed subgraph while
|
|
preserving completed sibling branches.
|
|
|
|
Partial rollback MUST only proceed if:
|
|
|
|
1. The checkpoints to be rolled back are in the same workflow
|
|
(`atd.wf_id`).
|
|
2. No completed sibling task depends on the output of the
|
|
failed task (verified by walking the DAG forward from the
|
|
checkpoint).
|
|
|
|
The blast radius is the set of agents holding checkpoints that
|
|
are descendants of the failed node. Orchestrators SHOULD
|
|
compute blast radius before initiating cascade rollback to
|
|
avoid unnecessary disruption.
|
|
|
|
## Rollback Timeout and Escalation
|
|
|
|
Rollback requests MUST include a timeout implicitly derived from
|
|
the original checkpoint's `atd.ttl`. If rollback is not
|
|
completed within `atd.ttl / 2` seconds, the agent MUST:
|
|
|
|
1. Emit an `atd:error` with `error_type: "timeout"` and
|
|
`atd.description` noting rollback timeout.
|
|
2. Escalate to HITL per {{hitl-escalation}}.
|
|
|
|
Agents MUST implement idempotent rollback: receiving the same
|
|
rollback request ECT `jti` twice MUST return the same result.
|
|
|
|
## Rollback Authorization {#rollback-authz}
|
|
|
|
Only agents within the same workflow (`wid`) with checkpoint
|
|
lineage in the DAG SHOULD be authorized to request rollback.
|
|
Rollback requests from outside the originating workflow MUST be
|
|
rejected with HTTP 403.
|
|
|
|
# Interaction with HITL {#hitl-escalation}
|
|
|
|
ATD escalates to HITL in the following scenarios:
|
|
|
|
1. **Irreversible action failure**: An error ECT with
|
|
`atd.reversible: false` on the checkpoint MUST trigger
|
|
HITL Level 2 (approval required) per the companion HITL
|
|
specification.
|
|
|
|
2. **Rollback failure**: A rollback result with `atd.status:
|
|
"failed"` MUST trigger HITL Level 3 (STOP) on the workflow.
|
|
|
|
3. **Cascaded rollback of critical nodes**: When `atd.cascade:
|
|
true` rollback propagates to a node with `atd.severity:
|
|
critical`, HITL SHOULD be triggered at Level 1 (PAUSE)
|
|
to allow human review before proceeding.
|
|
|
|
4. **Circuit breaker permanent open**: If a circuit breaker
|
|
re-opens after 3 successive HALF-OPEN probes, HITL Level 2
|
|
escalation SHOULD be triggered.
|
|
|
|
ATD-to-HITL escalation is recorded as an ECT linked to both
|
|
the triggering error ECT and the HITL override ECT, preserving
|
|
the causal chain in the audit DAG.
|
|
|
|
# Resource Hints {#resources}
|
|
|
|
## Resource Claim Format
|
|
|
|
Agents MAY declare resource requirements as ACP-DAG-HITL node
|
|
constraints:
|
|
|
|
~~~json
|
|
{
|
|
"constraints": {
|
|
"atd.resource_cpu": "2",
|
|
"atd.resource_memory_mb": 4096,
|
|
"atd.resource_timeout_s": 300,
|
|
"atd.resource_priority": "high",
|
|
"atd.resource_gpu": "0",
|
|
"atd.resource_network_mbps": 100
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-resources title="Resource Hints as Node Constraints"}
|
|
|
|
## Priority Levels
|
|
|
|
The `atd.resource_priority` field MUST be one of: `critical`,
|
|
`high`, `normal`, `low`. Orchestrators SHOULD map these to
|
|
scheduling priority classes (e.g., Kubernetes QoS classes:
|
|
`critical` → Guaranteed, `high`/`normal` → Burstable, `low`
|
|
→ BestEffort).
|
|
|
|
## Fair-Share Scheduling
|
|
|
|
When multiple agents compete for a shared resource pool,
|
|
orchestrators SHOULD implement fair-share scheduling:
|
|
|
|
1. Each active workflow receives an equal base allocation.
|
|
2. Unused allocation from `low` priority agents is redistributed
|
|
to `high`/`critical` agents within the same scheduling cycle.
|
|
3. Starvation prevention: `low` priority agents MUST eventually
|
|
be scheduled within a configurable maximum wait (default: 300s).
|
|
|
|
## Unsatisfiable Resource Hints
|
|
|
|
Resource hints are advisory; agents MUST NOT depend on them for
|
|
correctness. When resource hints cannot be satisfied:
|
|
|
|
- If `atd.resource_priority` is `critical`: orchestrator SHOULD
|
|
pre-empt lower-priority tasks.
|
|
- If `critical` tasks still cannot be scheduled within 60s:
|
|
emit `atd:error` with `error_type: "resource_exhausted"` and
|
|
escalate to HITL.
|
|
- All other priorities: proceed with degraded resources; log
|
|
a warning via `atd:error` with severity `warning`.
|
|
|
|
# Optional Declarative Workflow Format {#workflow-format}
|
|
|
|
To support pre-run planning and tooling, ATD defines an optional
|
|
declarative workflow descriptor. This is a planning artifact
|
|
only; at runtime it is realized as ECTs per this specification.
|
|
|
|
~~~json
|
|
{
|
|
"wf_id": "bgp-failover-v2",
|
|
"description": "BGP peer failover with validation",
|
|
"nodes": [
|
|
{
|
|
"id": "n1",
|
|
"label": "validate-config",
|
|
"reversible": true,
|
|
"hitl_required": false,
|
|
"resource_hints": {
|
|
"priority": "normal",
|
|
"timeout_s": 30
|
|
}
|
|
},
|
|
{
|
|
"id": "n2",
|
|
"label": "update-bgp-peer",
|
|
"reversible": true,
|
|
"hitl_required": true,
|
|
"resource_hints": {
|
|
"priority": "critical",
|
|
"timeout_s": 120
|
|
}
|
|
},
|
|
{
|
|
"id": "n3",
|
|
"label": "verify-session",
|
|
"reversible": false,
|
|
"hitl_required": false,
|
|
"resource_hints": {
|
|
"priority": "high",
|
|
"timeout_s": 60
|
|
}
|
|
}
|
|
],
|
|
"edges": [
|
|
{"from": "n1", "to": "n2"},
|
|
{"from": "n2", "to": "n3"}
|
|
]
|
|
}
|
|
~~~
|
|
{: #fig-workflow title="Declarative Workflow Descriptor"}
|
|
|
|
The workflow descriptor media type is
|
|
`application/atd-workflow+json`. Orchestrators MAY store and
|
|
version workflow descriptors independently of their ECT runtime
|
|
realization.
|
|
|
|
The `hitl_required` field is a hint to the HITL system that this
|
|
node MUST have an approval gate as defined in the companion HITL
|
|
specification.
|
|
|
|
# Security Considerations
|
|
|
|
## Rollback Authorization
|
|
|
|
Rollback requests are high-privilege operations. Agents MUST
|
|
authenticate rollback requests using the ECT identity binding
|
|
(L2/L3). The rollback endpoint MUST require mutual TLS or a
|
|
signed JWT from an agent within the same workflow DAG.
|
|
|
|
Only agents that are ancestors in the ECT DAG of the checkpoint
|
|
being rolled back SHOULD be authorized to request that rollback.
|
|
|
|
## Checkpoint Confidentiality
|
|
|
|
Checkpoint data may contain sensitive system state (API keys,
|
|
session tokens, configuration). Agents MUST:
|
|
|
|
- Encrypt stored checkpoints at rest.
|
|
- Reference checkpoint state via `out_hash` only in ECTs.
|
|
- MUST NOT include checkpoint contents in error ECTs.
|
|
|
|
## False Error Injection
|
|
|
|
A malicious agent could send false `atd:error` ECTs to trigger
|
|
unnecessary rollbacks and disrupt workflows. Mitigation:
|
|
|
|
- Agents SHOULD verify that error ECTs reference valid `par`
|
|
values within their own workflow DAG (`wid` claim).
|
|
- Rollback MUST require authentication (see {{rollback-authz}}).
|
|
- L2/L3 ECT signing prevents unauthenticated error injection.
|
|
|
|
## Checkpoint Flooding
|
|
|
|
An adversary could exhaust checkpoint storage by triggering
|
|
many checkpoints. Mitigation:
|
|
|
|
- Agents SHOULD enforce a maximum checkpoint count per workflow.
|
|
- Expired checkpoints (past `atd.ttl`) MUST be purged.
|
|
- Checkpoint creation rate SHOULD be rate-limited per calling
|
|
workflow.
|
|
|
|
## Circuit Breaker State Leakage
|
|
|
|
The `atd:circuit_open` ECT reveals system health topology. The
|
|
audit ledger SHOULD enforce access controls: only agents within
|
|
the same workflow or authorized operators SHOULD be able to query
|
|
circuit breaker history.
|
|
|
|
# IANA Considerations
|
|
|
|
This document requests registration of the following values in
|
|
the AEM Ecosystem Extension Registry established by
|
|
draft-aem-agent-ecosystem-model:
|
|
|
|
## `exec_act` Values
|
|
|
|
| Value | Description | Reference |
|
|
|-------|-------------|-----------|
|
|
| `atd:checkpoint` | State snapshot before consequential action | This document |
|
|
| `atd:error` | Error signal with severity and type | This document |
|
|
| `atd:circuit_open` | Circuit breaker opened to downstream agent | This document |
|
|
| `atd:circuit_close` | Circuit breaker returned to CLOSED state | This document |
|
|
| `atd:rollback_request` | Initiate rollback to named checkpoint | This document |
|
|
| `atd:rollback_result` | Result of rollback attempt | This document |
|
|
| `atd:workflow_start` | Workflow began execution | This document |
|
|
| `atd:workflow_complete` | Workflow reached terminal state | This document |
|
|
{: #fig-iana-actions title="ATD exec_act Registrations"}
|
|
|
|
## Well-Known URI
|
|
|
|
This document requests registration of `atd/rollback` as a
|
|
well-known URI suffix per {{RFC8615}}.
|
|
|
|
## Media Type
|
|
|
|
This document requests registration of
|
|
`application/atd-workflow+json` for the declarative workflow
|
|
descriptor format defined in {{workflow-format}}.
|
|
|
|
--- back
|
|
|
|
# Acknowledgments
|
|
{:numbered="false"}
|
|
|
|
ATD builds on ECT {{I-D.nennemann-wimse-ect}} for execution
|
|
evidence and ACP-DAG-HITL {{I-D.nennemann-agent-dag-hitl-safety}}
|
|
for delegation policy. The circuit breaker pattern is adapted
|
|
from microservice architecture best practices. The declarative
|
|
workflow format is inspired by workflow description languages
|
|
(BPEL, BPMN) adapted for lightweight agent coordination.
|