feat: add draft data, gap analysis report, and workspace config
This commit is contained in:
@@ -0,0 +1,289 @@
|
||||
---
|
||||
title: "Agent Ecosystem Model (AEM): Architecture and Terminology"
|
||||
abbrev: "AEM"
|
||||
category: info
|
||||
docname: draft-aem-agent-ecosystem-model-00
|
||||
submissiontype: IETF
|
||||
number:
|
||||
date:
|
||||
v: 3
|
||||
area: "OPS"
|
||||
workgroup: "NMOP"
|
||||
keyword:
|
||||
- agent ecosystem
|
||||
- DAG
|
||||
- HITL
|
||||
- agentic workflows
|
||||
|
||||
author:
|
||||
-
|
||||
fullname: TBD
|
||||
organization: Independent
|
||||
email: placeholder@example.com
|
||||
|
||||
normative:
|
||||
RFC2119:
|
||||
RFC8174:
|
||||
|
||||
informative:
|
||||
I-D.nennemann-wimse-ect:
|
||||
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
||||
I-D.nennemann-agent-dag-hitl-safety:
|
||||
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
||||
|
||||
--- abstract
|
||||
|
||||
This document defines the Agent Ecosystem Model (AEM), a shared
|
||||
architecture and terminology for building interoperable agent
|
||||
systems that incorporate DAG-based execution, human-in-the-loop
|
||||
safety, and graduated assurance levels. AEM is not a protocol.
|
||||
It is a reference model that establishes common vocabulary and
|
||||
architectural concepts so that companion specifications (ATD,
|
||||
HITL, AEPB, APAE) and implementors share a consistent frame of
|
||||
reference. The model builds on Execution Context Tokens (ECT)
|
||||
for execution evidence and ACP-DAG-HITL for delegation policy.
|
||||
|
||||
--- middle
|
||||
|
||||
# Introduction
|
||||
|
||||
The IETF AI/agent landscape includes over 260 drafts proposing
|
||||
protocols for agent communication, identity, safety, and
|
||||
operations. These drafts share many implicit concepts — tasks,
|
||||
delegation, workflows, safety checks — but use inconsistent
|
||||
terminology and incompatible models.
|
||||
|
||||
AEM provides a single reference architecture so that:
|
||||
|
||||
- Companion drafts (ATD, HITL, AEPB, APAE) share vocabulary.
|
||||
- Implementors understand how the pieces compose.
|
||||
- New proposals can position themselves within an existing model
|
||||
rather than inventing another one.
|
||||
|
||||
AEM is deliberately not a protocol. It defines no wire formats,
|
||||
no endpoints, and no new token types. It is the map; the
|
||||
companion drafts are the territory.
|
||||
|
||||
## Design Principles
|
||||
|
||||
1. **ECT is the execution backbone.** All significant agent
|
||||
actions produce Execution Context Tokens
|
||||
{{I-D.nennemann-wimse-ect}}. The ecosystem does not define a
|
||||
second DAG or audit format.
|
||||
|
||||
2. **ACP-DAG-HITL is the policy backbone.**
|
||||
{{I-D.nennemann-agent-dag-hitl-safety}} defines delegation
|
||||
DAGs and HITL rules. The ecosystem extends these with
|
||||
operational semantics, not replacement structures.
|
||||
|
||||
3. **Same model, different assurance.** The architecture works
|
||||
identically from a relaxed K8s dev cluster (ECT L1) to a
|
||||
regulated healthcare environment (ECT L3 with audit ledger).
|
||||
|
||||
4. **Protocol-agnostic.** The ecosystem sits above any A2A
|
||||
protocol. Agents may speak different protocols and still
|
||||
participate through translation.
|
||||
|
||||
# Conventions and Definitions
|
||||
|
||||
{::boilerplate bcp14-tagged}
|
||||
|
||||
# Terminology {#terminology}
|
||||
|
||||
Agent:
|
||||
: An autonomous software entity that performs tasks, makes
|
||||
decisions, and communicates with other agents or humans.
|
||||
|
||||
Task:
|
||||
: A discrete unit of work performed by an agent, recorded as a
|
||||
single ECT node.
|
||||
|
||||
Workflow:
|
||||
: A set of tasks linked by dependencies, forming a DAG.
|
||||
Identified by the ECT `wid` claim.
|
||||
|
||||
DAG (Directed Acyclic Graph):
|
||||
: The execution graph formed by ECT parent references (`par`
|
||||
claims). Also used in ACP-DAG-HITL for delegation structure.
|
||||
|
||||
Checkpoint:
|
||||
: An ECT node recording agent state before a consequential
|
||||
action, enabling rollback.
|
||||
|
||||
HITL Point:
|
||||
: A position in the workflow where human intervention is
|
||||
required or available, governed by ACP-DAG-HITL rules.
|
||||
|
||||
Override:
|
||||
: A human-initiated command that alters an agent's autonomous
|
||||
operation, taking precedence over the agent's own decisions.
|
||||
|
||||
Trust Score:
|
||||
: A floating-point value in \[0.0, 1.0\] representing one
|
||||
agent's assessed reliability of another.
|
||||
|
||||
Protocol Binding:
|
||||
: The mapping between ecosystem semantics and a specific A2A
|
||||
communication protocol.
|
||||
|
||||
Assurance Level:
|
||||
: The degree of cryptographic and audit protection applied to
|
||||
ECTs: L1 (unsigned JSON), L2 (signed JWT), L3 (signed +
|
||||
audit ledger). Defined by {{I-D.nennemann-wimse-ect}}.
|
||||
|
||||
# Architectural Model {#architecture}
|
||||
|
||||
The ecosystem is organized in four layers:
|
||||
|
||||
~~~
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ Policy Layer │
|
||||
│ ACP-DAG-HITL: delegation DAG, HITL rules, │
|
||||
│ node constraints, trust thresholds │
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ Semantics Layer │
|
||||
│ ATD: execution order, checkpoints, rollback, │
|
||||
│ circuit breakers, resource hints │
|
||||
│ HITL: override levels, approval gates, escalation │
|
||||
│ AEPB: capability ads, negotiation, translation │
|
||||
│ APAE: trust scoring, behavior verification, │
|
||||
│ provenance, assurance profiles │
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ Evidence Layer │
|
||||
│ ECT: signed DAG of execution records (L1/L2/L3) │
|
||||
│ inp_hash/out_hash, ext claims, audit ledger │
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ Identity Layer │
|
||||
│ WIMSE / X.509 / OAuth / JWK: agent identity │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
~~~
|
||||
{: #fig-stack title="Ecosystem Layer Stack"}
|
||||
|
||||
Identity Layer:
|
||||
: Answers "who is this agent?" AEM does not define identity
|
||||
mechanisms; it assumes WIMSE, X.509, OAuth, or equivalent.
|
||||
|
||||
Evidence Layer:
|
||||
: Answers "what did this agent do?" ECT provides per-task
|
||||
signed records linked into a DAG, with three assurance levels.
|
||||
|
||||
Semantics Layer:
|
||||
: Answers "what does it mean and what to do about it?" The
|
||||
four companion drafts define operational semantics on top of
|
||||
ECT:
|
||||
|
||||
- **ATD** (Agent Task DAG): execution order, checkpoints,
|
||||
rollback, circuit breakers, resource hints.
|
||||
- **HITL** (Human-in-the-Loop): override levels, approval
|
||||
gates, escalation paths, explainability.
|
||||
- **AEPB** (Agent Ecosystem Protocol Binding): capability
|
||||
advertisement, protocol negotiation, translation gateways,
|
||||
agent lifecycle.
|
||||
- **APAE** (Assurance Profiles): dynamic trust scoring,
|
||||
behavior verification, data provenance, assurance profiles.
|
||||
|
||||
Policy Layer:
|
||||
: Answers "what's allowed?" ACP-DAG-HITL defines delegation
|
||||
constraints and HITL trigger rules. Companion drafts extend
|
||||
`constraints` with protocol-specific fields (trust thresholds,
|
||||
checkpoint policies, protocol restrictions).
|
||||
|
||||
## How ECT Extensions Work
|
||||
|
||||
Each companion draft defines `ext` claim namespaces on ECT:
|
||||
|
||||
| Draft | `ext` prefix | Example claims |
|
||||
|-------|-------------|----------------|
|
||||
| ATD | `atd.*` | `atd.reversible`, `atd.severity`, `atd.circuit_state` |
|
||||
| HITL | `hitl.*` | `hitl.level`, `hitl.operator_id`, `hitl.prior_state` |
|
||||
| AEPB | `aepb.*` | `aepb.source_protocol`, `aepb.dest_protocol` |
|
||||
| APAE | `apae.*` | `apae.trust_score`, `apae.confidence`, `apae.hops` |
|
||||
{: #fig-ext title="ECT Extension Namespaces"}
|
||||
|
||||
## How Policy Extensions Work
|
||||
|
||||
Each companion draft defines `constraints` fields on
|
||||
ACP-DAG-HITL DAG nodes:
|
||||
|
||||
| Draft | Constraint fields |
|
||||
|-------|------------------|
|
||||
| ATD | `atd.checkpoint_policy`, `atd.circuit_threshold` |
|
||||
| HITL | (uses HITL rules directly) |
|
||||
| AEPB | `aepb.allowed_protocols`, `aepb.max_translation_hops` |
|
||||
| APAE | `apae.min_trust`, `apae.min_confidence`, `apae.assurance_profile` |
|
||||
{: #fig-constraints title="ACP-DAG-HITL Node Constraint Extensions"}
|
||||
|
||||
# Assurance as an Orthogonal Axis {#assurance}
|
||||
|
||||
The entire semantics layer operates identically at all ECT
|
||||
assurance levels. The DAG structure, HITL processing, trust
|
||||
scoring, and protocol translation are the same whether the ECT
|
||||
is unsigned JSON (L1) or a ledger-committed signed JWT (L3).
|
||||
|
||||
What changes across levels is the security envelope:
|
||||
|
||||
| Property | L1 | L2 | L3 |
|
||||
|----------|----|----|-----|
|
||||
| Structured execution records | Yes | Yes | Yes |
|
||||
| DAG validation | Yes | Yes | Yes |
|
||||
| Non-repudiation | No | Yes | Yes |
|
||||
| Tamper detection | Transport only | Signature | Signature + ledger |
|
||||
| Regulatory audit trail | No | No | Yes |
|
||||
{: #fig-assurance title="Assurance Level Properties"}
|
||||
|
||||
A deployment MAY use different levels for different workflows.
|
||||
Internal dev pipelines might use L1; cross-org integrations L2;
|
||||
regulated clinical workflows L3.
|
||||
|
||||
# Protocol Agnosticism {#agnosticism}
|
||||
|
||||
The ecosystem layer sits above any A2A communication protocol.
|
||||
Agents communicate via their native protocol (A2A, MCP, SLIM,
|
||||
uACP, etc.) while the Execution-Context HTTP header
|
||||
{{I-D.nennemann-wimse-ect}} carries ECTs alongside protocol
|
||||
messages.
|
||||
|
||||
When two agents speak different protocols, a translation gateway
|
||||
(defined by AEPB) converts between protocols while preserving
|
||||
ECT DAG continuity. The translation hop is itself an ECT node,
|
||||
so the cross-protocol path is one auditable DAG.
|
||||
|
||||
# Companion Draft Summary {#companions}
|
||||
|
||||
| Draft | Abbrev | Concern | Gaps Addressed |
|
||||
|-------|--------|---------|----------------|
|
||||
| Agent Task DAG | ATD | Execution, checkpoints, rollback | #1 Resource Mgmt, #3 Error Recovery |
|
||||
| Human-in-the-Loop | HITL | Override, approval, escalation | #7 Human Override, #11 Explainability |
|
||||
| Protocol Binding | AEPB | Interop, translation, lifecycle | #4 Cross-Protocol, #5 Lifecycle |
|
||||
| Assurance Profiles | APAE | Trust, verification, provenance | #2 Behavior Verification, #8 Cross-Domain, #9 Dynamic Trust, #12 Provenance |
|
||||
{: #fig-companions title="Companion Draft Family"}
|
||||
|
||||
Together with ECT (evidence) and ACP-DAG-HITL (policy), these
|
||||
six documents cover all 3 critical and 6 high-severity gaps
|
||||
identified in the IETF AI/agent draft landscape.
|
||||
|
||||
# Security Considerations
|
||||
|
||||
AEM defines no protocol mechanisms and therefore introduces no
|
||||
direct security considerations. Security properties are
|
||||
inherited from the evidence layer (ECT assurance levels) and
|
||||
the policy layer (ACP-DAG-HITL validation).
|
||||
|
||||
Implementors MUST ensure that all layers are consistently
|
||||
configured: an L3 ECT deployment provides no additional
|
||||
assurance if the policy layer accepts unsigned tokens.
|
||||
|
||||
# IANA Considerations
|
||||
|
||||
This document has no IANA actions.
|
||||
|
||||
--- back
|
||||
|
||||
# Acknowledgments
|
||||
{:numbered="false"}
|
||||
|
||||
This architecture builds on the Execution Context Token
|
||||
specification {{I-D.nennemann-wimse-ect}} and the Agent Context
|
||||
Policy Token {{I-D.nennemann-agent-dag-hitl-safety}}.
|
||||
@@ -0,0 +1,461 @@
|
||||
---
|
||||
title: "Agent Ecosystem Model (AEM): Architecture and Terminology"
|
||||
abbrev: "AEM"
|
||||
category: info
|
||||
docname: draft-aem-agent-ecosystem-model-01
|
||||
submissiontype: IETF
|
||||
number:
|
||||
date:
|
||||
v: 3
|
||||
area: "OPS"
|
||||
workgroup: "NMOP"
|
||||
keyword:
|
||||
- agent ecosystem
|
||||
- DAG
|
||||
- HITL
|
||||
- agentic workflows
|
||||
|
||||
author:
|
||||
-
|
||||
fullname: TBD
|
||||
organization: Independent
|
||||
email: placeholder@example.com
|
||||
|
||||
normative:
|
||||
RFC2119:
|
||||
RFC8174:
|
||||
I-D.nennemann-wimse-ect:
|
||||
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
||||
I-D.nennemann-agent-dag-hitl-safety:
|
||||
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
||||
|
||||
informative:
|
||||
RFC9334:
|
||||
RFC7519:
|
||||
RFC8615:
|
||||
|
||||
--- abstract
|
||||
|
||||
This document defines the Agent Ecosystem Model (AEM), a shared
|
||||
architecture and terminology for building interoperable agent
|
||||
systems that incorporate DAG-based execution, human-in-the-loop
|
||||
safety, and graduated assurance levels. AEM is not a protocol.
|
||||
It is a reference model that establishes common vocabulary and
|
||||
architectural concepts so that companion specifications (ATD,
|
||||
HITL, AEPB, APAE) and implementors share a consistent frame of
|
||||
reference. The model builds on Execution Context Tokens (ECT)
|
||||
for execution evidence and ACP-DAG-HITL for delegation policy.
|
||||
|
||||
--- middle
|
||||
|
||||
# Introduction
|
||||
|
||||
The IETF AI/agent landscape includes over 260 drafts proposing
|
||||
protocols for agent communication, identity, safety, and
|
||||
operations. These drafts share many implicit concepts — tasks,
|
||||
delegation, workflows, safety checks — but use inconsistent
|
||||
terminology and incompatible models.
|
||||
|
||||
AEM provides a single reference architecture so that:
|
||||
|
||||
- Companion drafts (ATD, HITL, AEPB, APAE) share vocabulary.
|
||||
- Implementors understand how the pieces compose.
|
||||
- New proposals can position themselves within an existing model
|
||||
rather than inventing another one.
|
||||
|
||||
AEM is deliberately not a protocol. It defines no wire formats,
|
||||
no endpoints, and no new token types. It is the map; the
|
||||
companion drafts are the territory.
|
||||
|
||||
## Design Principles
|
||||
|
||||
1. **ECT is the execution backbone.** All significant agent
|
||||
actions produce Execution Context Tokens
|
||||
{{I-D.nennemann-wimse-ect}}. The ecosystem does not define a
|
||||
second DAG or audit format.
|
||||
|
||||
2. **ACP-DAG-HITL is the policy backbone.**
|
||||
{{I-D.nennemann-agent-dag-hitl-safety}} defines delegation
|
||||
DAGs and HITL rules. The ecosystem extends these with
|
||||
operational semantics, not replacement structures.
|
||||
|
||||
3. **Same model, different assurance.** The architecture works
|
||||
identically from a relaxed K8s dev cluster (ECT L1) to a
|
||||
regulated healthcare environment (ECT L3 with audit ledger).
|
||||
|
||||
4. **Protocol-agnostic.** The ecosystem sits above any A2A
|
||||
protocol. Agents may speak different protocols and still
|
||||
participate through translation.
|
||||
|
||||
# Conventions and Definitions
|
||||
|
||||
{::boilerplate bcp14-tagged}
|
||||
|
||||
# Terminology {#terminology}
|
||||
|
||||
Agent:
|
||||
: An autonomous software entity that performs tasks, makes
|
||||
decisions, and communicates with other agents or humans.
|
||||
|
||||
Task:
|
||||
: A discrete unit of work performed by an agent, recorded as a
|
||||
single ECT node.
|
||||
|
||||
Workflow:
|
||||
: A set of tasks linked by dependencies, forming a DAG.
|
||||
Identified by the ECT `wid` claim {{I-D.nennemann-wimse-ect}}.
|
||||
|
||||
DAG (Directed Acyclic Graph):
|
||||
: The execution graph formed by ECT parent references (`par`
|
||||
claims). Also used in ACP-DAG-HITL for delegation structure.
|
||||
|
||||
Checkpoint:
|
||||
: An ECT node recording agent state before a consequential
|
||||
action, enabling rollback. Fully specified in ATD.
|
||||
|
||||
HITL Point:
|
||||
: A position in the workflow where human intervention is
|
||||
required or available, governed by ACP-DAG-HITL rules.
|
||||
|
||||
Override:
|
||||
: A human-initiated command that alters an agent's autonomous
|
||||
operation, taking precedence over the agent's own decisions.
|
||||
Fully specified in HITL.
|
||||
|
||||
Trust Score:
|
||||
: A floating-point value in \[0.0, 1.0\] representing one
|
||||
agent's assessed reliability of another. Updated using an
|
||||
AIMD model; fully specified in APAE.
|
||||
|
||||
Protocol Binding:
|
||||
: The mapping between ecosystem semantics and a specific A2A
|
||||
communication protocol. Fully specified in AEPB.
|
||||
|
||||
Assurance Level:
|
||||
: The degree of cryptographic and audit protection applied to
|
||||
ECTs, defined in {{I-D.nennemann-wimse-ect}}:
|
||||
|
||||
| Level | ECT Format | Non-repudiation | Tamper detection | Audit ledger |
|
||||
|-------|-----------|----------------|-----------------|-------------|
|
||||
| L1 | Unsigned JSON | No | Transport only | No |
|
||||
| L2 | Signed JWT | Yes | Signature | No |
|
||||
| L3 | Signed JWT | Yes | Signature | Yes (ledger-committed) |
|
||||
{: #fig-levels title="ECT Assurance Levels"}
|
||||
|
||||
Assurance Profile:
|
||||
: A named configuration (Relaxed, Standard, Regulated) selecting
|
||||
which mechanisms are required at a given deployment. Fully
|
||||
specified in APAE.
|
||||
|
||||
Blast Radius:
|
||||
: The set of agents and systems affected by a single failure.
|
||||
|
||||
Translation Gateway:
|
||||
: A service converting messages between two agent protocols,
|
||||
recording each hop as an ECT DAG node. Fully specified in AEPB.
|
||||
|
||||
# Architectural Model {#architecture}
|
||||
|
||||
The ecosystem is organized in four layers:
|
||||
|
||||
~~~
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ Policy Layer │
|
||||
│ ACP-DAG-HITL: delegation DAG, HITL rules, │
|
||||
│ node constraints, trust thresholds │
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ Semantics Layer │
|
||||
│ ATD: execution order, checkpoints, rollback, │
|
||||
│ circuit breakers, resource hints │
|
||||
│ HITL: override levels, approval gates, escalation │
|
||||
│ AEPB: capability ads, negotiation, translation │
|
||||
│ APAE: trust scoring, behavior verification, │
|
||||
│ provenance, assurance profiles │
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ Evidence Layer │
|
||||
│ ECT: signed DAG of execution records (L1/L2/L3) │
|
||||
│ inp_hash/out_hash, ext claims, audit ledger │
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ Identity Layer │
|
||||
│ WIMSE / X.509 / OAuth / JWK: agent identity │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
~~~
|
||||
{: #fig-stack title="Ecosystem Layer Stack"}
|
||||
|
||||
Identity Layer:
|
||||
: Answers "who is this agent?" AEM does not define identity
|
||||
mechanisms; it assumes WIMSE, X.509, OAuth, or equivalent.
|
||||
|
||||
Evidence Layer:
|
||||
: Answers "what did this agent do?" ECT provides per-task
|
||||
signed records linked into a DAG, with three assurance levels.
|
||||
|
||||
Semantics Layer:
|
||||
: Answers "what does it mean and what to do about it?" The
|
||||
four companion drafts define operational semantics on top of
|
||||
ECT:
|
||||
|
||||
- **ATD** (Agent Task DAG): execution order, checkpoints,
|
||||
rollback, circuit breakers, resource hints.
|
||||
- **HITL** (Human-in-the-Loop): override levels, approval
|
||||
gates, escalation paths, explainability.
|
||||
- **AEPB** (Agent Ecosystem Protocol Binding): capability
|
||||
advertisement, protocol negotiation, translation gateways,
|
||||
agent lifecycle.
|
||||
- **APAE** (Assurance Profiles): dynamic trust scoring,
|
||||
behavior verification, data provenance, assurance profiles.
|
||||
|
||||
Policy Layer:
|
||||
: Answers "what's allowed?" ACP-DAG-HITL defines delegation
|
||||
constraints and HITL trigger rules. Companion drafts extend
|
||||
`constraints` with protocol-specific fields (trust thresholds,
|
||||
checkpoint policies, protocol restrictions).
|
||||
|
||||
## How ECT Extensions Work {#ect-ext}
|
||||
|
||||
Each companion draft defines `ext` claim namespaces on ECT:
|
||||
|
||||
| Draft | `ext` prefix | Example claims |
|
||||
|-------|-------------|----------------|
|
||||
| ATD | `atd.*` | `atd.reversible`, `atd.severity`, `atd.circuit_state` |
|
||||
| HITL | `hitl.*` | `hitl.level`, `hitl.operator_id`, `hitl.prior_state` |
|
||||
| AEPB | `aepb.*` | `aepb.source_protocol`, `aepb.dest_protocol` |
|
||||
| APAE | `apae.*` | `apae.trust_score`, `apae.confidence`, `apae.hops` |
|
||||
{: #fig-ext title="ECT Extension Namespaces"}
|
||||
|
||||
No draft MAY use another draft's `ext` namespace without a
|
||||
normative reference to that draft.
|
||||
|
||||
## How Policy Extensions Work {#policy-ext}
|
||||
|
||||
Each companion draft defines `constraints` fields on
|
||||
ACP-DAG-HITL DAG nodes:
|
||||
|
||||
| Draft | Constraint fields |
|
||||
|-------|------------------|
|
||||
| ATD | `atd.checkpoint_policy`, `atd.circuit_threshold` |
|
||||
| HITL | (uses ACP-DAG-HITL HITL rule fields directly) |
|
||||
| AEPB | `aepb.allowed_protocols`, `aepb.max_translation_hops` |
|
||||
| APAE | `apae.min_trust`, `apae.min_confidence`, `apae.assurance_profile` |
|
||||
{: #fig-constraints title="ACP-DAG-HITL Node Constraint Extensions"}
|
||||
|
||||
# Assurance as an Orthogonal Axis {#assurance}
|
||||
|
||||
The entire semantics layer operates identically at all ECT
|
||||
assurance levels. The DAG structure, HITL processing, trust
|
||||
scoring, and protocol translation are the same whether the ECT
|
||||
is unsigned JSON (L1) or a ledger-committed signed JWT (L3).
|
||||
|
||||
What changes across levels is the security envelope (see
|
||||
{{fig-levels}}). A deployment MAY use different levels for
|
||||
different workflows. Internal dev pipelines might use L1;
|
||||
cross-org integrations L2; regulated clinical workflows L3.
|
||||
|
||||
Implementations MUST ensure consistency across layers: an L3
|
||||
evidence configuration provides no additional assurance if the
|
||||
policy layer accepts unsigned tokens.
|
||||
|
||||
# Protocol Agnosticism {#agnosticism}
|
||||
|
||||
The ecosystem layer sits above any A2A communication protocol.
|
||||
Agents communicate via their native protocol (A2A, MCP, SLIM,
|
||||
uACP, etc.) while the `Execution-Context` HTTP header
|
||||
{{I-D.nennemann-wimse-ect}} carries ECTs alongside protocol
|
||||
messages.
|
||||
|
||||
When two agents speak different protocols, a translation gateway
|
||||
(defined by AEPB) converts between protocols while preserving
|
||||
ECT DAG continuity. The translation hop is itself an ECT node,
|
||||
so the cross-protocol path is one auditable DAG.
|
||||
|
||||
# Relationship to Existing Standards {#standards}
|
||||
|
||||
The ecosystem builds on existing IETF and industry standards.
|
||||
It does not replace any of them.
|
||||
|
||||
| Standard | Scope | Relationship to AEM |
|
||||
|----------|-------|---------------------|
|
||||
| WIMSE (draft-ietf-wimse-arch) | Workload identity and security context propagation | Identity Layer; AEM assumes WIMSE for agent credentials and context propagation. |
|
||||
| ECT (I-D.nennemann-wimse-ect) | JWT-based execution evidence; DAG linkage via `par` | Evidence Layer; every significant action in the ecosystem produces an ECT. |
|
||||
| ACP-DAG-HITL (I-D.nennemann-agent-dag-hitl-safety) | Delegation DAG policy; HITL trigger rules | Policy Layer; ATD/HITL/AEPB/APAE extend `constraints` fields, not replace the policy language. |
|
||||
| OAuth 2.0 / RAR (RFC9396) | Authorization for API access | Identity Layer; operators and agents authenticate to HITL endpoints and capability documents via OAuth. |
|
||||
| RATS (RFC9334) | Remote attestation for verifying evidence freshness | Informative to APAE Regulated profile; behavior verification attestations are RATS-compatible. |
|
||||
| SPIFFE/SPIRE | Workload identity URI scheme (`spiffe://`) | Identity Layer; agent identities in ECT `sub` and ACP-DAG-HITL node `agent` fields use SPIFFE URIs by convention. |
|
||||
{: #fig-standards title="Relationship to Existing Standards"}
|
||||
|
||||
## Working Group Targets
|
||||
|
||||
| Companion Draft | Suggested WG | Rationale |
|
||||
|----------------|-------------|-----------|
|
||||
| AEM (this document) | NMOP | Informational reference model for network operations automation. |
|
||||
| ATD | NMOP | Execution semantics and error recovery for network agent workflows. |
|
||||
| HITL | NMOP or OPS | Human override for autonomous network management. |
|
||||
| AEPB | DISPATCH or ART | Protocol binding and interoperability layer; dispatch to appropriate WG. |
|
||||
| APAE | RATS or Security Dispatch | Attestation-based trust and assurance profiles for agents. |
|
||||
{: #fig-wgs title="Suggested Working Group Targets"}
|
||||
|
||||
# Companion Draft Summary {#companions}
|
||||
|
||||
| Draft | Abbrev | Concern | Gaps Addressed | Normative/Informative |
|
||||
|-------|--------|---------|----------------|----------------------|
|
||||
| Agent Task DAG | ATD | Execution, checkpoints, rollback, circuit breakers | #1 Resource Mgmt, #3 Error Recovery | Normative |
|
||||
| Human-in-the-Loop | HITL | Override, approval, escalation, explainability | #7 Human Override, #11 Explainability | Normative |
|
||||
| Protocol Binding | AEPB | Interop, translation, lifecycle | #4 Cross-Protocol, #5 Lifecycle | Normative |
|
||||
| Assurance Profiles | APAE | Trust, verification, provenance, dual-regime | #2 Behavior Verification, #8 Cross-Domain, #9 Dynamic Trust, #12 Provenance | Informative/Normative |
|
||||
{: #fig-companions title="Companion Draft Family"}
|
||||
|
||||
Together with ECT (evidence) and ACP-DAG-HITL (policy), these
|
||||
six documents cover all 3 critical and 6 high-severity gaps
|
||||
identified in the IETF AI/agent draft landscape analysis.
|
||||
|
||||
# Implementation Guidance {#implementation}
|
||||
|
||||
## Choosing an Assurance Level
|
||||
|
||||
Operators select the assurance level based on deployment context:
|
||||
|
||||
Relaxed (L1):
|
||||
: Appropriate for internal development, testing, and
|
||||
observability pipelines. No cryptographic overhead.
|
||||
Operators SHOULD NOT use L1 where ECT records could be
|
||||
relied upon as evidence in disputes.
|
||||
|
||||
Standard (L2):
|
||||
: Appropriate for production cross-organization deployments.
|
||||
Signed ECTs provide non-repudiation. RECOMMENDED as the
|
||||
default for any deployment where agents cross trust domains.
|
||||
|
||||
Regulated (L3):
|
||||
: Required for deployments subject to regulatory audit
|
||||
requirements (healthcare, finance, critical infrastructure).
|
||||
ECTs are committed to an append-only audit ledger.
|
||||
Operators MUST use L3 when a regulatory framework mandates
|
||||
tamper-evident audit trails.
|
||||
|
||||
## Minimum Viable Implementation
|
||||
|
||||
An implementation is AEM-compliant if it satisfies:
|
||||
|
||||
1. **Evidence**: Emits ECTs for all consequential actions.
|
||||
MAY use L1 initially.
|
||||
|
||||
2. **Policy**: Evaluates ACP-DAG-HITL node constraints before
|
||||
delegating tasks.
|
||||
|
||||
3. **Checkpoints**: Implements ATD §4 (checkpoints before
|
||||
consequential actions). MUST declare `atd.reversible`.
|
||||
|
||||
4. **HITL endpoint**: Implements HITL `/.well-known/hitl/override`
|
||||
and responds within 1 second.
|
||||
|
||||
5. **Capability document**: Serves AEPB `/.well-known/aepb` so
|
||||
peers can discover protocol support.
|
||||
|
||||
The following are OPTIONAL at L1 but REQUIRED at L2+:
|
||||
|
||||
- Cryptographic signing of ECTs.
|
||||
- APAE trust scoring.
|
||||
- Behavior verification.
|
||||
|
||||
The following are REQUIRED only at L3 (Regulated profile):
|
||||
|
||||
- Audit ledger commitment.
|
||||
- Continuous behavior verification.
|
||||
- Provenance claims on data-transforming ECT nodes.
|
||||
|
||||
## Upgrade Path
|
||||
|
||||
Upgrading from L1 to L2:
|
||||
: Add a signing key (WIMSE WIT or X.509). Update ECT emission
|
||||
to sign tokens. Update all agents to verify signatures.
|
||||
No protocol changes needed; ECT format is compatible.
|
||||
|
||||
Upgrading from L2 to L3:
|
||||
: Configure an audit ledger endpoint. Update ECT emission to
|
||||
commit each ECT. Enable APAE continuous behavior
|
||||
verification. Enable provenance claims.
|
||||
|
||||
Operators MUST NOT downgrade assurance level during an active
|
||||
workflow.
|
||||
|
||||
# Security Considerations
|
||||
|
||||
## Threat Model
|
||||
|
||||
The AEM threat model covers the following adversary classes:
|
||||
|
||||
**Compromised Agent**: An agent that emits false ECTs, fabricates
|
||||
errors, or attempts unauthorized rollbacks. Mitigated by ECT
|
||||
signature verification (L2+) and ACP-DAG-HITL policy validation.
|
||||
|
||||
**Rogue Operator**: A human who issues unauthorized overrides.
|
||||
Mitigated by HITL authentication requirements (signed JWTs,
|
||||
mutual TLS) and multi-operator approval for Level 4 TAKEOVER.
|
||||
|
||||
**Translation Gateway Attack**: A malicious or compromised
|
||||
gateway that alters message content in transit. Mitigated by
|
||||
ECT `inp_hash`/`out_hash` integrity checks; receivers MUST
|
||||
detect hash mismatches.
|
||||
|
||||
**Trust Score Manipulation**: An agent accumulates high trust
|
||||
through benign behavior, then executes a malicious action.
|
||||
Mitigated by APAE double-penalty for `policy_violation` events
|
||||
and anomaly detection.
|
||||
|
||||
**Downgrade Attack**: An attacker forces use of L1 ECTs where
|
||||
L2+ is required. Mitigated by explicit assurance level checks
|
||||
in ACP-DAG-HITL constraints (`apae.assurance_profile` field).
|
||||
|
||||
## Layer Consistency Requirement
|
||||
|
||||
Implementations MUST configure the semantics, evidence, and
|
||||
policy layers consistently. Specifically:
|
||||
|
||||
- An L3 evidence deployment MUST NOT accept L1 ECTs as proof
|
||||
of action in audit or policy decisions.
|
||||
- A Regulated assurance profile MUST be paired with L3 ECTs.
|
||||
- HITL Level 2+ (approval required) MUST be authenticated.
|
||||
|
||||
## Translation Gateway Supply Chain
|
||||
|
||||
Translation gateways are privileged intermediaries: they have
|
||||
access to plaintext message content and can inject ECT nodes.
|
||||
Operators MUST:
|
||||
|
||||
- Authenticate gateways using the same identity mechanisms as
|
||||
agents (WIMSE/SPIFFE).
|
||||
- Audit gateway ECT nodes at L2+ for tamper detection.
|
||||
- Limit `aepb.max_translation_hops` to prevent unbounded
|
||||
delegation chains through untrusted gateways.
|
||||
|
||||
# IANA Considerations
|
||||
|
||||
## AEM Ecosystem Extension Registry
|
||||
|
||||
This document requests the creation of the "AEM Ecosystem
|
||||
Extension Registry" under IANA. This registry collects:
|
||||
|
||||
1. **ECT Extension Namespaces**: Companion draft `ext` claim
|
||||
prefixes (see {{fig-ext}}).
|
||||
2. **ACP-DAG-HITL Constraint Field Namespaces**: Companion draft
|
||||
`constraints` field prefixes (see {{fig-constraints}}).
|
||||
3. **ECT `exec_act` Values**: All `exec_act` strings registered
|
||||
by companion drafts (see each companion's IANA section).
|
||||
|
||||
Registration policy: Specification Required.
|
||||
|
||||
Initial entries: as defined in {{fig-ext}}, {{fig-constraints}},
|
||||
and the companion draft `exec_act` registrations.
|
||||
|
||||
--- back
|
||||
|
||||
# Acknowledgments
|
||||
{:numbered="false"}
|
||||
|
||||
This architecture builds on the Execution Context Token
|
||||
specification {{I-D.nennemann-wimse-ect}} and the Agent Context
|
||||
Policy Token {{I-D.nennemann-agent-dag-hitl-safety}}. The
|
||||
working group targets in {{fig-wgs}} reflect the current IETF
|
||||
AI/agent draft landscape analysis.
|
||||
@@ -0,0 +1,397 @@
|
||||
---
|
||||
title: "Agent Error Recovery and Rollback (AERR)"
|
||||
abbrev: "AERR"
|
||||
category: std
|
||||
docname: draft-aerr-agent-error-recovery-rollback-00
|
||||
submissiontype: IETF
|
||||
number:
|
||||
date:
|
||||
v: 3
|
||||
area: "OPS"
|
||||
workgroup: "NMOP"
|
||||
keyword:
|
||||
- error recovery
|
||||
- rollback
|
||||
- circuit breaker
|
||||
- agentic workflows
|
||||
- execution context
|
||||
|
||||
author:
|
||||
-
|
||||
fullname: Generated by IETF Draft Analyzer
|
||||
organization: Independent
|
||||
email: placeholder@example.com
|
||||
|
||||
normative:
|
||||
RFC7519:
|
||||
RFC7515:
|
||||
RFC9110:
|
||||
I-D.nennemann-wimse-ect:
|
||||
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
||||
|
||||
informative:
|
||||
I-D.nennemann-agent-dag-hitl-safety:
|
||||
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
||||
|
||||
--- abstract
|
||||
|
||||
This document defines the Agent Error Recovery and Rollback (AERR)
|
||||
protocol, a standard for handling errors, cascading failures, and
|
||||
rollback in multi-agent systems. AERR defines three mechanisms:
|
||||
state checkpoints recorded as Execution Context Token (ECT) DAG
|
||||
nodes, a circuit breaker pattern to contain cascading failures,
|
||||
and a rollback protocol that walks the ECT DAG backwards to revert
|
||||
agent actions to a known-good state. By building on ECT, AERR
|
||||
inherits cryptographic audit trails, assurance levels, and DAG
|
||||
validation without inventing parallel infrastructure.
|
||||
|
||||
--- middle
|
||||
|
||||
# Introduction
|
||||
|
||||
The IETF AI/agent landscape includes 60 drafts on autonomous
|
||||
network operations but none that standardize error recovery. When
|
||||
an autonomous agent misconfigures a router, allocates resources
|
||||
incorrectly, or triggers a cascade of failures across a multi-agent
|
||||
system, there is no standard mechanism for detecting the failure,
|
||||
containing its blast radius, or reverting to a safe state.
|
||||
|
||||
AERR borrows proven patterns from distributed systems -- checkpoints
|
||||
from database transactions, circuit breakers from microservice
|
||||
architectures, rollback from version control -- and adapts them for
|
||||
AI agent workflows. Rather than inventing its own audit and
|
||||
tracing layer, AERR records all checkpoints, errors, and rollbacks
|
||||
as ECT DAG nodes {{I-D.nennemann-wimse-ect}}, giving every
|
||||
recovery action a cryptographic proof chain.
|
||||
|
||||
Design principles:
|
||||
|
||||
1. Agents that take consequential actions MUST be able to undo
|
||||
them, or MUST declare them irreversible upfront.
|
||||
2. Failure containment takes priority over failure diagnosis.
|
||||
3. The protocol adds minimal overhead to the happy path.
|
||||
|
||||
# Conventions and Definitions
|
||||
|
||||
{::boilerplate bcp14-tagged}
|
||||
|
||||
Checkpoint:
|
||||
: An ECT recording an agent's state hash before a consequential
|
||||
action, providing a restore point for rollback.
|
||||
|
||||
Circuit Breaker:
|
||||
: A mechanism that stops an agent from propagating requests to a
|
||||
failing downstream agent, preventing cascading failures.
|
||||
|
||||
Rollback:
|
||||
: The process of reverting an agent's actions and state to a
|
||||
previously recorded checkpoint, walking the ECT DAG backwards.
|
||||
|
||||
Blast Radius:
|
||||
: The set of agents and systems affected by a single agent's
|
||||
failure, determinable by traversing the ECT DAG forward from the
|
||||
failing node.
|
||||
|
||||
# Problem Statement
|
||||
|
||||
Consider a network operations scenario: Agent A instructs Agent B
|
||||
to update firewall rules, which causes Agent C's traffic monitoring
|
||||
to fail, which causes Agent D to misclassify traffic. Today each
|
||||
agent handles errors independently. There is no standard way for
|
||||
Agent D to signal that the root cause is upstream, for the cascade
|
||||
to be halted, or for the chain of actions to be rolled back.
|
||||
|
||||
The ECT DAG {{I-D.nennemann-wimse-ect}} already records causal
|
||||
ordering of agent actions via `par` references. AERR adds
|
||||
checkpoint semantics, error propagation, and rollback operations
|
||||
on top of this existing structure.
|
||||
|
||||
# Checkpoint Mechanism {#checkpoints}
|
||||
|
||||
An AERR-compliant agent MUST create a checkpoint ECT before any
|
||||
action it classifies as consequential. An action is consequential
|
||||
if it modifies external state (e.g., network config, database
|
||||
records, API calls with side effects).
|
||||
|
||||
## Checkpoint as ECT
|
||||
|
||||
A checkpoint is an ECT with:
|
||||
|
||||
- `exec_act`: `"aerr:checkpoint"`
|
||||
- `par`: the `jti` of the preceding task ECT in the workflow
|
||||
- `out_hash`: SHA-256 hash of the agent's state snapshot at
|
||||
checkpoint time (for rollback integrity verification)
|
||||
|
||||
The `ext` claim carries AERR-specific metadata:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"ext": {
|
||||
"aerr.action_type": "config_update",
|
||||
"aerr.target": "router-07.example.com",
|
||||
"aerr.reversible": true,
|
||||
"aerr.rollback_uri": "https://agent-b.example.com/aerr/rollback",
|
||||
"aerr.ttl": 86400
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-checkpoint title="Checkpoint ECT Extension Claims"}
|
||||
|
||||
The `aerr.reversible` field MUST be present. If `false`, the
|
||||
agent declares that this action cannot be automatically undone
|
||||
and rollback requests MUST be escalated to a human operator via
|
||||
the HITL mechanism {{I-D.nennemann-agent-dag-hitl-safety}}.
|
||||
|
||||
Agents MAY create hierarchical checkpoints using the ECT DAG: a
|
||||
parent checkpoint ECT with `par` references to multiple child
|
||||
checkpoint ECTs. Rolling back the parent rolls back all children.
|
||||
|
||||
## Checkpoint Storage
|
||||
|
||||
Checkpoint ECTs MUST be stored for at least the duration specified
|
||||
by `aerr.ttl`. At L3 {{I-D.nennemann-wimse-ect}}, checkpoints
|
||||
are automatically preserved in the audit ledger. At L1 and L2,
|
||||
agents MUST store checkpoints in durable local storage that
|
||||
survives agent restarts.
|
||||
|
||||
# Error Signaling {#error-signals}
|
||||
|
||||
When an agent detects an error, it MUST produce an error ECT and
|
||||
propagate it to affected agents in the DAG.
|
||||
|
||||
## Error ECT
|
||||
|
||||
An error signal is an ECT with:
|
||||
|
||||
- `exec_act`: `"aerr:error"`
|
||||
- `par`: the `jti` of the checkpoint ECT associated with the
|
||||
failing action
|
||||
|
||||
The `ext` claim carries error details:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"ext": {
|
||||
"aerr.severity": "critical",
|
||||
"aerr.error_type": "action_failed",
|
||||
"aerr.description": "BGP session did not establish",
|
||||
"aerr.checkpoint_id": "550e8400-e29b-41d4-a716-446655440001",
|
||||
"aerr.upstream_errors": []
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-error title="Error ECT Extension Claims"}
|
||||
|
||||
Severity levels: `info`, `warning`, `error`, `critical`.
|
||||
|
||||
Error types: `action_failed`, `timeout`, `constraint_violation`,
|
||||
`resource_exhausted`, `upstream_cascade`, `unknown`.
|
||||
|
||||
## Error Propagation via DAG
|
||||
|
||||
When an agent receives an error ECT caused by an action it
|
||||
initiated, it MUST either:
|
||||
|
||||
(a) Attempt automatic rollback of its checkpoint ({{rollback}}), or
|
||||
|
||||
(b) Escalate to its operator if the action was irreversible.
|
||||
|
||||
The `aerr.upstream_errors` array allows agents to chain error
|
||||
context by referencing `jti` values of predecessor error ECTs,
|
||||
building a causal trace from symptom to root cause through the
|
||||
DAG.
|
||||
|
||||
## HITL Escalation
|
||||
|
||||
When an error requires human intervention, the error ECT SHOULD
|
||||
trigger a HITL rule per {{I-D.nennemann-agent-dag-hitl-safety}}.
|
||||
Example policy:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"hitl": {
|
||||
"rules": [{
|
||||
"id": "r-critical-error",
|
||||
"trigger": {
|
||||
"kind": "keyword_match",
|
||||
"op": "eq",
|
||||
"value": "critical",
|
||||
"input_ref": "ext.aerr.severity"
|
||||
},
|
||||
"required_role": "operator:oncall",
|
||||
"action": "escalate",
|
||||
"allow_override": true,
|
||||
"override_action": "continue"
|
||||
}]
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-hitl-error title="HITL Policy for Critical Errors"}
|
||||
|
||||
# Circuit Breaker Pattern {#circuit-breaker}
|
||||
|
||||
Each agent MUST implement a circuit breaker for every downstream
|
||||
agent it communicates with.
|
||||
|
||||
## States
|
||||
|
||||
CLOSED (normal):
|
||||
: Requests flow through. The agent tracks the error rate over a
|
||||
sliding window (default: 60 seconds).
|
||||
|
||||
OPEN (failure detected):
|
||||
: When the error rate exceeds a threshold (default: 50% over the
|
||||
window), the breaker opens. All requests to the downstream
|
||||
agent are immediately rejected with `aerr.error_type`:
|
||||
`circuit_open`. The agent MUST produce an error ECT and emit
|
||||
it to upstream peers.
|
||||
|
||||
HALF-OPEN (recovery probe):
|
||||
: After a cooldown period (default: 30 seconds), the breaker
|
||||
allows a single probe request. If it succeeds, the breaker
|
||||
returns to CLOSED. If it fails, it returns to OPEN with doubled
|
||||
cooldown (exponential backoff, max 300 seconds).
|
||||
|
||||
## State Change ECTs
|
||||
|
||||
Each circuit breaker state change MUST produce an ECT:
|
||||
|
||||
- `exec_act`: `"aerr:circuit_open"`, `"aerr:circuit_half_open"`,
|
||||
or `"aerr:circuit_closed"`
|
||||
- `par`: the `jti` of the error ECT that triggered the transition
|
||||
|
||||
This records the health topology of the agent network in the ECT
|
||||
DAG, queryable from the audit ledger at L3.
|
||||
|
||||
## Observability
|
||||
|
||||
Agents MUST expose circuit breaker state at:
|
||||
|
||||
~~~
|
||||
GET /aerr/circuits
|
||||
~~~
|
||||
|
||||
Response:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"circuits": [{
|
||||
"downstream_agent": "spiffe://example.com/agent/router-mgr",
|
||||
"state": "open",
|
||||
"error_rate": 0.75,
|
||||
"last_failure_ect": "550e8400-e29b-41d4-a716-446655440099",
|
||||
"cooldown_remaining_s": 22
|
||||
}]
|
||||
}
|
||||
~~~
|
||||
{: #fig-circuits title="Circuit Breaker Status"}
|
||||
|
||||
# Rollback Protocol {#rollback}
|
||||
|
||||
## Rollback Request
|
||||
|
||||
A rollback is initiated by sending an HTTP POST to the target
|
||||
agent's rollback endpoint:
|
||||
|
||||
~~~
|
||||
POST /aerr/rollback HTTP/1.1
|
||||
Content-Type: application/json
|
||||
Execution-Context: <rollback-request-ECT>
|
||||
|
||||
{
|
||||
"rollback_id": "urn:uuid:...",
|
||||
"checkpoint_id": "550e8400-e29b-41d4-a716-446655440001",
|
||||
"reason": "Upstream action caused cascading failure",
|
||||
"cascade": true
|
||||
}
|
||||
~~~
|
||||
{: #fig-rollback-req title="Rollback Request"}
|
||||
|
||||
The request MUST include an ECT in the Execution-Context header
|
||||
with `exec_act`: `"aerr:rollback_request"` and `par` referencing
|
||||
the error ECT that motivated the rollback.
|
||||
|
||||
When `cascade` is `true`, the receiving agent MUST also initiate
|
||||
rollback of any downstream checkpoints created as a consequence
|
||||
of the checkpointed action. The ECT DAG's `par` chain identifies
|
||||
these downstream actions.
|
||||
|
||||
## Rollback Response
|
||||
|
||||
The agent produces a rollback result ECT with:
|
||||
|
||||
- `exec_act`: `"aerr:rollback_complete"` (or `"aerr:rollback_escalated"`)
|
||||
- `par`: the `jti` of the rollback request ECT
|
||||
- `out_hash`: SHA-256 hash of the agent's state after rollback
|
||||
|
||||
~~~json
|
||||
{
|
||||
"ext": {
|
||||
"aerr.rollback_id": "urn:uuid:...",
|
||||
"aerr.status": "completed",
|
||||
"aerr.state_hash_before": "sha256:...",
|
||||
"aerr.state_hash_after": "sha256:...",
|
||||
"aerr.cascaded": [
|
||||
{"agent": "spiffe://example.com/agent/monitor", "status": "completed"},
|
||||
{"agent": "spiffe://example.com/agent/classify", "status": "escalated"}
|
||||
]
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-rollback-resp title="Rollback Result ECT"}
|
||||
|
||||
Status values: `completed`, `partial`, `escalated`, `failed`.
|
||||
|
||||
`escalated` means the action was irreversible and a human operator
|
||||
has been notified via HITL. `partial` means some but not all
|
||||
downstream rollbacks succeeded.
|
||||
|
||||
## Idempotency
|
||||
|
||||
Agents MUST implement idempotent rollback: receiving the same
|
||||
`rollback_id` twice MUST return the same result without
|
||||
re-executing the rollback.
|
||||
|
||||
# Security Considerations
|
||||
|
||||
Rollback requests are sensitive operations. Agents MUST
|
||||
authenticate rollback requests via the ECT signature chain -- only
|
||||
agents whose ECTs appear in the same workflow DAG (identified by
|
||||
`wid`) SHOULD be authorized to request rollback.
|
||||
|
||||
Checkpoint ECTs contain `out_hash` of agent state but not the
|
||||
state itself. Agents MUST encrypt stored state snapshots at rest.
|
||||
|
||||
Circuit breaker status exposes system health topology. The
|
||||
`/aerr/circuits` endpoint SHOULD be access-controlled.
|
||||
|
||||
Malicious agents could emit false error ECTs to trigger rollbacks.
|
||||
Agents SHOULD verify that error ECTs reference valid checkpoint
|
||||
`jti` values from their own workflow DAG before initiating
|
||||
rollback. At L2 and L3, ECT signatures prevent forgery.
|
||||
|
||||
# IANA Considerations
|
||||
|
||||
This document requests the following IANA registrations:
|
||||
|
||||
1. An "AERR Error Type" registry under Specification Required
|
||||
policy. Initial entries: `action_failed`, `timeout`,
|
||||
`constraint_violation`, `resource_exhausted`,
|
||||
`upstream_cascade`, `circuit_open`, `unknown`.
|
||||
|
||||
2. Registration of `exec_act` values `aerr:checkpoint`,
|
||||
`aerr:error`, `aerr:rollback_request`, `aerr:rollback_complete`,
|
||||
`aerr:circuit_open`, `aerr:circuit_half_open`,
|
||||
`aerr:circuit_closed` in a future ECT action type registry.
|
||||
|
||||
--- back
|
||||
|
||||
# Acknowledgments
|
||||
{:numbered="false"}
|
||||
|
||||
This document builds on the Execution Context Token specification
|
||||
{{I-D.nennemann-wimse-ect}} for DAG-based audit trails and the
|
||||
Agent Context Policy Token {{I-D.nennemann-agent-dag-hitl-safety}}
|
||||
for HITL escalation of irreversible actions.
|
||||
@@ -0,0 +1,309 @@
|
||||
Internet-Draft AI/Agent WG
|
||||
Intended status: Standards Track March 2026
|
||||
Expires: September 15, 2026
|
||||
|
||||
|
||||
Agent Error Recovery and Rollback (AERR)
|
||||
draft-aerr-agent-error-recovery-rollback-00
|
||||
|
||||
Abstract
|
||||
|
||||
This document defines the Agent Error Recovery and Rollback
|
||||
(AERR) protocol, a lightweight standard for handling errors,
|
||||
cascading failures, and rollback in multi-agent systems.
|
||||
Autonomous AI agents increasingly make unsupervised decisions,
|
||||
yet no standard exists for how agents checkpoint state, signal
|
||||
errors to peers, contain cascading failures, or roll back
|
||||
autonomous decisions gone wrong. AERR defines three mechanisms:
|
||||
state checkpoints that agents create before consequential
|
||||
actions, a circuit breaker pattern to contain cascading failures
|
||||
across agent networks, and a rollback protocol for reverting
|
||||
agent actions to a known-good state. The protocol is transport-
|
||||
agnostic and builds on JSON and standard HTTP semantics.
|
||||
|
||||
Status of This Memo
|
||||
|
||||
This Internet-Draft is submitted in full conformance with the
|
||||
provisions of BCP 78 and BCP 79.
|
||||
|
||||
This document is intended to have Standards Track status.
|
||||
Distribution of this memo is unlimited.
|
||||
|
||||
Table of Contents
|
||||
|
||||
1. Introduction
|
||||
2. Terminology
|
||||
3. Problem Statement
|
||||
4. Checkpoint Mechanism
|
||||
5. Error Signaling
|
||||
6. Circuit Breaker Pattern
|
||||
7. Rollback Protocol
|
||||
8. Security Considerations
|
||||
9. IANA Considerations
|
||||
|
||||
1. Introduction
|
||||
|
||||
The IETF AI/agent landscape includes 60 drafts on autonomous
|
||||
network operations but none that standardize error recovery.
|
||||
When an autonomous agent misconfigures a router, allocates
|
||||
resources incorrectly, or triggers an unintended cascade of
|
||||
actions across a multi-agent system, there is currently no
|
||||
standard mechanism for detecting the failure, containing its
|
||||
blast radius, or reverting to a safe state.
|
||||
|
||||
AERR borrows proven patterns from distributed systems:
|
||||
checkpoints from database transactions, circuit breakers from
|
||||
microservice architectures, and rollback from version control.
|
||||
It adapts these patterns to the specific needs of AI agents,
|
||||
where actions may be partially reversible and where the agent
|
||||
that caused the error may not be the best one to fix it.
|
||||
|
||||
Design principles:
|
||||
1. Agents that take consequential actions MUST be able to undo
|
||||
them, or MUST declare them irreversible upfront.
|
||||
2. Failure containment takes priority over failure diagnosis.
|
||||
3. The protocol adds minimal overhead to the happy path.
|
||||
|
||||
2. Terminology
|
||||
|
||||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
|
||||
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
|
||||
"OPTIONAL" in this document are to be interpreted as described
|
||||
in RFC 2119 [RFC2119].
|
||||
|
||||
Checkpoint: A snapshot of an agent's state and the external
|
||||
effects of its actions at a point in time, sufficient to
|
||||
restore the system to that state.
|
||||
|
||||
Circuit Breaker: A mechanism that stops an agent from
|
||||
propagating requests to a failing downstream agent, preventing
|
||||
cascading failures.
|
||||
|
||||
Rollback: The process of reverting an agent's actions and state
|
||||
to a previously recorded checkpoint.
|
||||
|
||||
Blast Radius: The set of agents and systems affected by a
|
||||
single agent's failure.
|
||||
|
||||
3. Problem Statement
|
||||
|
||||
Consider a network operations scenario: Agent A instructs
|
||||
Agent B to update firewall rules, which causes Agent C's
|
||||
traffic monitoring to fail, which causes Agent D to
|
||||
misclassify traffic patterns. Today each agent handles errors
|
||||
independently with no coordination. There is no standard way
|
||||
for Agent D to signal that the root cause is upstream, for the
|
||||
cascade to be halted, or for the chain of actions to be rolled
|
||||
back.
|
||||
|
||||
The only existing draft that partially addresses this space
|
||||
(draft-yue-anima-agent-recovery-networks) focuses on mobile
|
||||
network fault recovery and does not provide general-purpose
|
||||
error recovery primitives usable across agent types.
|
||||
|
||||
4. Checkpoint Mechanism
|
||||
|
||||
An AERR-compliant agent MUST create a checkpoint before any
|
||||
action it classifies as "consequential." An action is
|
||||
consequential if it modifies external state (e.g., network
|
||||
config, database records, API calls with side effects).
|
||||
|
||||
A checkpoint is a JSON object:
|
||||
|
||||
{
|
||||
"checkpoint_id": "urn:uuid:...",
|
||||
"agent_id": "urn:uuid:...",
|
||||
"timestamp": "2026-03-01T12:00:00Z",
|
||||
"action": {
|
||||
"type": "config_update",
|
||||
"target": "router-07.example.com",
|
||||
"description": "Update BGP peer config"
|
||||
},
|
||||
"reversible": true,
|
||||
"rollback_procedure": {
|
||||
"method": "POST",
|
||||
"uri": "https://agent-b.example.com/aerr/rollback",
|
||||
"payload_ref": "urn:uuid:...prior-config-snapshot"
|
||||
},
|
||||
"state_hash": "sha256:abcdef...",
|
||||
"ttl": 86400
|
||||
}
|
||||
|
||||
The "reversible" field MUST be present. If false, the agent
|
||||
declares that this action cannot be automatically undone and
|
||||
rollback requests for this checkpoint MUST be escalated to a
|
||||
human operator.
|
||||
|
||||
The "state_hash" provides integrity verification: the agent
|
||||
hashes its relevant state at checkpoint time so that rollback
|
||||
can verify it is restoring to an authentic prior state.
|
||||
|
||||
Checkpoints MUST be stored for at least the duration specified
|
||||
by "ttl" (seconds). Agents SHOULD store checkpoints in durable
|
||||
storage that survives agent restarts.
|
||||
|
||||
Agents MAY create hierarchical checkpoints where a parent
|
||||
checkpoint groups multiple child checkpoints from a multi-step
|
||||
operation. Rolling back the parent rolls back all children.
|
||||
|
||||
5. Error Signaling
|
||||
|
||||
When an agent detects an error, it MUST emit an AERR error
|
||||
signal to all agents in the current action chain. The error
|
||||
signal is an HTTP POST to each peer's AERR endpoint:
|
||||
|
||||
POST /aerr/error HTTP/1.1
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"error_id": "urn:uuid:...",
|
||||
"source_agent": "urn:uuid:...",
|
||||
"severity": "critical",
|
||||
"checkpoint_id": "urn:uuid:...",
|
||||
"error_type": "action_failed",
|
||||
"description": "BGP session did not establish after config update",
|
||||
"timestamp": "2026-03-01T12:05:00Z",
|
||||
"upstream_errors": []
|
||||
}
|
||||
|
||||
Severity levels: "info", "warning", "error", "critical".
|
||||
|
||||
Error types: "action_failed", "timeout", "constraint_violation",
|
||||
"resource_exhausted", "upstream_cascade", "unknown".
|
||||
|
||||
When an agent receives an error signal caused by an action it
|
||||
initiated, it MUST either:
|
||||
(a) Attempt automatic rollback of its checkpoint, or
|
||||
(b) Escalate to its operator if the action was irreversible.
|
||||
|
||||
The "upstream_errors" array allows agents to chain error
|
||||
context, building a causal trace from the symptom back to the
|
||||
root cause.
|
||||
|
||||
6. Circuit Breaker Pattern
|
||||
|
||||
Each agent MUST implement a circuit breaker for every downstream
|
||||
agent it communicates with. The circuit breaker has three
|
||||
states:
|
||||
|
||||
CLOSED (normal operation): Requests flow through. The agent
|
||||
tracks the error rate over a sliding window (default: 60s).
|
||||
|
||||
OPEN (failure detected): When the error rate exceeds a
|
||||
threshold (default: 50% over the window), the circuit breaker
|
||||
opens. All requests to the downstream agent are immediately
|
||||
rejected with error_type "circuit_open". The agent MUST emit
|
||||
an error signal to upstream peers.
|
||||
|
||||
HALF-OPEN (recovery probe): After a cooldown period (default:
|
||||
30s), the circuit breaker allows a single probe request. If it
|
||||
succeeds, the breaker returns to CLOSED. If it fails, it
|
||||
returns to OPEN with a doubled cooldown (exponential backoff,
|
||||
max 300s).
|
||||
|
||||
Agents MUST expose circuit breaker state at:
|
||||
|
||||
GET /aerr/circuits
|
||||
|
||||
Response:
|
||||
{
|
||||
"circuits": [
|
||||
{
|
||||
"downstream_agent": "urn:uuid:...",
|
||||
"state": "open",
|
||||
"error_rate": 0.75,
|
||||
"last_failure": "2026-03-01T12:05:00Z",
|
||||
"cooldown_remaining_s": 22
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
This enables monitoring systems and upstream agents to
|
||||
understand the health topology of the agent network.
|
||||
|
||||
7. Rollback Protocol
|
||||
|
||||
A rollback is initiated by sending an HTTP POST to the target
|
||||
agent's rollback endpoint:
|
||||
|
||||
POST /aerr/rollback HTTP/1.1
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"rollback_id": "urn:uuid:...",
|
||||
"checkpoint_id": "urn:uuid:...",
|
||||
"reason": "Upstream action caused cascading failure",
|
||||
"initiator": "urn:uuid:...",
|
||||
"cascade": true
|
||||
}
|
||||
|
||||
When "cascade" is true, the receiving agent MUST also initiate
|
||||
rollback of any downstream checkpoints that were created as a
|
||||
consequence of the checkpointed action. This enables a single
|
||||
rollback request to unwind an entire chain of agent actions.
|
||||
|
||||
The agent MUST respond with a rollback result:
|
||||
|
||||
{
|
||||
"rollback_id": "urn:uuid:...",
|
||||
"status": "completed",
|
||||
"checkpoint_id": "urn:uuid:...",
|
||||
"state_hash_before": "sha256:...",
|
||||
"state_hash_after": "sha256:...",
|
||||
"cascaded_rollbacks": [
|
||||
{"agent_id": "urn:uuid:...", "status": "completed"},
|
||||
{"agent_id": "urn:uuid:...", "status": "escalated"}
|
||||
]
|
||||
}
|
||||
|
||||
Rollback status values: "completed", "partial", "escalated",
|
||||
"failed".
|
||||
|
||||
"escalated" means the action was irreversible and a human
|
||||
operator has been notified. "partial" means some but not all
|
||||
downstream rollbacks succeeded.
|
||||
|
||||
Agents MUST implement idempotent rollback: receiving the same
|
||||
rollback_id twice MUST return the same result without re-
|
||||
executing the rollback.
|
||||
|
||||
8. Security Considerations
|
||||
|
||||
Rollback requests are sensitive operations. Agents MUST
|
||||
authenticate rollback requests using mutual TLS or signed JWTs.
|
||||
Only agents in the same action chain (identified by checkpoint
|
||||
lineage) SHOULD be authorized to request rollback.
|
||||
|
||||
Checkpoint data may contain sensitive system state. Agents
|
||||
MUST encrypt stored checkpoints at rest and MUST NOT include
|
||||
checkpoint contents in error signals.
|
||||
|
||||
Circuit breaker state is observable information about system
|
||||
health. The /aerr/circuits endpoint SHOULD be access-
|
||||
controlled to prevent adversaries from mapping system topology.
|
||||
|
||||
Malicious agents could send false error signals to trigger
|
||||
unnecessary rollbacks. Agents SHOULD verify that error signals
|
||||
reference valid checkpoint IDs from their own action chains
|
||||
before initiating rollback.
|
||||
|
||||
9. IANA Considerations
|
||||
|
||||
This document requests IANA establish the following:
|
||||
|
||||
1. An "AERR Error Type" registry under Specification Required
|
||||
policy. Initial entries: "action_failed", "timeout",
|
||||
"constraint_violation", "resource_exhausted",
|
||||
"upstream_cascade", "unknown".
|
||||
|
||||
2. An "AERR Severity Level" registry under Specification
|
||||
Required policy. Initial entries: "info", "warning",
|
||||
"error", "critical".
|
||||
|
||||
3. Well-known URI registrations for "aerr/error",
|
||||
"aerr/rollback", and "aerr/circuits" per RFC 8615.
|
||||
|
||||
Author's Address
|
||||
|
||||
Generated by IETF Draft Analyzer
|
||||
2026-03-01
|
||||
386
workspace/drafts/new-drafts/draft-b-atd-agent-task-dag-00.md
Normal file
386
workspace/drafts/new-drafts/draft-b-atd-agent-task-dag-00.md
Normal file
@@ -0,0 +1,386 @@
|
||||
---
|
||||
title: "Agent Task DAG (ATD): Execution Model, Checkpoints, and Recovery"
|
||||
abbrev: "ATD"
|
||||
category: std
|
||||
docname: draft-atd-agent-task-dag-00
|
||||
submissiontype: IETF
|
||||
number:
|
||||
date:
|
||||
v: 3
|
||||
area: "OPS"
|
||||
workgroup: "NMOP"
|
||||
keyword:
|
||||
- agent DAG
|
||||
- checkpoint
|
||||
- rollback
|
||||
- error recovery
|
||||
- circuit breaker
|
||||
|
||||
author:
|
||||
-
|
||||
fullname: TBD
|
||||
organization: Independent
|
||||
email: placeholder@example.com
|
||||
|
||||
normative:
|
||||
RFC2119:
|
||||
RFC8174:
|
||||
RFC8446:
|
||||
I-D.nennemann-wimse-ect:
|
||||
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
||||
I-D.nennemann-agent-dag-hitl-safety:
|
||||
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
||||
|
||||
informative:
|
||||
|
||||
--- abstract
|
||||
|
||||
This document defines the Agent Task DAG (ATD) specification:
|
||||
execution semantics, checkpoints, error signaling, circuit
|
||||
breakers, and rollback for agent workflows. ATD does not define a
|
||||
new DAG or token format. It defines when agents MUST emit ECT
|
||||
nodes, what those nodes mean, and how to recover when things go
|
||||
wrong. Checkpoints, errors, and rollback results are ECT nodes
|
||||
with specific `exec_act` values and `ext` claims. Rollback walks
|
||||
the ECT DAG backwards. Circuit breakers contain cascading
|
||||
failures. Resource hints enable scheduling. The protocol is
|
||||
transport-agnostic and builds on ECT for evidence and ACP-DAG-HITL
|
||||
for policy.
|
||||
|
||||
--- middle
|
||||
|
||||
# Introduction
|
||||
|
||||
Autonomous agents increasingly make unsupervised decisions, yet no
|
||||
standard exists for how agents checkpoint state, signal errors to
|
||||
peers, contain cascading failures, or roll back decisions gone
|
||||
wrong.
|
||||
|
||||
ATD borrows proven patterns from distributed systems: checkpoints
|
||||
from database transactions, circuit breakers from microservice
|
||||
architectures, and rollback from version control. It adapts these
|
||||
to agent workflows where actions may be partially reversible and
|
||||
where the agent that caused the error may not be the best one to
|
||||
fix it.
|
||||
|
||||
ATD does not define a new DAG format. The ECT DAG
|
||||
{{I-D.nennemann-wimse-ect}} IS the execution graph. ATD defines
|
||||
the semantics of specific node types within that graph.
|
||||
|
||||
Design principles:
|
||||
|
||||
1. Agents that take consequential actions MUST be able to undo
|
||||
them, or MUST declare them irreversible upfront.
|
||||
2. Failure containment takes priority over failure diagnosis.
|
||||
3. The protocol adds minimal overhead to the happy path.
|
||||
|
||||
# Conventions and Definitions
|
||||
|
||||
{::boilerplate bcp14-tagged}
|
||||
|
||||
Checkpoint:
|
||||
: An ECT node recording agent state before a consequential action,
|
||||
sufficient to restore the system to that state.
|
||||
|
||||
Circuit Breaker:
|
||||
: A mechanism that stops an agent from propagating requests to a
|
||||
failing downstream agent, preventing cascading failures.
|
||||
|
||||
Rollback:
|
||||
: The process of reverting an agent's actions and state to a
|
||||
previously recorded checkpoint.
|
||||
|
||||
Blast Radius:
|
||||
: The set of agents and systems affected by a single failure.
|
||||
|
||||
# Node States {#node-states}
|
||||
|
||||
Each task node in the ECT DAG has an implicit state derived from
|
||||
subsequent ECT nodes:
|
||||
|
||||
- **pending**: A delegation node exists in ACP-DAG-HITL but no
|
||||
corresponding ECT has been emitted.
|
||||
- **running**: An ECT with `exec_act` matching the task type has
|
||||
been emitted but no completion or error ECT follows.
|
||||
- **done**: A completion ECT (or the next `par`-linked ECT) exists.
|
||||
- **failed**: An `atd:error` ECT references this node.
|
||||
- **rolled_back**: An `atd:rollback_result` ECT references this
|
||||
node's checkpoint.
|
||||
|
||||
# Checkpoint Mechanism {#checkpoints}
|
||||
|
||||
An ATD-compliant agent MUST create a checkpoint before any action
|
||||
it classifies as consequential. An action is consequential if it
|
||||
modifies external state (network config, database records, API
|
||||
calls with side effects).
|
||||
|
||||
A checkpoint is an ECT with:
|
||||
|
||||
- `exec_act`: `"atd:checkpoint"`
|
||||
- `par`: the ECT of the action being checkpointed
|
||||
|
||||
~~~json
|
||||
{
|
||||
"jti": "ckpt-uuid",
|
||||
"exec_act": "atd:checkpoint",
|
||||
"par": ["action-ect-uuid"],
|
||||
"out_hash": "sha256-of-agent-state-snapshot",
|
||||
"ext": {
|
||||
"atd.reversible": true,
|
||||
"atd.rollback_uri": "https://agent-b.example.com/atd/rollback",
|
||||
"atd.target": "router-07.example.com",
|
||||
"atd.description": "Update BGP peer config",
|
||||
"atd.ttl": 86400
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-checkpoint title="Checkpoint ECT"}
|
||||
|
||||
The `atd.reversible` field MUST be present. If `false`, the agent
|
||||
declares that this action cannot be automatically undone and
|
||||
rollback requests MUST be escalated per the ACP-DAG-HITL
|
||||
`unreachable_human` policy.
|
||||
|
||||
The `out_hash` provides integrity verification: the agent hashes
|
||||
its state at checkpoint time so that rollback can verify it is
|
||||
restoring to an authentic prior state.
|
||||
|
||||
Checkpoints MUST be stored for at least `atd.ttl` seconds. Agents
|
||||
SHOULD store checkpoints in durable storage that survives restarts.
|
||||
|
||||
## Hierarchical Checkpoints
|
||||
|
||||
Agents MAY create hierarchical checkpoints where a parent groups
|
||||
multiple child checkpoints from a multi-step operation. Rolling
|
||||
back the parent rolls back all children. The parent checkpoint's
|
||||
`par` array references all child checkpoint `jti` values.
|
||||
|
||||
# Error Signaling {#errors}
|
||||
|
||||
When an agent detects an error, it MUST emit an error ECT:
|
||||
|
||||
- `exec_act`: `"atd:error"`
|
||||
- `par`: the ECT of the failed action
|
||||
|
||||
~~~json
|
||||
{
|
||||
"jti": "error-uuid",
|
||||
"exec_act": "atd:error",
|
||||
"par": ["failed-action-ect-uuid"],
|
||||
"ext": {
|
||||
"atd.severity": "critical",
|
||||
"atd.error_type": "action_failed",
|
||||
"atd.description": "BGP session did not establish",
|
||||
"atd.checkpoint_id": "ckpt-uuid",
|
||||
"atd.upstream_errors": []
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-error title="Error ECT"}
|
||||
|
||||
Severity levels: `info`, `warning`, `error`, `critical`.
|
||||
|
||||
Error types: `action_failed`, `timeout`, `constraint_violation`,
|
||||
`resource_exhausted`, `upstream_cascade`, `unknown`.
|
||||
|
||||
When an agent receives an error signal caused by an action it
|
||||
initiated, it MUST either:
|
||||
|
||||
(a) Attempt automatic rollback of its checkpoint, or
|
||||
(b) Escalate per ACP-DAG-HITL HITL rules if the action was
|
||||
irreversible.
|
||||
|
||||
The `atd.upstream_errors` array allows agents to chain error
|
||||
context, building a causal trace from symptom to root cause.
|
||||
|
||||
## HITL Escalation on Error
|
||||
|
||||
Error ECTs MAY trigger ACP-DAG-HITL rules. A deployment can
|
||||
define HITL rules such as:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"id": "r-critical-error",
|
||||
"trigger": {
|
||||
"kind": "keyword_match",
|
||||
"op": "eq",
|
||||
"value": "critical",
|
||||
"input_ref": "atd.severity"
|
||||
},
|
||||
"required_role": "operator:oncall",
|
||||
"action": "escalate",
|
||||
"allow_override": true,
|
||||
"override_action": "continue"
|
||||
}
|
||||
~~~
|
||||
{: #fig-error-hitl title="HITL Rule for Critical Errors"}
|
||||
|
||||
# Circuit Breaker Pattern {#circuit-breaker}
|
||||
|
||||
Each agent MUST implement a circuit breaker for every downstream
|
||||
agent it communicates with. The circuit breaker has three states:
|
||||
|
||||
CLOSED (normal):
|
||||
: Requests flow through. The agent tracks the error rate over a
|
||||
sliding window (default: 60 seconds).
|
||||
|
||||
OPEN (failure detected):
|
||||
: When the error rate exceeds a threshold (default: 50%), the
|
||||
breaker opens. All requests are immediately rejected. The
|
||||
agent MUST emit a circuit breaker ECT:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "atd:circuit_open",
|
||||
"ext": {
|
||||
"atd.downstream_agent": "spiffe://example.com/agent/b",
|
||||
"atd.error_rate": 0.75,
|
||||
"atd.window_s": 60
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-circuit title="Circuit Breaker ECT"}
|
||||
|
||||
HALF-OPEN (recovery probe):
|
||||
: After a cooldown period (default: 30s), the breaker allows one
|
||||
probe request. If it succeeds, the breaker returns to CLOSED.
|
||||
If it fails, it returns to OPEN with doubled cooldown
|
||||
(exponential backoff, max 300s).
|
||||
|
||||
Circuit breaker thresholds can be configured as ACP-DAG-HITL
|
||||
node constraints:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"constraints": {
|
||||
"atd.circuit_threshold": 0.5,
|
||||
"atd.circuit_window_s": 60
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-circuit-policy title="Circuit Breaker Policy"}
|
||||
|
||||
# Rollback Protocol {#rollback}
|
||||
|
||||
A rollback is initiated by emitting a rollback request ECT and
|
||||
sending an HTTP POST to the target agent's rollback endpoint:
|
||||
|
||||
~~~
|
||||
POST /atd/rollback HTTP/1.1
|
||||
Content-Type: application/json
|
||||
Execution-Context: <rollback-request-ect>
|
||||
~~~
|
||||
|
||||
- `exec_act`: `"atd:rollback_request"`
|
||||
- `par`: the checkpoint ECT to roll back to
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "atd:rollback_request",
|
||||
"par": ["ckpt-uuid"],
|
||||
"ext": {
|
||||
"atd.reason": "Upstream action caused cascading failure",
|
||||
"atd.cascade": true
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-rollback-req title="Rollback Request ECT"}
|
||||
|
||||
When `atd.cascade` is `true`, the receiving agent MUST also
|
||||
initiate rollback of any downstream checkpoints created as a
|
||||
consequence of the checkpointed action.
|
||||
|
||||
The agent MUST respond with a rollback result ECT:
|
||||
|
||||
- `exec_act`: `"atd:rollback_result"`
|
||||
- `par`: the rollback request ECT
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "atd:rollback_result",
|
||||
"par": ["rollback-request-uuid"],
|
||||
"out_hash": "sha256-of-restored-state",
|
||||
"ext": {
|
||||
"atd.status": "completed",
|
||||
"atd.checkpoint_id": "ckpt-uuid",
|
||||
"atd.cascaded": [
|
||||
{"agent": "spiffe://example.com/agent/c", "status": "completed"},
|
||||
{"agent": "spiffe://example.com/agent/d", "status": "escalated"}
|
||||
]
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-rollback-result title="Rollback Result ECT"}
|
||||
|
||||
Status values: `completed`, `partial`, `escalated`, `failed`.
|
||||
|
||||
`escalated` means the action was irreversible and a human operator
|
||||
has been notified per ACP-DAG-HITL `unreachable_human` policy.
|
||||
|
||||
Agents MUST implement idempotent rollback: receiving the same
|
||||
rollback request ECT `jti` twice MUST return the same result.
|
||||
|
||||
# Resource Hints {#resources}
|
||||
|
||||
Agents MAY declare resource requirements as ECT extension claims
|
||||
or ACP-DAG-HITL node constraints:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"constraints": {
|
||||
"atd.resource_cpu": "2",
|
||||
"atd.resource_memory_mb": 4096,
|
||||
"atd.resource_timeout_s": 300,
|
||||
"atd.resource_priority": "high"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-resources title="Resource Hints as Node Constraints"}
|
||||
|
||||
Orchestrators (e.g., Kubernetes schedulers, agent gateways) MAY
|
||||
use these hints for scheduling and quota enforcement. Resource
|
||||
hints are advisory; agents MUST NOT depend on them for
|
||||
correctness.
|
||||
|
||||
# Security Considerations
|
||||
|
||||
Rollback requests are sensitive operations. Agents MUST
|
||||
authenticate rollback requests using the ECT identity binding
|
||||
(L2/L3). Only agents in the same workflow (`wid`) with
|
||||
checkpoint lineage in the DAG SHOULD be authorized to request
|
||||
rollback.
|
||||
|
||||
Checkpoint data may contain sensitive system state. Agents MUST
|
||||
encrypt stored checkpoints at rest and MUST NOT include checkpoint
|
||||
contents in error ECTs.
|
||||
|
||||
Circuit breaker state reveals system health topology. The
|
||||
`atd:circuit_open` ECT is part of the audit trail; access to the
|
||||
audit ledger SHOULD be controlled.
|
||||
|
||||
Malicious agents could send false error ECTs to trigger
|
||||
unnecessary rollbacks. Agents SHOULD verify that error ECTs
|
||||
reference valid `par` values within their own workflow DAG.
|
||||
|
||||
# IANA Considerations
|
||||
|
||||
This document requests registration of the following `exec_act`
|
||||
values in a future ECT action type registry:
|
||||
|
||||
- `atd:checkpoint`
|
||||
- `atd:error`
|
||||
- `atd:circuit_open`
|
||||
- `atd:rollback_request`
|
||||
- `atd:rollback_result`
|
||||
|
||||
--- back
|
||||
|
||||
# Acknowledgments
|
||||
{:numbered="false"}
|
||||
|
||||
ATD builds on ECT {{I-D.nennemann-wimse-ect}} for execution
|
||||
evidence and ACP-DAG-HITL {{I-D.nennemann-agent-dag-hitl-safety}}
|
||||
for delegation policy. The circuit breaker pattern is adapted
|
||||
from microservice architecture best practices.
|
||||
725
workspace/drafts/new-drafts/draft-b-atd-agent-task-dag-01.md
Normal file
725
workspace/drafts/new-drafts/draft-b-atd-agent-task-dag-01.md
Normal file
@@ -0,0 +1,725 @@
|
||||
---
|
||||
title: "Agent Task DAG (ATD): Execution Model, Checkpoints, and Recovery"
|
||||
abbrev: "ATD"
|
||||
category: std
|
||||
docname: draft-atd-agent-task-dag-01
|
||||
submissiontype: IETF
|
||||
number:
|
||||
date:
|
||||
v: 3
|
||||
area: "OPS"
|
||||
workgroup: "NMOP"
|
||||
keyword:
|
||||
- agent DAG
|
||||
- checkpoint
|
||||
- rollback
|
||||
- error recovery
|
||||
- circuit breaker
|
||||
|
||||
author:
|
||||
-
|
||||
fullname: TBD
|
||||
organization: Independent
|
||||
email: placeholder@example.com
|
||||
|
||||
normative:
|
||||
RFC2119:
|
||||
RFC8174:
|
||||
RFC8446:
|
||||
RFC9110:
|
||||
RFC8615:
|
||||
I-D.nennemann-wimse-ect:
|
||||
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
||||
I-D.nennemann-agent-dag-hitl-safety:
|
||||
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
||||
|
||||
informative:
|
||||
|
||||
--- abstract
|
||||
|
||||
This document defines the Agent Task DAG (ATD) specification:
|
||||
execution semantics, checkpoints, error signaling, circuit
|
||||
breakers, and rollback for agent workflows. ATD does not define a
|
||||
new DAG or token format. It defines when agents MUST emit ECT
|
||||
nodes, what those nodes mean, and how to recover when things go
|
||||
wrong. Checkpoints, errors, and rollback results are ECT nodes
|
||||
with specific `exec_act` values and `ext` claims. Rollback walks
|
||||
the ECT DAG backwards. Circuit breakers contain cascading
|
||||
failures. Resource hints enable scheduling. The protocol is
|
||||
transport-agnostic and builds on ECT for evidence and ACP-DAG-HITL
|
||||
for policy.
|
||||
|
||||
--- middle
|
||||
|
||||
# Introduction
|
||||
|
||||
Autonomous agents increasingly make unsupervised decisions, yet no
|
||||
standard exists for how agents checkpoint state, signal errors to
|
||||
peers, contain cascading failures, or roll back decisions gone
|
||||
wrong.
|
||||
|
||||
ATD borrows proven patterns from distributed systems: checkpoints
|
||||
from database transactions, circuit breakers from microservice
|
||||
architectures, and rollback from version control. It adapts these
|
||||
to agent workflows where actions may be partially reversible and
|
||||
where the agent that caused the error may not be the best one to
|
||||
fix it.
|
||||
|
||||
ATD does not define a new DAG format. The ECT DAG
|
||||
{{I-D.nennemann-wimse-ect}} IS the execution graph. ATD defines
|
||||
the semantics of specific node types within that graph.
|
||||
|
||||
Design principles:
|
||||
|
||||
1. Agents that take consequential actions MUST be able to undo
|
||||
them, or MUST declare them irreversible upfront.
|
||||
2. Failure containment takes priority over failure diagnosis.
|
||||
3. The protocol adds minimal overhead to the happy path.
|
||||
|
||||
# Conventions and Definitions
|
||||
|
||||
{::boilerplate bcp14-tagged}
|
||||
|
||||
Checkpoint:
|
||||
: An ECT node recording agent state before a consequential action,
|
||||
sufficient to restore the system to that state.
|
||||
|
||||
Circuit Breaker:
|
||||
: A mechanism that stops an agent from propagating requests to a
|
||||
failing downstream agent, preventing cascading failures.
|
||||
|
||||
Rollback:
|
||||
: The process of reverting an agent's actions and state to a
|
||||
previously recorded checkpoint.
|
||||
|
||||
Blast Radius:
|
||||
: The set of agents and systems affected by a single failure.
|
||||
|
||||
Consequential Action:
|
||||
: An action that modifies external state (network configuration,
|
||||
database records, API calls with side effects) such that
|
||||
reversal requires explicit effort.
|
||||
|
||||
# Execution Semantics {#execution}
|
||||
|
||||
## Topological Order
|
||||
|
||||
Tasks in the ECT DAG MUST execute in topological order: a task
|
||||
MUST NOT begin execution until all tasks referenced by its ECT
|
||||
`par` claims are in state `done`.
|
||||
|
||||
Two tasks with no common ancestor in the DAG (no shared `par`
|
||||
lineage) MAY execute concurrently. Orchestrators SHOULD
|
||||
exploit this parallelism for performance.
|
||||
|
||||
Circular dependencies are prohibited. Agents MUST reject
|
||||
ACP-DAG-HITL delegation DAGs containing cycles.
|
||||
|
||||
## Workflow Boundary ECTs
|
||||
|
||||
When a workflow begins, the initiating agent MUST emit:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "atd:workflow_start",
|
||||
"ext": {
|
||||
"atd.wf_id": "wf-uuid",
|
||||
"atd.description": "BGP failover workflow",
|
||||
"atd.node_count": 5
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-wf-start title="Workflow Start ECT"}
|
||||
|
||||
When the workflow reaches a terminal state (all leaf nodes
|
||||
complete or any node failed with no rollback path), the
|
||||
orchestrator MUST emit:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "atd:workflow_complete",
|
||||
"par": ["wf-start-ect-uuid"],
|
||||
"ext": {
|
||||
"atd.wf_id": "wf-uuid",
|
||||
"atd.terminal_status": "success",
|
||||
"atd.elapsed_s": 42
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-wf-complete title="Workflow Complete ECT"}
|
||||
|
||||
Terminal status values: `success`, `partial`, `failed`,
|
||||
`rolled_back`, `escalated`.
|
||||
|
||||
# Node States {#node-states}
|
||||
|
||||
Each task node in the ECT DAG has an implicit state derived from
|
||||
subsequent ECT nodes:
|
||||
|
||||
- **pending**: A delegation node exists in ACP-DAG-HITL but no
|
||||
corresponding ECT has been emitted.
|
||||
- **running**: An ECT matching the task type has been emitted
|
||||
but no completion or error ECT follows.
|
||||
- **done**: A completion ECT (or the next `par`-linked ECT) exists.
|
||||
- **failed**: An `atd:error` ECT references this node.
|
||||
- **rolled_back**: An `atd:rollback_result` ECT references this
|
||||
node's checkpoint.
|
||||
- **escalated**: The task failed and a human has been notified
|
||||
per HITL escalation rules.
|
||||
|
||||
# Checkpoint Mechanism {#checkpoints}
|
||||
|
||||
## Checkpoint Placement Policy
|
||||
|
||||
An ATD-compliant agent MUST create a checkpoint before any action
|
||||
it classifies as consequential. The following actions are always
|
||||
consequential and MUST be checkpointed:
|
||||
|
||||
1. Any modification to network device configuration.
|
||||
2. Any write to a shared database or external data store.
|
||||
3. Any API call with side effects (non-idempotent HTTP methods).
|
||||
4. Any delegation to another agent that will itself take
|
||||
consequential actions.
|
||||
|
||||
The following SHOULD be checkpointed:
|
||||
|
||||
1. Long-running computations (> `atd.resource_timeout_s`).
|
||||
2. Actions that cannot be verified without external state.
|
||||
|
||||
The following are exempt from checkpoint requirements:
|
||||
|
||||
1. Read-only queries.
|
||||
2. Sending notifications with no side effects.
|
||||
3. Internal state computations with no external observable effect.
|
||||
|
||||
## Checkpoint ECT Format
|
||||
|
||||
A checkpoint is an ECT with:
|
||||
|
||||
- `exec_act`: `"atd:checkpoint"`
|
||||
- `par`: the ECT of the action being checkpointed
|
||||
|
||||
~~~json
|
||||
{
|
||||
"jti": "ckpt-uuid",
|
||||
"exec_act": "atd:checkpoint",
|
||||
"par": ["action-ect-uuid"],
|
||||
"out_hash": "sha256-of-agent-state-snapshot",
|
||||
"ext": {
|
||||
"atd.reversible": true,
|
||||
"atd.rollback_uri": "https://agent-b.example.com/.well-known/atd/rollback",
|
||||
"atd.target": "router-07.example.com",
|
||||
"atd.description": "Update BGP peer config",
|
||||
"atd.ttl": 86400
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-checkpoint title="Checkpoint ECT"}
|
||||
|
||||
The `atd.reversible` field MUST be present. If `false`, the agent
|
||||
declares that this action cannot be automatically undone and
|
||||
rollback requests MUST be escalated per the ACP-DAG-HITL
|
||||
`unreachable_human` policy.
|
||||
|
||||
The `out_hash` provides integrity verification: the agent hashes
|
||||
its state at checkpoint time so that rollback can verify it is
|
||||
restoring to an authentic prior state.
|
||||
|
||||
Checkpoints MUST be stored for at least `atd.ttl` seconds. Agents
|
||||
SHOULD store checkpoints in durable storage that survives restarts.
|
||||
|
||||
The rollback URI MUST be a well-known URI per {{RFC8615}} at the
|
||||
path `/.well-known/atd/rollback`.
|
||||
|
||||
## Hierarchical Checkpoints
|
||||
|
||||
Agents MAY create hierarchical checkpoints where a parent groups
|
||||
multiple child checkpoints from a multi-step operation. Rolling
|
||||
back the parent rolls back all children. The parent checkpoint's
|
||||
`par` array references all child checkpoint `jti` values.
|
||||
|
||||
## Checkpoint `exec_act` Table
|
||||
|
||||
| `exec_act` value | When emitted | Required `ext` fields |
|
||||
|-----------------|-------------|----------------------|
|
||||
| `atd:checkpoint` | Before consequential action | `atd.reversible`, `atd.rollback_uri`, `atd.ttl` |
|
||||
| `atd:error` | On failure detection | `atd.severity`, `atd.error_type`, `atd.checkpoint_id` |
|
||||
| `atd:circuit_open` | When error rate exceeds threshold | `atd.downstream_agent`, `atd.error_rate`, `atd.window_s` |
|
||||
| `atd:circuit_close` | When probe succeeds in HALF-OPEN | `atd.downstream_agent`, `atd.cooldown_s` |
|
||||
| `atd:rollback_request` | To initiate rollback | `atd.reason`, `atd.cascade` |
|
||||
| `atd:rollback_result` | Rollback complete or failed | `atd.status`, `atd.checkpoint_id`, `atd.cascaded` |
|
||||
| `atd:workflow_start` | Workflow begins | `atd.wf_id`, `atd.description` |
|
||||
| `atd:workflow_complete` | Workflow terminal | `atd.wf_id`, `atd.terminal_status` |
|
||||
{: #fig-actions title="ATD exec_act Values"}
|
||||
|
||||
# Error Signaling {#errors}
|
||||
|
||||
When an agent detects an error, it MUST emit an error ECT:
|
||||
|
||||
- `exec_act`: `"atd:error"`
|
||||
- `par`: the ECT of the failed action
|
||||
|
||||
~~~json
|
||||
{
|
||||
"jti": "error-uuid",
|
||||
"exec_act": "atd:error",
|
||||
"par": ["failed-action-ect-uuid"],
|
||||
"ext": {
|
||||
"atd.severity": "critical",
|
||||
"atd.error_type": "action_failed",
|
||||
"atd.description": "BGP session did not establish",
|
||||
"atd.checkpoint_id": "ckpt-uuid",
|
||||
"atd.upstream_errors": []
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-error title="Error ECT"}
|
||||
|
||||
Severity levels (in increasing order): `info`, `warning`,
|
||||
`error`, `critical`.
|
||||
|
||||
Error types: `action_failed`, `timeout`, `constraint_violation`,
|
||||
`resource_exhausted`, `upstream_cascade`, `unknown`.
|
||||
|
||||
When an agent receives an error signal caused by an action it
|
||||
initiated, it MUST either:
|
||||
|
||||
(a) Attempt automatic rollback of its checkpoint, or
|
||||
(b) Escalate per ACP-DAG-HITL HITL rules if the action was
|
||||
irreversible.
|
||||
|
||||
The `atd.upstream_errors` array allows agents to chain error
|
||||
context, building a causal trace from symptom to root cause.
|
||||
|
||||
## HITL Escalation on Error
|
||||
|
||||
Error ECTs with severity `critical` SHOULD trigger HITL
|
||||
escalation. Deployments SHOULD define ACP-DAG-HITL rules such
|
||||
as:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"id": "r-critical-error",
|
||||
"trigger": {
|
||||
"kind": "keyword_match",
|
||||
"op": "eq",
|
||||
"value": "critical",
|
||||
"input_ref": "atd.severity"
|
||||
},
|
||||
"required_role": "operator:oncall",
|
||||
"action": "escalate",
|
||||
"allow_override": true,
|
||||
"override_action": "continue"
|
||||
}
|
||||
~~~
|
||||
{: #fig-error-hitl title="HITL Rule for Critical Errors"}
|
||||
|
||||
# Circuit Breaker Pattern {#circuit-breaker}
|
||||
|
||||
Each agent MUST implement a circuit breaker for every downstream
|
||||
agent it communicates with. The circuit breaker has three states:
|
||||
|
||||
CLOSED (normal):
|
||||
: Requests flow through. The agent tracks the error rate over a
|
||||
sliding window (default: 60 seconds).
|
||||
|
||||
OPEN (failure detected):
|
||||
: When the error rate exceeds a threshold (default: 50%), the
|
||||
breaker opens. All requests are immediately rejected. The
|
||||
agent MUST emit a circuit breaker open ECT:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "atd:circuit_open",
|
||||
"ext": {
|
||||
"atd.downstream_agent": "spiffe://example.com/agent/b",
|
||||
"atd.error_rate": 0.75,
|
||||
"atd.window_s": 60
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-circuit-open title="Circuit Breaker Open ECT"}
|
||||
|
||||
HALF-OPEN (recovery probe):
|
||||
: After a cooldown period (default: 30s), the breaker allows one
|
||||
probe request. If it succeeds, the breaker returns to CLOSED
|
||||
and MUST emit:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "atd:circuit_close",
|
||||
"ext": {
|
||||
"atd.downstream_agent": "spiffe://example.com/agent/b",
|
||||
"atd.cooldown_s": 30
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-circuit-close title="Circuit Breaker Close ECT"}
|
||||
|
||||
If the probe fails, the breaker returns to OPEN with doubled
|
||||
cooldown (exponential backoff, max 300s).
|
||||
|
||||
## Circuit Breaker State Machine
|
||||
|
||||
~~~
|
||||
error_rate > threshold
|
||||
CLOSED ─────────────────────────► OPEN
|
||||
▲ │
|
||||
│ probe success │ cooldown expires
|
||||
│ ▼
|
||||
└────────────────────────── HALF-OPEN
|
||||
probe failure ──► OPEN (cooldown * 2)
|
||||
~~~
|
||||
{: #fig-fsm title="Circuit Breaker State Machine"}
|
||||
|
||||
## Coordinated Circuit Breaking
|
||||
|
||||
When multiple agents share a downstream dependency, each maintains
|
||||
its own circuit breaker independently. However, agents SHOULD
|
||||
publish circuit breaker state via their ECT stream so peers can
|
||||
observe the signal.
|
||||
|
||||
If an orchestrator observes N circuit breakers opening for the
|
||||
same downstream agent within a short window, it SHOULD initiate
|
||||
a HITL escalation rather than allowing N parallel recovery probes.
|
||||
|
||||
## Circuit Breaker Policy Configuration
|
||||
|
||||
Circuit breaker thresholds can be configured as ACP-DAG-HITL
|
||||
node constraints:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"constraints": {
|
||||
"atd.circuit_threshold": 0.5,
|
||||
"atd.circuit_window_s": 60
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-circuit-policy title="Circuit Breaker Policy"}
|
||||
|
||||
# Rollback Protocol {#rollback}
|
||||
|
||||
## Basic Rollback
|
||||
|
||||
A rollback is initiated by emitting a rollback request ECT and
|
||||
sending an HTTP POST to the target agent's rollback endpoint:
|
||||
|
||||
~~~
|
||||
POST /.well-known/atd/rollback HTTP/1.1
|
||||
Content-Type: application/json
|
||||
Execution-Context: <rollback-request-ect>
|
||||
~~~
|
||||
|
||||
- `exec_act`: `"atd:rollback_request"`
|
||||
- `par`: the checkpoint ECT to roll back to
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "atd:rollback_request",
|
||||
"par": ["ckpt-uuid"],
|
||||
"ext": {
|
||||
"atd.reason": "Upstream action caused cascading failure",
|
||||
"atd.cascade": true
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-rollback-req title="Rollback Request ECT"}
|
||||
|
||||
When `atd.cascade` is `true`, the receiving agent MUST also
|
||||
initiate rollback of any downstream checkpoints created as a
|
||||
consequence of the checkpointed action.
|
||||
|
||||
The agent MUST respond with a rollback result ECT:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "atd:rollback_result",
|
||||
"par": ["rollback-request-uuid"],
|
||||
"out_hash": "sha256-of-restored-state",
|
||||
"ext": {
|
||||
"atd.status": "completed",
|
||||
"atd.checkpoint_id": "ckpt-uuid",
|
||||
"atd.cascaded": [
|
||||
{"agent": "spiffe://example.com/agent/c", "status": "completed"},
|
||||
{"agent": "spiffe://example.com/agent/d", "status": "escalated"}
|
||||
]
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-rollback-result title="Rollback Result ECT"}
|
||||
|
||||
Status values: `completed`, `partial`, `escalated`, `failed`.
|
||||
|
||||
`escalated` means the action was irreversible and a human operator
|
||||
has been notified per ACP-DAG-HITL `unreachable_human` policy.
|
||||
|
||||
## Partial Rollback and Blast Radius Containment
|
||||
|
||||
When a failure occurs in the middle of a DAG, it is often
|
||||
undesirable to roll back the entire workflow. ATD defines
|
||||
partial rollback as rolling back the failed subgraph while
|
||||
preserving completed sibling branches.
|
||||
|
||||
Partial rollback MUST only proceed if:
|
||||
|
||||
1. The checkpoints to be rolled back are in the same workflow
|
||||
(`atd.wf_id`).
|
||||
2. No completed sibling task depends on the output of the
|
||||
failed task (verified by walking the DAG forward from the
|
||||
checkpoint).
|
||||
|
||||
The blast radius is the set of agents holding checkpoints that
|
||||
are descendants of the failed node. Orchestrators SHOULD
|
||||
compute blast radius before initiating cascade rollback to
|
||||
avoid unnecessary disruption.
|
||||
|
||||
## Rollback Timeout and Escalation
|
||||
|
||||
Rollback requests MUST include a timeout implicitly derived from
|
||||
the original checkpoint's `atd.ttl`. If rollback is not
|
||||
completed within `atd.ttl / 2` seconds, the agent MUST:
|
||||
|
||||
1. Emit an `atd:error` with `error_type: "timeout"` and
|
||||
`atd.description` noting rollback timeout.
|
||||
2. Escalate to HITL per {{hitl-escalation}}.
|
||||
|
||||
Agents MUST implement idempotent rollback: receiving the same
|
||||
rollback request ECT `jti` twice MUST return the same result.
|
||||
|
||||
## Rollback Authorization {#rollback-authz}
|
||||
|
||||
Only agents within the same workflow (`wid`) with checkpoint
|
||||
lineage in the DAG SHOULD be authorized to request rollback.
|
||||
Rollback requests from outside the originating workflow MUST be
|
||||
rejected with HTTP 403.
|
||||
|
||||
# Interaction with HITL {#hitl-escalation}
|
||||
|
||||
ATD escalates to HITL in the following scenarios:
|
||||
|
||||
1. **Irreversible action failure**: An error ECT with
|
||||
`atd.reversible: false` on the checkpoint MUST trigger
|
||||
HITL Level 2 (approval required) per the companion HITL
|
||||
specification.
|
||||
|
||||
2. **Rollback failure**: A rollback result with `atd.status:
|
||||
"failed"` MUST trigger HITL Level 3 (STOP) on the workflow.
|
||||
|
||||
3. **Cascaded rollback of critical nodes**: When `atd.cascade:
|
||||
true` rollback propagates to a node with `atd.severity:
|
||||
critical`, HITL SHOULD be triggered at Level 1 (PAUSE)
|
||||
to allow human review before proceeding.
|
||||
|
||||
4. **Circuit breaker permanent open**: If a circuit breaker
|
||||
re-opens after 3 successive HALF-OPEN probes, HITL Level 2
|
||||
escalation SHOULD be triggered.
|
||||
|
||||
ATD-to-HITL escalation is recorded as an ECT linked to both
|
||||
the triggering error ECT and the HITL override ECT, preserving
|
||||
the causal chain in the audit DAG.
|
||||
|
||||
# Resource Hints {#resources}
|
||||
|
||||
## Resource Claim Format
|
||||
|
||||
Agents MAY declare resource requirements as ACP-DAG-HITL node
|
||||
constraints:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"constraints": {
|
||||
"atd.resource_cpu": "2",
|
||||
"atd.resource_memory_mb": 4096,
|
||||
"atd.resource_timeout_s": 300,
|
||||
"atd.resource_priority": "high",
|
||||
"atd.resource_gpu": "0",
|
||||
"atd.resource_network_mbps": 100
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-resources title="Resource Hints as Node Constraints"}
|
||||
|
||||
## Priority Levels
|
||||
|
||||
The `atd.resource_priority` field MUST be one of: `critical`,
|
||||
`high`, `normal`, `low`. Orchestrators SHOULD map these to
|
||||
scheduling priority classes (e.g., Kubernetes QoS classes:
|
||||
`critical` → Guaranteed, `high`/`normal` → Burstable, `low`
|
||||
→ BestEffort).
|
||||
|
||||
## Fair-Share Scheduling
|
||||
|
||||
When multiple agents compete for a shared resource pool,
|
||||
orchestrators SHOULD implement fair-share scheduling:
|
||||
|
||||
1. Each active workflow receives an equal base allocation.
|
||||
2. Unused allocation from `low` priority agents is redistributed
|
||||
to `high`/`critical` agents within the same scheduling cycle.
|
||||
3. Starvation prevention: `low` priority agents MUST eventually
|
||||
be scheduled within a configurable maximum wait (default: 300s).
|
||||
|
||||
## Unsatisfiable Resource Hints
|
||||
|
||||
Resource hints are advisory; agents MUST NOT depend on them for
|
||||
correctness. When resource hints cannot be satisfied:
|
||||
|
||||
- If `atd.resource_priority` is `critical`: orchestrator SHOULD
|
||||
pre-empt lower-priority tasks.
|
||||
- If `critical` tasks still cannot be scheduled within 60s:
|
||||
emit `atd:error` with `error_type: "resource_exhausted"` and
|
||||
escalate to HITL.
|
||||
- All other priorities: proceed with degraded resources; log
|
||||
a warning via `atd:error` with severity `warning`.
|
||||
|
||||
# Optional Declarative Workflow Format {#workflow-format}
|
||||
|
||||
To support pre-run planning and tooling, ATD defines an optional
|
||||
declarative workflow descriptor. This is a planning artifact
|
||||
only; at runtime it is realized as ECTs per this specification.
|
||||
|
||||
~~~json
|
||||
{
|
||||
"wf_id": "bgp-failover-v2",
|
||||
"description": "BGP peer failover with validation",
|
||||
"nodes": [
|
||||
{
|
||||
"id": "n1",
|
||||
"label": "validate-config",
|
||||
"reversible": true,
|
||||
"hitl_required": false,
|
||||
"resource_hints": {
|
||||
"priority": "normal",
|
||||
"timeout_s": 30
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "n2",
|
||||
"label": "update-bgp-peer",
|
||||
"reversible": true,
|
||||
"hitl_required": true,
|
||||
"resource_hints": {
|
||||
"priority": "critical",
|
||||
"timeout_s": 120
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": "n3",
|
||||
"label": "verify-session",
|
||||
"reversible": false,
|
||||
"hitl_required": false,
|
||||
"resource_hints": {
|
||||
"priority": "high",
|
||||
"timeout_s": 60
|
||||
}
|
||||
}
|
||||
],
|
||||
"edges": [
|
||||
{"from": "n1", "to": "n2"},
|
||||
{"from": "n2", "to": "n3"}
|
||||
]
|
||||
}
|
||||
~~~
|
||||
{: #fig-workflow title="Declarative Workflow Descriptor"}
|
||||
|
||||
The workflow descriptor media type is
|
||||
`application/atd-workflow+json`. Orchestrators MAY store and
|
||||
version workflow descriptors independently of their ECT runtime
|
||||
realization.
|
||||
|
||||
The `hitl_required` field is a hint to the HITL system that this
|
||||
node MUST have an approval gate as defined in the companion HITL
|
||||
specification.
|
||||
|
||||
# Security Considerations
|
||||
|
||||
## Rollback Authorization
|
||||
|
||||
Rollback requests are high-privilege operations. Agents MUST
|
||||
authenticate rollback requests using the ECT identity binding
|
||||
(L2/L3). The rollback endpoint MUST require mutual TLS or a
|
||||
signed JWT from an agent within the same workflow DAG.
|
||||
|
||||
Only agents that are ancestors in the ECT DAG of the checkpoint
|
||||
being rolled back SHOULD be authorized to request that rollback.
|
||||
|
||||
## Checkpoint Confidentiality
|
||||
|
||||
Checkpoint data may contain sensitive system state (API keys,
|
||||
session tokens, configuration). Agents MUST:
|
||||
|
||||
- Encrypt stored checkpoints at rest.
|
||||
- Reference checkpoint state via `out_hash` only in ECTs.
|
||||
- MUST NOT include checkpoint contents in error ECTs.
|
||||
|
||||
## False Error Injection
|
||||
|
||||
A malicious agent could send false `atd:error` ECTs to trigger
|
||||
unnecessary rollbacks and disrupt workflows. Mitigation:
|
||||
|
||||
- Agents SHOULD verify that error ECTs reference valid `par`
|
||||
values within their own workflow DAG (`wid` claim).
|
||||
- Rollback MUST require authentication (see {{rollback-authz}}).
|
||||
- L2/L3 ECT signing prevents unauthenticated error injection.
|
||||
|
||||
## Checkpoint Flooding
|
||||
|
||||
An adversary could exhaust checkpoint storage by triggering
|
||||
many checkpoints. Mitigation:
|
||||
|
||||
- Agents SHOULD enforce a maximum checkpoint count per workflow.
|
||||
- Expired checkpoints (past `atd.ttl`) MUST be purged.
|
||||
- Checkpoint creation rate SHOULD be rate-limited per calling
|
||||
workflow.
|
||||
|
||||
## Circuit Breaker State Leakage
|
||||
|
||||
The `atd:circuit_open` ECT reveals system health topology. The
|
||||
audit ledger SHOULD enforce access controls: only agents within
|
||||
the same workflow or authorized operators SHOULD be able to query
|
||||
circuit breaker history.
|
||||
|
||||
# IANA Considerations
|
||||
|
||||
This document requests registration of the following values in
|
||||
the AEM Ecosystem Extension Registry established by
|
||||
draft-aem-agent-ecosystem-model:
|
||||
|
||||
## `exec_act` Values
|
||||
|
||||
| Value | Description | Reference |
|
||||
|-------|-------------|-----------|
|
||||
| `atd:checkpoint` | State snapshot before consequential action | This document |
|
||||
| `atd:error` | Error signal with severity and type | This document |
|
||||
| `atd:circuit_open` | Circuit breaker opened to downstream agent | This document |
|
||||
| `atd:circuit_close` | Circuit breaker returned to CLOSED state | This document |
|
||||
| `atd:rollback_request` | Initiate rollback to named checkpoint | This document |
|
||||
| `atd:rollback_result` | Result of rollback attempt | This document |
|
||||
| `atd:workflow_start` | Workflow began execution | This document |
|
||||
| `atd:workflow_complete` | Workflow reached terminal state | This document |
|
||||
{: #fig-iana-actions title="ATD exec_act Registrations"}
|
||||
|
||||
## Well-Known URI
|
||||
|
||||
This document requests registration of `atd/rollback` as a
|
||||
well-known URI suffix per {{RFC8615}}.
|
||||
|
||||
## Media Type
|
||||
|
||||
This document requests registration of
|
||||
`application/atd-workflow+json` for the declarative workflow
|
||||
descriptor format defined in {{workflow-format}}.
|
||||
|
||||
--- back
|
||||
|
||||
# Acknowledgments
|
||||
{:numbered="false"}
|
||||
|
||||
ATD builds on ECT {{I-D.nennemann-wimse-ect}} for execution
|
||||
evidence and ACP-DAG-HITL {{I-D.nennemann-agent-dag-hitl-safety}}
|
||||
for delegation policy. The circuit breaker pattern is adapted
|
||||
from microservice architecture best practices. The declarative
|
||||
workflow format is inspired by workflow description languages
|
||||
(BPEL, BPMN) adapted for lightweight agent coordination.
|
||||
368
workspace/drafts/new-drafts/draft-c-hitl-human-in-the-loop-00.md
Normal file
368
workspace/drafts/new-drafts/draft-c-hitl-human-in-the-loop-00.md
Normal file
@@ -0,0 +1,368 @@
|
||||
---
|
||||
title: "Human-in-the-Loop (HITL) Primitives for Agent Ecosystems"
|
||||
abbrev: "HITL"
|
||||
category: std
|
||||
docname: draft-hitl-human-in-the-loop-00
|
||||
submissiontype: IETF
|
||||
number:
|
||||
date:
|
||||
v: 3
|
||||
area: "OPS"
|
||||
workgroup: "NMOP"
|
||||
keyword:
|
||||
- human override
|
||||
- HITL
|
||||
- emergency stop
|
||||
- agentic safety
|
||||
|
||||
author:
|
||||
-
|
||||
fullname: TBD
|
||||
organization: Independent
|
||||
email: placeholder@example.com
|
||||
|
||||
normative:
|
||||
RFC2119:
|
||||
RFC8174:
|
||||
RFC7519:
|
||||
RFC8446:
|
||||
RFC8615:
|
||||
I-D.nennemann-wimse-ect:
|
||||
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
||||
I-D.nennemann-agent-dag-hitl-safety:
|
||||
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
||||
|
||||
informative:
|
||||
|
||||
--- abstract
|
||||
|
||||
This document defines runtime HITL (Human-in-the-Loop) primitives
|
||||
for agent ecosystems: four escalating override levels, approval
|
||||
gates, escalation paths, and explainability hooks. ACP-DAG-HITL
|
||||
defines WHEN humans must intervene (policy rules and triggers).
|
||||
This specification defines HOW the intervention actually happens at
|
||||
the protocol level: the HTTP endpoints, override semantics, agent
|
||||
compliance requirements, and acknowledgment flows. All overrides
|
||||
and decisions produce ECT nodes, making human interventions part of
|
||||
the same auditable DAG as agent actions.
|
||||
|
||||
--- middle
|
||||
|
||||
# Introduction
|
||||
|
||||
The current ratio of autonomous capability drafts to human
|
||||
oversight drafts in the IETF is roughly 7:1. Agents can act but
|
||||
humans cannot reliably stop them.
|
||||
|
||||
ACP-DAG-HITL {{I-D.nennemann-agent-dag-hitl-safety}} defines the
|
||||
policy: trigger conditions, required roles, and actions (`pause`,
|
||||
`escalate`, `abort`). But it deliberately defers the runtime
|
||||
protocol — how does an operator actually send a stop command? How
|
||||
does the agent acknowledge it? What happens if the operator is
|
||||
unreachable?
|
||||
|
||||
This specification fills that gap. It is the runtime enforcement
|
||||
companion to ACP-DAG-HITL, inspired by industrial safety systems:
|
||||
the e-stop button on factory equipment, the circuit breaker in
|
||||
electrical systems, and the kill switch in robotics.
|
||||
|
||||
HITL is deliberately not a governance framework, policy language,
|
||||
or accountability protocol. It is a panic button with a
|
||||
well-defined interface.
|
||||
|
||||
# Conventions and Definitions
|
||||
|
||||
{::boilerplate bcp14-tagged}
|
||||
|
||||
Override:
|
||||
: A human-initiated command that alters an agent's autonomous
|
||||
operation, taking precedence over the agent's own decisions.
|
||||
|
||||
Operator:
|
||||
: A human user authorized to issue override commands.
|
||||
|
||||
Approval Gate:
|
||||
: A DAG node that blocks workflow progression until a human
|
||||
approves or rejects continuation.
|
||||
|
||||
# Relationship to ACP-DAG-HITL {#mapping}
|
||||
|
||||
ACP-DAG-HITL defines three HITL actions. This specification
|
||||
maps them to four runtime override levels and extends with
|
||||
CONSTRAIN (partial restriction):
|
||||
|
||||
| ACP-DAG-HITL action | HITL Override Level | Behavior |
|
||||
|---------------------|---------------------|----------|
|
||||
| `pause` | Level 1: PAUSE | Suspend autonomous actions, hold state |
|
||||
| (no equivalent) | Level 2: CONSTRAIN | Restrict to an allowlist of actions |
|
||||
| `abort` | Level 3: STOP | Cease all actions, enter inert state |
|
||||
| `escalate` | Level 4: TAKEOVER | Transfer control to human operator |
|
||||
{: #fig-mapping title="ACP-DAG-HITL to HITL Level Mapping"}
|
||||
|
||||
When ACP-DAG-HITL rules trigger, the runtime system uses the
|
||||
corresponding HITL level to enforce the action.
|
||||
|
||||
# Override Levels {#levels}
|
||||
|
||||
## Level 1: PAUSE
|
||||
|
||||
The agent MUST suspend all autonomous actions and hold current
|
||||
state. It MUST NOT initiate new actions but MAY complete
|
||||
in-progress actions if stopping mid-execution would cause harm
|
||||
(e.g., an in-flight database transaction). The agent resumes
|
||||
when a RESUME command is received.
|
||||
|
||||
## Level 2: CONSTRAIN
|
||||
|
||||
The agent MUST restrict its actions to a specified subset. The
|
||||
override command includes an allowlist of permitted action types.
|
||||
The agent MUST reject any action not on the allowlist.
|
||||
|
||||
## Level 3: STOP
|
||||
|
||||
The agent MUST immediately cease all autonomous actions and enter
|
||||
an inert state. It MUST NOT take any autonomous actions until
|
||||
explicitly restarted. This is the e-stop.
|
||||
|
||||
## Level 4: TAKEOVER
|
||||
|
||||
The agent MUST transfer operational control to the human operator.
|
||||
It enters a pass-through mode where it executes only explicit
|
||||
operator commands. The agent's sensors and outputs remain
|
||||
available to the operator as tools.
|
||||
|
||||
# Override Protocol {#protocol}
|
||||
|
||||
## Override Command
|
||||
|
||||
Override commands are sent as HTTP POST to the agent's well-known
|
||||
endpoint:
|
||||
|
||||
~~~
|
||||
POST /.well-known/hitl/override HTTP/1.1
|
||||
Content-Type: application/json
|
||||
Authorization: Bearer <operator-jwt>
|
||||
Execution-Context: <override-ect>
|
||||
~~~
|
||||
|
||||
The override ECT MUST contain:
|
||||
|
||||
- `exec_act`: `"hitl:override"`
|
||||
- `par`: the most recent ECT from the agent being overridden
|
||||
(linking the override into the workflow DAG)
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "hitl:override",
|
||||
"par": ["agent-last-action-ect"],
|
||||
"ext": {
|
||||
"hitl.level": 3,
|
||||
"hitl.reason": "Agent blocking legitimate traffic",
|
||||
"hitl.operator_id": "user:alice",
|
||||
"hitl.scope": "*",
|
||||
"hitl.constraints": null,
|
||||
"hitl.ttl": null
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-override title="Override ECT"}
|
||||
|
||||
Field definitions:
|
||||
|
||||
- `hitl.level`: Integer 1-4. MUST be present.
|
||||
- `hitl.reason`: Human-readable text. MUST be logged.
|
||||
- `hitl.scope`: `"*"` for all functions, or an array of function
|
||||
IDs for partial override.
|
||||
- `hitl.constraints`: For Level 2 only. Array of permitted action
|
||||
types.
|
||||
- `hitl.ttl`: Duration in seconds. If set, override auto-expires.
|
||||
If null, persists until explicitly lifted.
|
||||
|
||||
## Acknowledgment
|
||||
|
||||
The agent MUST respond with an acknowledgment ECT:
|
||||
|
||||
- `exec_act`: `"hitl:ack"`
|
||||
- `par`: the override ECT
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "hitl:ack",
|
||||
"par": ["override-ect-uuid"],
|
||||
"ext": {
|
||||
"hitl.status": "accepted",
|
||||
"hitl.prior_state": "autonomous",
|
||||
"hitl.current_state": "stopped",
|
||||
"hitl.effective_at": "2026-03-01T12:00:00.123Z"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-ack title="Acknowledgment ECT"}
|
||||
|
||||
The override/ack ECT pair serves as the Decision Record defined
|
||||
in ACP-DAG-HITL Section 6.5. No separate audit mechanism is
|
||||
needed.
|
||||
|
||||
## Resume and Lift
|
||||
|
||||
To resume from PAUSE:
|
||||
|
||||
~~~
|
||||
POST /.well-known/hitl/resume HTTP/1.1
|
||||
Execution-Context: <resume-ect with exec_act="hitl:resume">
|
||||
~~~
|
||||
|
||||
To lift any override:
|
||||
|
||||
~~~
|
||||
POST /.well-known/hitl/lift HTTP/1.1
|
||||
Execution-Context: <lift-ect with exec_act="hitl:lift">
|
||||
~~~
|
||||
|
||||
Both produce ECTs linked to the original override ECT via `par`.
|
||||
|
||||
# Agent Compliance Requirements {#compliance}
|
||||
|
||||
Every HITL-compliant agent MUST:
|
||||
|
||||
1. Implement the `/.well-known/hitl/override` endpoint.
|
||||
|
||||
2. Process override commands within 1 second of receipt. The
|
||||
override path MUST be independent of the agent's main
|
||||
processing loop.
|
||||
|
||||
3. Acknowledge every override with an ECT response.
|
||||
|
||||
4. An agent MUST NOT respond with "rejected". Overrides are
|
||||
mandatory. If the agent cannot fully comply, it MUST respond
|
||||
with status `partial` and describe what it could not do.
|
||||
|
||||
5. Expose current override status at:
|
||||
|
||||
~~~
|
||||
GET /.well-known/hitl/status
|
||||
~~~
|
||||
|
||||
~~~json
|
||||
{
|
||||
"agent_id": "spiffe://example.com/agent/firewall",
|
||||
"override_active": true,
|
||||
"current_level": 3,
|
||||
"override_ect": "override-ect-uuid",
|
||||
"since": "2026-03-01T12:00:00Z",
|
||||
"operator_id": "user:alice"
|
||||
}
|
||||
~~~
|
||||
{: #fig-status title="Override Status Response"}
|
||||
|
||||
# Approval Gates {#approval-gates}
|
||||
|
||||
An approval gate is a DAG node that blocks workflow progression
|
||||
until a human approves. Unlike overrides (which interrupt running
|
||||
agents), approval gates are planned checkpoints in the workflow.
|
||||
|
||||
Approval gates are defined as ACP-DAG-HITL nodes with HITL rules:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"dag": {
|
||||
"nodes": [
|
||||
{
|
||||
"id": "n-approve",
|
||||
"type": "hitl:approval_gate",
|
||||
"agent": "system:hitl-gateway",
|
||||
"constraints": {
|
||||
"hitl.required_role": "clinician:oncall",
|
||||
"hitl.timeout_s": 300,
|
||||
"hitl.timeout_action": "safe_pause"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-gate title="Approval Gate as DAG Node"}
|
||||
|
||||
When the workflow reaches an approval gate, the system:
|
||||
|
||||
1. Emits an ECT with `exec_act: "hitl:approval_request"`
|
||||
2. Notifies the required human role
|
||||
3. Waits for approval (ECT: `"hitl:approval_granted"`) or
|
||||
rejection (ECT: `"hitl:approval_denied"`)
|
||||
4. On timeout, applies `hitl.timeout_action`
|
||||
|
||||
# Broadcast Override {#broadcast}
|
||||
|
||||
For environments with many agents, an operator MAY send a
|
||||
broadcast override to a management endpoint:
|
||||
|
||||
~~~
|
||||
POST /hitl/broadcast HTTP/1.1
|
||||
Execution-Context: <broadcast-override-ect>
|
||||
|
||||
{
|
||||
"targets": ["spiffe://example.com/agent/a",
|
||||
"spiffe://example.com/agent/b"],
|
||||
"level": 3,
|
||||
"reason": "Coordinated emergency stop"
|
||||
}
|
||||
~~~
|
||||
|
||||
The broadcast endpoint fans out individual override ECTs to each
|
||||
target and returns per-agent results.
|
||||
|
||||
# Dead Man's Switch {#dead-man}
|
||||
|
||||
For maximum reliability, agents SHOULD implement a heartbeat
|
||||
mechanism: the agent periodically pings an operator heartbeat
|
||||
endpoint. If the heartbeat is missed for a configurable duration,
|
||||
the agent automatically enters Level 1 (PAUSE).
|
||||
|
||||
This provides a safety net when network connectivity to the
|
||||
operator is lost. The `unreachable_human` policy from
|
||||
ACP-DAG-HITL governs behavior when the dead man's switch
|
||||
activates: either `abort` or `safe_pause`.
|
||||
|
||||
# Security Considerations
|
||||
|
||||
Override commands are high-privilege operations. All override
|
||||
endpoints MUST require authentication via mutual TLS or signed
|
||||
JWTs.
|
||||
|
||||
Override ECTs MUST be signed at L2 or L3. Agents MUST verify
|
||||
signatures before processing.
|
||||
|
||||
To prevent replay attacks, agents MUST reject override ECTs with
|
||||
`iat` more than 30 seconds in the past. The `jti` MUST be unique;
|
||||
agents MUST reject duplicate `jti` values.
|
||||
|
||||
Deployments SHOULD implement multi-operator approval for Level 4
|
||||
(TAKEOVER), requiring two independent operator identities.
|
||||
|
||||
The override endpoint SHOULD be served on a separate port or
|
||||
network interface from the agent's main API to ensure availability
|
||||
during overload.
|
||||
|
||||
# IANA Considerations
|
||||
|
||||
This document requests the following registrations:
|
||||
|
||||
1. Well-known URI registrations for `hitl/override`,
|
||||
`hitl/resume`, `hitl/lift`, and `hitl/status` per {{RFC8615}}.
|
||||
|
||||
2. Registration of `exec_act` values: `hitl:override`,
|
||||
`hitl:ack`, `hitl:resume`, `hitl:lift`,
|
||||
`hitl:approval_request`, `hitl:approval_granted`,
|
||||
`hitl:approval_denied` in a future ECT action type registry.
|
||||
|
||||
--- back
|
||||
|
||||
# Acknowledgments
|
||||
{:numbered="false"}
|
||||
|
||||
This specification is the runtime enforcement companion to
|
||||
ACP-DAG-HITL {{I-D.nennemann-agent-dag-hitl-safety}}. Override
|
||||
design is inspired by industrial safety systems (IEC 62061,
|
||||
ISO 13849).
|
||||
612
workspace/drafts/new-drafts/draft-c-hitl-human-in-the-loop-01.md
Normal file
612
workspace/drafts/new-drafts/draft-c-hitl-human-in-the-loop-01.md
Normal file
@@ -0,0 +1,612 @@
|
||||
---
|
||||
title: "Human-in-the-Loop (HITL) Primitives for Agent Ecosystems"
|
||||
abbrev: "HITL"
|
||||
category: std
|
||||
docname: draft-hitl-human-in-the-loop-01
|
||||
submissiontype: IETF
|
||||
number:
|
||||
date:
|
||||
v: 3
|
||||
area: "OPS"
|
||||
workgroup: "NMOP"
|
||||
keyword:
|
||||
- human override
|
||||
- HITL
|
||||
- emergency stop
|
||||
- agentic safety
|
||||
- explainability
|
||||
|
||||
author:
|
||||
-
|
||||
fullname: TBD
|
||||
organization: Independent
|
||||
email: placeholder@example.com
|
||||
|
||||
normative:
|
||||
RFC2119:
|
||||
RFC8174:
|
||||
RFC7519:
|
||||
RFC8446:
|
||||
RFC8615:
|
||||
RFC9110:
|
||||
I-D.nennemann-wimse-ect:
|
||||
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
||||
I-D.nennemann-agent-dag-hitl-safety:
|
||||
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
||||
|
||||
informative:
|
||||
|
||||
--- abstract
|
||||
|
||||
This document defines runtime HITL (Human-in-the-Loop) primitives
|
||||
for agent ecosystems: four escalating override levels, approval
|
||||
gates, timeout and fallback policies, and explainability hooks.
|
||||
ACP-DAG-HITL defines WHEN humans must intervene (policy rules and
|
||||
triggers). This specification defines HOW the intervention
|
||||
actually happens at the protocol level: the HTTP endpoints,
|
||||
override semantics, agent compliance requirements,
|
||||
acknowledgment flows, and explainability tokens that allow
|
||||
operators to make informed decisions. All overrides and decisions
|
||||
produce ECT nodes, making human interventions part of the same
|
||||
auditable DAG as agent actions.
|
||||
|
||||
--- middle
|
||||
|
||||
# Introduction
|
||||
|
||||
The current ratio of autonomous capability drafts to human
|
||||
oversight drafts in the IETF is roughly 7:1. Agents can act but
|
||||
humans cannot reliably stop them.
|
||||
|
||||
ACP-DAG-HITL {{I-D.nennemann-agent-dag-hitl-safety}} defines the
|
||||
policy: trigger conditions, required roles, and actions (`pause`,
|
||||
`escalate`, `abort`). But it deliberately defers the runtime
|
||||
protocol — how does an operator actually send a stop command? How
|
||||
does the agent acknowledge it? What happens if the operator is
|
||||
unreachable?
|
||||
|
||||
This specification fills that gap. It is the runtime enforcement
|
||||
companion to ACP-DAG-HITL, inspired by industrial safety systems:
|
||||
the e-stop button on factory equipment, the circuit breaker in
|
||||
electrical systems, and the kill switch in robotics.
|
||||
|
||||
HITL is deliberately not a governance framework, policy language,
|
||||
or accountability protocol. It is a panic button with a
|
||||
well-defined interface.
|
||||
|
||||
# Conventions and Definitions
|
||||
|
||||
{::boilerplate bcp14-tagged}
|
||||
|
||||
Override:
|
||||
: A human-initiated command that alters an agent's autonomous
|
||||
operation, taking precedence over the agent's own decisions.
|
||||
|
||||
Operator:
|
||||
: A human user authorized to issue override commands.
|
||||
|
||||
Approval Gate:
|
||||
: A DAG node that blocks workflow progression until a human
|
||||
approves or rejects continuation.
|
||||
|
||||
HITL Intensity Level:
|
||||
: A deployment-wide configuration of how actively human oversight
|
||||
is required. Distinct from override levels (which are runtime
|
||||
commands).
|
||||
|
||||
# HITL Intensity Levels {#intensity}
|
||||
|
||||
A deployment configures a HITL intensity level that determines
|
||||
the baseline human oversight requirement. This is orthogonal to
|
||||
the four runtime override levels ({{levels}}): intensity levels
|
||||
govern planning; override levels govern runtime intervention.
|
||||
|
||||
| Intensity | Label | Human requirement | When to use |
|
||||
|-----------|-------|-------------------|-------------|
|
||||
| I0 | Autonomous | No HITL required by default | Dev/test; fully trusted agents |
|
||||
| I1 | Advisory | Notifications; no blocking | Monitoring-only production deployments |
|
||||
| I2 | Selective | Approval required on critical paths only | Standard production cross-org deployments |
|
||||
| I3 | Mandatory | Approval required on every consequential action | Regulated environments; EU AI Act critical systems |
|
||||
{: #fig-intensity title="HITL Intensity Levels"}
|
||||
|
||||
Intensity levels are declared in ACP-DAG-HITL workflow policy and
|
||||
map to AEM assurance levels (see {{assurance-binding}}):
|
||||
|
||||
| HITL Intensity | Minimum AEM Assurance Level |
|
||||
|---------------|----------------------------|
|
||||
| I0 | L1 |
|
||||
| I1 | L1 |
|
||||
| I2 | L2 |
|
||||
| I3 | L3 |
|
||||
{: #fig-intensity-assurance title="Intensity to Assurance Level Mapping"}
|
||||
|
||||
# Relationship to ACP-DAG-HITL {#mapping}
|
||||
|
||||
ACP-DAG-HITL defines three HITL actions. This specification
|
||||
maps them to four runtime override levels and extends with
|
||||
CONSTRAIN (partial restriction):
|
||||
|
||||
| ACP-DAG-HITL action | HITL Override Level | Behavior |
|
||||
|---------------------|---------------------|----------|
|
||||
| `pause` | Level 1: PAUSE | Suspend autonomous actions, hold state |
|
||||
| (no equivalent) | Level 2: CONSTRAIN | Restrict to an allowlist of actions |
|
||||
| `abort` | Level 3: STOP | Cease all actions, enter inert state |
|
||||
| `escalate` | Level 4: TAKEOVER | Transfer control to human operator |
|
||||
{: #fig-mapping title="ACP-DAG-HITL to HITL Level Mapping"}
|
||||
|
||||
When ACP-DAG-HITL rules trigger, the runtime system uses the
|
||||
corresponding HITL level to enforce the action.
|
||||
|
||||
# Override Levels {#levels}
|
||||
|
||||
## Level 1: PAUSE
|
||||
|
||||
The agent MUST suspend all autonomous actions and hold current
|
||||
state. It MUST NOT initiate new actions but MAY complete
|
||||
in-progress actions if stopping mid-execution would cause harm
|
||||
(e.g., an in-flight database transaction). The agent resumes
|
||||
when a RESUME command is received.
|
||||
|
||||
## Level 2: CONSTRAIN
|
||||
|
||||
The agent MUST restrict its actions to a specified subset. The
|
||||
override command includes an allowlist of permitted action types.
|
||||
The agent MUST reject any action not on the allowlist, responding
|
||||
with HTTP 403 and an ECT noting the constraint violation.
|
||||
|
||||
## Level 3: STOP
|
||||
|
||||
The agent MUST immediately cease all autonomous actions and enter
|
||||
an inert state. It MUST NOT take any autonomous actions until
|
||||
explicitly restarted. This is the e-stop. Any in-progress
|
||||
consequential actions MUST be abandoned; if abandonment would
|
||||
leave external state inconsistent, the agent MUST emit an
|
||||
`atd:error` ECT and the ATD rollback protocol applies.
|
||||
|
||||
## Level 4: TAKEOVER
|
||||
|
||||
The agent MUST transfer operational control to the human operator.
|
||||
It enters a pass-through mode where it executes only explicit
|
||||
operator commands. The agent's sensors and outputs remain
|
||||
available to the operator as tools. Deployments SHOULD require
|
||||
two-operator authorization for TAKEOVER (see {{security}}).
|
||||
|
||||
# Override Protocol {#protocol}
|
||||
|
||||
## Override Command
|
||||
|
||||
Override commands are sent as HTTP POST to the agent's well-known
|
||||
endpoint:
|
||||
|
||||
~~~
|
||||
POST /.well-known/hitl/override HTTP/1.1
|
||||
Content-Type: application/json
|
||||
Authorization: Bearer <operator-jwt>
|
||||
Execution-Context: <override-ect>
|
||||
~~~
|
||||
|
||||
The override ECT MUST contain:
|
||||
|
||||
- `exec_act`: `"hitl:override"`
|
||||
- `par`: the most recent ECT from the agent being overridden
|
||||
(linking the override into the workflow DAG)
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "hitl:override",
|
||||
"par": ["agent-last-action-ect"],
|
||||
"ext": {
|
||||
"hitl.level": 3,
|
||||
"hitl.reason": "Agent blocking legitimate traffic",
|
||||
"hitl.operator_id": "user:alice",
|
||||
"hitl.scope": "*",
|
||||
"hitl.constraints": null,
|
||||
"hitl.ttl": null,
|
||||
"hitl.nonce": "a3f8b2c1"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-override title="Override ECT"}
|
||||
|
||||
Field definitions:
|
||||
|
||||
- `hitl.level`: Integer 1-4. MUST be present.
|
||||
- `hitl.reason`: Human-readable text. MUST be logged.
|
||||
- `hitl.scope`: `"*"` for all functions, or an array of function
|
||||
IDs for partial override.
|
||||
- `hitl.constraints`: For Level 2 only. Array of permitted action
|
||||
types.
|
||||
- `hitl.ttl`: Duration in seconds. If set, override auto-expires.
|
||||
If null, persists until explicitly lifted.
|
||||
- `hitl.nonce`: REQUIRED. A random value to prevent replay attacks.
|
||||
|
||||
## Acknowledgment
|
||||
|
||||
The agent MUST respond within 1 second with an acknowledgment ECT:
|
||||
|
||||
- `exec_act`: `"hitl:ack"`
|
||||
- `par`: the override ECT
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "hitl:ack",
|
||||
"par": ["override-ect-uuid"],
|
||||
"ext": {
|
||||
"hitl.status": "accepted",
|
||||
"hitl.prior_state": "autonomous",
|
||||
"hitl.current_state": "stopped",
|
||||
"hitl.effective_at": "2026-03-01T12:00:00.123Z"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-ack title="Acknowledgment ECT"}
|
||||
|
||||
The override/ack ECT pair serves as the Decision Record defined
|
||||
in ACP-DAG-HITL Section 6.5. No separate audit mechanism is
|
||||
needed.
|
||||
|
||||
## Resume and Lift
|
||||
|
||||
To resume from PAUSE:
|
||||
|
||||
~~~
|
||||
POST /.well-known/hitl/resume HTTP/1.1
|
||||
Execution-Context: <resume-ect with exec_act="hitl:resume">
|
||||
~~~
|
||||
|
||||
To lift any override:
|
||||
|
||||
~~~
|
||||
POST /.well-known/hitl/lift HTTP/1.1
|
||||
Execution-Context: <lift-ect with exec_act="hitl:lift">
|
||||
~~~
|
||||
|
||||
Both produce ECTs linked to the original override ECT via `par`.
|
||||
|
||||
# Agent Compliance Requirements {#compliance}
|
||||
|
||||
Every HITL-compliant agent MUST:
|
||||
|
||||
1. Implement the `/.well-known/hitl/override` endpoint per
|
||||
{{RFC8615}}.
|
||||
|
||||
2. Process override commands within 1 second of receipt. The
|
||||
override path MUST be independent of the agent's main
|
||||
processing loop and MUST NOT be blocked by ongoing tasks.
|
||||
|
||||
3. Acknowledge every override with an ECT response.
|
||||
|
||||
4. An agent MUST NOT respond with "rejected". Overrides are
|
||||
mandatory. If the agent cannot fully comply, it MUST respond
|
||||
with status `partial` and describe what it could not do.
|
||||
|
||||
5. Expose current override status at:
|
||||
|
||||
~~~
|
||||
GET /.well-known/hitl/status
|
||||
~~~
|
||||
|
||||
~~~json
|
||||
{
|
||||
"agent_id": "spiffe://example.com/agent/firewall",
|
||||
"override_active": true,
|
||||
"current_level": 3,
|
||||
"override_ect": "override-ect-uuid",
|
||||
"since": "2026-03-01T12:00:00Z",
|
||||
"operator_id": "user:alice"
|
||||
}
|
||||
~~~
|
||||
{: #fig-status title="Override Status Response"}
|
||||
|
||||
6. The override endpoint SHOULD be served on a separate port or
|
||||
network interface from the agent's main API to ensure
|
||||
availability under load.
|
||||
|
||||
# Approval Gates {#approval-gates}
|
||||
|
||||
An approval gate is a DAG node that blocks workflow progression
|
||||
until a human approves. Unlike overrides (which interrupt running
|
||||
agents), approval gates are planned checkpoints in the workflow.
|
||||
|
||||
Approval gates are defined as ACP-DAG-HITL nodes with HITL rules:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"dag": {
|
||||
"nodes": [
|
||||
{
|
||||
"id": "n-approve",
|
||||
"type": "hitl:approval_gate",
|
||||
"agent": "system:hitl-gateway",
|
||||
"constraints": {
|
||||
"hitl.required_role": "clinician:oncall",
|
||||
"hitl.timeout_s": 300,
|
||||
"hitl.timeout_action": "safe_pause"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-gate title="Approval Gate as DAG Node"}
|
||||
|
||||
When the workflow reaches an approval gate, the system:
|
||||
|
||||
1. Emits an ECT with `exec_act: "hitl:approval_request"`.
|
||||
2. Notifies the required human role with an explainability
|
||||
token (see {{explainability}}).
|
||||
3. Waits for approval (ECT: `"hitl:approval_granted"`) or
|
||||
rejection (ECT: `"hitl:approval_denied"`).
|
||||
4. On timeout, applies `hitl.timeout_action` per {{timeout}}.
|
||||
|
||||
## Approval Request and Response ECTs
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "hitl:approval_request",
|
||||
"par": ["pre-gate-ect-uuid"],
|
||||
"ext": {
|
||||
"hitl.required_role": "clinician:oncall",
|
||||
"hitl.context": "Medication dosage adjustment for patient P-1042",
|
||||
"hitl.timeout_s": 300,
|
||||
"hitl.explainability_ref": "expl-ect-uuid"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-approval-req title="Approval Request ECT"}
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "hitl:approval_granted",
|
||||
"par": ["approval-request-ect-uuid"],
|
||||
"ext": {
|
||||
"hitl.operator_id": "user:dr-jones",
|
||||
"hitl.scope": "medication:adjust",
|
||||
"hitl.expires": "2026-03-01T13:00:00Z"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-approval-grant title="Approval Granted ECT"}
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "hitl:approval_denied",
|
||||
"par": ["approval-request-ect-uuid"],
|
||||
"ext": {
|
||||
"hitl.operator_id": "user:dr-jones",
|
||||
"hitl.reason": "Dosage exceeds safe maximum for patient weight",
|
||||
"hitl.alternative": "Use standard protocol dosage"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-approval-deny title="Approval Denied ECT"}
|
||||
|
||||
# Timeout and Fallback Policy {#timeout}
|
||||
|
||||
When a human does not respond within `hitl.timeout_s`, the
|
||||
agent applies `hitl.timeout_action`. Three policies are
|
||||
supported:
|
||||
|
||||
fail-closed:
|
||||
: Abort the workflow. The agent emits `atd:error` with
|
||||
`error_type: "timeout"` and the ATD rollback protocol
|
||||
applies. Use when safety requires no action over wrong action.
|
||||
|
||||
fail-open:
|
||||
: Continue as if approved, recording an audit ECT that no human
|
||||
approved. Use only when workflow continuity is more important
|
||||
than human review (I0/I1 intensity deployments).
|
||||
|
||||
escalate:
|
||||
: Move the approval request to the next operator in the
|
||||
escalation chain (defined in ACP-DAG-HITL policy). If the
|
||||
escalation chain is exhausted, fall back to `fail-closed`.
|
||||
|
||||
The timeout policy is set in ACP-DAG-HITL node constraints:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"constraints": {
|
||||
"hitl.timeout_s": 300,
|
||||
"hitl.timeout_action": "escalate"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-timeout title="Timeout Policy as Node Constraint"}
|
||||
|
||||
Timeout policy MUST be `fail-closed` at HITL intensity I3.
|
||||
Timeout policy MUST NOT be `fail-open` when assurance level is L3.
|
||||
|
||||
# Explainability {#explainability}
|
||||
|
||||
When a HITL point is triggered, the agent SHOULD provide an
|
||||
explainability token that allows the operator to make an informed
|
||||
decision. At AEM assurance L2+, explainability is REQUIRED for
|
||||
approval gate requests.
|
||||
|
||||
An explainability token is an ECT:
|
||||
|
||||
- `exec_act`: `"hitl:explanation"`
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "hitl:explanation",
|
||||
"par": ["last-agent-action-ect"],
|
||||
"ext": {
|
||||
"hitl.summary": "Agent proposes to reroute BGP traffic from AS64496 to AS64497 due to packet loss exceeding 15% threshold over 5-minute window.",
|
||||
"hitl.proposed_action": "update-bgp-peer router-07 neighbor 198.51.100.1 remove-private-as",
|
||||
"hitl.evidence_ects": [
|
||||
"snmp-poll-1-ect-uuid",
|
||||
"snmp-poll-2-ect-uuid",
|
||||
"loss-calc-ect-uuid"
|
||||
],
|
||||
"hitl.confidence": 0.91,
|
||||
"hitl.risk_level": "medium",
|
||||
"hitl.reversible": true
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-explanation title="Explainability Token ECT"}
|
||||
|
||||
Field definitions:
|
||||
|
||||
- `hitl.summary`: Human-readable description of what the agent
|
||||
was doing and why HITL was reached. REQUIRED.
|
||||
- `hitl.proposed_action`: What the agent proposes to do.
|
||||
REQUIRED.
|
||||
- `hitl.evidence_ects`: Array of `jti` values from prior ECTs
|
||||
that support the proposal. SHOULD be present.
|
||||
- `hitl.confidence`: Float 0.0-1.0; agent's self-assessed
|
||||
confidence in the proposed action. SHOULD be present.
|
||||
- `hitl.risk_level`: One of `low`, `medium`, `high`, `critical`.
|
||||
SHOULD be present.
|
||||
- `hitl.reversible`: Whether the proposed action can be rolled
|
||||
back. REQUIRED.
|
||||
|
||||
The `hitl.explainability_ref` field in the approval request ECT
|
||||
({{fig-approval-req}}) references the `jti` of this ECT.
|
||||
|
||||
# Binding to AEM Assurance Levels {#assurance-binding}
|
||||
|
||||
HITL requirements vary by AEM assurance level. The following
|
||||
table is normative:
|
||||
|
||||
| AEM Level | Required HITL Intensity | Override signing | Explainability |
|
||||
|-----------|------------------------|-----------------|----------------|
|
||||
| L1 | I0 (optional) | Optional | Optional |
|
||||
| L2 | I2 or higher | REQUIRED (signed JWT) | REQUIRED for I2+ |
|
||||
| L3 | I3 | REQUIRED (signed JWT, L3 ECT) | REQUIRED |
|
||||
{: #fig-assurance-hitl title="HITL Requirements by Assurance Level"}
|
||||
|
||||
At L3, approval gate responses (hitl:approval_granted) MUST be
|
||||
committed to the audit ledger.
|
||||
|
||||
# Broadcast Override {#broadcast}
|
||||
|
||||
For environments with many agents, an operator MAY send a
|
||||
broadcast override to a management endpoint:
|
||||
|
||||
~~~
|
||||
POST /hitl/broadcast HTTP/1.1
|
||||
Execution-Context: <broadcast-override-ect>
|
||||
|
||||
{
|
||||
"targets": ["spiffe://example.com/agent/a",
|
||||
"spiffe://example.com/agent/b"],
|
||||
"level": 3,
|
||||
"reason": "Coordinated emergency stop"
|
||||
}
|
||||
~~~
|
||||
|
||||
The broadcast endpoint fans out individual override ECTs to each
|
||||
target and returns per-agent results. Each fan-out is itself an
|
||||
ECT linked to the broadcast override ECT.
|
||||
|
||||
Broadcast overrides MUST be authenticated at L2 or higher.
|
||||
|
||||
# Dead Man's Switch {#dead-man}
|
||||
|
||||
For maximum reliability, agents SHOULD implement a heartbeat
|
||||
mechanism: the agent periodically pings an operator heartbeat
|
||||
endpoint. If the heartbeat is missed for a configurable duration,
|
||||
the agent automatically enters Level 1 (PAUSE).
|
||||
|
||||
The heartbeat interval SHOULD be 30 seconds. The trigger
|
||||
threshold SHOULD be 3 missed heartbeats.
|
||||
|
||||
This provides a safety net when network connectivity to the
|
||||
operator is lost. The `unreachable_human` policy from
|
||||
ACP-DAG-HITL governs behavior when the dead man's switch
|
||||
activates: either `abort` (→ Level 3) or `safe_pause` (→ Level 1).
|
||||
|
||||
# Security Considerations {#security}
|
||||
|
||||
## Authentication of Override Commands
|
||||
|
||||
All override endpoints MUST require authentication via mutual
|
||||
TLS ({{RFC8446}}) or signed JWTs ({{RFC7519}}). The JWT MUST
|
||||
contain the operator's identity and be signed by a trusted key
|
||||
(per ACP-DAG-HITL operator role configuration).
|
||||
|
||||
## Replay Prevention
|
||||
|
||||
To prevent replay attacks, agents MUST:
|
||||
|
||||
1. Reject override ECTs with `iat` more than 30 seconds in the
|
||||
past.
|
||||
2. Reject duplicate `jti` values (require a nonce per override).
|
||||
3. Require the `hitl.nonce` field in override ECTs.
|
||||
|
||||
## Impersonation
|
||||
|
||||
Override commands carry high privilege. Agents MUST verify:
|
||||
|
||||
- The operator JWT is signed by a trusted key in the ACP-DAG-HITL
|
||||
operator registry.
|
||||
- The operator role matches the `required_role` in the triggering
|
||||
HITL rule.
|
||||
|
||||
## Two-Operator Approval for TAKEOVER
|
||||
|
||||
Deployments SHOULD implement multi-operator approval for Level 4
|
||||
(TAKEOVER), requiring two independent operator identities. The
|
||||
two approval ECTs MUST both appear as `par` in the TAKEOVER
|
||||
override ECT.
|
||||
|
||||
## HITL Bypass Prevention
|
||||
|
||||
Agents that claim a HITL gate was satisfied MUST provide the
|
||||
`jti` of the corresponding `hitl:approval_granted` ECT in the
|
||||
ECT that follows the gate. Agents MUST NOT proceed past an
|
||||
approval gate without a valid signed approval ECT.
|
||||
|
||||
## Escalation Chain Integrity
|
||||
|
||||
The escalation chain in ACP-DAG-HITL policy defines which roles
|
||||
receive escalations. This chain MUST be signed as part of the
|
||||
policy token to prevent tampering. Agents MUST NOT follow
|
||||
escalation chains from unsigned or unverified policy tokens.
|
||||
|
||||
# IANA Considerations
|
||||
|
||||
## Well-Known URI Registrations
|
||||
|
||||
This document requests the following registrations per {{RFC8615}}:
|
||||
|
||||
| URI Suffix | Purpose |
|
||||
|------------|---------|
|
||||
| `hitl/override` | Override command endpoint |
|
||||
| `hitl/resume` | Resume from PAUSE |
|
||||
| `hitl/lift` | Lift any active override |
|
||||
| `hitl/status` | Override status query |
|
||||
{: #fig-wellknown title="Well-Known URI Registrations"}
|
||||
|
||||
## `exec_act` Values
|
||||
|
||||
This document requests registration in the AEM Ecosystem
|
||||
Extension Registry:
|
||||
|
||||
| Value | Description | Reference |
|
||||
|-------|-------------|-----------|
|
||||
| `hitl:override` | Human override command | This document |
|
||||
| `hitl:ack` | Agent acknowledgment of override | This document |
|
||||
| `hitl:resume` | Resume from PAUSE state | This document |
|
||||
| `hitl:lift` | Lift any active override | This document |
|
||||
| `hitl:approval_request` | Workflow blocked at approval gate | This document |
|
||||
| `hitl:approval_granted` | Human approved continuation | This document |
|
||||
| `hitl:approval_denied` | Human denied continuation | This document |
|
||||
| `hitl:explanation` | Explainability token for HITL decision | This document |
|
||||
{: #fig-iana-actions title="HITL exec_act Registrations"}
|
||||
|
||||
--- back
|
||||
|
||||
# Acknowledgments
|
||||
{:numbered="false"}
|
||||
|
||||
This specification is the runtime enforcement companion to
|
||||
ACP-DAG-HITL {{I-D.nennemann-agent-dag-hitl-safety}}. Override
|
||||
design is inspired by industrial safety systems (IEC 62061,
|
||||
ISO 13849). The explainability token design is informed by
|
||||
EU AI Act Article 13 transparency requirements.
|
||||
@@ -0,0 +1,354 @@
|
||||
---
|
||||
title: "Cross-Protocol Agent Translation (CPAT)"
|
||||
abbrev: "CPAT"
|
||||
category: std
|
||||
docname: draft-cpat-cross-protocol-agent-translation-00
|
||||
submissiontype: IETF
|
||||
number:
|
||||
date:
|
||||
v: 3
|
||||
area: "ART"
|
||||
workgroup: "DISPATCH"
|
||||
keyword:
|
||||
- agent interoperability
|
||||
- protocol translation
|
||||
- agentic workflows
|
||||
- execution context
|
||||
|
||||
author:
|
||||
-
|
||||
fullname: Generated by IETF Draft Analyzer
|
||||
organization: Independent
|
||||
email: placeholder@example.com
|
||||
|
||||
normative:
|
||||
RFC7519:
|
||||
RFC7515:
|
||||
RFC9110:
|
||||
RFC8615:
|
||||
I-D.nennemann-wimse-ect:
|
||||
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
||||
|
||||
informative:
|
||||
I-D.nennemann-agent-dag-hitl-safety:
|
||||
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
||||
|
||||
--- abstract
|
||||
|
||||
This document defines the Cross-Protocol Agent Translation (CPAT)
|
||||
framework, a mechanism enabling AI agents using different
|
||||
communication protocols to interoperate. With over 90 competing
|
||||
agent-to-agent protocol drafts and no interoperability standard,
|
||||
protocol fragmentation is the primary barrier to multi-vendor agent
|
||||
ecosystems. CPAT defines capability advertisement, protocol
|
||||
negotiation, and translation gateways. Translation hops are
|
||||
recorded as Execution Context Token (ECT) DAG nodes, giving every
|
||||
cross-protocol interaction a cryptographic audit trail without
|
||||
inventing a parallel tracing mechanism.
|
||||
|
||||
--- middle
|
||||
|
||||
# Introduction
|
||||
|
||||
The IETF AI/agent landscape includes over 90 drafts proposing
|
||||
agent-to-agent communication protocols, yet no standard exists
|
||||
for agents using different protocols to exchange messages.
|
||||
|
||||
CPAT takes a pragmatic approach: rather than mandating a single
|
||||
protocol, it defines the minimum machinery for agents to discover
|
||||
each other's protocol support, agree on a common format, and fall
|
||||
back to translation gateways when no common protocol exists.
|
||||
|
||||
CPAT builds on Execution Context Tokens
|
||||
{{I-D.nennemann-wimse-ect}} as its audit and tracing backbone.
|
||||
Every translation hop produces an ECT, linking into the workflow
|
||||
DAG alongside the source and destination agents. This eliminates
|
||||
the need for a separate tracing or provenance mechanism -- the ECT
|
||||
DAG already provides it.
|
||||
|
||||
Design principles:
|
||||
|
||||
1. Reuse existing standards (HTTP, JSON, TLS, ECT) wherever
|
||||
possible.
|
||||
2. Keep the core mechanism small enough to implement in a day.
|
||||
3. Do not require agents to support any protocol beyond their own
|
||||
plus CPAT negotiation.
|
||||
|
||||
# Conventions and Definitions
|
||||
|
||||
{::boilerplate bcp14-tagged}
|
||||
|
||||
The following terms are used in this document:
|
||||
|
||||
Agent Protocol:
|
||||
: A communication protocol used by an AI agent for peer-to-peer
|
||||
message exchange (e.g., A2A, MCP, SLIM, uACP).
|
||||
|
||||
Capability Document:
|
||||
: A JSON object describing the protocols an agent supports, served
|
||||
at a well-known URI.
|
||||
|
||||
Translation Gateway:
|
||||
: A service that converts messages between two agent protocols,
|
||||
recording each translation as an ECT DAG node.
|
||||
|
||||
# Problem Statement
|
||||
|
||||
Consider three agents: Agent A speaks Protocol X, Agent B speaks
|
||||
Protocol Y, and Agent C speaks both X and Z. Today there is no
|
||||
standard way for A to discover that B uses a different protocol,
|
||||
negotiate a common format, or route through a translator.
|
||||
|
||||
Existing work on Agent Name Service (ANS) and agent discovery
|
||||
addresses finding agents but not protocol compatibility. CPAT
|
||||
fills the gap between discovery and communication.
|
||||
|
||||
# Protocol Capability Advertisement {#capability-ad}
|
||||
|
||||
Each CPAT-compliant agent MUST serve a capability document at the
|
||||
well-known URI `/.well-known/cpat` {{RFC8615}}. The document is a
|
||||
JSON object:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"cpat_version": "1.0",
|
||||
"agent_id": "spiffe://example.com/agent/pricing",
|
||||
"protocols": [
|
||||
{
|
||||
"id": "a2a-v1",
|
||||
"version": "1.0",
|
||||
"endpoint": "https://agent.example.com/a2a",
|
||||
"priority": 10
|
||||
},
|
||||
{
|
||||
"id": "mcp-v1",
|
||||
"version": "2025-03-26",
|
||||
"endpoint": "https://agent.example.com/mcp",
|
||||
"priority": 20
|
||||
}
|
||||
],
|
||||
"translation_gateways": [
|
||||
"https://gateway.example.com/cpat/translate"
|
||||
],
|
||||
"ect_assurance_level": "L2"
|
||||
}
|
||||
~~~
|
||||
{: #fig-capability title="Capability Document Example"}
|
||||
|
||||
The `protocols` array MUST contain at least one entry. Each entry
|
||||
MUST include `id` (a registered protocol identifier), `version`,
|
||||
and `endpoint`. The `priority` field is OPTIONAL; lower values
|
||||
indicate higher preference.
|
||||
|
||||
The `ect_assurance_level` field declares the minimum ECT assurance
|
||||
level the agent requires for interactions. This enables gateways
|
||||
to produce ECTs at the correct level.
|
||||
|
||||
Agents SHOULD also advertise their capability document URI in DNS
|
||||
SVCB records. The DNS record type `_cpat._tcp` SHOULD be used.
|
||||
|
||||
# Negotiation Handshake {#negotiation}
|
||||
|
||||
When Agent A wants to communicate with Agent B:
|
||||
|
||||
Step 1:
|
||||
: Agent A fetches Agent B's capability document from B's
|
||||
well-known CPAT URI over HTTPS.
|
||||
|
||||
Step 2:
|
||||
: Agent A computes the intersection of its own protocol list with
|
||||
Agent B's. If the intersection is non-empty, the protocol with
|
||||
the lowest combined priority score is selected. Communication
|
||||
proceeds directly using that protocol.
|
||||
|
||||
Step 3:
|
||||
: If no common protocol exists, Agent A checks whether any
|
||||
translation gateway listed by either agent supports both
|
||||
protocols. Agent A queries the gateway:
|
||||
|
||||
~~~
|
||||
GET /.well-known/cpat/gateway?from=a2a-v1&to=slim-v1
|
||||
~~~
|
||||
|
||||
The gateway responds with 200 OK if it supports the pair, or
|
||||
404 if not.
|
||||
|
||||
Step 4:
|
||||
: If a suitable gateway is found, Agent A sends its message to the
|
||||
gateway, which translates and forwards it to Agent B. The
|
||||
gateway records the translation as an ECT (see {{ect-integration}}).
|
||||
|
||||
Step 5:
|
||||
: If no gateway supports the required pair, Agent A returns an
|
||||
error to its caller with error code `no_translation_path`.
|
||||
|
||||
The entire negotiation is stateless and cacheable. Agents SHOULD
|
||||
cache capability documents for the duration indicated by HTTP
|
||||
Cache-Control headers, defaulting to 3600 seconds.
|
||||
|
||||
# ECT Integration {#ect-integration}
|
||||
|
||||
Every translation hop produces an ECT {{I-D.nennemann-wimse-ect}}
|
||||
that links into the workflow DAG. This provides cryptographic
|
||||
proof of protocol translation without a separate tracing mechanism.
|
||||
|
||||
## Translation ECT Claims
|
||||
|
||||
A gateway producing a translation ECT MUST set:
|
||||
|
||||
- `exec_act`: `"cpat:translate"`
|
||||
- `par`: array containing the `jti` of the source agent's ECT
|
||||
- `wid`: the workflow identifier from the source ECT (preserving
|
||||
workflow continuity across protocol boundaries)
|
||||
|
||||
The `ext` claim carries CPAT-specific metadata:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"ext": {
|
||||
"cpat.source_protocol": "a2a-v1",
|
||||
"cpat.dest_protocol": "slim-v1",
|
||||
"cpat.gateway_id": "spiffe://gw.example.com/cpat",
|
||||
"cpat.translation_warnings": []
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-translation-ect title="Translation ECT Extension Claims"}
|
||||
|
||||
The `inp_hash` claim MUST contain the SHA-256 hash of the source
|
||||
protocol message. The `out_hash` claim MUST contain the SHA-256
|
||||
hash of the translated message. This allows verifiers to confirm
|
||||
that a specific input produced a specific output without accessing
|
||||
the message content.
|
||||
|
||||
## Assurance Level Inheritance
|
||||
|
||||
The gateway MUST produce ECTs at the higher of:
|
||||
|
||||
- The source agent's declared `ect_assurance_level`
|
||||
- The destination agent's declared `ect_assurance_level`
|
||||
|
||||
At L3, the translation ECT MUST be recorded in the audit ledger
|
||||
before the translated message is forwarded to the destination agent.
|
||||
|
||||
## DAG Continuity
|
||||
|
||||
The translation creates a three-node subgraph in the workflow DAG:
|
||||
|
||||
~~~
|
||||
Source Agent ECT (exec_act: "send_task")
|
||||
|
|
||||
v [par reference]
|
||||
Gateway ECT (exec_act: "cpat:translate")
|
||||
|
|
||||
v [par reference]
|
||||
Dest Agent ECT (exec_act: "receive_task")
|
||||
~~~
|
||||
{: #fig-dag-continuity title="Translation DAG Subgraph"}
|
||||
|
||||
The Execution-Context HTTP header {{I-D.nennemann-wimse-ect}}
|
||||
survives protocol translation: the gateway includes the
|
||||
translation ECT in the Execution-Context header of the forwarded
|
||||
request to the destination agent.
|
||||
|
||||
# Translation Gateway Requirements {#gateway-reqs}
|
||||
|
||||
A CPAT translation gateway MUST:
|
||||
|
||||
1. Serve a capability document listing all supported protocol
|
||||
pairs at `/.well-known/cpat/gateway`.
|
||||
|
||||
2. Accept messages via HTTP POST at its translate endpoint.
|
||||
|
||||
3. Produce an ECT for every translation per {{ect-integration}}.
|
||||
|
||||
4. Preserve message semantics: the intent, core payload content,
|
||||
and metadata MUST survive translation. Fields with no
|
||||
equivalent in the destination protocol SHOULD be carried in a
|
||||
protocol-specific extension field or dropped with a warning
|
||||
recorded in `cpat.translation_warnings`.
|
||||
|
||||
5. Return the translated message in the response body, or forward
|
||||
it directly to the destination agent.
|
||||
|
||||
A gateway MUST NOT modify payload semantics during translation.
|
||||
|
||||
Gateways MUST require TLS 1.3 for all connections and SHOULD
|
||||
implement rate limiting per source agent.
|
||||
|
||||
# Policy Integration {#policy-integration}
|
||||
|
||||
When used with the Agent Context Policy Token
|
||||
{{I-D.nennemann-agent-dag-hitl-safety}}, CPAT-related policies
|
||||
can be expressed as DAG node constraints:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"dag": {
|
||||
"nodes": [
|
||||
{
|
||||
"id": "n-translate",
|
||||
"type": "cpat:translate",
|
||||
"agent": "spiffe://gw.example.com/cpat",
|
||||
"constraints": {
|
||||
"allowed_source_protocols": ["a2a-v1", "mcp-v1"],
|
||||
"allowed_dest_protocols": ["slim-v1"],
|
||||
"max_translation_hops": 2
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-policy title="CPAT Policy as DAG Node Constraints"}
|
||||
|
||||
The `max_translation_hops` constraint prevents messages from being
|
||||
translated through an excessive number of gateways. Agents
|
||||
receiving a message SHOULD reject it if the ECT DAG contains more
|
||||
translation hops than allowed by policy.
|
||||
|
||||
# Security Considerations
|
||||
|
||||
Capability documents are served over HTTPS, ensuring transport
|
||||
security. Agents SHOULD verify TLS certificates before trusting
|
||||
capability documents.
|
||||
|
||||
Gateways are trusted intermediaries with access to message content
|
||||
during translation. For end-to-end confidentiality, agents MAY
|
||||
encrypt the message payload using a shared key established out of
|
||||
band; the gateway translates only the protocol framing, not the
|
||||
encrypted content.
|
||||
|
||||
The ECT audit trail ({{ect-integration}}) enables detection of:
|
||||
|
||||
- Unauthorized gateways (unexpected `cpat.gateway_id` in the DAG)
|
||||
- Content tampering (mismatched `inp_hash`/`out_hash` relative to
|
||||
message content)
|
||||
- Routing loops (repeated gateway IDs in the DAG ancestry)
|
||||
|
||||
At L3, the audit ledger provides tamper-evident proof of all
|
||||
translations for regulatory compliance.
|
||||
|
||||
# IANA Considerations
|
||||
|
||||
This document requests the following IANA registrations:
|
||||
|
||||
1. A "CPAT Protocol Identifier" registry under Expert Review
|
||||
policy. Initial entries: "a2a-v1", "mcp-v1", "slim-v1",
|
||||
"uacp-v1", "ainp-v1".
|
||||
|
||||
2. A well-known URI registration for "cpat" per {{RFC8615}}.
|
||||
|
||||
3. Registration of the `exec_act` value "cpat:translate" in a
|
||||
future ECT action type registry.
|
||||
|
||||
--- back
|
||||
|
||||
# Acknowledgments
|
||||
{:numbered="false"}
|
||||
|
||||
This document builds on the Execution Context Token specification
|
||||
{{I-D.nennemann-wimse-ect}} and the Agent Context Policy Token
|
||||
{{I-D.nennemann-agent-dag-hitl-safety}}.
|
||||
@@ -0,0 +1,281 @@
|
||||
Internet-Draft AI/Agent WG
|
||||
Intended status: Standards Track March 2026
|
||||
Expires: September 15, 2026
|
||||
|
||||
|
||||
Cross-Protocol Agent Translation (CPAT)
|
||||
draft-cpat-cross-protocol-agent-translation-00
|
||||
|
||||
Abstract
|
||||
|
||||
This document defines the Cross-Protocol Agent Translation (CPAT)
|
||||
framework, a lightweight mechanism enabling AI agents using
|
||||
different communication protocols to interoperate through
|
||||
capability advertisement and message translation. With over 90
|
||||
competing agent-to-agent (A2A) protocol drafts and no
|
||||
interoperability standard, protocol fragmentation is the primary
|
||||
barrier to multi-vendor agent ecosystems. CPAT defines three
|
||||
components: a capability advertisement format for agents to
|
||||
declare supported protocols, a negotiation handshake to select a
|
||||
common protocol or translation path, and a canonical envelope
|
||||
format that enables translation gateways to convert messages
|
||||
between incompatible protocols. CPAT reuses existing HTTP
|
||||
content negotiation patterns and builds on JSON for simplicity.
|
||||
|
||||
Status of This Memo
|
||||
|
||||
This Internet-Draft is submitted in full conformance with the
|
||||
provisions of BCP 78 and BCP 79.
|
||||
|
||||
This document is intended to have Standards Track status.
|
||||
Distribution of this memo is unlimited.
|
||||
|
||||
Table of Contents
|
||||
|
||||
1. Introduction
|
||||
2. Terminology
|
||||
3. Problem Statement
|
||||
4. Protocol Capability Advertisement
|
||||
5. Negotiation Handshake
|
||||
6. Canonical Envelope Format
|
||||
7. Translation Gateway Requirements
|
||||
8. Security Considerations
|
||||
9. IANA Considerations
|
||||
|
||||
1. Introduction
|
||||
|
||||
The IETF AI/agent landscape includes over 90 drafts proposing
|
||||
agent-to-agent communication protocols, yet no standard exists
|
||||
for agents using different protocols to exchange messages. This
|
||||
fragmentation mirrors the early days of instant messaging, where
|
||||
users on different networks could not communicate until gateway
|
||||
and federation standards emerged.
|
||||
|
||||
CPAT takes a pragmatic approach: rather than mandating a single
|
||||
protocol, it defines the minimum machinery for agents to
|
||||
discover each other's protocol support, agree on a common
|
||||
format, and fall back to translation gateways when no common
|
||||
protocol exists. The design follows three principles:
|
||||
|
||||
1. Reuse existing standards (HTTP, JSON, TLS) wherever possible.
|
||||
2. Keep the core mechanism small enough to implement in a day.
|
||||
3. Do not require agents to support any protocol beyond their own
|
||||
plus CPAT negotiation.
|
||||
|
||||
2. Terminology
|
||||
|
||||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
|
||||
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
|
||||
"OPTIONAL" in this document are to be interpreted as described
|
||||
in RFC 2119 [RFC2119].
|
||||
|
||||
Agent Protocol: A communication protocol used by an AI agent for
|
||||
peer-to-peer message exchange (e.g., A2A, MCP, SLIM, uACP).
|
||||
|
||||
Capability Document: A JSON object describing the protocols an
|
||||
agent supports, served at a well-known URI.
|
||||
|
||||
Translation Gateway: A service that converts messages between
|
||||
two agent protocols using the CPAT canonical envelope as an
|
||||
intermediate representation.
|
||||
|
||||
3. Problem Statement
|
||||
|
||||
Consider three agents: Agent A speaks Protocol X, Agent B speaks
|
||||
Protocol Y, and Agent C speaks both X and Z. Today there is no
|
||||
standard way for A to discover that B uses a different protocol,
|
||||
negotiate a common format, or route through a translator. Each
|
||||
protocol defines its own discovery and messaging layer, creating
|
||||
isolated silos.
|
||||
|
||||
Existing work on Agent Name Service (ANS) and agent discovery
|
||||
addresses finding agents but not protocol compatibility. The
|
||||
ADOL draft addresses token efficiency within a single protocol
|
||||
but not cross-protocol translation. CPAT fills the gap between
|
||||
discovery and communication.
|
||||
|
||||
4. Protocol Capability Advertisement
|
||||
|
||||
Each CPAT-compliant agent MUST serve a capability document at
|
||||
the well-known URI /.well-known/cpat. The document is a JSON
|
||||
object with the following structure:
|
||||
|
||||
{
|
||||
"cpat_version": "1.0",
|
||||
"agent_id": "urn:uuid:550e8400-e29b-41d4-a716-446655440000",
|
||||
"protocols": [
|
||||
{
|
||||
"id": "a2a-v1",
|
||||
"version": "1.0",
|
||||
"endpoint": "https://agent.example.com/a2a",
|
||||
"priority": 10
|
||||
},
|
||||
{
|
||||
"id": "mcp-v1",
|
||||
"version": "2025-03-26",
|
||||
"endpoint": "https://agent.example.com/mcp",
|
||||
"priority": 20
|
||||
}
|
||||
],
|
||||
"translation_gateways": [
|
||||
"https://gateway.example.com/cpat/translate"
|
||||
],
|
||||
"envelope_formats": ["cpat-envelope-v1"]
|
||||
}
|
||||
|
||||
The "protocols" array MUST contain at least one entry. Each
|
||||
entry MUST include "id" (a registered protocol identifier),
|
||||
"version", and "endpoint". The "priority" field is OPTIONAL;
|
||||
lower values indicate higher preference.
|
||||
|
||||
Agents SHOULD also advertise their capability document URI in
|
||||
DNS SRV or SVCB records for automated discovery. The DNS
|
||||
record type "_cpat._tcp" SHOULD be used.
|
||||
|
||||
5. Negotiation Handshake
|
||||
|
||||
When Agent A wants to communicate with Agent B, the following
|
||||
negotiation procedure applies:
|
||||
|
||||
Step 1: Agent A fetches Agent B's capability document from
|
||||
B's well-known CPAT URI over HTTPS.
|
||||
|
||||
Step 2: Agent A computes the intersection of its own protocol
|
||||
list with Agent B's. If the intersection is non-empty, the
|
||||
protocol with the lowest combined priority score is selected.
|
||||
Communication proceeds directly using that protocol.
|
||||
|
||||
Step 3: If no common protocol exists, Agent A checks whether
|
||||
any translation gateway listed by either agent supports both
|
||||
protocols. Agent A queries the gateway's capability endpoint
|
||||
at /.well-known/cpat/gateway:
|
||||
|
||||
GET /.well-known/cpat/gateway?from=a2a-v1&to=slim-v1
|
||||
|
||||
The gateway responds with 200 OK and a translation descriptor
|
||||
if it supports the pair, or 404 if not.
|
||||
|
||||
Step 4: If a suitable gateway is found, Agent A sends its
|
||||
message wrapped in a CPAT envelope (Section 6) to the gateway,
|
||||
which translates and forwards it to Agent B.
|
||||
|
||||
Step 5: If no gateway supports the required pair, Agent A
|
||||
SHOULD return an error to its caller indicating protocol
|
||||
incompatibility, using the CPAT error code "no_translation_path".
|
||||
|
||||
The entire negotiation is stateless and cacheable. Agents
|
||||
SHOULD cache capability documents for the duration indicated by
|
||||
HTTP Cache-Control headers, defaulting to 3600 seconds.
|
||||
|
||||
6. Canonical Envelope Format
|
||||
|
||||
The CPAT envelope wraps a protocol-specific message in a
|
||||
standard container for gateway translation. The envelope is a
|
||||
JSON object:
|
||||
|
||||
{
|
||||
"cpat_version": "1.0",
|
||||
"message_id": "urn:uuid:6ba7b810-9dad-11d1-80b4-00c04fd430c8",
|
||||
"timestamp": "2026-03-01T12:00:00Z",
|
||||
"source": {
|
||||
"agent_id": "urn:uuid:...",
|
||||
"protocol": "a2a-v1"
|
||||
},
|
||||
"destination": {
|
||||
"agent_id": "urn:uuid:...",
|
||||
"protocol": "slim-v1"
|
||||
},
|
||||
"intent": "task_request",
|
||||
"payload": {
|
||||
"content_type": "application/json",
|
||||
"body": "...base64-encoded protocol-specific message..."
|
||||
},
|
||||
"trace": ["urn:uuid:...source", "urn:uuid:...gateway"]
|
||||
}
|
||||
|
||||
The "intent" field MUST be one of: "task_request",
|
||||
"task_response", "notification", "error", "capability_query".
|
||||
This allows gateways to perform semantic translation even when
|
||||
protocol message structures differ significantly.
|
||||
|
||||
The "trace" array provides a simple provenance chain of all
|
||||
agents and gateways that have handled the message. Each
|
||||
intermediary MUST append its own identifier.
|
||||
|
||||
The "payload.body" field contains the original protocol message,
|
||||
base64-encoded. Gateways translate by decoding the source
|
||||
protocol message, mapping it to the CPAT semantic model (intent
|
||||
+ standard fields), and re-encoding in the destination protocol.
|
||||
|
||||
7. Translation Gateway Requirements
|
||||
|
||||
A CPAT translation gateway MUST:
|
||||
|
||||
1. Serve a capability document listing all supported protocol
|
||||
pairs at /.well-known/cpat/gateway.
|
||||
|
||||
2. Accept CPAT envelopes via HTTP POST at its translate endpoint.
|
||||
|
||||
3. Validate envelope integrity before translation.
|
||||
|
||||
4. Preserve message semantics: the intent, core payload content,
|
||||
and metadata MUST survive translation. Fields with no
|
||||
equivalent in the destination protocol SHOULD be carried in
|
||||
a protocol-specific extension field or dropped with a warning.
|
||||
|
||||
5. Return the translated envelope in the response body, or
|
||||
forward it directly to the destination agent.
|
||||
|
||||
6. Log all translations with source, destination, and timestamp
|
||||
for audit purposes.
|
||||
|
||||
A gateway MUST NOT modify the payload semantics during
|
||||
translation. If exact translation is not possible, the gateway
|
||||
MUST include a "translation_warnings" array in the envelope
|
||||
listing fields that were approximated or dropped.
|
||||
|
||||
Gateways SHOULD implement rate limiting per source agent and
|
||||
MUST require TLS 1.3 [RFC8446] for all connections.
|
||||
|
||||
8. Security Considerations
|
||||
|
||||
Capability documents are served over HTTPS, ensuring transport
|
||||
security. Agents SHOULD verify the TLS certificate of peers
|
||||
before trusting their capability documents.
|
||||
|
||||
CPAT envelopes in transit through gateways are visible to the
|
||||
gateway operator. For end-to-end confidentiality, agents MAY
|
||||
encrypt the payload.body field using a shared key established
|
||||
out of band. The envelope metadata (intent, agent IDs,
|
||||
timestamps) remains visible to enable routing.
|
||||
|
||||
Gateways are trusted intermediaries. Deployments SHOULD use
|
||||
gateways operated by mutually trusted parties or verified
|
||||
through attestation mechanisms such as those in
|
||||
draft-aylward-daap-v2.
|
||||
|
||||
The trace array enables detection of routing loops and
|
||||
unauthorized intermediaries. Agents SHOULD reject messages
|
||||
with unexpected entries in the trace.
|
||||
|
||||
Denial-of-service attacks against gateways are mitigated by
|
||||
rate limiting (Section 7) and standard HTTP-layer protections.
|
||||
|
||||
9. IANA Considerations
|
||||
|
||||
This document requests IANA establish the following:
|
||||
|
||||
1. A "CPAT Protocol Identifier" registry under Expert Review
|
||||
policy. Initial entries: "a2a-v1", "mcp-v1", "slim-v1",
|
||||
"uacp-v1", "ainp-v1".
|
||||
|
||||
2. A "CPAT Intent Type" registry under Specification Required
|
||||
policy. Initial entries: "task_request", "task_response",
|
||||
"notification", "error", "capability_query".
|
||||
|
||||
3. A well-known URI registration for "cpat" per RFC 8615.
|
||||
|
||||
Author's Address
|
||||
|
||||
Generated by IETF Draft Analyzer
|
||||
2026-03-01
|
||||
315
workspace/drafts/new-drafts/draft-d-aepb-protocol-binding-00.md
Normal file
315
workspace/drafts/new-drafts/draft-d-aepb-protocol-binding-00.md
Normal file
@@ -0,0 +1,315 @@
|
||||
---
|
||||
title: "Agent Ecosystem Protocol Binding (AEPB): Interop and Lifecycle"
|
||||
abbrev: "AEPB"
|
||||
category: std
|
||||
docname: draft-aepb-agent-ecosystem-protocol-binding-00
|
||||
submissiontype: IETF
|
||||
number:
|
||||
date:
|
||||
v: 3
|
||||
area: "ART"
|
||||
workgroup: "DISPATCH"
|
||||
keyword:
|
||||
- agent interoperability
|
||||
- protocol translation
|
||||
- lifecycle
|
||||
- agentic workflows
|
||||
|
||||
author:
|
||||
-
|
||||
fullname: TBD
|
||||
organization: Independent
|
||||
email: placeholder@example.com
|
||||
|
||||
normative:
|
||||
RFC2119:
|
||||
RFC8174:
|
||||
RFC8446:
|
||||
RFC8615:
|
||||
RFC9110:
|
||||
I-D.nennemann-wimse-ect:
|
||||
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
||||
I-D.nennemann-agent-dag-hitl-safety:
|
||||
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
||||
|
||||
informative:
|
||||
|
||||
--- abstract
|
||||
|
||||
This document defines the Agent Ecosystem Protocol Binding (AEPB),
|
||||
the interoperability and lifecycle layer of the agent ecosystem.
|
||||
With over 90 competing A2A protocol drafts and no interoperability
|
||||
standard, AEPB defines capability advertisement, protocol
|
||||
negotiation, translation gateways, and agent lifecycle management
|
||||
(versioning, graceful shutdown, retirement). Translation hops
|
||||
produce ECT nodes, preserving DAG continuity across protocol
|
||||
boundaries. Protocol constraints are expressed as ACP-DAG-HITL
|
||||
node constraints.
|
||||
|
||||
--- middle
|
||||
|
||||
# Introduction
|
||||
|
||||
The IETF AI/agent landscape includes over 90 drafts proposing
|
||||
agent-to-agent communication protocols. No standard exists for
|
||||
agents using different protocols to exchange messages, and no
|
||||
standard exists for how agents evolve, get replaced, or retire
|
||||
without disrupting dependent services.
|
||||
|
||||
AEPB addresses both gaps with a pragmatic approach: rather than
|
||||
mandating a single protocol, it defines the minimum machinery for
|
||||
agents to discover each other's protocol support, agree on a
|
||||
common format, fall back to translation gateways, and manage their
|
||||
lifecycle.
|
||||
|
||||
AEPB builds on ECT {{I-D.nennemann-wimse-ect}} for audit (every
|
||||
translation hop is a DAG node) and ACP-DAG-HITL
|
||||
{{I-D.nennemann-agent-dag-hitl-safety}} for policy (protocol
|
||||
constraints as node constraints).
|
||||
|
||||
# Conventions and Definitions
|
||||
|
||||
{::boilerplate bcp14-tagged}
|
||||
|
||||
Agent Protocol:
|
||||
: A communication protocol used by an AI agent for peer-to-peer
|
||||
message exchange (e.g., A2A, MCP, SLIM, uACP).
|
||||
|
||||
Capability Document:
|
||||
: A JSON object describing the protocols an agent supports.
|
||||
|
||||
Translation Gateway:
|
||||
: A service that converts messages between two agent protocols,
|
||||
recording each translation as an ECT DAG node.
|
||||
|
||||
# Capability Advertisement {#capability}
|
||||
|
||||
Each AEPB-compliant agent MUST serve a capability document at
|
||||
`/.well-known/aepb` {{RFC8615}}:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"aepb_version": "1.0",
|
||||
"agent_id": "spiffe://example.com/agent/pricing",
|
||||
"protocols": [
|
||||
{
|
||||
"id": "a2a-v1",
|
||||
"version": "1.0",
|
||||
"endpoint": "https://agent.example.com/a2a",
|
||||
"priority": 10
|
||||
},
|
||||
{
|
||||
"id": "mcp-v1",
|
||||
"version": "2025-03-26",
|
||||
"endpoint": "https://agent.example.com/mcp",
|
||||
"priority": 20
|
||||
}
|
||||
],
|
||||
"translation_gateways": [
|
||||
"https://gateway.example.com/aepb/translate"
|
||||
],
|
||||
"ect_assurance_level": "L2",
|
||||
"lifecycle": {
|
||||
"status": "active",
|
||||
"version": "2.1.0",
|
||||
"deprecated_at": null,
|
||||
"sunset_at": null,
|
||||
"successor": null
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-capability title="Capability Document"}
|
||||
|
||||
The `protocols` array MUST contain at least one entry. `priority`
|
||||
is OPTIONAL; lower values indicate higher preference.
|
||||
|
||||
The `lifecycle` object (see {{lifecycle}}) provides versioning and
|
||||
deprecation metadata.
|
||||
|
||||
Agents SHOULD advertise via DNS SVCB records (`_aepb._tcp`).
|
||||
|
||||
# Protocol Negotiation {#negotiation}
|
||||
|
||||
When Agent A wants to communicate with Agent B:
|
||||
|
||||
1. Agent A fetches B's capability document over HTTPS.
|
||||
|
||||
2. Agent A computes the intersection of protocol lists. If
|
||||
non-empty, the protocol with the lowest combined priority is
|
||||
selected. Communication proceeds directly.
|
||||
|
||||
3. If no common protocol exists, Agent A checks translation
|
||||
gateways listed by either agent:
|
||||
|
||||
~~~
|
||||
GET /.well-known/aepb/gateway?from=a2a-v1&to=slim-v1
|
||||
~~~
|
||||
|
||||
The gateway responds 200 if it supports the pair, 404 if not.
|
||||
|
||||
4. If a suitable gateway is found, Agent A sends its message to
|
||||
the gateway, which translates and forwards.
|
||||
|
||||
5. If no gateway supports the pair, Agent A returns error
|
||||
`no_translation_path`.
|
||||
|
||||
Negotiation is stateless and cacheable (Cache-Control, default
|
||||
3600s).
|
||||
|
||||
# Translation as ECT DAG Nodes {#translation-ect}
|
||||
|
||||
Every translation hop produces an ECT:
|
||||
|
||||
- `exec_act`: `"aepb:translate"`
|
||||
- `par`: the source agent's ECT
|
||||
- `inp_hash`: SHA-256 of source protocol message
|
||||
- `out_hash`: SHA-256 of translated message
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "aepb:translate",
|
||||
"par": ["source-agent-ect-uuid"],
|
||||
"inp_hash": "sha256-of-source-message",
|
||||
"out_hash": "sha256-of-translated-message",
|
||||
"ext": {
|
||||
"aepb.source_protocol": "a2a-v1",
|
||||
"aepb.dest_protocol": "slim-v1",
|
||||
"aepb.gateway_id": "spiffe://gw.example.com/aepb",
|
||||
"aepb.translation_warnings": []
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-translate-ect title="Translation ECT"}
|
||||
|
||||
This creates a three-node subgraph:
|
||||
|
||||
~~~
|
||||
Source ECT → Gateway ECT (aepb:translate) → Dest ECT
|
||||
~~~
|
||||
|
||||
The Execution-Context HTTP header survives protocol translation:
|
||||
the gateway includes the translation ECT in the header of the
|
||||
forwarded request.
|
||||
|
||||
## Translation Policy
|
||||
|
||||
Protocol constraints are ACP-DAG-HITL node constraints:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"constraints": {
|
||||
"aepb.allowed_source_protocols": ["a2a-v1", "mcp-v1"],
|
||||
"aepb.allowed_dest_protocols": ["slim-v1"],
|
||||
"aepb.max_translation_hops": 2
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-policy title="Translation Policy"}
|
||||
|
||||
Agents receiving a message SHOULD reject it if the ECT DAG
|
||||
contains more translation hops than `aepb.max_translation_hops`.
|
||||
|
||||
# Translation Gateway Requirements {#gateway}
|
||||
|
||||
A gateway MUST:
|
||||
|
||||
1. Serve a capability document at `/.well-known/aepb/gateway`.
|
||||
2. Accept messages via HTTP POST at its translate endpoint.
|
||||
3. Produce an ECT per {{translation-ect}} for every translation.
|
||||
4. Preserve message semantics. Fields without a destination
|
||||
equivalent are carried in an extension field or dropped with
|
||||
a warning in `aepb.translation_warnings`.
|
||||
5. Require TLS 1.3 {{RFC8446}} for all connections.
|
||||
6. Implement rate limiting per source agent.
|
||||
|
||||
A gateway MUST NOT modify payload semantics.
|
||||
|
||||
# Agent Lifecycle Management {#lifecycle}
|
||||
|
||||
## Lifecycle States
|
||||
|
||||
An agent's `lifecycle.status` MUST be one of:
|
||||
|
||||
- `active`: Normal operation. Default state.
|
||||
- `deprecated`: Agent is functional but will be retired.
|
||||
`deprecated_at` MUST be set. Clients SHOULD migrate to
|
||||
`successor` if provided.
|
||||
- `draining`: Agent is rejecting new workflows but completing
|
||||
in-progress ones. New delegation requests return HTTP 503
|
||||
with `Retry-After` header pointing to `successor`.
|
||||
- `retired`: Agent is offline. Capability document returns
|
||||
HTTP 410 Gone with `successor` for redirect.
|
||||
|
||||
## Versioning
|
||||
|
||||
The `lifecycle.version` field uses semantic versioning. Agents
|
||||
MUST increment the major version when breaking changes occur
|
||||
(incompatible protocol or behavior changes).
|
||||
|
||||
Capability documents MUST include the version. Agents SHOULD
|
||||
include version in ECT `ext` claims (`aepb.agent_version`) so
|
||||
the audit trail records which version performed each action.
|
||||
|
||||
## Graceful Shutdown
|
||||
|
||||
When an agent transitions to `draining`:
|
||||
|
||||
1. Update capability document: `status: "draining"`,
|
||||
set `sunset_at` timestamp.
|
||||
2. Reject new workflow delegations with HTTP 503.
|
||||
3. Complete all in-progress workflows.
|
||||
4. Emit a final ECT: `exec_act: "aepb:shutdown"`.
|
||||
5. Transition to `retired`.
|
||||
|
||||
Agents SHOULD provide at least 24 hours between `deprecated`
|
||||
and `draining` to allow clients to discover the change via
|
||||
cached capability documents.
|
||||
|
||||
## Successor Discovery
|
||||
|
||||
When `successor` is set, it MUST be the URI of the replacement
|
||||
agent's capability document. Clients SHOULD transparently
|
||||
redirect to the successor after verifying its capability
|
||||
document.
|
||||
|
||||
# Security Considerations
|
||||
|
||||
Capability documents are served over HTTPS. Agents SHOULD verify
|
||||
TLS certificates before trusting capability documents.
|
||||
|
||||
Gateways are trusted intermediaries with access to message content.
|
||||
For end-to-end confidentiality, agents MAY encrypt message payloads
|
||||
with a shared key established out of band.
|
||||
|
||||
The ECT audit trail enables detection of unauthorized gateways,
|
||||
content tampering (mismatched `inp_hash`/`out_hash`), and routing
|
||||
loops (repeated gateway IDs in DAG ancestry).
|
||||
|
||||
Lifecycle transitions (especially `draining` and `retired`) can be
|
||||
exploited for denial of service. Only the agent operator (verified
|
||||
via identity binding) SHOULD be able to update lifecycle status.
|
||||
|
||||
# IANA Considerations
|
||||
|
||||
This document requests:
|
||||
|
||||
1. A "AEPB Protocol Identifier" registry under Expert Review.
|
||||
Initial entries: `a2a-v1`, `mcp-v1`, `slim-v1`, `uacp-v1`,
|
||||
`ainp-v1`.
|
||||
|
||||
2. Well-known URI registrations for `aepb` and `aepb/gateway`
|
||||
per {{RFC8615}}.
|
||||
|
||||
3. Registration of `exec_act` values: `aepb:translate`,
|
||||
`aepb:shutdown` in a future ECT action type registry.
|
||||
|
||||
--- back
|
||||
|
||||
# Acknowledgments
|
||||
{:numbered="false"}
|
||||
|
||||
AEPB builds on ECT {{I-D.nennemann-wimse-ect}} for translation
|
||||
audit trails and ACP-DAG-HITL
|
||||
{{I-D.nennemann-agent-dag-hitl-safety}} for protocol policy.
|
||||
577
workspace/drafts/new-drafts/draft-d-aepb-protocol-binding-01.md
Normal file
577
workspace/drafts/new-drafts/draft-d-aepb-protocol-binding-01.md
Normal file
@@ -0,0 +1,577 @@
|
||||
---
|
||||
title: "Agent Ecosystem Protocol Binding (AEPB): Interop and Lifecycle"
|
||||
abbrev: "AEPB"
|
||||
category: std
|
||||
docname: draft-aepb-agent-ecosystem-protocol-binding-01
|
||||
submissiontype: IETF
|
||||
number:
|
||||
date:
|
||||
v: 3
|
||||
area: "ART"
|
||||
workgroup: "DISPATCH"
|
||||
keyword:
|
||||
- agent interoperability
|
||||
- protocol translation
|
||||
- lifecycle
|
||||
- agentic workflows
|
||||
|
||||
author:
|
||||
-
|
||||
fullname: TBD
|
||||
organization: Independent
|
||||
email: placeholder@example.com
|
||||
|
||||
normative:
|
||||
RFC2119:
|
||||
RFC8174:
|
||||
RFC8446:
|
||||
RFC8615:
|
||||
RFC9110:
|
||||
RFC8594:
|
||||
I-D.nennemann-wimse-ect:
|
||||
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
||||
I-D.nennemann-agent-dag-hitl-safety:
|
||||
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
||||
|
||||
informative:
|
||||
|
||||
--- abstract
|
||||
|
||||
This document defines the Agent Ecosystem Protocol Binding (AEPB),
|
||||
the interoperability and lifecycle layer of the agent ecosystem.
|
||||
With over 90 competing A2A protocol drafts and no interoperability
|
||||
standard, AEPB defines capability advertisement, protocol
|
||||
negotiation, formal binding requirements, translation gateway
|
||||
architecture, and agent lifecycle management (versioning, graceful
|
||||
shutdown, retirement). Translation hops produce ECT nodes,
|
||||
preserving DAG continuity across protocol boundaries. Protocol
|
||||
constraints are expressed as ACP-DAG-HITL node constraints.
|
||||
|
||||
--- middle
|
||||
|
||||
# Introduction
|
||||
|
||||
The IETF AI/agent landscape includes over 90 drafts proposing
|
||||
agent-to-agent communication protocols. No standard exists for
|
||||
agents using different protocols to exchange messages, and no
|
||||
standard exists for how agents evolve, get replaced, or retire
|
||||
without disrupting dependent services.
|
||||
|
||||
AEPB addresses both gaps with a pragmatic approach: rather than
|
||||
mandating a single protocol, it defines the minimum machinery for
|
||||
agents to discover each other's protocol support, agree on a
|
||||
common format, fall back to translation gateways, and manage their
|
||||
lifecycle.
|
||||
|
||||
AEPB builds on ECT {{I-D.nennemann-wimse-ect}} for audit (every
|
||||
translation hop is a DAG node) and ACP-DAG-HITL
|
||||
{{I-D.nennemann-agent-dag-hitl-safety}} for policy (protocol
|
||||
constraints as node constraints).
|
||||
|
||||
# Conventions and Definitions
|
||||
|
||||
{::boilerplate bcp14-tagged}
|
||||
|
||||
Agent Protocol:
|
||||
: A communication protocol used by an AI agent for peer-to-peer
|
||||
message exchange (e.g., A2A, MCP, SLIM, uACP).
|
||||
|
||||
Capability Document:
|
||||
: A JSON object describing the protocols an agent supports,
|
||||
lifecycle status, and ECT assurance level.
|
||||
|
||||
Translation Gateway:
|
||||
: A service that converts messages between two agent protocols,
|
||||
recording each translation as an ECT DAG node.
|
||||
|
||||
Protocol Binding:
|
||||
: The mapping between the AEPB ecosystem semantics and a specific
|
||||
agent protocol. Each binding has a stable identifier string.
|
||||
|
||||
Binding Identifier:
|
||||
: A short string identifying a specific protocol binding
|
||||
version (e.g., `a2a-v1`, `mcp-v1`).
|
||||
|
||||
# Capability Advertisement {#capability}
|
||||
|
||||
## Capability Document Format
|
||||
|
||||
Each AEPB-compliant agent MUST serve a capability document at
|
||||
`/.well-known/aepb` per {{RFC8615}}:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"aepb_version": "1.0",
|
||||
"agent_id": "spiffe://example.com/agent/pricing",
|
||||
"protocols": [
|
||||
{
|
||||
"id": "a2a-v1",
|
||||
"version": "1.0",
|
||||
"endpoint": "https://agent.example.com/a2a",
|
||||
"priority": 10
|
||||
},
|
||||
{
|
||||
"id": "mcp-v1",
|
||||
"version": "2025-03-26",
|
||||
"endpoint": "https://agent.example.com/mcp",
|
||||
"priority": 20
|
||||
}
|
||||
],
|
||||
"translation_gateways": [
|
||||
"https://gateway.example.com/aepb/translate"
|
||||
],
|
||||
"ect_assurance_level": "L2",
|
||||
"ect_namespaces": ["atd", "hitl", "apae"],
|
||||
"lifecycle": {
|
||||
"status": "active",
|
||||
"version": "2.1.0",
|
||||
"deprecated_at": null,
|
||||
"sunset_at": null,
|
||||
"successor": null
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-capability title="Capability Document"}
|
||||
|
||||
The `protocols` array MUST contain at least one entry. `priority`
|
||||
is OPTIONAL; lower values indicate higher preference.
|
||||
|
||||
The `ect_namespaces` field MUST list all ECT extension namespaces
|
||||
(ATD, HITL, APAE) that this agent emits and can process. Peers
|
||||
use this to determine whether ecosystem semantics are compatible.
|
||||
|
||||
The `lifecycle` object (see {{lifecycle}}) provides versioning and
|
||||
deprecation metadata.
|
||||
|
||||
## DNS-SD Advertisement
|
||||
|
||||
Agents SHOULD advertise via DNS SVCB records (`_aepb._tcp`) as
|
||||
an alternative to well-known URI discovery. The SVCB record
|
||||
MUST include a `hint` parameter pointing to the well-known URI.
|
||||
|
||||
## Capability Document Caching
|
||||
|
||||
Capability documents MAY be cached per HTTP cache-control
|
||||
semantics per {{RFC9110}}. The default max-age is 3600 seconds.
|
||||
Agents MUST set `Expires` or `Cache-Control: max-age` on
|
||||
capability document responses.
|
||||
|
||||
# Protocol Negotiation {#negotiation}
|
||||
|
||||
When Agent A wants to communicate with Agent B:
|
||||
|
||||
1. Agent A fetches B's capability document over HTTPS.
|
||||
|
||||
2. Agent A computes the intersection of protocol lists. If
|
||||
non-empty, the protocol with the lowest combined priority is
|
||||
selected. Communication proceeds directly.
|
||||
|
||||
3. If no common protocol exists, Agent A checks translation
|
||||
gateways listed by either agent:
|
||||
|
||||
~~~
|
||||
GET /.well-known/aepb/gateway?from=a2a-v1&to=slim-v1 HTTP/1.1
|
||||
~~~
|
||||
|
||||
The gateway responds 200 if it supports the pair, 404 if not.
|
||||
|
||||
4. If a suitable gateway is found, Agent A sends its message to
|
||||
the gateway, which translates and forwards.
|
||||
|
||||
5. If no gateway supports the pair, Agent A MUST return error
|
||||
`no_translation_path` and MUST NOT proceed.
|
||||
|
||||
Negotiation is stateless and cacheable (Cache-Control, default
|
||||
3600s).
|
||||
|
||||
## Protocol Downgrade Prevention
|
||||
|
||||
Protocol negotiation MUST NOT result in selection of a binding
|
||||
below the minimum configured in ACP-DAG-HITL node constraints:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"constraints": {
|
||||
"aepb.min_protocol_security": "tls-1.3"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
|
||||
Agents MUST reject capability documents that advertise only
|
||||
protocols below their configured minimum security requirement.
|
||||
Specifically, all protocols MUST use TLS 1.3 {{RFC8446}}; no
|
||||
plaintext bindings are permitted in production deployments.
|
||||
|
||||
# Conforming Protocol Binding Requirements {#binding-requirements}
|
||||
|
||||
A protocol binding MUST satisfy the following requirements to be
|
||||
registered in the AEPB Protocol Binding Registry.
|
||||
|
||||
## ECT Carriage
|
||||
|
||||
A conforming binding MUST provide a mechanism to carry ECTs
|
||||
alongside protocol messages. For HTTP-based protocols, this
|
||||
MUST be the `Execution-Context` header as defined in
|
||||
{{I-D.nennemann-wimse-ect}}. For non-HTTP protocols, the
|
||||
binding specification MUST define an equivalent envelope field.
|
||||
|
||||
## Task Invocation with Parent Reference
|
||||
|
||||
A conforming binding MUST support task invocation messages that
|
||||
include a reference to the parent ECT `jti`. This allows the
|
||||
receiving agent to link the new task into the ECT DAG.
|
||||
|
||||
## Checkpoint and Rollback Signal Carriage
|
||||
|
||||
A conforming binding MUST support conveying ATD rollback requests
|
||||
and results. For HTTP-based bindings, the `/.well-known/atd/rollback`
|
||||
endpoint MUST be accessible independent of the main protocol
|
||||
endpoint.
|
||||
|
||||
## HITL Callback Registration
|
||||
|
||||
A conforming binding MUST support HITL approval callback
|
||||
registration. When a task involves a planned approval gate, the
|
||||
initiating agent MUST be able to register a callback URI that
|
||||
receives the `hitl:approval_granted` or `hitl:approval_denied`
|
||||
ECT when the human responds. For HTTP bindings, this is a
|
||||
standard webhook registration.
|
||||
|
||||
## Summary Table
|
||||
|
||||
| Requirement | Minimum | Rationale |
|
||||
|-------------|---------|-----------|
|
||||
| ECT carriage | `Execution-Context` header or equivalent | DAG continuity |
|
||||
| Parent ECT reference | In task invocation | DAG linkage |
|
||||
| Rollback signal | `/.well-known/atd/rollback` accessible | Error recovery |
|
||||
| HITL callback | Webhook or equivalent | Async approval |
|
||||
| Transport security | TLS 1.3 | Integrity and confidentiality |
|
||||
{: #fig-requirements title="Protocol Binding Conformance Requirements"}
|
||||
|
||||
# Translation Gateway Architecture {#translation}
|
||||
|
||||
## Gateway as DAG Node
|
||||
|
||||
Every translation hop produces an ECT:
|
||||
|
||||
- `exec_act`: `"aepb:translate"`
|
||||
- `par`: the source agent's ECT
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "aepb:translate",
|
||||
"par": ["source-agent-ect-uuid"],
|
||||
"inp_hash": "sha256-of-source-message",
|
||||
"out_hash": "sha256-of-translated-message",
|
||||
"ext": {
|
||||
"aepb.source_protocol": "a2a-v1",
|
||||
"aepb.dest_protocol": "slim-v1",
|
||||
"aepb.gateway_id": "spiffe://gw.example.com/aepb",
|
||||
"aepb.translation_warnings": []
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-translate-ect title="Translation ECT"}
|
||||
|
||||
This creates a three-node subgraph in the ECT DAG:
|
||||
|
||||
~~~
|
||||
Source ECT → Gateway ECT (aepb:translate) → Dest ECT
|
||||
~~~
|
||||
|
||||
The `Execution-Context` HTTP header survives protocol translation:
|
||||
the gateway includes the translation ECT in the header of the
|
||||
forwarded request, maintaining DAG continuity.
|
||||
|
||||
## Multi-Hop Translation
|
||||
|
||||
When a single gateway cannot handle a translation pair, messages
|
||||
may traverse multiple gateways. Each hop produces an
|
||||
`aepb:translate` ECT, all linked in the same DAG:
|
||||
|
||||
~~~
|
||||
Agent-A ECT
|
||||
│
|
||||
▼
|
||||
Gateway-1 ECT (a2a-v1 → mcp-v1)
|
||||
│
|
||||
▼
|
||||
Gateway-2 ECT (mcp-v1 → slim-v1)
|
||||
│
|
||||
▼
|
||||
Agent-B ECT
|
||||
~~~
|
||||
{: #fig-multihop title="Multi-Hop Translation DAG"}
|
||||
|
||||
The maximum number of translation hops is configured as a
|
||||
node constraint:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"constraints": {
|
||||
"aepb.max_translation_hops": 2
|
||||
}
|
||||
}
|
||||
~~~
|
||||
|
||||
Agents receiving a message MUST count `aepb:translate` ECTs in
|
||||
the `par` ancestry and MUST reject messages exceeding
|
||||
`aepb.max_translation_hops`. The default maximum is 3.
|
||||
|
||||
## Gateway Requirements
|
||||
|
||||
A gateway MUST:
|
||||
|
||||
1. Serve a capability document at `/.well-known/aepb/gateway`
|
||||
listing supported translation pairs.
|
||||
2. Accept messages via HTTP POST at its translate endpoint.
|
||||
3. Produce an `aepb:translate` ECT per {{translation}} for
|
||||
every translation.
|
||||
4. Preserve message semantics. Fields without a destination
|
||||
equivalent are carried in an extension field or dropped with
|
||||
a warning in `aepb.translation_warnings`.
|
||||
5. Require TLS 1.3 {{RFC8446}} for all connections.
|
||||
6. Implement per-source-agent rate limiting.
|
||||
7. Verify gateway ECTs at L2 or higher (signed JWT minimum).
|
||||
|
||||
A gateway MUST NOT modify payload semantics beyond what is
|
||||
required for protocol translation.
|
||||
|
||||
## Translation Failure Handling
|
||||
|
||||
When a gateway fails to translate a message, it MUST emit an
|
||||
error ECT:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "aepb:translate_error",
|
||||
"par": ["source-agent-ect-uuid"],
|
||||
"ext": {
|
||||
"aepb.source_protocol": "a2a-v1",
|
||||
"aepb.dest_protocol": "slim-v1",
|
||||
"aepb.error": "semantic_loss",
|
||||
"aepb.description": "Source message contains field 'action.stream' with no slim-v1 equivalent"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-translate-error title="Translation Error ECT"}
|
||||
|
||||
Error values: `semantic_loss` (untranslatable field), `timeout`,
|
||||
`policy_violation` (exceeds hop limit), `internal_error`.
|
||||
|
||||
On translation failure:
|
||||
- The ATD circuit breaker for the gateway agent SHOULD be
|
||||
updated.
|
||||
- If `atd.cascade: false`, the calling agent returns
|
||||
`no_translation_path` to its upstream caller.
|
||||
- If `atd.cascade: true`, the ATD rollback protocol applies
|
||||
to the entire workflow subgraph.
|
||||
|
||||
## Translation Policy
|
||||
|
||||
Protocol constraints are ACP-DAG-HITL node constraints:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"constraints": {
|
||||
"aepb.allowed_source_protocols": ["a2a-v1", "mcp-v1"],
|
||||
"aepb.allowed_dest_protocols": ["slim-v1"],
|
||||
"aepb.max_translation_hops": 2
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-policy title="Translation Policy"}
|
||||
|
||||
# Agent Lifecycle Management {#lifecycle}
|
||||
|
||||
## Lifecycle States
|
||||
|
||||
An agent's `lifecycle.status` MUST be one of:
|
||||
|
||||
active:
|
||||
: Normal operation. Default state.
|
||||
|
||||
deprecated:
|
||||
: Agent is functional but will be retired.
|
||||
`deprecated_at` MUST be set. The agent MUST include a
|
||||
`Deprecation` header per {{RFC8594}} in all responses.
|
||||
Clients SHOULD migrate to `successor` if provided.
|
||||
|
||||
draining:
|
||||
: Agent is rejecting new workflows but completing in-progress
|
||||
ones. New delegation requests MUST return HTTP 503 with
|
||||
`Retry-After` header and, if set, `Location` pointing to
|
||||
`successor`.
|
||||
|
||||
retired:
|
||||
: Agent is offline. Capability document MUST return HTTP 410
|
||||
Gone with `Link: <successor>; rel="successor-version"`.
|
||||
|
||||
## Lifecycle State Transitions
|
||||
|
||||
~~~
|
||||
deprecate drain
|
||||
active ──────────► deprecated ────────► draining ──► retired
|
||||
▲ │ │
|
||||
│ │ immediate drain │
|
||||
└────────────────────┴────────────────────┘
|
||||
(operator discretion)
|
||||
~~~
|
||||
{: #fig-lifecycle-fsm title="Lifecycle State Machine"}
|
||||
|
||||
All transitions MUST be recorded as ECTs:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "aepb:lifecycle_change",
|
||||
"ext": {
|
||||
"aepb.agent_id": "spiffe://example.com/agent/pricing",
|
||||
"aepb.from_state": "active",
|
||||
"aepb.to_state": "deprecated",
|
||||
"aepb.reason": "Replaced by pricing-v3"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-lifecycle-ect title="Lifecycle Change ECT"}
|
||||
|
||||
## Versioning
|
||||
|
||||
The `lifecycle.version` field uses semantic versioning. Agents
|
||||
MUST increment the major version when breaking changes occur
|
||||
(incompatible protocol or behavior changes).
|
||||
|
||||
Capability documents MUST include the version. Agents SHOULD
|
||||
include version in ECT `ext` claims (`aepb.agent_version`) so
|
||||
the audit trail records which version performed each action.
|
||||
|
||||
## Graceful Shutdown
|
||||
|
||||
When an agent transitions to `draining`:
|
||||
|
||||
1. Update capability document: `status: "draining"`,
|
||||
set `sunset_at` timestamp.
|
||||
2. Reject new workflow delegations with HTTP 503.
|
||||
3. Complete all in-progress workflows.
|
||||
4. Emit a final ECT: `exec_act: "aepb:shutdown"`.
|
||||
5. Transition to `retired`.
|
||||
|
||||
Agents SHOULD provide at least 24 hours between `deprecated`
|
||||
and `draining` to allow clients to discover the change via
|
||||
cached capability documents.
|
||||
|
||||
## Successor Discovery
|
||||
|
||||
When `successor` is set, it MUST be the URI of the replacement
|
||||
agent's capability document. Clients SHOULD transparently
|
||||
redirect to the successor after verifying its capability
|
||||
document. Clients MUST verify that the successor's assurance
|
||||
level is equal to or greater than the predecessor's.
|
||||
|
||||
# Security Considerations
|
||||
|
||||
## Capability Document Integrity
|
||||
|
||||
Capability documents are served over HTTPS with TLS 1.3.
|
||||
Agents SHOULD verify TLS certificates before trusting capability
|
||||
documents. For high-assurance deployments, capability documents
|
||||
SHOULD be signed as JWTs ({{RFC7519}}) so their integrity can
|
||||
be verified independently of transport security.
|
||||
|
||||
## Gateway Trust
|
||||
|
||||
Gateways are trusted intermediaries with access to message
|
||||
content. For end-to-end confidentiality, agents MAY encrypt
|
||||
message payloads with a shared key established out of band.
|
||||
|
||||
The ECT audit trail enables detection of:
|
||||
- Unauthorized gateways (unknown `aepb.gateway_id`).
|
||||
- Content tampering (`inp_hash`/`out_hash` mismatch).
|
||||
- Routing loops (repeated gateway IDs in DAG ancestry).
|
||||
|
||||
Gateways MUST authenticate using WIMSE/SPIFFE identities at
|
||||
ECT assurance L2+.
|
||||
|
||||
## Protocol Downgrade Attacks
|
||||
|
||||
An attacker may attempt to force negotiation to a weaker
|
||||
protocol. Mitigation:
|
||||
|
||||
- Agents MUST enforce `aepb.min_protocol_security` constraint.
|
||||
- TLS 1.3 is the minimum transport; lower versions MUST be
|
||||
rejected.
|
||||
- Protocol negotiation results MUST be logged as part of the
|
||||
workflow ECT DAG.
|
||||
|
||||
## Translation Amplification
|
||||
|
||||
A single cross-protocol request could trigger a chain of N
|
||||
translations, each consuming resources. Mitigation:
|
||||
|
||||
- `aepb.max_translation_hops` (default 3) prevents unbounded
|
||||
chains.
|
||||
- Per-source rate limiting at each gateway prevents a single
|
||||
agent from flooding the translation infrastructure.
|
||||
|
||||
## Lifecycle Denial of Service
|
||||
|
||||
Transitioning an agent to `draining` or `retired` disrupts
|
||||
its callers. Only the agent operator (verified via ACP-DAG-HITL
|
||||
identity binding) SHOULD be able to trigger lifecycle
|
||||
transitions. Lifecycle-change ECTs MUST be signed at L2+.
|
||||
|
||||
# IANA Considerations
|
||||
|
||||
## AEPB Protocol Binding Registry
|
||||
|
||||
This document requests the creation of the "AEPB Protocol Binding
|
||||
Registry" under IANA. Registration policy: Specification Required.
|
||||
|
||||
Required fields: Binding Identifier, Protocol Name, Specification
|
||||
Reference, Minimum ECT Assurance Level, HITL Callback Support.
|
||||
|
||||
Initial entries:
|
||||
|
||||
| Identifier | Protocol | Spec Reference | Min Assurance | HITL Callback |
|
||||
|------------|----------|---------------|--------------|---------------|
|
||||
| `a2a-v1` | A2A | (TBD) | L1 | Webhook |
|
||||
| `mcp-v1` | Model Context Protocol | (TBD) | L1 | Webhook |
|
||||
| `slim-v1` | SLIM | (TBD) | L1 | Webhook |
|
||||
| `uacp-v1` | uACP | (TBD) | L1 | Webhook |
|
||||
| `ainp-v1` | AINP | (TBD) | L1 | Webhook |
|
||||
{: #fig-registry title="Initial Protocol Binding Registry Entries"}
|
||||
|
||||
## Well-Known URIs
|
||||
|
||||
This document requests registration per {{RFC8615}}:
|
||||
|
||||
| URI Suffix | Purpose |
|
||||
|------------|---------|
|
||||
| `aepb` | Agent capability document |
|
||||
| `aepb/gateway` | Translation gateway capability |
|
||||
{: #fig-wellknown title="Well-Known URI Registrations"}
|
||||
|
||||
## `exec_act` Values
|
||||
|
||||
This document requests registration in the AEM Ecosystem
|
||||
Extension Registry:
|
||||
|
||||
| Value | Description | Reference |
|
||||
|-------|-------------|-----------|
|
||||
| `aepb:translate` | Protocol translation hop | This document |
|
||||
| `aepb:translate_error` | Translation failure | This document |
|
||||
| `aepb:shutdown` | Agent graceful shutdown complete | This document |
|
||||
| `aepb:lifecycle_change` | Lifecycle state transition | This document |
|
||||
{: #fig-iana-actions title="AEPB exec_act Registrations"}
|
||||
|
||||
--- back
|
||||
|
||||
# Acknowledgments
|
||||
{:numbered="false"}
|
||||
|
||||
AEPB builds on ECT {{I-D.nennemann-wimse-ect}} for translation
|
||||
audit trails and ACP-DAG-HITL
|
||||
{{I-D.nennemann-agent-dag-hitl-safety}} for protocol policy.
|
||||
The lifecycle model is inspired by Kubernetes graceful shutdown
|
||||
semantics and the `Deprecation` header {{RFC8594}}.
|
||||
@@ -0,0 +1,360 @@
|
||||
---
|
||||
title: "Dynamic Agent Trust Scoring (DATS)"
|
||||
abbrev: "DATS"
|
||||
category: std
|
||||
docname: draft-dats-dynamic-agent-trust-scoring-00
|
||||
submissiontype: IETF
|
||||
number:
|
||||
date:
|
||||
v: 3
|
||||
area: "SEC"
|
||||
workgroup: "Security Dispatch"
|
||||
keyword:
|
||||
- dynamic trust
|
||||
- reputation
|
||||
- agentic workflows
|
||||
- execution context
|
||||
|
||||
author:
|
||||
-
|
||||
fullname: Generated by IETF Draft Analyzer
|
||||
organization: Independent
|
||||
email: placeholder@example.com
|
||||
|
||||
normative:
|
||||
RFC7519:
|
||||
RFC7515:
|
||||
RFC7518:
|
||||
RFC9110:
|
||||
I-D.nennemann-wimse-ect:
|
||||
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
||||
|
||||
informative:
|
||||
I-D.nennemann-agent-dag-hitl-safety:
|
||||
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
||||
|
||||
--- abstract
|
||||
|
||||
This document defines the Dynamic Agent Trust Scoring (DATS)
|
||||
protocol, a mechanism for AI agents to build, assess, and revoke
|
||||
trust relationships based on observed behavior over time. Static
|
||||
authentication verifies identity but says nothing about reliability.
|
||||
DATS augments identity-based auth with a numeric trust score that
|
||||
adjusts dynamically based on interaction outcomes recorded in the
|
||||
ECT DAG. Trust events are derived from ECT action outcomes rather
|
||||
than agent-local tracking, making trust computation auditable and
|
||||
tamper-evident. Trust assertions are ECTs themselves, and trust
|
||||
thresholds integrate with ACP-DAG-HITL node constraints as
|
||||
enforceable policy.
|
||||
|
||||
--- middle
|
||||
|
||||
# Introduction
|
||||
|
||||
The IETF has 98 drafts addressing agent identity and
|
||||
authentication, providing strong mechanisms for verifying who an
|
||||
agent is. But identity alone is insufficient for long-running
|
||||
autonomous systems. A properly authenticated agent may still
|
||||
produce bad results, violate expectations, or degrade over time.
|
||||
|
||||
DATS adds a behavioral dimension to trust. It answers: "I know
|
||||
who you are, but should I rely on you?" The model is deliberately
|
||||
simple -- a single floating-point score between 0.0 and 1.0 per
|
||||
agent relationship -- because complex reputation systems tend to
|
||||
be gamed or ignored.
|
||||
|
||||
By building on ECT {{I-D.nennemann-wimse-ect}}, DATS derives trust
|
||||
from the cryptographically signed record of actual interactions
|
||||
rather than agent-local counters that can be manipulated. At L3,
|
||||
the audit ledger provides an immutable interaction history.
|
||||
|
||||
The protocol is inspired by:
|
||||
|
||||
- TCP congestion control: trust increases slowly (additive) and
|
||||
decreases quickly (multiplicative) on failure.
|
||||
- TLS certificate transparency: trust assertions are logged.
|
||||
- Web of trust (PGP): trust propagates through intermediaries
|
||||
with attenuation.
|
||||
|
||||
# Conventions and Definitions
|
||||
|
||||
{::boilerplate bcp14-tagged}
|
||||
|
||||
Trust Score:
|
||||
: A floating-point value in \[0.0, 1.0\] representing one agent's
|
||||
assessed reliability of another, based on observed ECT outcomes.
|
||||
|
||||
Trust Event:
|
||||
: An observable interaction outcome that causes a trust score
|
||||
adjustment. Derived from ECTs in the workflow DAG.
|
||||
|
||||
Trust Decay:
|
||||
: Automatic reduction of trust scores over inactivity, reflecting
|
||||
the principle that trust requires ongoing evidence.
|
||||
|
||||
Trust Assertion:
|
||||
: An ECT recording one agent's trust score for another,
|
||||
transportable as a signed token.
|
||||
|
||||
# Problem Statement
|
||||
|
||||
Agent A delegates a task to Agent B. After 100 successful
|
||||
interactions, Agent B starts returning incorrect results (model
|
||||
drift, adversarial manipulation, or degradation). Agent A has no
|
||||
standard way to:
|
||||
|
||||
1. Track B's reliability over time.
|
||||
2. Reduce B's privileges based on degraded performance.
|
||||
3. Share its experience with Agent C.
|
||||
4. Automatically revoke B's access when trust drops below
|
||||
acceptable levels.
|
||||
|
||||
Existing attestation drafts (STAMP, DAAP) provide cryptographic
|
||||
proof of specific actions but not ongoing behavioral assessment.
|
||||
The ECT DAG records what happened; DATS adds evaluation of
|
||||
whether what happened was good.
|
||||
|
||||
# Trust Score Model {#trust-model}
|
||||
|
||||
Each agent maintains a trust table: a mapping from peer agent IDs
|
||||
to trust scores.
|
||||
|
||||
~~~json
|
||||
{
|
||||
"spiffe://example.com/agent/b": {
|
||||
"score": 0.82,
|
||||
"interactions": 147,
|
||||
"last_updated": "2026-03-01T11:30:00Z",
|
||||
"last_event_ect": "550e8400-e29b-41d4-a716-446655440099"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-trust-table title="Trust Table Entry"}
|
||||
|
||||
Initial trust for an unknown agent is deployment-configured. A
|
||||
value of 0.5 is RECOMMENDED as a neutral starting point.
|
||||
Zero-trust deployments MAY use 0.1.
|
||||
|
||||
Trust scores are updated using additive-increase,
|
||||
multiplicative-decrease (AIMD):
|
||||
|
||||
On positive event:
|
||||
: `score = min(1.0, score + alpha)`
|
||||
|
||||
On negative event:
|
||||
: `score = max(0.0, score * beta)`
|
||||
|
||||
Default parameters: `alpha = 0.01`, `beta = 0.8`.
|
||||
|
||||
This means trust builds slowly (100 successes from 0.5 to ~1.0)
|
||||
but drops quickly (a single failure takes 0.82 to 0.66). This
|
||||
asymmetry is intentional: in autonomous systems, the cost of
|
||||
trusting a bad agent exceeds the cost of slow trust building.
|
||||
|
||||
# Trust Events from ECT {#trust-events}
|
||||
|
||||
Trust events are derived from ECTs in the workflow DAG rather than
|
||||
agent-local tracking. This makes trust computation auditable.
|
||||
|
||||
## Standard Trust Events
|
||||
|
||||
| ECT condition | Event | Adjustment |
|
||||
|--------------|-------|------------|
|
||||
| `exec_act` completed, no error ECT follows | `task_success` | +1x alpha |
|
||||
| `exec_act` completed, partial result | `task_partial` | +0.5x alpha |
|
||||
| `aerr:error` ECT with `par` referencing agent | `task_failure` | 1x beta |
|
||||
| Timeout (no response ECT within threshold) | `task_timeout` | 1x beta |
|
||||
| `aerr:error` with `constraint_violation` | `policy_violation` | beta^2 |
|
||||
| ECT signature verification fails | `attestation_invalid` | beta^2 |
|
||||
| `aerr:rollback_request` targeting agent | `rollback_triggered` | 1x beta |
|
||||
{: #fig-events title="Trust Events Derived from ECT"}
|
||||
|
||||
`beta^2` means the multiplicative decrease is applied twice
|
||||
(`score * beta * beta`), reflecting the severity of policy
|
||||
violations versus simple failures.
|
||||
|
||||
## Trust Decay
|
||||
|
||||
If no interaction (no ECT involving the peer) occurs for a
|
||||
configurable period (default: 7 days), the trust score decays:
|
||||
|
||||
`score = max(initial_default, score - decay_rate)`
|
||||
|
||||
Default `decay_rate`: 0.01 per day.
|
||||
|
||||
Agents MUST record all trust events in a local audit log. At L3,
|
||||
the trust events are derivable from the audit ledger, providing
|
||||
independent verifiability.
|
||||
|
||||
# Trust Assertions as ECT {#trust-assertions}
|
||||
|
||||
Agent A shares its trust assessment of Agent B with Agent C via a
|
||||
trust assertion ECT:
|
||||
|
||||
- `exec_act`: `"dats:assertion"`
|
||||
- `par`: empty (trust assertions are standalone) or referencing
|
||||
the most recent interaction ECT
|
||||
|
||||
~~~json
|
||||
{
|
||||
"iss": "spiffe://example.com/agent/a",
|
||||
"ext": {
|
||||
"dats.subject": "spiffe://example.com/agent/b",
|
||||
"dats.score": 0.82,
|
||||
"dats.interactions": 147,
|
||||
"dats.confidence": "high",
|
||||
"dats.hops": 0
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-assertion title="Trust Assertion ECT"}
|
||||
|
||||
`dats.confidence` is based on interaction count: `low` (<10),
|
||||
`medium` (10-99), `high` (100+).
|
||||
|
||||
## Trust Propagation with Attenuation
|
||||
|
||||
When Agent C receives a trust assertion from Agent A about Agent B,
|
||||
it MAY incorporate it:
|
||||
|
||||
~~~
|
||||
c_score_for_b = max(c_score_for_b,
|
||||
a_score_for_b * trust_of_a * attenuation)
|
||||
~~~
|
||||
|
||||
Where:
|
||||
|
||||
- `a_score_for_b` = A's reported score for B (0.82)
|
||||
- `trust_of_a` = C's own trust score for A
|
||||
- `attenuation` = constant (default: 0.5)
|
||||
|
||||
Trust assertions are advisory. An agent's own direct observations
|
||||
always take precedence over propagated trust.
|
||||
|
||||
## Anti-Gaming Measures
|
||||
|
||||
To prevent trust laundering (colluding agents inflating each
|
||||
other's scores):
|
||||
|
||||
- Agents SHOULD limit propagation depth to 1 hop by default
|
||||
- The `dats.hops` field tracks depth; agents MUST NOT propagate
|
||||
assertions where `dats.hops` exceeds their configured maximum
|
||||
- At L3, trust assertions are recorded in the audit ledger,
|
||||
making collusion patterns detectable through graph analysis
|
||||
|
||||
# Trust Thresholds as Policy {#trust-policy}
|
||||
|
||||
## Threshold-Based Access
|
||||
|
||||
Agents SHOULD define trust thresholds per action type:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"thresholds": {
|
||||
"read_data": 0.3,
|
||||
"execute_task": 0.5,
|
||||
"modify_config": 0.7,
|
||||
"delegate_auth": 0.9
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-thresholds title="Trust Thresholds"}
|
||||
|
||||
When a request arrives, the agent checks the requester's trust
|
||||
score against the threshold. If below threshold, the request is
|
||||
denied with HTTP 403 and error `trust_insufficient`.
|
||||
|
||||
## Integration with ACP-DAG-HITL
|
||||
|
||||
Trust thresholds can be expressed as DAG node constraints
|
||||
{{I-D.nennemann-agent-dag-hitl-safety}}:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"dag": {
|
||||
"nodes": [{
|
||||
"id": "n-critical-action",
|
||||
"type": "modify_config",
|
||||
"agent": "spiffe://example.com/agent/b",
|
||||
"constraints": {
|
||||
"dats.min_trust": 0.7,
|
||||
"dats.min_confidence": "medium"
|
||||
}
|
||||
}]
|
||||
},
|
||||
"hitl": {
|
||||
"rules": [{
|
||||
"id": "r-low-trust",
|
||||
"trigger": {
|
||||
"kind": "confidence_below",
|
||||
"op": "lt",
|
||||
"value": 0.5,
|
||||
"input_ref": "dats.peer_trust_score"
|
||||
},
|
||||
"required_role": "operator:security",
|
||||
"action": "escalate",
|
||||
"allow_override": true,
|
||||
"override_action": "continue"
|
||||
}]
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-policy title="Trust Policy as DAG Constraints + HITL"}
|
||||
|
||||
This means: if the delegated agent's trust score drops below 0.5,
|
||||
escalate to a human security operator before proceeding.
|
||||
|
||||
## Automatic Revocation
|
||||
|
||||
When an agent's trust score drops below a configured floor
|
||||
(default: 0.2), the trusting agent SHOULD:
|
||||
|
||||
1. Revoke all outstanding delegations to that agent
|
||||
2. Produce a revocation ECT (`exec_act`: `"dats:revoke"`)
|
||||
3. Emit an error ECT per AERR if the agent was part of an
|
||||
active workflow
|
||||
|
||||
# Security Considerations
|
||||
|
||||
Trust scores are sensitive metadata. Agents MUST NOT expose
|
||||
their full trust tables to peers. Only pairwise trust assertions
|
||||
should be shared intentionally.
|
||||
|
||||
Trust assertion ECTs MUST be signed at L2 or L3. Agents MUST
|
||||
verify signatures before processing.
|
||||
|
||||
Score manipulation: a malicious agent could behave well to build
|
||||
trust, then exploit it. Mitigation: `policy_violation` events
|
||||
apply double penalties, and deployments SHOULD set high thresholds
|
||||
for critical actions.
|
||||
|
||||
Sybil attacks: an attacker creates many agents for fake positive
|
||||
assertions. Mitigation: attenuation ({{trust-assertions}}),
|
||||
hop limits, and requiring agents to be registered in a trusted
|
||||
directory before accepting assertions.
|
||||
|
||||
All trust-related communications MUST use TLS 1.3.
|
||||
|
||||
# IANA Considerations
|
||||
|
||||
This document requests the following IANA registrations:
|
||||
|
||||
1. Registration of `exec_act` values `dats:assertion` and
|
||||
`dats:revoke` in a future ECT action type registry.
|
||||
|
||||
2. A "DATS Trust Event Type" registry under Specification Required
|
||||
policy. Initial entries: `task_success`, `task_partial`,
|
||||
`task_failure`, `task_timeout`, `policy_violation`,
|
||||
`attestation_invalid`, `rollback_triggered`.
|
||||
|
||||
--- back
|
||||
|
||||
# Acknowledgments
|
||||
{:numbered="false"}
|
||||
|
||||
This document builds on the Execution Context Token specification
|
||||
{{I-D.nennemann-wimse-ect}} for interaction evidence and the
|
||||
Agent Context Policy Token {{I-D.nennemann-agent-dag-hitl-safety}}
|
||||
for trust threshold policy enforcement.
|
||||
@@ -0,0 +1,298 @@
|
||||
Internet-Draft AI/Agent WG
|
||||
Intended status: Standards Track March 2026
|
||||
Expires: September 15, 2026
|
||||
|
||||
|
||||
Dynamic Agent Trust Scoring (DATS)
|
||||
draft-dats-dynamic-agent-trust-scoring-00
|
||||
|
||||
Abstract
|
||||
|
||||
This document defines the Dynamic Agent Trust Scoring (DATS)
|
||||
protocol, a mechanism for AI agents to build, assess, and
|
||||
revoke trust relationships based on observed behavior over
|
||||
time. Static authentication (certificates, API keys) verifies
|
||||
identity but says nothing about whether an agent is reliable,
|
||||
accurate, or well-behaved. DATS augments identity-based auth
|
||||
with a numeric trust score that adjusts dynamically based on
|
||||
interaction outcomes. The protocol defines trust score
|
||||
computation, propagation between agents, decay over inactivity,
|
||||
and threshold-based access policies. DATS is intentionally
|
||||
simple: a single score per agent-pair, standard adjustment
|
||||
events, and a JWT-based transport for trust assertions.
|
||||
|
||||
Status of This Memo
|
||||
|
||||
This Internet-Draft is submitted in full conformance with the
|
||||
provisions of BCP 78 and BCP 79.
|
||||
|
||||
This document is intended to have Standards Track status.
|
||||
Distribution of this memo is unlimited.
|
||||
|
||||
Table of Contents
|
||||
|
||||
1. Introduction
|
||||
2. Terminology
|
||||
3. Problem Statement
|
||||
4. Trust Score Model
|
||||
5. Trust Events and Adjustments
|
||||
6. Trust Propagation
|
||||
7. Threshold-Based Access Policies
|
||||
8. Security Considerations
|
||||
9. IANA Considerations
|
||||
|
||||
1. Introduction
|
||||
|
||||
The IETF has 98 drafts addressing agent identity and
|
||||
authentication, providing strong mechanisms for verifying who
|
||||
an agent is. But identity alone is insufficient for long-
|
||||
running autonomous systems. A properly authenticated agent
|
||||
may still produce bad results, violate expectations, or
|
||||
degrade over time. Static certificates cannot capture this.
|
||||
|
||||
DATS adds a behavioral dimension to agent trust. It answers
|
||||
the question: "I know who you are, but should I rely on you?"
|
||||
The model is deliberately simple — a single floating-point
|
||||
score between 0.0 and 1.0 per agent relationship — because
|
||||
complex reputation systems tend to be gamed or ignored.
|
||||
|
||||
The protocol is inspired by:
|
||||
- TCP congestion control: trust increases slowly (additive)
|
||||
and decreases quickly (multiplicative) on failure.
|
||||
- TLS certificate transparency: trust assertions are logged
|
||||
for auditability.
|
||||
- Web of trust (PGP): trust can propagate through
|
||||
intermediaries, with attenuation.
|
||||
|
||||
2. Terminology
|
||||
|
||||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
|
||||
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
|
||||
"OPTIONAL" in this document are to be interpreted as described
|
||||
in RFC 2119 [RFC2119].
|
||||
|
||||
Trust Score: A floating-point value in [0.0, 1.0] representing
|
||||
one agent's assessed reliability of another, based on observed
|
||||
interaction outcomes.
|
||||
|
||||
Trust Event: An observable interaction outcome that causes a
|
||||
trust score adjustment. Events are either positive (task
|
||||
completed successfully) or negative (task failed, timeout,
|
||||
policy violation).
|
||||
|
||||
Trust Decay: The automatic reduction of trust scores over
|
||||
periods of inactivity, reflecting the principle that trust
|
||||
requires ongoing evidence.
|
||||
|
||||
Trust Assertion: A signed statement by one agent about another
|
||||
agent's trust score, transportable as a JWT claim.
|
||||
|
||||
3. Problem Statement
|
||||
|
||||
Agent A delegates a task to Agent B. Agent B completes it
|
||||
correctly. Agent A delegates again. After 100 successful
|
||||
interactions, Agent B starts returning subtly incorrect results
|
||||
(model drift, adversarial manipulation, or simple degradation).
|
||||
Agent A has no standard way to:
|
||||
|
||||
1. Track B's reliability over time.
|
||||
2. Reduce B's privileges based on degraded performance.
|
||||
3. Share its experience with Agent C, who is considering
|
||||
delegating to Agent B.
|
||||
4. Automatically revoke B's access when trust drops below
|
||||
acceptable levels.
|
||||
|
||||
Existing attestation drafts (STAMP, DAAP) provide
|
||||
cryptographic proof of specific actions but not ongoing
|
||||
behavioral assessment. DATS fills this gap.
|
||||
|
||||
4. Trust Score Model
|
||||
|
||||
Each agent maintains a trust table: a mapping from peer agent
|
||||
IDs to trust scores.
|
||||
|
||||
{
|
||||
"urn:uuid:agent-b": {
|
||||
"score": 0.82,
|
||||
"interactions": 147,
|
||||
"last_updated": "2026-03-01T11:30:00Z",
|
||||
"last_event": "task_success"
|
||||
}
|
||||
}
|
||||
|
||||
Initial trust for an unknown agent is a deployment-configured
|
||||
default. A value of 0.5 is RECOMMENDED as a neutral starting
|
||||
point, but deployments MAY use lower values (e.g., 0.1) for
|
||||
zero-trust environments.
|
||||
|
||||
Trust scores are updated using an additive-increase,
|
||||
multiplicative-decrease (AIMD) algorithm:
|
||||
|
||||
On positive event:
|
||||
score = min(1.0, score + alpha)
|
||||
|
||||
On negative event:
|
||||
score = max(0.0, score * beta)
|
||||
|
||||
Default parameters: alpha = 0.01, beta = 0.8.
|
||||
|
||||
This means trust builds slowly (100 successes to go from 0.5
|
||||
to ~1.0) but drops quickly (a single failure reduces an 0.82
|
||||
score to 0.66). This asymmetry is intentional: in autonomous
|
||||
systems, the cost of trusting a bad agent exceeds the cost of
|
||||
being slow to trust a good one.
|
||||
|
||||
Agents MAY tune alpha and beta per relationship or per action
|
||||
type, but MUST use the AIMD structure.
|
||||
|
||||
5. Trust Events and Adjustments
|
||||
|
||||
The following standard trust events are defined:
|
||||
|
||||
| Event | Direction | Default Weight |
|
||||
|----------------------|-----------|----------------|
|
||||
| task_success | positive | 1x alpha |
|
||||
| task_partial_success | positive | 0.5x alpha |
|
||||
| task_failure | negative | 1x beta |
|
||||
| task_timeout | negative | 1x beta |
|
||||
| policy_violation | negative | applied twice |
|
||||
| attestation_invalid | negative | applied twice |
|
||||
| rollback_triggered | negative | 1x beta |
|
||||
|
||||
"applied twice" means the multiplicative decrease is applied
|
||||
two times in succession (score * beta * beta), reflecting the
|
||||
severity of policy violations versus simple failures.
|
||||
|
||||
Trust decay: if no interaction occurs for a configurable
|
||||
period (default: 7 days), the trust score decays:
|
||||
|
||||
score = max(initial_default, score - decay_rate)
|
||||
|
||||
Default decay_rate: 0.01 per day. This ensures that stale
|
||||
trust relationships gradually return to the default level
|
||||
rather than persisting indefinitely.
|
||||
|
||||
Agents MUST record all trust events in a local audit log.
|
||||
|
||||
6. Trust Propagation
|
||||
|
||||
Agent A may share its trust assessment of Agent B with Agent C
|
||||
through a signed trust assertion. The assertion is a JWT
|
||||
(RFC 7519) with the following claims:
|
||||
|
||||
{
|
||||
"iss": "urn:uuid:agent-a",
|
||||
"sub": "urn:uuid:agent-b",
|
||||
"iat": 1709294400,
|
||||
"exp": 1709380800,
|
||||
"dats_score": 0.82,
|
||||
"dats_interactions": 147,
|
||||
"dats_confidence": "high"
|
||||
}
|
||||
|
||||
"dats_confidence" is based on interaction count: "low" (<10),
|
||||
"medium" (10-99), "high" (100+).
|
||||
|
||||
When Agent C receives this assertion, it MAY incorporate it
|
||||
into its own trust score for Agent B using attenuation:
|
||||
|
||||
c_score_for_b = max(c_score_for_b,
|
||||
a_score_for_b * trust_of_a * attenuation)
|
||||
|
||||
Where:
|
||||
- a_score_for_b is Agent A's reported score for B (0.82)
|
||||
- trust_of_a is Agent C's trust score for Agent A
|
||||
- attenuation is a constant (default: 0.5) preventing
|
||||
unbounded trust propagation
|
||||
|
||||
Trust assertions are advisory. Agents MUST NOT blindly adopt
|
||||
propagated scores. An agent's own direct observations always
|
||||
take precedence over propagated trust.
|
||||
|
||||
To prevent trust laundering (colluding agents inflating each
|
||||
other's scores), agents SHOULD limit propagation depth to 1
|
||||
hop by default. The "dats_hops" claim tracks propagation
|
||||
depth; agents MUST NOT propagate assertions where dats_hops
|
||||
exceeds their configured maximum.
|
||||
|
||||
7. Threshold-Based Access Policies
|
||||
|
||||
Agents SHOULD define trust thresholds for different action
|
||||
categories:
|
||||
|
||||
{
|
||||
"thresholds": {
|
||||
"read_data": 0.3,
|
||||
"execute_task": 0.5,
|
||||
"modify_config": 0.7,
|
||||
"delegate_auth": 0.9
|
||||
}
|
||||
}
|
||||
|
||||
When an agent requests an action, the serving agent checks the
|
||||
requester's trust score against the threshold for that action
|
||||
type. If the score is below the threshold, the request is
|
||||
denied with a 403 response including a DATS-specific error:
|
||||
|
||||
{
|
||||
"error": "trust_insufficient",
|
||||
"required_score": 0.7,
|
||||
"current_score": 0.54,
|
||||
"action": "modify_config"
|
||||
}
|
||||
|
||||
The response SHOULD NOT reveal the exact current score in
|
||||
production deployments to prevent score probing. Instead, it
|
||||
MAY return only the "trust_insufficient" error.
|
||||
|
||||
Automatic revocation: when an agent's trust score drops below
|
||||
a configured floor (default: 0.2), the trusting agent SHOULD
|
||||
revoke all outstanding delegations and emit a trust revocation
|
||||
event. This provides automatic containment of agents that
|
||||
have become unreliable.
|
||||
|
||||
8. Security Considerations
|
||||
|
||||
Trust scores are sensitive metadata. Agents MUST NOT expose
|
||||
their full trust tables to peers. Only pairwise trust
|
||||
assertions (Section 6) should be shared, and only
|
||||
intentionally.
|
||||
|
||||
Trust assertion JWTs MUST be signed using algorithms from
|
||||
RFC 7518 (e.g., ES256, EdDSA). Agents MUST verify signatures
|
||||
before processing trust assertions.
|
||||
|
||||
Score manipulation attacks: a malicious agent could
|
||||
intentionally behave well for many interactions to build trust,
|
||||
then exploit high trust for a damaging action. Mitigation:
|
||||
policy_violation events apply double penalties, and
|
||||
deployments SHOULD set trust thresholds high for critical
|
||||
actions regardless of accumulated trust.
|
||||
|
||||
Sybil attacks: an attacker could create many agents to
|
||||
generate fake positive trust assertions. Mitigation: agents
|
||||
SHOULD weight propagated trust by their own direct trust in
|
||||
the asserting agent (Section 6 attenuation) and SHOULD
|
||||
require agents to be registered in a trusted directory (e.g.,
|
||||
ANS) before accepting trust assertions.
|
||||
|
||||
All trust-related communications MUST use TLS 1.3 [RFC8446].
|
||||
|
||||
9. IANA Considerations
|
||||
|
||||
This document requests IANA establish the following:
|
||||
|
||||
1. Registration of JWT claims "dats_score",
|
||||
"dats_interactions", "dats_confidence", and "dats_hops"
|
||||
in the JSON Web Token Claims registry per RFC 7519.
|
||||
|
||||
2. A "DATS Trust Event Type" registry under Specification
|
||||
Required policy. Initial entries: "task_success",
|
||||
"task_partial_success", "task_failure", "task_timeout",
|
||||
"policy_violation", "attestation_invalid",
|
||||
"rollback_triggered".
|
||||
|
||||
Author's Address
|
||||
|
||||
Generated by IETF Draft Analyzer
|
||||
2026-03-01
|
||||
@@ -0,0 +1,384 @@
|
||||
---
|
||||
title: "Assurance Profiles for Agent Ecosystems (APAE)"
|
||||
abbrev: "APAE"
|
||||
category: info
|
||||
docname: draft-apae-assurance-profiles-00
|
||||
submissiontype: IETF
|
||||
number:
|
||||
date:
|
||||
v: 3
|
||||
area: "SEC"
|
||||
workgroup: "Security Dispatch"
|
||||
keyword:
|
||||
- dynamic trust
|
||||
- assurance
|
||||
- behavior verification
|
||||
- data provenance
|
||||
|
||||
author:
|
||||
-
|
||||
fullname: TBD
|
||||
organization: Independent
|
||||
email: placeholder@example.com
|
||||
|
||||
normative:
|
||||
RFC2119:
|
||||
RFC8174:
|
||||
RFC7519:
|
||||
RFC7518:
|
||||
I-D.nennemann-wimse-ect:
|
||||
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
||||
I-D.nennemann-agent-dag-hitl-safety:
|
||||
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
||||
|
||||
informative:
|
||||
|
||||
--- abstract
|
||||
|
||||
This document defines Assurance Profiles for Agent Ecosystems
|
||||
(APAE): dynamic trust scoring, behavior verification, data
|
||||
provenance, and graduated assurance profiles that allow the same
|
||||
agent ecosystem to operate in relaxed (dev/K8s) and regulated
|
||||
(healthcare, finance) environments. Trust events are derived from
|
||||
ECT outcomes. Trust assertions are ECTs. Behavior verification
|
||||
references ECT claims. Provenance chains are implicit in the ECT
|
||||
DAG. Assurance profiles select which combination of these
|
||||
mechanisms is required for a given deployment, mapping to ECT
|
||||
assurance levels L1/L2/L3.
|
||||
|
||||
--- middle
|
||||
|
||||
# Introduction
|
||||
|
||||
Identity verifies who an agent is. ECT records what an agent did.
|
||||
But neither answers: should I rely on this agent? Is it doing what
|
||||
it promised? Can I trace where this data came from?
|
||||
|
||||
APAE adds three capabilities to the ecosystem:
|
||||
|
||||
1. **Dynamic trust scoring** — behavioral reputation that adjusts
|
||||
based on interaction outcomes (AIMD model).
|
||||
2. **Behavior verification** — checking agent actions against
|
||||
declared specifications.
|
||||
3. **Data provenance** — tracing data lineage through the DAG.
|
||||
|
||||
These three capabilities are bundled into assurance profiles
|
||||
(relaxed, standard, regulated) that map to ECT assurance levels,
|
||||
so the same ecosystem works from a dev cluster to a hospital.
|
||||
|
||||
# Conventions and Definitions
|
||||
|
||||
{::boilerplate bcp14-tagged}
|
||||
|
||||
Trust Score:
|
||||
: A floating-point value in \[0.0, 1.0\] representing one agent's
|
||||
assessed reliability of another.
|
||||
|
||||
Trust Event:
|
||||
: An interaction outcome that causes a trust score adjustment.
|
||||
Derived from ECTs.
|
||||
|
||||
Behavior Specification:
|
||||
: A machine-readable declaration of permitted agent actions and
|
||||
constraints.
|
||||
|
||||
Provenance Chain:
|
||||
: The sequence of ECT nodes recording how a piece of data was
|
||||
produced, transformed, and consumed.
|
||||
|
||||
Assurance Profile:
|
||||
: A named configuration selecting which trust, verification, and
|
||||
provenance mechanisms are required.
|
||||
|
||||
# Dynamic Trust Scoring {#trust}
|
||||
|
||||
## Trust Score Model
|
||||
|
||||
Each agent maintains a trust table: peer agent IDs mapped to
|
||||
trust scores. Initial trust for unknown agents is deployment-
|
||||
configured (RECOMMENDED: 0.5; zero-trust: 0.1).
|
||||
|
||||
Scores update using additive-increase, multiplicative-decrease
|
||||
(AIMD):
|
||||
|
||||
- Positive event: `score = min(1.0, score + alpha)`
|
||||
- Negative event: `score = max(0.0, score * beta)`
|
||||
|
||||
Defaults: `alpha = 0.01`, `beta = 0.8`.
|
||||
|
||||
Trust builds slowly (100 successes: 0.5 → ~1.0) and drops fast
|
||||
(one failure: 0.82 → 0.66).
|
||||
|
||||
## Trust Events from ECT {#trust-events}
|
||||
|
||||
Trust events are derived from ECTs rather than agent-local
|
||||
counters, making trust computation auditable:
|
||||
|
||||
| ECT condition | Event | Adjustment |
|
||||
|--------------|-------|------------|
|
||||
| Completed, no error follows | `task_success` | +1x alpha |
|
||||
| Completed, partial result | `task_partial` | +0.5x alpha |
|
||||
| `atd:error` referencing agent | `task_failure` | 1x beta |
|
||||
| No response within threshold | `task_timeout` | 1x beta |
|
||||
| `atd:error` with `constraint_violation` | `policy_violation` | beta^2 |
|
||||
| ECT signature verification fails | `attestation_invalid` | beta^2 |
|
||||
| `atd:rollback_request` targeting agent | `rollback_triggered` | 1x beta |
|
||||
{: #fig-events title="Trust Events from ECT"}
|
||||
|
||||
## Trust Decay
|
||||
|
||||
If no interaction occurs for a configurable period (default:
|
||||
7 days): `score = max(initial_default, score - 0.01/day)`.
|
||||
|
||||
## Trust Assertions as ECT {#trust-assertions}
|
||||
|
||||
Agent A shares its trust assessment via a trust assertion ECT:
|
||||
|
||||
- `exec_act`: `"apae:trust_assertion"`
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "apae:trust_assertion",
|
||||
"ext": {
|
||||
"apae.subject": "spiffe://example.com/agent/b",
|
||||
"apae.trust_score": 0.82,
|
||||
"apae.interactions": 147,
|
||||
"apae.confidence": "high",
|
||||
"apae.hops": 0
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-assertion title="Trust Assertion ECT"}
|
||||
|
||||
Confidence: `low` (<10 interactions), `medium` (10-99),
|
||||
`high` (100+).
|
||||
|
||||
## Trust Propagation
|
||||
|
||||
When Agent C receives A's assertion about B:
|
||||
|
||||
~~~
|
||||
c_score_for_b = max(c_score_for_b,
|
||||
a_score * trust_of_a * attenuation)
|
||||
~~~
|
||||
|
||||
Default `attenuation`: 0.5. Direct observations always take
|
||||
precedence. `apae.hops` tracks propagation depth; agents MUST NOT
|
||||
propagate beyond their configured maximum (default: 1).
|
||||
|
||||
## Trust Thresholds as Policy
|
||||
|
||||
Trust thresholds are ACP-DAG-HITL node constraints:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"constraints": {
|
||||
"apae.min_trust": 0.7,
|
||||
"apae.min_confidence": "medium"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-threshold title="Trust Threshold as Node Constraint"}
|
||||
|
||||
Requests from agents below threshold are denied with HTTP 403.
|
||||
|
||||
Low trust can trigger HITL escalation:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"id": "r-low-trust",
|
||||
"trigger": {
|
||||
"kind": "confidence_below",
|
||||
"op": "lt",
|
||||
"value": 0.5,
|
||||
"input_ref": "apae.peer_trust_score"
|
||||
},
|
||||
"required_role": "operator:security",
|
||||
"action": "escalate",
|
||||
"allow_override": true,
|
||||
"override_action": "continue"
|
||||
}
|
||||
~~~
|
||||
{: #fig-trust-hitl title="HITL Rule for Low Trust"}
|
||||
|
||||
## Automatic Revocation
|
||||
|
||||
When trust drops below a floor (default: 0.2), the trusting agent
|
||||
SHOULD revoke delegations and emit:
|
||||
`exec_act: "apae:trust_revoke"`.
|
||||
|
||||
# Behavior Verification {#behavior}
|
||||
|
||||
## Behavior Specifications
|
||||
|
||||
A behavior specification declares what an agent is permitted to do.
|
||||
Specifications are JSON documents referencing ECT claims:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"spec_version": "1.0",
|
||||
"agent_id": "spiffe://example.com/agent/firewall",
|
||||
"allowed_actions": ["update_rules", "read_config", "report"],
|
||||
"constraints": {
|
||||
"max_actions_per_minute": 60,
|
||||
"forbidden_targets": ["core-router-*"],
|
||||
"require_checkpoint_before": ["update_rules"]
|
||||
},
|
||||
"verification_frequency": "continuous"
|
||||
}
|
||||
~~~
|
||||
{: #fig-spec title="Behavior Specification"}
|
||||
|
||||
## Verification Against ECT Stream
|
||||
|
||||
A verifier monitors the agent's ECT stream and checks:
|
||||
|
||||
1. `exec_act` values are in `allowed_actions`.
|
||||
2. Action rate does not exceed `max_actions_per_minute` (computed
|
||||
from `iat` timestamps).
|
||||
3. `atd:checkpoint` ECTs precede `update_rules` ECTs (from
|
||||
`require_checkpoint_before`).
|
||||
4. Targets in `ext` claims do not match `forbidden_targets`.
|
||||
|
||||
Verification results are ECTs:
|
||||
|
||||
- `exec_act`: `"apae:compliance_check"`
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "apae:compliance_check",
|
||||
"par": ["latest-agent-ect-uuid"],
|
||||
"ext": {
|
||||
"apae.compliance_status": "passing",
|
||||
"apae.violations": [],
|
||||
"apae.spec_version": "1.0",
|
||||
"apae.window": "2026-03-01T12:00:00Z/PT1H"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-compliance title="Compliance Check ECT"}
|
||||
|
||||
Violations trigger trust score decreases (`policy_violation` event)
|
||||
and MAY trigger HITL escalation.
|
||||
|
||||
# Data Provenance {#provenance}
|
||||
|
||||
## DAG as Provenance Chain
|
||||
|
||||
The ECT DAG already encodes data provenance: each ECT's `par`
|
||||
references show which prior tasks produced its inputs. The
|
||||
`inp_hash` and `out_hash` claims prove what was processed without
|
||||
revealing the data.
|
||||
|
||||
For deployments requiring explicit provenance metadata, agents
|
||||
MAY include:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"ext": {
|
||||
"apae.data_source": "database:patients",
|
||||
"apae.data_classification": "pii",
|
||||
"apae.retention_days": 365,
|
||||
"apae.transformations": ["anonymize", "aggregate"]
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-provenance title="Provenance Extension Claims"}
|
||||
|
||||
## Provenance Queries
|
||||
|
||||
At L3, the audit ledger enables provenance queries:
|
||||
|
||||
- "Which agents touched this data?" → walk `par` chain from
|
||||
final ECT to roots.
|
||||
- "Was this data transformed?" → check `apae.transformations`
|
||||
along the chain.
|
||||
- "Is provenance complete?" → verify all `par` references
|
||||
resolve to ledger entries.
|
||||
|
||||
# Assurance Profiles {#profiles}
|
||||
|
||||
An assurance profile is a named configuration that selects which
|
||||
mechanisms are required:
|
||||
|
||||
| | Relaxed | Standard | Regulated |
|
||||
|---|---------|----------|-----------|
|
||||
| **ECT level** | L1 | L2 | L3 |
|
||||
| **Trust scoring** | Optional | RECOMMENDED | REQUIRED |
|
||||
| **Trust threshold enforcement** | Optional | RECOMMENDED | REQUIRED |
|
||||
| **Behavior verification** | Off | Periodic | Continuous |
|
||||
| **HITL approval gates** | Optional | Critical paths | Mandatory |
|
||||
| **Data provenance** | Off | Optional | REQUIRED |
|
||||
| **Checkpoint before consequential** | RECOMMENDED | REQUIRED | REQUIRED |
|
||||
| **Audit ledger** | Optional | Optional | REQUIRED |
|
||||
{: #fig-profiles title="Assurance Profiles"}
|
||||
|
||||
Relaxed:
|
||||
: Internal dev/staging. L1 ECTs. Trust and verification
|
||||
optional. Useful for debugging and observability without
|
||||
cryptographic overhead.
|
||||
|
||||
Standard:
|
||||
: Production cross-org. L2 ECTs. Trust scoring and thresholds
|
||||
recommended. Periodic behavior verification. HITL on critical
|
||||
paths.
|
||||
|
||||
Regulated:
|
||||
: Healthcare, finance, EU AI Act. L3 ECTs with audit ledger.
|
||||
Continuous behavior verification. All trust mechanisms
|
||||
required. Full provenance chain. Mandatory HITL gates.
|
||||
|
||||
Profiles are declared in ACP-DAG-HITL node constraints:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"constraints": {
|
||||
"apae.assurance_profile": "regulated"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-profile-policy title="Profile as Node Constraint"}
|
||||
|
||||
A single deployment MAY use different profiles for different
|
||||
workflows.
|
||||
|
||||
# Security Considerations
|
||||
|
||||
Trust scores are sensitive metadata. Agents MUST NOT expose full
|
||||
trust tables. Only pairwise assertions should be shared.
|
||||
|
||||
Trust assertion ECTs MUST be signed at L2/L3.
|
||||
|
||||
Score manipulation (building trust then exploiting it): mitigated
|
||||
by double penalties for `policy_violation` and high thresholds for
|
||||
critical actions.
|
||||
|
||||
Sybil attacks (fake agents inflating trust): mitigated by
|
||||
attenuation ({{trust-assertions}}), hop limits, and requiring
|
||||
agents to be registered in a trusted directory.
|
||||
|
||||
Behavior specifications could be tampered with. Specifications
|
||||
SHOULD be signed and versioned. Changes MUST be recorded as ECTs.
|
||||
|
||||
All trust and verification communications MUST use TLS 1.3.
|
||||
|
||||
# IANA Considerations
|
||||
|
||||
This document requests registration of `exec_act` values:
|
||||
|
||||
- `apae:trust_assertion`
|
||||
- `apae:trust_revoke`
|
||||
- `apae:compliance_check`
|
||||
|
||||
--- back
|
||||
|
||||
# Acknowledgments
|
||||
{:numbered="false"}
|
||||
|
||||
APAE builds on ECT {{I-D.nennemann-wimse-ect}} for interaction
|
||||
evidence and audit, and ACP-DAG-HITL
|
||||
{{I-D.nennemann-agent-dag-hitl-safety}} for trust threshold and
|
||||
assurance profile policy enforcement. The AIMD trust model is
|
||||
adapted from TCP congestion control.
|
||||
@@ -0,0 +1,695 @@
|
||||
---
|
||||
title: "Assurance Profiles for Agent Ecosystems (APAE)"
|
||||
abbrev: "APAE"
|
||||
category: info
|
||||
docname: draft-apae-assurance-profiles-01
|
||||
submissiontype: IETF
|
||||
number:
|
||||
date:
|
||||
v: 3
|
||||
area: "SEC"
|
||||
workgroup: "Security Dispatch"
|
||||
keyword:
|
||||
- dynamic trust
|
||||
- assurance
|
||||
- behavior verification
|
||||
- data provenance
|
||||
- quarantine
|
||||
|
||||
author:
|
||||
-
|
||||
fullname: TBD
|
||||
organization: Independent
|
||||
email: placeholder@example.com
|
||||
|
||||
normative:
|
||||
RFC2119:
|
||||
RFC8174:
|
||||
RFC7519:
|
||||
RFC7518:
|
||||
RFC9110:
|
||||
I-D.nennemann-wimse-ect:
|
||||
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
||||
I-D.nennemann-agent-dag-hitl-safety:
|
||||
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
||||
|
||||
informative:
|
||||
RFC9334:
|
||||
|
||||
--- abstract
|
||||
|
||||
This document defines Assurance Profiles for Agent Ecosystems
|
||||
(APAE): dynamic trust scoring, behavior verification, data
|
||||
provenance, cross-domain trust, and graduated assurance profiles
|
||||
that allow the same agent ecosystem to operate in relaxed
|
||||
(dev/K8s) and regulated (healthcare, finance) environments.
|
||||
Trust events are derived from ECT outcomes. Trust assertions are
|
||||
ECTs. Behavior verification references ECT claims. Provenance
|
||||
chains are implicit in the ECT DAG. Assurance profiles select
|
||||
which combination of these mechanisms is required for a given
|
||||
deployment, mapping to ECT assurance levels L1/L2/L3. Agents
|
||||
whose trust falls below a floor are quarantined via a protocol
|
||||
defined here.
|
||||
|
||||
--- middle
|
||||
|
||||
# Introduction
|
||||
|
||||
Identity verifies who an agent is. ECT records what an agent did.
|
||||
But neither answers: should I rely on this agent? Is it doing what
|
||||
it promised? Can I trace where this data came from?
|
||||
|
||||
APAE adds four capabilities to the ecosystem:
|
||||
|
||||
1. **Dynamic trust scoring** — behavioral reputation that adjusts
|
||||
based on interaction outcomes (AIMD model).
|
||||
2. **Behavior verification** — checking agent actions against
|
||||
declared specifications.
|
||||
3. **Data provenance** — tracing data lineage through the DAG.
|
||||
4. **Cross-domain trust** — federating trust across administrative
|
||||
domains.
|
||||
|
||||
These capabilities are bundled into assurance profiles
|
||||
(Relaxed, Standard, Regulated) that map to ECT assurance levels,
|
||||
so the same ecosystem works from a dev cluster to a hospital.
|
||||
|
||||
# Conventions and Definitions
|
||||
|
||||
{::boilerplate bcp14-tagged}
|
||||
|
||||
Trust Score:
|
||||
: A floating-point value in \[0.0, 1.0\] representing one agent's
|
||||
assessed reliability of another.
|
||||
|
||||
Trust Event:
|
||||
: An interaction outcome that causes a trust score adjustment.
|
||||
Derived from ECTs.
|
||||
|
||||
Trust Domain:
|
||||
: An administrative boundary within which a single trust anchor
|
||||
(CA or JWK set) governs agent identity.
|
||||
|
||||
Behavior Specification:
|
||||
: A machine-readable declaration of permitted agent actions and
|
||||
constraints.
|
||||
|
||||
Provenance Chain:
|
||||
: The sequence of ECT nodes recording how a piece of data was
|
||||
produced, transformed, and consumed.
|
||||
|
||||
Assurance Profile:
|
||||
: A named configuration selecting which trust, verification, and
|
||||
provenance mechanisms are required.
|
||||
|
||||
Quarantine:
|
||||
: A state in which an agent's trust score has dropped below a
|
||||
configured floor; the agent is prohibited from accepting new
|
||||
delegations.
|
||||
|
||||
# Dynamic Trust Scoring {#trust}
|
||||
|
||||
## Trust Score Model
|
||||
|
||||
Each agent maintains a trust table: peer agent IDs mapped to
|
||||
trust scores. Initial trust for unknown agents is deployment-
|
||||
configured (RECOMMENDED: 0.5; zero-trust deployments: 0.1).
|
||||
|
||||
Scores update using additive-increase, multiplicative-decrease
|
||||
(AIMD):
|
||||
|
||||
- Positive event: `score = min(1.0, score + alpha)`
|
||||
- Negative event: `score = max(0.0, score * beta)`
|
||||
|
||||
Defaults: `alpha = 0.01`, `beta = 0.8`.
|
||||
|
||||
Trust builds slowly (100 successes: 0.5 → ~1.0) and drops fast
|
||||
(one failure: 0.82 → 0.66).
|
||||
|
||||
## Trust Events from ECT {#trust-events}
|
||||
|
||||
Trust events are derived from ECTs rather than agent-local
|
||||
counters, making trust computation auditable:
|
||||
|
||||
| ECT condition | Event | Adjustment |
|
||||
|--------------|-------|------------|
|
||||
| Completed, no error follows | `task_success` | +1x alpha |
|
||||
| Completed, partial result | `task_partial` | +0.5x alpha |
|
||||
| `atd:error` referencing agent | `task_failure` | 1x beta |
|
||||
| No response within threshold | `task_timeout` | 1x beta |
|
||||
| `atd:error` with `constraint_violation` | `policy_violation` | beta^2 |
|
||||
| ECT signature verification fails | `attestation_invalid` | beta^2 |
|
||||
| `atd:rollback_request` targeting agent | `rollback_triggered` | 1x beta |
|
||||
{: #fig-events title="Trust Events from ECT"}
|
||||
|
||||
## Trust Decay
|
||||
|
||||
If no interaction occurs for a configurable period (default:
|
||||
7 days): `score = max(initial_default, score - 0.01/day)`.
|
||||
|
||||
## Trust Assertions as ECT {#trust-assertions}
|
||||
|
||||
Agent A shares its trust assessment via a trust assertion ECT:
|
||||
|
||||
- `exec_act`: `"apae:trust_assertion"`
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "apae:trust_assertion",
|
||||
"ext": {
|
||||
"apae.subject": "spiffe://example.com/agent/b",
|
||||
"apae.trust_score": 0.82,
|
||||
"apae.interactions": 147,
|
||||
"apae.confidence": "high",
|
||||
"apae.hops": 0
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-assertion title="Trust Assertion ECT"}
|
||||
|
||||
Confidence: `low` (<10 interactions), `medium` (10-99),
|
||||
`high` (100+).
|
||||
|
||||
Trust assertion ECTs MUST be signed at L2/L3.
|
||||
|
||||
## Trust Propagation
|
||||
|
||||
When Agent C receives A's assertion about B:
|
||||
|
||||
~~~
|
||||
c_score_for_b = max(c_score_for_b,
|
||||
a_score * trust_of_a * attenuation)
|
||||
~~~
|
||||
|
||||
Default `attenuation`: 0.5. Direct observations always take
|
||||
precedence. `apae.hops` tracks propagation depth; agents MUST NOT
|
||||
propagate beyond their configured maximum (default: 1).
|
||||
|
||||
## Trust Thresholds as Policy
|
||||
|
||||
Trust thresholds are ACP-DAG-HITL node constraints:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"constraints": {
|
||||
"apae.min_trust": 0.7,
|
||||
"apae.min_confidence": "medium"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-threshold title="Trust Threshold as Node Constraint"}
|
||||
|
||||
Requests from agents below threshold MUST be denied with HTTP 403.
|
||||
The `apae.peer_trust_score` is a runtime context value derived
|
||||
from the trusting agent's trust table for the requesting peer;
|
||||
it is not an ECT claim itself.
|
||||
|
||||
Low trust can trigger HITL escalation:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"id": "r-low-trust",
|
||||
"trigger": {
|
||||
"kind": "confidence_below",
|
||||
"op": "lt",
|
||||
"value": 0.5,
|
||||
"input_ref": "apae.peer_trust_score"
|
||||
},
|
||||
"required_role": "operator:security",
|
||||
"action": "escalate",
|
||||
"allow_override": true,
|
||||
"override_action": "continue"
|
||||
}
|
||||
~~~
|
||||
{: #fig-trust-hitl title="HITL Rule for Low Trust"}
|
||||
|
||||
## Automatic Revocation
|
||||
|
||||
When trust drops below a floor (default: 0.2), the trusting agent
|
||||
SHOULD revoke delegations and emit:
|
||||
`exec_act: "apae:trust_revoke"`.
|
||||
|
||||
# Quarantine Protocol {#quarantine}
|
||||
|
||||
When a trust score drops below the configured quarantine floor
|
||||
(default: 0.15), the agent enters quarantine.
|
||||
|
||||
## Quarantine Entry
|
||||
|
||||
The detecting agent MUST emit a quarantine ECT:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "apae:quarantine",
|
||||
"ext": {
|
||||
"apae.subject": "spiffe://example.com/agent/b",
|
||||
"apae.score": 0.12,
|
||||
"apae.threshold": 0.15,
|
||||
"apae.quarantine_until": "2026-03-02T12:00:00Z",
|
||||
"apae.reason": "Repeated policy_violation events (3 in 1 hour)"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-quarantine title="Quarantine ECT"}
|
||||
|
||||
The quarantine ECT MUST be broadcast to all agents that have
|
||||
received trust assertions about the quarantined agent
|
||||
(via `apae:trust_assertion` with matching `apae.subject`).
|
||||
|
||||
## Quarantined Agent Behavior
|
||||
|
||||
While quarantined, a subject agent:
|
||||
|
||||
- MUST NOT accept new delegations. New delegation requests MUST
|
||||
return HTTP 503 with `Retry-After` set to `apae.quarantine_until`.
|
||||
- MUST complete in-progress workflows (drain behavior per AEPB).
|
||||
- MAY accept direct operator commands (HITL Level 4 is unaffected).
|
||||
|
||||
Agents receiving the quarantine notification MUST update their
|
||||
trust table and MUST NOT delegate new tasks to the quarantined
|
||||
agent until the quarantine expires or is lifted.
|
||||
|
||||
## Quarantine Duration
|
||||
|
||||
The default quarantine duration is 1 hour, doubling on each
|
||||
successive quarantine entry:
|
||||
|
||||
| Quarantine count | Duration |
|
||||
|-----------------|---------|
|
||||
| 1 | 1 hour |
|
||||
| 2 | 2 hours |
|
||||
| 3 | 4 hours |
|
||||
| n | 2^(n-1) hours (max 168 hours / 7 days) |
|
||||
{: #fig-quarantine-duration title="Quarantine Duration Escalation"}
|
||||
|
||||
## Quarantine Expiry and Recovery
|
||||
|
||||
When the quarantine period expires:
|
||||
|
||||
1. The agent's trust score is reset to the initial default
|
||||
(deployment-configured; RECOMMENDED: 0.5 for recovery).
|
||||
2. The agent transitions back to active status per AEPB lifecycle.
|
||||
3. A recovery ECT MAY be emitted: `exec_act: "apae:quarantine"` with
|
||||
`apae.to_state: "active"`.
|
||||
|
||||
An operator MAY lift a quarantine early by issuing a HITL override
|
||||
(Level 1 or higher) with scope `apae:quarantine_lift` for the
|
||||
subject agent.
|
||||
|
||||
# Behavior Verification {#behavior}
|
||||
|
||||
## Behavior Specifications
|
||||
|
||||
A behavior specification declares what an agent is permitted to do.
|
||||
Specifications are JSON documents referencing ECT claims:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"spec_version": "1.0",
|
||||
"agent_id": "spiffe://example.com/agent/firewall",
|
||||
"allowed_actions": ["update_rules", "read_config", "report"],
|
||||
"constraints": {
|
||||
"max_actions_per_minute": 60,
|
||||
"forbidden_targets": ["core-router-*"],
|
||||
"require_checkpoint_before": ["update_rules"]
|
||||
},
|
||||
"verification_frequency": "continuous"
|
||||
}
|
||||
~~~
|
||||
{: #fig-spec title="Behavior Specification"}
|
||||
|
||||
Behavior specifications SHOULD be signed and versioned. Changes
|
||||
MUST be recorded as ECTs.
|
||||
|
||||
## Verification Against ECT Stream
|
||||
|
||||
A verifier monitors the agent's ECT stream and checks:
|
||||
|
||||
1. `exec_act` values are in `allowed_actions`.
|
||||
2. Action rate does not exceed `max_actions_per_minute` (computed
|
||||
from `iat` timestamps).
|
||||
3. `atd:checkpoint` ECTs precede `update_rules` ECTs (from
|
||||
`require_checkpoint_before`).
|
||||
4. Targets in `ext` claims do not match `forbidden_targets`.
|
||||
|
||||
Verification results are ECTs:
|
||||
|
||||
- `exec_act`: `"apae:compliance_check"`
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "apae:compliance_check",
|
||||
"par": ["latest-agent-ect-uuid"],
|
||||
"ext": {
|
||||
"apae.compliance_status": "passing",
|
||||
"apae.violations": [],
|
||||
"apae.spec_version": "1.0",
|
||||
"apae.window": "2026-03-01T12:00:00Z/PT1H"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-compliance title="Compliance Check ECT"}
|
||||
|
||||
Violations trigger trust score decreases (`policy_violation` event)
|
||||
and MAY trigger HITL escalation.
|
||||
|
||||
A violation compliance check ECT looks like:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "apae:compliance_check",
|
||||
"par": ["offending-ect-uuid"],
|
||||
"ext": {
|
||||
"apae.compliance_status": "failing",
|
||||
"apae.violations": [
|
||||
{
|
||||
"rule": "require_checkpoint_before",
|
||||
"action": "update_rules",
|
||||
"ect": "offending-ect-uuid",
|
||||
"description": "update_rules at 12:03:15 has no preceding atd:checkpoint within 10s"
|
||||
}
|
||||
],
|
||||
"apae.spec_version": "1.0",
|
||||
"apae.window": "2026-03-01T12:00:00Z/PT1H"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-violation title="Compliance Violation ECT"}
|
||||
|
||||
# Data Provenance {#provenance}
|
||||
|
||||
## DAG as Provenance Chain
|
||||
|
||||
The ECT DAG already encodes data provenance: each ECT's `par`
|
||||
references show which prior tasks produced its inputs. The
|
||||
`inp_hash` and `out_hash` claims prove what was processed without
|
||||
revealing the data.
|
||||
|
||||
For deployments requiring explicit provenance metadata, agents
|
||||
MAY include:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"ext": {
|
||||
"apae.data_source": "database:patients",
|
||||
"apae.data_classification": "pii",
|
||||
"apae.retention_days": 365,
|
||||
"apae.transformations": ["anonymize", "aggregate"]
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-provenance title="Provenance Extension Claims"}
|
||||
|
||||
At Regulated assurance level, all data-transforming ECT nodes
|
||||
MUST include provenance claims.
|
||||
|
||||
## Provenance Queries
|
||||
|
||||
At L3, the audit ledger enables provenance queries:
|
||||
|
||||
- "Which agents touched this data?" → walk `par` chain from
|
||||
final ECT to roots.
|
||||
- "Was this data transformed?" → check `apae.transformations`
|
||||
along the chain.
|
||||
- "Is provenance complete?" → verify all `par` references
|
||||
resolve to ledger entries.
|
||||
|
||||
# Cross-Domain Trust {#cross-domain}
|
||||
|
||||
## Trust Domain Basics
|
||||
|
||||
A trust domain is an administrative boundary within which a
|
||||
single trust anchor (CA certificate or JWK set) governs agent
|
||||
identity. Trust scores are local to a trust domain by default.
|
||||
|
||||
## Trust Domain Registration
|
||||
|
||||
Each trust domain MUST publish a trust anchor at a well-known URI:
|
||||
|
||||
~~~
|
||||
GET /.well-known/apae/trust-anchor HTTP/1.1
|
||||
~~~
|
||||
|
||||
The response MUST be a JSON object containing:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"domain": "example.com",
|
||||
"trust_anchor_type": "jwks",
|
||||
"trust_anchor_uri": "https://example.com/.well-known/jwks.json",
|
||||
"contact": "trust-admin@example.com"
|
||||
}
|
||||
~~~
|
||||
{: #fig-trust-anchor title="Trust Anchor Document"}
|
||||
|
||||
## Cross-Domain Delegation
|
||||
|
||||
When Agent A (domain X) delegates to Agent B (domain Y):
|
||||
|
||||
1. A MUST verify that its ACP-DAG-HITL policy permits cross-domain
|
||||
delegation to domain Y (bilateral trust agreement).
|
||||
2. A fetches B's trust anchor document to verify B's identity.
|
||||
3. A creates an `apae:cross_domain_assertion` ECT linking the
|
||||
two domains.
|
||||
4. Both A and B include their domain in ECT `iss` claims.
|
||||
|
||||
~~~json
|
||||
{
|
||||
"exec_act": "apae:cross_domain_assertion",
|
||||
"ext": {
|
||||
"apae.source_domain": "example.com",
|
||||
"apae.dest_domain": "hospital.example",
|
||||
"apae.bilateral_agreement_ref": "agreement-id-2026-001",
|
||||
"apae.min_assurance": "L2"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-cross-domain title="Cross-Domain Assertion ECT"}
|
||||
|
||||
The ASCII diagram below illustrates a cross-domain delegation:
|
||||
|
||||
~~~
|
||||
Domain: example.com Domain: hospital.example
|
||||
┌──────────────────┐ ┌──────────────────────┐
|
||||
│ Agent A │ AEPB │ Agent B │
|
||||
│ (orchestrator) ├───────►│ (treatment planner) │
|
||||
│ ECT: L2 │ │ ECT: L3 │
|
||||
└──────────────────┘ └──────────────────────┘
|
||||
│ │
|
||||
└─── cross_domain_assertion ECT ──┘
|
||||
(bilateral agreement verified)
|
||||
~~~
|
||||
{: #fig-cross-domain-diag title="Cross-Domain Delegation"}
|
||||
|
||||
## Cross-Domain Trust Scores
|
||||
|
||||
Trust scores do not transfer across domain boundaries by default.
|
||||
When Agent A in domain X has no prior interactions with Agent B
|
||||
in domain Y:
|
||||
|
||||
- If a bilateral trust agreement exists: initial trust is set to
|
||||
the agreement's `default_trust` value (negotiated out of band).
|
||||
- If no agreement exists: delegation MUST be rejected (zero-trust
|
||||
default).
|
||||
|
||||
Cross-domain trust scores are isolated from intra-domain scores
|
||||
and are stored separately in the trust table.
|
||||
|
||||
# Assurance Profiles {#profiles}
|
||||
|
||||
## Profile Definitions
|
||||
|
||||
An assurance profile is a named configuration that selects which
|
||||
mechanisms are required. Profiles MUST be declared in ACP-DAG-HITL
|
||||
workflow policy and announced in the AEPB capability document.
|
||||
|
||||
| Mechanism | Relaxed | Standard | Regulated |
|
||||
|-----------|---------|----------|-----------|
|
||||
| **ECT level** | L1 | L2 | L3 |
|
||||
| **Trust scoring** | Optional | RECOMMENDED | REQUIRED |
|
||||
| **Trust threshold enforcement** | Optional | RECOMMENDED | REQUIRED |
|
||||
| **Behavior verification** | Off | Periodic | Continuous |
|
||||
| **HITL approval gates** | Optional | Critical paths | Mandatory |
|
||||
| **Data provenance claims** | Off | Optional | REQUIRED |
|
||||
| **Checkpoint before consequential** | RECOMMENDED | REQUIRED | REQUIRED |
|
||||
| **Audit ledger** | Optional | Optional | REQUIRED |
|
||||
| **Quarantine protocol** | Optional | RECOMMENDED | REQUIRED |
|
||||
| **Cross-domain trust agreements** | Optional | Required if cross-domain | Required if cross-domain |
|
||||
{: #fig-profiles title="Assurance Profile Requirements"}
|
||||
|
||||
Relaxed:
|
||||
: Internal dev/staging. L1 ECTs. Trust and verification
|
||||
optional. Useful for debugging and observability without
|
||||
cryptographic overhead.
|
||||
|
||||
Standard:
|
||||
: Production cross-org. L2 ECTs. Trust scoring and thresholds
|
||||
recommended. Periodic behavior verification. HITL on critical
|
||||
paths.
|
||||
|
||||
Regulated:
|
||||
: Healthcare, finance, EU AI Act. L3 ECTs with audit ledger.
|
||||
Continuous behavior verification. All trust mechanisms
|
||||
required. Full provenance chain. Mandatory HITL gates.
|
||||
|
||||
Profiles are declared in ACP-DAG-HITL node constraints:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"constraints": {
|
||||
"apae.assurance_profile": "regulated"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-profile-policy title="Profile as Node Constraint"}
|
||||
|
||||
A single deployment MAY use different profiles for different
|
||||
workflows.
|
||||
|
||||
## Profile Selection Guidance
|
||||
|
||||
Operators SHOULD select profiles using the following decision
|
||||
table:
|
||||
|
||||
| Deployment context | Recommended profile |
|
||||
|-------------------|--------------------|
|
||||
| Unit tests, local development | Relaxed |
|
||||
| Internal production (single org) | Standard |
|
||||
| Cross-organization production | Standard (with trust agreements) |
|
||||
| Financial services, EU AI Act critical | Regulated |
|
||||
| Healthcare (HIPAA, clinical trials) | Regulated |
|
||||
| Critical infrastructure (NIS2) | Regulated |
|
||||
{: #fig-profile-selection title="Profile Selection Guidance"}
|
||||
|
||||
## Upgrade Path Between Profiles
|
||||
|
||||
Operators MUST NOT downgrade assurance profile during an active
|
||||
workflow.
|
||||
|
||||
Relaxed → Standard:
|
||||
: (1) Add ECT signing keys (WIMSE WIT or X.509). (2) Update ECT
|
||||
emission to sign tokens. (3) Configure trust scoring
|
||||
(alpha/beta, initial trust, thresholds). (4) Define behavior
|
||||
specifications for critical agents. (5) Add HITL approval gates
|
||||
on critical DAG paths.
|
||||
|
||||
Standard → Regulated:
|
||||
: (1) Configure audit ledger endpoint. (2) Update ECT emission
|
||||
to commit each ECT to ledger. (3) Enable continuous behavior
|
||||
verification (change `verification_frequency` from `periodic`
|
||||
to `continuous`). (4) Enable provenance claims on all
|
||||
data-transforming ECTs. (5) Add mandatory HITL gates on all
|
||||
consequential actions. (6) Enable quarantine protocol.
|
||||
|
||||
# Security Considerations
|
||||
|
||||
## Trust Score Sensitivity
|
||||
|
||||
Trust scores are sensitive metadata. Agents MUST NOT expose
|
||||
full trust tables. Only pairwise assertions SHOULD be shared,
|
||||
and only in response to explicit authenticated requests.
|
||||
|
||||
## Score Inflation (Adversarial Trust Building)
|
||||
|
||||
An adversary performs many small successful interactions to
|
||||
inflate trust, then executes a malicious action. Mitigation:
|
||||
|
||||
- Apply double penalty (`beta^2`) for `policy_violation` events.
|
||||
- Enforce high trust thresholds for high-risk actions.
|
||||
- Rate-limit trust score increases: an agent MUST NOT increase
|
||||
trust by more than 0.1 per day toward any single peer.
|
||||
- Use behavior verification continuously at Standard+.
|
||||
|
||||
## Attestation Freshness
|
||||
|
||||
Stale compliance check ECTs MUST be rejected. The verifier MUST
|
||||
check that `apae:compliance_check` ECTs have `iat` within the
|
||||
configured verification window (default: 1 hour for Standard,
|
||||
5 minutes for Regulated).
|
||||
|
||||
## Provenance Chain Forgery
|
||||
|
||||
Each provenance hop must be signed (L2+) to prevent injection
|
||||
of false provenance records. Agents MUST verify the signature
|
||||
on all `par`-linked ECTs before accepting provenance claims.
|
||||
|
||||
## Sybil Attack on Trust
|
||||
|
||||
Fake agents inflate trust for each other to gain influence.
|
||||
Mitigation:
|
||||
|
||||
- Trust propagation attenuation (default 0.5) limits the impact
|
||||
of second-hand assertions.
|
||||
- Maximum hop count of 1 for trust propagation.
|
||||
- Require agents to be registered in a trusted directory before
|
||||
initial trust is assigned above the floor value.
|
||||
|
||||
## Cross-Domain Trust Downgrade
|
||||
|
||||
An attacker forces delegation through an untrusted domain by
|
||||
presenting a forged bilateral agreement. Mitigation:
|
||||
|
||||
- Bilateral trust agreements MUST be signed by operators of
|
||||
both domains.
|
||||
- Agents MUST verify the agreement signature before accepting
|
||||
cross-domain delegations.
|
||||
- Cross-domain ECTs MUST use L2+ assurance.
|
||||
|
||||
## Quarantine Evasion
|
||||
|
||||
An agent subject to quarantine re-registers under a different
|
||||
identity to escape the quarantine. Mitigation:
|
||||
|
||||
- Quarantine ECTs are broadcast; receiving agents record the
|
||||
quarantine by both agent ID and by behavioral fingerprint.
|
||||
- Agents SHOULD require re-onboarding with operator approval
|
||||
before accepting new identities from known-quarantined domains.
|
||||
|
||||
# IANA Considerations
|
||||
|
||||
## Assurance Profile Registry
|
||||
|
||||
This document requests the creation of the "APAE Assurance Profile
|
||||
Registry" under IANA. Registration policy: Specification Required.
|
||||
|
||||
Initial entries:
|
||||
|
||||
| Profile Name | Profile URI | Description | Reference |
|
||||
|-------------|------------|-------------|-----------|
|
||||
| Relaxed | `urn:ietf:params:apae:profile:relaxed` | Dev/test, L1 ECTs | This document |
|
||||
| Standard | `urn:ietf:params:apae:profile:standard` | Production, L2 ECTs | This document |
|
||||
| Regulated | `urn:ietf:params:apae:profile:regulated` | Regulated, L3 ECTs | This document |
|
||||
{: #fig-profile-registry title="Assurance Profile Registry"}
|
||||
|
||||
## `exec_act` Values
|
||||
|
||||
This document requests registration in the AEM Ecosystem
|
||||
Extension Registry:
|
||||
|
||||
| Value | Description | Reference |
|
||||
|-------|-------------|-----------|
|
||||
| `apae:trust_assertion` | Sharing trust score for a peer | This document |
|
||||
| `apae:trust_revoke` | Revoking delegations due to low trust | This document |
|
||||
| `apae:compliance_check` | Behavior verification result | This document |
|
||||
| `apae:quarantine` | Agent quarantine entry or exit | This document |
|
||||
| `apae:cross_domain_assertion` | Cross-domain delegation evidence | This document |
|
||||
{: #fig-iana-actions title="APAE exec_act Registrations"}
|
||||
|
||||
## Well-Known URI
|
||||
|
||||
This document requests registration of `apae/trust-anchor` as a
|
||||
well-known URI suffix per RFC 8615 for trust domain anchor
|
||||
publication.
|
||||
|
||||
--- back
|
||||
|
||||
# Acknowledgments
|
||||
{:numbered="false"}
|
||||
|
||||
APAE builds on ECT {{I-D.nennemann-wimse-ect}} for interaction
|
||||
evidence and audit, and ACP-DAG-HITL
|
||||
{{I-D.nennemann-agent-dag-hitl-safety}} for trust threshold and
|
||||
assurance profile policy enforcement. The AIMD trust model is
|
||||
adapted from TCP congestion control (RFC 5681). Behavior
|
||||
verification is informed by RATS architecture {{RFC9334}}.
|
||||
@@ -0,0 +1,372 @@
|
||||
---
|
||||
title: "Human Emergency Override Protocol (HEOP)"
|
||||
abbrev: "HEOP"
|
||||
category: std
|
||||
docname: draft-heop-human-emergency-override-00
|
||||
submissiontype: IETF
|
||||
number:
|
||||
date:
|
||||
v: 3
|
||||
area: "SEC"
|
||||
workgroup: "Security Dispatch"
|
||||
keyword:
|
||||
- human override
|
||||
- emergency stop
|
||||
- agentic workflows
|
||||
- HITL
|
||||
- execution context
|
||||
|
||||
author:
|
||||
-
|
||||
fullname: Generated by IETF Draft Analyzer
|
||||
organization: Independent
|
||||
email: placeholder@example.com
|
||||
|
||||
normative:
|
||||
RFC7519:
|
||||
RFC7515:
|
||||
RFC9110:
|
||||
RFC8615:
|
||||
I-D.nennemann-wimse-ect:
|
||||
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
||||
I-D.nennemann-agent-dag-hitl-safety:
|
||||
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
||||
|
||||
informative:
|
||||
|
||||
--- abstract
|
||||
|
||||
This document defines the Human Emergency Override Protocol (HEOP),
|
||||
the runtime enforcement mechanism for human intervention in
|
||||
autonomous AI agent operations. HEOP is the "how" to ACP-DAG-HITL's
|
||||
"when": where the Agent Context Policy Token defines conditions
|
||||
that require human decision, HEOP defines the wire protocol for
|
||||
override commands, agent compliance, and acknowledgment. HEOP
|
||||
specifies four override levels (pause, constrain, stop, takeover),
|
||||
a mandatory agent compliance endpoint, and records every override
|
||||
as an ECT DAG node for tamper-evident audit. Override levels map
|
||||
directly to ACP-DAG-HITL actions.
|
||||
|
||||
--- middle
|
||||
|
||||
# Introduction
|
||||
|
||||
As AI agents gain autonomy in critical infrastructure, the ability
|
||||
for humans to intervene quickly and reliably becomes essential.
|
||||
The current ratio of autonomous capability drafts to human
|
||||
oversight drafts in the IETF is roughly 7:1.
|
||||
|
||||
The Agent Context Policy Token
|
||||
{{I-D.nennemann-agent-dag-hitl-safety}} defines a policy language
|
||||
for human-in-the-loop safety: trigger conditions, required roles,
|
||||
and permitted actions (`pause`, `escalate`, `abort`). But it does
|
||||
not define the runtime protocol for how overrides are transmitted to
|
||||
agents, how agents acknowledge them, or how the intervention is
|
||||
recorded. HEOP fills this gap.
|
||||
|
||||
HEOP draws from industrial safety: the emergency stop button on
|
||||
factory equipment, the circuit breaker in electrical systems, the
|
||||
kill switch in robotics. The override mechanism must be simpler
|
||||
and more reliable than the system it controls.
|
||||
|
||||
Every override command and acknowledgment is recorded as an ECT
|
||||
{{I-D.nennemann-wimse-ect}}, linking into the workflow DAG. At
|
||||
L3, this provides the tamper-evident audit trail that regulated
|
||||
environments (FDA, MiFID II, EU AI Act) require for human
|
||||
intervention records.
|
||||
|
||||
# Conventions and Definitions
|
||||
|
||||
{::boilerplate bcp14-tagged}
|
||||
|
||||
Override:
|
||||
: A human-initiated command that alters an agent's autonomous
|
||||
operation, taking precedence over the agent's own decision-making.
|
||||
|
||||
Operator:
|
||||
: A human user authorized to issue override commands, corresponding
|
||||
to a `required_role` in ACP-DAG-HITL policy.
|
||||
|
||||
Override Level:
|
||||
: One of four escalating intervention types, each with
|
||||
deterministic agent behavior requirements.
|
||||
|
||||
# Mapping to ACP-DAG-HITL Actions {#mapping}
|
||||
|
||||
HEOP override levels are the runtime realization of ACP-DAG-HITL
|
||||
actions:
|
||||
|
||||
| ACP-DAG-HITL action | HEOP Level | Behavior |
|
||||
|---------------------|------------|----------|
|
||||
| `pause` | 1 (PAUSE) | Suspend autonomous actions, hold state |
|
||||
| (no equivalent) | 2 (CONSTRAIN) | Restrict to allowed action subset |
|
||||
| `abort` | 3 (STOP) | Cease all actions, enter inert state |
|
||||
| `escalate` | 4 (TAKEOVER) | Transfer control to human operator |
|
||||
{: #fig-mapping title="ACP-DAG-HITL to HEOP Mapping"}
|
||||
|
||||
Level 2 (CONSTRAIN) extends beyond ACP-DAG-HITL's current action
|
||||
vocabulary. When a HITL rule triggers with `action: "pause"` and
|
||||
`override_action: "continue"`, the operator MAY continue with
|
||||
HEOP Level 2 constraints rather than full resumption.
|
||||
|
||||
# Override Levels {#levels}
|
||||
|
||||
## Level 1 -- PAUSE
|
||||
|
||||
The agent MUST suspend all autonomous actions and hold its current
|
||||
state. It MUST NOT initiate new actions but MAY complete
|
||||
in-progress actions if stopping mid-execution would cause harm.
|
||||
The agent resumes when a RESUME command is received.
|
||||
|
||||
## Level 2 -- CONSTRAIN
|
||||
|
||||
The agent MUST restrict its actions to a specified subset defined
|
||||
in the override command. The agent MUST reject any action not on
|
||||
the allowlist.
|
||||
|
||||
## Level 3 -- STOP
|
||||
|
||||
The agent MUST immediately cease all autonomous actions, abandon
|
||||
in-progress actions where safe, and enter an inert state. It
|
||||
MUST NOT act until explicitly restarted. This is the e-stop.
|
||||
|
||||
## Level 4 -- TAKEOVER
|
||||
|
||||
The agent MUST transfer operational control to the human operator,
|
||||
entering pass-through mode where it executes only explicit operator
|
||||
commands. The agent's sensors and outputs remain available to the
|
||||
operator as tools.
|
||||
|
||||
# Override Command Format {#command-format}
|
||||
|
||||
Override commands are HTTP POST requests to the agent's well-known
|
||||
endpoint, carrying an ECT in the Execution-Context header:
|
||||
|
||||
~~~
|
||||
POST /.well-known/heop/override HTTP/1.1
|
||||
Content-Type: application/json
|
||||
Authorization: Bearer <operator-jwt>
|
||||
Execution-Context: <override-ECT>
|
||||
|
||||
{
|
||||
"override_id": "urn:uuid:...",
|
||||
"level": 3,
|
||||
"reason": "Agent blocking legitimate traffic",
|
||||
"operator_id": "spiffe://example.com/human/alice",
|
||||
"scope": "*",
|
||||
"constraints": null,
|
||||
"ttl": null
|
||||
}
|
||||
~~~
|
||||
{: #fig-override title="Override Command"}
|
||||
|
||||
Field definitions:
|
||||
|
||||
`level`:
|
||||
: Integer 1-4. MUST be present.
|
||||
|
||||
`reason`:
|
||||
: Human-readable text. MUST be present and logged.
|
||||
|
||||
`scope`:
|
||||
: Which agent functions to override. `"*"` means all. MAY be a
|
||||
list of function identifiers for partial overrides.
|
||||
|
||||
`constraints`:
|
||||
: For Level 2 only. JSON array of permitted action types, e.g.,
|
||||
`["read", "monitor", "report"]`.
|
||||
|
||||
`ttl`:
|
||||
: Optional duration in seconds. If set, the override expires
|
||||
automatically and the agent resumes its prior mode.
|
||||
|
||||
## Resume and Lift
|
||||
|
||||
~~~
|
||||
POST /.well-known/heop/resume HTTP/1.1
|
||||
{"override_id": "urn:uuid:...", "operator_id": "..."}
|
||||
|
||||
POST /.well-known/heop/lift HTTP/1.1
|
||||
{"override_id": "urn:uuid:...", "operator_id": "..."}
|
||||
~~~
|
||||
{: #fig-resume title="Resume and Lift Commands"}
|
||||
|
||||
# ECT Integration {#ect-integration}
|
||||
|
||||
## Override ECT
|
||||
|
||||
The operator (or operator's tooling) MUST produce an ECT for
|
||||
every override command:
|
||||
|
||||
- `exec_act`: `"heop:override"`
|
||||
- `par`: the `jti` of the HITL trigger ECT (if the override was
|
||||
triggered by ACP-DAG-HITL policy) or empty (if manually
|
||||
initiated)
|
||||
|
||||
~~~json
|
||||
{
|
||||
"ext": {
|
||||
"heop.level": 3,
|
||||
"heop.reason": "Agent blocking legitimate traffic",
|
||||
"heop.operator_id": "spiffe://example.com/human/alice",
|
||||
"heop.scope": "*"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-override-ect title="Override ECT Extension Claims"}
|
||||
|
||||
## Acknowledgment ECT
|
||||
|
||||
The agent MUST produce an acknowledgment ECT:
|
||||
|
||||
- `exec_act`: `"heop:ack"`
|
||||
- `par`: the `jti` of the override ECT
|
||||
|
||||
~~~json
|
||||
{
|
||||
"ext": {
|
||||
"heop.status": "accepted",
|
||||
"heop.prior_state": "autonomous",
|
||||
"heop.current_state": "stopped",
|
||||
"heop.effective_at": "2026-03-01T12:00:00.123Z"
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-ack-ect title="Acknowledgment ECT Extension Claims"}
|
||||
|
||||
## Decision Record Alignment
|
||||
|
||||
The override/ack ECT pair serves as the ACP-DAG-HITL Decision
|
||||
Record {{I-D.nennemann-agent-dag-hitl-safety}}. The required
|
||||
Decision Record fields map as follows:
|
||||
|
||||
| Decision Record field | ECT source |
|
||||
|----------------------|------------|
|
||||
| `decision_id` | Override ECT `jti` |
|
||||
| `token_jti` | HITL trigger ECT `jti` (from `par`) |
|
||||
| `rule_ids` | From HITL trigger context |
|
||||
| `human_id` | `heop.operator_id` |
|
||||
| `human_role` | From operator JWT claims |
|
||||
| `decision` | Derived from `heop.level` |
|
||||
| `time` | Override ECT `iat` |
|
||||
{: #fig-decision-record title="Decision Record Mapping"}
|
||||
|
||||
At L3, both ECTs are recorded in the audit ledger, providing a
|
||||
tamper-evident record of every human intervention.
|
||||
|
||||
# Agent Compliance Requirements {#compliance}
|
||||
|
||||
Every HEOP-compliant agent MUST:
|
||||
|
||||
1. Implement the `/.well-known/heop/override` endpoint.
|
||||
|
||||
2. Process override commands within 1 second of receipt. The
|
||||
override path MUST be independent of the agent's main
|
||||
processing loop.
|
||||
|
||||
3. Produce an acknowledgment ECT for every override.
|
||||
|
||||
4. If the agent cannot fully comply (e.g., hardware limitation),
|
||||
it MUST respond with `heop.status`: `"partial"` and a
|
||||
description. An agent MUST NOT respond with `"rejected"`.
|
||||
|
||||
5. Expose current override status at:
|
||||
|
||||
~~~
|
||||
GET /.well-known/heop/status
|
||||
~~~
|
||||
|
||||
Response:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"agent_id": "spiffe://example.com/agent/firewall-mgr",
|
||||
"override_active": true,
|
||||
"current_level": 3,
|
||||
"override_ect_jti": "550e8400-e29b-41d4-a716-446655440055",
|
||||
"since": "2026-03-01T12:00:00Z",
|
||||
"operator_id": "spiffe://example.com/human/alice"
|
||||
}
|
||||
~~~
|
||||
{: #fig-status title="Override Status"}
|
||||
|
||||
# Broadcast Overrides {#broadcast}
|
||||
|
||||
For environments with many agents, HEOP supports broadcast. An
|
||||
operator sends a single command to a management endpoint:
|
||||
|
||||
~~~
|
||||
POST /heop/broadcast HTTP/1.1
|
||||
{
|
||||
"override_id": "urn:uuid:...",
|
||||
"level": 3,
|
||||
"reason": "Coordinated emergency stop",
|
||||
"targets": ["spiffe://example.com/agent/a1", "spiffe://example.com/agent/a2"]
|
||||
}
|
||||
~~~
|
||||
{: #fig-broadcast title="Broadcast Override"}
|
||||
|
||||
The broadcast endpoint produces a parent ECT with
|
||||
`exec_act`: `"heop:broadcast"`, and each per-agent override ECT
|
||||
references it via `par`.
|
||||
|
||||
# Dead Man's Switch {#dead-mans-switch}
|
||||
|
||||
Agents SHOULD support a heartbeat-based safety net: the agent
|
||||
periodically pings an operator heartbeat endpoint. If the
|
||||
heartbeat is missed for a configurable duration, the agent
|
||||
automatically enters Level 1 (PAUSE) and produces a
|
||||
self-override ECT with `exec_act`: `"heop:dead_mans_switch"`.
|
||||
|
||||
This provides safety when network connectivity to the operator
|
||||
is lost.
|
||||
|
||||
# Security Considerations
|
||||
|
||||
Override commands are high-privilege operations. All override
|
||||
endpoints MUST require authentication via signed JWTs with the
|
||||
`heop_override` scope. The JWT MUST include the operator's
|
||||
identity, a timestamp, and be signed using an asymmetric algorithm.
|
||||
|
||||
Override commands MUST be transmitted over TLS 1.3.
|
||||
|
||||
To prevent replay, agents MUST reject overrides with timestamps
|
||||
more than 30 seconds in the past. The `override_id` MUST be
|
||||
unique; agents MUST reject duplicates.
|
||||
|
||||
Deployments SHOULD implement multi-operator approval for Level 4
|
||||
(TAKEOVER), requiring two independent operator JWTs.
|
||||
|
||||
The override endpoint SHOULD be served on a separate port or
|
||||
network interface from the agent's main API to ensure availability
|
||||
during overload.
|
||||
|
||||
The ECT DAG provides tamper-evident audit of all overrides. At
|
||||
L3, the audit ledger prevents override records from being deleted
|
||||
or modified after the fact.
|
||||
|
||||
# IANA Considerations
|
||||
|
||||
This document requests the following IANA registrations:
|
||||
|
||||
1. Well-known URI registrations for `heop/override`,
|
||||
`heop/resume`, `heop/lift`, and `heop/status` per {{RFC8615}}.
|
||||
|
||||
2. Registration of `exec_act` values `heop:override`, `heop:ack`,
|
||||
`heop:broadcast`, `heop:dead_mans_switch` in a future ECT
|
||||
action type registry.
|
||||
|
||||
3. Registration of the `heop_override` OAuth scope.
|
||||
|
||||
--- back
|
||||
|
||||
# Acknowledgments
|
||||
{:numbered="false"}
|
||||
|
||||
This document is the runtime enforcement companion to the Agent
|
||||
Context Policy Token {{I-D.nennemann-agent-dag-hitl-safety}},
|
||||
which defines the HITL policy language, and builds on the
|
||||
Execution Context Token {{I-D.nennemann-wimse-ect}} for
|
||||
audit and tracing.
|
||||
@@ -0,0 +1,307 @@
|
||||
Internet-Draft AI/Agent WG
|
||||
Intended status: Standards Track March 2026
|
||||
Expires: September 15, 2026
|
||||
|
||||
|
||||
Human Emergency Override Protocol (HEOP)
|
||||
draft-heop-human-emergency-override-00
|
||||
|
||||
Abstract
|
||||
|
||||
This document defines the Human Emergency Override Protocol
|
||||
(HEOP), a standard mechanism for human operators to intervene
|
||||
in autonomous AI agent operations during critical situations.
|
||||
Current IETF drafts include 60 autonomous operations proposals
|
||||
but only 22 addressing human-agent interaction, with none
|
||||
defining emergency override procedures. HEOP specifies four
|
||||
escalating override levels (pause, constrain, stop, takeover),
|
||||
a mandatory agent compliance interface, and acknowledgment
|
||||
semantics that ensure overrides are received and acted upon.
|
||||
The protocol is intentionally minimal: a single HTTP endpoint
|
||||
per agent, four command types, and deterministic agent
|
||||
behavior for each.
|
||||
|
||||
Status of This Memo
|
||||
|
||||
This Internet-Draft is submitted in full conformance with the
|
||||
provisions of BCP 78 and BCP 79.
|
||||
|
||||
This document is intended to have Standards Track status.
|
||||
Distribution of this memo is unlimited.
|
||||
|
||||
Table of Contents
|
||||
|
||||
1. Introduction
|
||||
2. Terminology
|
||||
3. Problem Statement
|
||||
4. Override Levels
|
||||
5. Override Command Format
|
||||
6. Agent Compliance Requirements
|
||||
7. Override Management Interface
|
||||
8. Security Considerations
|
||||
9. IANA Considerations
|
||||
|
||||
1. Introduction
|
||||
|
||||
As AI agents gain autonomy in critical infrastructure, the
|
||||
ability for humans to intervene quickly and reliably becomes
|
||||
essential. The current ratio of autonomous capability drafts
|
||||
to human oversight drafts in the IETF is roughly 7:1, creating
|
||||
an asymmetry where agents can act but humans cannot reliably
|
||||
stop them.
|
||||
|
||||
HEOP draws inspiration from industrial safety systems: the
|
||||
emergency stop (e-stop) button on factory equipment, the
|
||||
circuit breaker in electrical systems, and the kill switch in
|
||||
robotics. These systems share a design philosophy: the
|
||||
override mechanism must be simpler and more reliable than the
|
||||
system it controls.
|
||||
|
||||
HEOP is deliberately not a governance framework, policy
|
||||
language, or accountability protocol. It is a panic button
|
||||
with a well-defined interface.
|
||||
|
||||
2. Terminology
|
||||
|
||||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
|
||||
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
|
||||
"OPTIONAL" in this document are to be interpreted as described
|
||||
in RFC 2119 [RFC2119].
|
||||
|
||||
Override: A human-initiated command that alters an agent's
|
||||
autonomous operation, taking precedence over the agent's own
|
||||
decision-making.
|
||||
|
||||
Operator: A human user authorized to issue override commands
|
||||
to one or more agents.
|
||||
|
||||
Override Level: One of four escalating intervention types,
|
||||
each with deterministic agent behavior requirements.
|
||||
|
||||
3. Problem Statement
|
||||
|
||||
An autonomous network management agent detects what it believes
|
||||
is a DDoS attack and begins blocking traffic. It is wrong —
|
||||
the traffic spike is legitimate (a product launch). The
|
||||
operator sees revenue dropping and needs to stop the agent
|
||||
immediately. Today, the operator must:
|
||||
|
||||
1. Figure out which agent is responsible.
|
||||
2. Find that agent's proprietary management interface.
|
||||
3. Understand its specific stop mechanism (if one exists).
|
||||
4. Hope the agent actually stops.
|
||||
|
||||
There is no standard for any of these steps. HEOP addresses
|
||||
steps 2-4 by defining a universal override interface that all
|
||||
agents MUST implement.
|
||||
|
||||
4. Override Levels
|
||||
|
||||
HEOP defines four override levels, each more restrictive than
|
||||
the last:
|
||||
|
||||
Level 1 — PAUSE
|
||||
The agent MUST suspend all autonomous actions and hold its
|
||||
current state. It MUST NOT initiate new actions but MAY
|
||||
complete actions already in progress if stopping them mid-
|
||||
execution would cause more harm (e.g., an in-flight database
|
||||
transaction). The agent MUST resume normal operation when a
|
||||
RESUME command is received.
|
||||
|
||||
Level 2 — CONSTRAIN
|
||||
The agent MUST restrict its actions to a specified subset.
|
||||
The override command includes an allowlist of permitted action
|
||||
types. The agent MUST reject any action not on the allowlist.
|
||||
This enables operators to let the agent continue operating in
|
||||
a limited, safe capacity.
|
||||
|
||||
Level 3 — STOP
|
||||
The agent MUST immediately cease all autonomous actions,
|
||||
abandon in-progress actions where safe to do so, and enter an
|
||||
inert state. It MUST NOT take any autonomous actions until
|
||||
explicitly restarted by an operator. This is the equivalent
|
||||
of an e-stop.
|
||||
|
||||
Level 4 — TAKEOVER
|
||||
The agent MUST transfer operational control to the human
|
||||
operator. It enters a pass-through mode where it executes
|
||||
only explicit operator commands and takes no autonomous
|
||||
actions. The agent's sensors and outputs remain available to
|
||||
the operator as tools.
|
||||
|
||||
5. Override Command Format
|
||||
|
||||
Override commands are sent as HTTP POST requests to the agent's
|
||||
well-known override endpoint:
|
||||
|
||||
POST /.well-known/heop/override HTTP/1.1
|
||||
Content-Type: application/json
|
||||
Authorization: Bearer <operator-jwt>
|
||||
|
||||
{
|
||||
"override_id": "urn:uuid:...",
|
||||
"level": 3,
|
||||
"reason": "Agent blocking legitimate traffic",
|
||||
"operator_id": "urn:uuid:...",
|
||||
"timestamp": "2026-03-01T12:00:00Z",
|
||||
"scope": "*",
|
||||
"constraints": null,
|
||||
"ttl": null
|
||||
}
|
||||
|
||||
Field definitions:
|
||||
|
||||
"level": Integer 1-4, corresponding to the override levels in
|
||||
Section 4. MUST be present.
|
||||
|
||||
"reason": Human-readable text. MUST be present and MUST be
|
||||
logged by the agent.
|
||||
|
||||
"scope": Which of the agent's functions to override. "*" means
|
||||
all functions. MAY be a list of function identifiers for
|
||||
partial overrides.
|
||||
|
||||
"constraints": For Level 2 only. A JSON array of permitted
|
||||
action types, e.g., ["read", "monitor", "report"].
|
||||
|
||||
"ttl": Optional duration in seconds. If set, the override
|
||||
automatically expires after this duration and the agent
|
||||
resumes its prior operating mode. If null, the override
|
||||
persists until explicitly lifted.
|
||||
|
||||
To resume from Level 1 (PAUSE):
|
||||
|
||||
POST /.well-known/heop/resume HTTP/1.1
|
||||
Authorization: Bearer <operator-jwt>
|
||||
|
||||
{"override_id": "urn:uuid:...", "operator_id": "urn:uuid:..."}
|
||||
|
||||
To lift any override:
|
||||
|
||||
POST /.well-known/heop/lift HTTP/1.1
|
||||
Authorization: Bearer <operator-jwt>
|
||||
|
||||
{"override_id": "urn:uuid:...", "operator_id": "urn:uuid:..."}
|
||||
|
||||
6. Agent Compliance Requirements
|
||||
|
||||
Every HEOP-compliant agent MUST:
|
||||
|
||||
1. Implement the /.well-known/heop/override endpoint.
|
||||
|
||||
2. Process override commands within 1 second of receipt.
|
||||
The override path MUST be independent of the agent's main
|
||||
processing loop to ensure responsiveness even when the
|
||||
agent is under heavy load or in a failure state.
|
||||
|
||||
3. Acknowledge every override with an HTTP response:
|
||||
|
||||
200 OK:
|
||||
{
|
||||
"override_id": "urn:uuid:...",
|
||||
"status": "accepted",
|
||||
"effective_at": "2026-03-01T12:00:00.123Z",
|
||||
"prior_state": "autonomous",
|
||||
"current_state": "stopped"
|
||||
}
|
||||
|
||||
4. Log all overrides, including the full command, timestamp,
|
||||
operator identity, and agent state before and after.
|
||||
|
||||
5. If the agent cannot comply (e.g., hardware limitation), it
|
||||
MUST respond with status "partial" and a description of
|
||||
what it could and could not do. An agent MUST NOT respond
|
||||
with "rejected" — overrides are mandatory.
|
||||
|
||||
6. Expose current override status at:
|
||||
|
||||
GET /.well-known/heop/status
|
||||
|
||||
{
|
||||
"agent_id": "urn:uuid:...",
|
||||
"override_active": true,
|
||||
"current_level": 3,
|
||||
"override_id": "urn:uuid:...",
|
||||
"since": "2026-03-01T12:00:00Z",
|
||||
"operator_id": "urn:uuid:..."
|
||||
}
|
||||
|
||||
7. Override Management Interface
|
||||
|
||||
For environments with many agents, HEOP supports broadcast
|
||||
overrides. An operator MAY send a single override command to
|
||||
a management endpoint that fans out to multiple agents:
|
||||
|
||||
POST /heop/broadcast HTTP/1.1
|
||||
|
||||
{
|
||||
"override_id": "urn:uuid:...",
|
||||
"level": 3,
|
||||
"reason": "Coordinated emergency stop",
|
||||
"targets": ["urn:uuid:agent-1", "urn:uuid:agent-2"],
|
||||
"operator_id": "urn:uuid:..."
|
||||
}
|
||||
|
||||
The broadcast endpoint MUST return per-agent results:
|
||||
|
||||
{
|
||||
"results": [
|
||||
{"agent_id": "urn:uuid:agent-1", "status": "accepted"},
|
||||
{"agent_id": "urn:uuid:agent-2", "status": "accepted"}
|
||||
],
|
||||
"failed": []
|
||||
}
|
||||
|
||||
For maximum reliability, operators SHOULD also implement a
|
||||
dead man's switch: agents periodically ping an operator
|
||||
heartbeat endpoint, and if the heartbeat is missed for a
|
||||
configurable duration, the agent automatically enters Level 1
|
||||
(PAUSE). This provides a safety net when network connectivity
|
||||
to the operator is lost.
|
||||
|
||||
8. Security Considerations
|
||||
|
||||
Override commands are high-privilege operations. All override
|
||||
endpoints MUST require authentication via mutual TLS or signed
|
||||
JWTs issued by a trusted operator identity provider.
|
||||
|
||||
The JWT MUST include the operator's identity, a timestamp, and
|
||||
the "heop_override" scope. Agents MUST verify JWT signatures
|
||||
and reject expired tokens.
|
||||
|
||||
Override commands MUST be transmitted over TLS 1.3 [RFC8446].
|
||||
|
||||
To prevent override replay attacks, agents MUST reject
|
||||
override commands with timestamps more than 30 seconds in the
|
||||
past. The override_id MUST be unique; agents MUST reject
|
||||
duplicate override_ids.
|
||||
|
||||
Rogue operators are mitigated through the operator identity
|
||||
framework. Deployments SHOULD implement multi-operator
|
||||
approval for Level 4 (TAKEOVER) overrides, requiring two
|
||||
independent operator JWTs.
|
||||
|
||||
The override mechanism itself MUST be resistant to denial of
|
||||
service. The override endpoint SHOULD be served on a
|
||||
separate port or network interface from the agent's main
|
||||
API to ensure availability during agent overload conditions.
|
||||
|
||||
9. IANA Considerations
|
||||
|
||||
This document requests IANA establish the following:
|
||||
|
||||
1. A well-known URI registration for "heop/override",
|
||||
"heop/resume", "heop/lift", and "heop/status" per
|
||||
RFC 8615.
|
||||
|
||||
2. A "HEOP Override Level" registry under Standards Action
|
||||
policy. Initial entries: 1 (PAUSE), 2 (CONSTRAIN),
|
||||
3 (STOP), 4 (TAKEOVER).
|
||||
|
||||
3. Registration of the "heop_override" OAuth scope in the
|
||||
OAuth Parameters registry.
|
||||
|
||||
Author's Address
|
||||
|
||||
Generated by IETF Draft Analyzer
|
||||
2026-03-01
|
||||
598
workspace/drafts/new-drafts/generated-draft.txt
Normal file
598
workspace/drafts/new-drafts/generated-draft.txt
Normal file
@@ -0,0 +1,598 @@
|
||||
Internet-Draft AI/Agent WG
|
||||
Intended status: standards-track March 2026
|
||||
Expires: September 02, 2026
|
||||
|
||||
|
||||
Agent Behavior Verification Protocol (ABVP) for Runtime Compliance Validation
|
||||
draft-ai-agent-behavior-verification-protocol-00
|
||||
|
||||
Abstract
|
||||
|
||||
This document defines the Agent Behavior Verification Protocol
|
||||
(ABVP), a standardized framework for continuously validating that
|
||||
deployed AI agents operate according to their declared policies
|
||||
and specifications. As autonomous agents become increasingly
|
||||
prevalent in critical systems, there is a growing gap between
|
||||
stated agent capabilities and actual runtime behavior
|
||||
verification. ABVP provides mechanisms for real-time behavior
|
||||
monitoring, policy compliance validation, and cryptographic
|
||||
attestation of agent actions against predefined behavioral
|
||||
specifications. The protocol defines a verification architecture
|
||||
that includes behavior witnesses, compliance checkers, and
|
||||
attestation chains to ensure agents maintain fidelity to their
|
||||
declared operational parameters. ABVP integrates with existing
|
||||
agent accountability frameworks while providing specific
|
||||
mechanisms for runtime verification, behavioral drift detection,
|
||||
and compliance reporting. This specification addresses the
|
||||
critical need for trustworthy agent deployment by enabling
|
||||
operators to continuously verify agent behavior matches stated
|
||||
policies throughout the agent lifecycle.
|
||||
|
||||
Status of This Memo
|
||||
|
||||
This Internet-Draft is submitted in full conformance with the
|
||||
provisions of BCP 78 and BCP 79.
|
||||
|
||||
This document is intended to have standards-track status.
|
||||
Distribution of this memo is unlimited.
|
||||
|
||||
Table of Contents
|
||||
|
||||
1. Introduction ................................................ 3
|
||||
2. Terminology ................................................. 4
|
||||
3. Problem Statement ........................................... 5
|
||||
4. Agent Behavior Verification Architecture .................... 6
|
||||
5. Behavior Specification Format ............................... 7
|
||||
6. Runtime Verification Protocol ............................... 8
|
||||
7. Compliance Reporting and Attestation ........................ 9
|
||||
8. Security Considerations ..................................... 10
|
||||
9. IANA Considerations ......................................... 11
|
||||
|
||||
1. Introduction
|
||||
|
||||
The proliferation of autonomous AI agents in critical
|
||||
infrastructure, financial systems, and decision-making processes
|
||||
has created an urgent need for continuous verification that these
|
||||
agents operate according to their declared policies and behavioral
|
||||
specifications. Traditional approaches to agent deployment rely on
|
||||
pre-deployment testing and static policy validation, which fail to
|
||||
address the dynamic nature of agent behavior in production
|
||||
environments. As agents adapt, learn, and respond to changing
|
||||
conditions, their actual runtime behavior may diverge
|
||||
significantly from their original specifications, creating
|
||||
security vulnerabilities, compliance violations, and operational
|
||||
risks that remain undetected until system failures occur.
|
||||
|
||||
Existing agent accountability frameworks primarily focus on post-
|
||||
hoc analysis and audit trails, providing limited capability for
|
||||
real-time behavior verification and immediate detection of policy
|
||||
violations. This reactive approach is insufficient for autonomous
|
||||
systems that make critical decisions with limited human oversight,
|
||||
where behavioral drift or policy violations can have immediate and
|
||||
severe consequences. Current verification methodologies also lack
|
||||
standardized protocols for expressing behavioral constraints in
|
||||
machine-verifiable formats, making it difficult to establish
|
||||
consistent compliance validation across diverse agent
|
||||
implementations and deployment environments.
|
||||
|
||||
The gap between declared agent capabilities and actual runtime
|
||||
behavior verification represents a fundamental trust problem in
|
||||
autonomous systems deployment. Organizations deploying AI agents
|
||||
face significant challenges in ensuring that agents continue to
|
||||
operate within specified parameters throughout their operational
|
||||
lifecycle, particularly as agents encounter novel situations not
|
||||
covered in initial testing scenarios. This verification gap
|
||||
undermines confidence in agent reliability and limits the adoption
|
||||
of autonomous systems in high-stakes environments where behavioral
|
||||
compliance is critical for safety, security, and regulatory
|
||||
compliance.
|
||||
|
||||
The Agent Behavior Verification Protocol (ABVP) addresses these
|
||||
challenges by providing a standardized framework for continuous
|
||||
runtime verification of agent behavior against declared
|
||||
specifications. ABVP enables real-time monitoring of agent
|
||||
actions, automated compliance checking against behavioral
|
||||
policies, and cryptographic attestation of verification results to
|
||||
establish trust chains for agent operation validation. The
|
||||
protocol is designed to integrate with existing agent
|
||||
architectures while providing mechanisms for detecting behavioral
|
||||
drift, validating policy adherence, and generating verifiable
|
||||
evidence of agent compliance throughout the operational lifecycle.
|
||||
This specification defines the core protocol mechanisms, message
|
||||
formats, and verification procedures necessary to implement
|
||||
trustworthy agent behavior validation in production deployments.
|
||||
|
||||
2. Terminology
|
||||
|
||||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
|
||||
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL"
|
||||
in this document are to be interpreted as described in RFC 2119
|
||||
[RFC2119].
|
||||
|
||||
Agent: An autonomous software entity that performs actions or
|
||||
makes decisions according to defined policies and specifications.
|
||||
In the context of ABVP, an agent is a system whose runtime
|
||||
behavior requires continuous verification against its declared
|
||||
operational parameters and behavioral constraints.
|
||||
|
||||
Behavior Specification: A formally defined set of policies,
|
||||
constraints, and operational parameters that describe the expected
|
||||
and permitted actions of an agent. A behavior specification MUST
|
||||
be machine-readable and verifiable, containing sufficient detail
|
||||
to enable automated compliance checking during agent runtime
|
||||
operation.
|
||||
|
||||
Behavior Witness: A system component or external entity that
|
||||
observes and records agent actions for verification purposes. A
|
||||
behavior witness MUST provide cryptographically signed
|
||||
attestations of observed agent behavior and MAY operate
|
||||
independently of the agent being monitored to ensure verification
|
||||
integrity.
|
||||
|
||||
Compliance Validation: The process of evaluating agent runtime
|
||||
behavior against its declared behavior specification to determine
|
||||
conformance. Compliance validation encompasses real-time
|
||||
monitoring, policy checking, and the generation of verification
|
||||
results that attest to agent adherence to specified behavioral
|
||||
constraints.
|
||||
|
||||
Verification Attestation: A cryptographically signed statement
|
||||
that asserts the compliance status of an agent's behavior relative
|
||||
to its specification during a defined time period. Verification
|
||||
attestations MUST include sufficient detail to enable third-party
|
||||
validation and SHOULD reference the specific behavior
|
||||
specification version and verification criteria used in the
|
||||
assessment.
|
||||
|
||||
Behavioral Drift: The phenomenon where an agent's actual runtime
|
||||
behavior gradually diverges from its declared specification over
|
||||
time, either due to learning adaptations, environmental changes,
|
||||
or system degradation. ABVP mechanisms MUST be capable of
|
||||
detecting behavioral drift and reporting deviations from
|
||||
established behavioral baselines.
|
||||
|
||||
3. Problem Statement
|
||||
|
||||
The proliferation of autonomous AI agents in critical
|
||||
infrastructure, financial systems, and safety-critical
|
||||
applications has created an urgent need for continuous
|
||||
verification that deployed agents operate within their declared
|
||||
behavioral boundaries. Current agent deployment practices rely
|
||||
primarily on pre-deployment testing and static policy
|
||||
declarations, creating a significant verification gap between an
|
||||
agent's stated capabilities and constraints and its actual runtime
|
||||
behavior. This gap becomes particularly problematic as agents
|
||||
adapt their behavior through learning mechanisms, interact with
|
||||
dynamic environments, or experience gradual behavioral drift due
|
||||
to model degradation or adversarial influences.
|
||||
|
||||
Traditional software verification approaches are insufficient for
|
||||
autonomous agents because agent behavior is often non-
|
||||
deterministic, context-dependent, and may evolve over time through
|
||||
machine learning processes. Unlike conventional software systems
|
||||
where behavior can be predicted from code analysis, agent systems
|
||||
exhibit emergent behaviors that arise from complex interactions
|
||||
between training data, environmental inputs, and decision-making
|
||||
algorithms. The absence of standardized mechanisms for expressing
|
||||
machine-verifiable behavioral specifications further complicates
|
||||
runtime verification, as operators lack a common framework for
|
||||
defining what constitutes compliant agent behavior and how
|
||||
compliance can be automatically validated.
|
||||
|
||||
The security and trust implications of unverified agent behavior
|
||||
are substantial, particularly in scenarios where agents operate
|
||||
with elevated privileges or make decisions affecting human safety
|
||||
or economic systems. Behavioral drift, where an agent's actions
|
||||
gradually deviate from intended policies, may go undetected for
|
||||
extended periods without continuous verification mechanisms.
|
||||
Similarly, adversarial attacks that subtly modify agent behavior
|
||||
to achieve malicious objectives could remain unnoticed in systems
|
||||
that lack real-time compliance monitoring. The inability to
|
||||
provide cryptographic attestations of agent behavior compliance
|
||||
also prevents the establishment of trust chains necessary for
|
||||
multi-agent systems or cross-organizational agent interactions.
|
||||
|
||||
Current accountability frameworks for AI systems focus primarily
|
||||
on explainability and audit trails but do not provide mechanisms
|
||||
for real-time verification of behavioral compliance against
|
||||
formally specified policies. This creates operational risks where
|
||||
agents may violate their declared constraints without immediate
|
||||
detection, potentially causing system failures, security breaches,
|
||||
or regulatory violations. The lack of standardized verification
|
||||
protocols also prevents interoperability between different agent
|
||||
verification systems and limits the ability to establish industry-
|
||||
wide trust frameworks for autonomous agent deployment.
|
||||
|
||||
4. Agent Behavior Verification Architecture
|
||||
|
||||
The ABVP architecture consists of four primary components that
|
||||
work together to provide continuous runtime verification of agent
|
||||
behavior: Agent Runtime Environments (AREs), Behavior Verification
|
||||
Nodes (BVNs), Attestation Authorities (AAs), and Verification
|
||||
Clients (VCs). Agent Runtime Environments host the deployed agents
|
||||
and MUST implement behavior monitoring capabilities that capture
|
||||
relevant behavioral data and forward it to designated Behavior
|
||||
Verification Nodes. These environments MUST provide secure
|
||||
isolation between the agent execution context and the monitoring
|
||||
subsystem to prevent agents from interfering with their own
|
||||
verification processes. The ARE MUST also implement a trusted
|
||||
communication channel to BVNs using protocols such as TLS 1.3
|
||||
[RFC8446] or QUIC [RFC9000] to ensure behavior data integrity
|
||||
during transmission.
|
||||
|
||||
Behavior Verification Nodes serve as the core verification engines
|
||||
within the ABVP architecture and MUST implement the runtime
|
||||
verification protocol defined in Section 6. Each BVN maintains a
|
||||
repository of behavior specifications for agents under its
|
||||
verification authority and continuously processes behavioral
|
||||
evidence received from AREs. BVNs MUST validate incoming behavior
|
||||
data against the appropriate specifications and generate
|
||||
compliance assessments in real-time. Multiple BVNs MAY collaborate
|
||||
in a distributed verification network to provide redundancy and
|
||||
prevent single points of failure. When operating in a distributed
|
||||
configuration, BVNs MUST implement consensus mechanisms to ensure
|
||||
consistent verification results across the network. BVNs MUST also
|
||||
implement rate limiting and resource management to handle high-
|
||||
volume verification requests without compromising verification
|
||||
quality.
|
||||
|
||||
Attestation Authorities provide cryptographic attestation services
|
||||
for verified behavior compliance and MUST maintain secure key
|
||||
management infrastructure capable of generating unforgeable
|
||||
attestations. AAs receive compliance reports from BVNs and MUST
|
||||
verify the authenticity and integrity of these reports before
|
||||
issuing attestations. The AA MUST implement a hierarchical trust
|
||||
model where attestations can be validated through a chain of trust
|
||||
extending to a root certificate authority. AAs SHOULD implement
|
||||
hardware security modules (HSMs) or equivalent trusted execution
|
||||
environments to protect attestation signing keys from compromise.
|
||||
Multiple AAs MAY participate in cross-attestation relationships to
|
||||
provide attestation redundancy and prevent single points of trust
|
||||
failure.
|
||||
|
||||
Verification Clients represent entities that consume ABVP
|
||||
attestations to make trust decisions about agent behavior and MAY
|
||||
include system operators, regulatory bodies, or other automated
|
||||
systems. VCs MUST implement attestation verification capabilities
|
||||
including certificate chain validation and revocation checking as
|
||||
specified in Section 7. The architecture MUST support both real-
|
||||
time verification queries and batch verification processes to
|
||||
accommodate different operational requirements. VCs SHOULD
|
||||
implement local attestation caching with appropriate cache
|
||||
invalidation mechanisms to reduce verification latency while
|
||||
maintaining attestation freshness. The ABVP architecture MUST
|
||||
provide clear separation of duties between verification components
|
||||
to prevent conflicts of interest and ensure independent
|
||||
verification processes.
|
||||
|
||||
The communication between architectural components MUST follow the
|
||||
protocol specifications defined in Section 6, with all inter-
|
||||
component communications authenticated and encrypted. The
|
||||
architecture MUST support both synchronous and asynchronous
|
||||
verification modes to accommodate different agent deployment
|
||||
scenarios and performance requirements. Components MUST implement
|
||||
appropriate logging and audit trail capabilities to support
|
||||
forensic analysis and compliance reporting. The overall
|
||||
architecture SHOULD be designed for horizontal scalability to
|
||||
support large-scale agent deployments while maintaining
|
||||
verification performance and reliability.
|
||||
|
||||
5. Behavior Specification Format
|
||||
|
||||
This section defines the standardized format for expressing agent
|
||||
behavioral policies and constraints within the ABVP framework. The
|
||||
behavior specification format enables machine-readable policy
|
||||
declarations that can be automatically verified during agent
|
||||
runtime. All behavior specifications MUST be expressed in a
|
||||
structured format that supports both human readability and
|
||||
automated processing by verification systems.
|
||||
|
||||
The core behavior specification is structured as a JSON document
|
||||
conforming to the ABVP Behavior Schema. Each specification MUST
|
||||
contain a policy declaration section, verification parameters, and
|
||||
compliance thresholds. The policy declaration section includes
|
||||
behavioral constraints expressed as logical predicates, allowed
|
||||
action sets, and resource utilization bounds. Verification
|
||||
parameters specify the monitoring frequency, sampling rates, and
|
||||
attestation requirements for each declared behavior. Compliance
|
||||
thresholds define the acceptable deviation ranges and tolerance
|
||||
levels for measured behaviors compared to declared specifications.
|
||||
|
||||
Behavioral constraints within the specification are expressed
|
||||
using a formal constraint language based on temporal logic
|
||||
predicates. Each constraint MUST specify a behavioral property
|
||||
(such as "response_time_bound" or "resource_utilization_limit"),
|
||||
an operator (such as "less_than", "equals", or "within_range"),
|
||||
and target values or ranges. Complex behavioral policies MAY be
|
||||
constructed using logical operators (AND, OR, NOT) to combine
|
||||
multiple constraints. The specification format supports
|
||||
hierarchical constraint groupings to represent different
|
||||
operational modes or contextual behavior variations.
|
||||
|
||||
The behavior specification includes a verification requirements
|
||||
section that defines how each behavioral constraint should be
|
||||
monitored and validated. This section MUST specify the required
|
||||
verification frequency, acceptable measurement methods, and
|
||||
cryptographic attestation parameters for each constraint.
|
||||
Verification requirements MAY include sampling strategies for
|
||||
performance-sensitive constraints and continuous monitoring
|
||||
directives for safety-critical behaviors. The specification format
|
||||
also supports conditional verification rules that adjust
|
||||
monitoring parameters based on agent operational context or
|
||||
detected behavioral patterns.
|
||||
|
||||
Each behavior specification MUST include metadata sections
|
||||
containing versioning information, validity periods, and
|
||||
specification dependencies. The metadata enables proper
|
||||
specification lifecycle management and ensures compatibility
|
||||
between agent deployments and verification infrastructure.
|
||||
Specifications SHOULD include digital signatures from authorized
|
||||
policy authors to ensure specification integrity and authenticity.
|
||||
The format supports specification inheritance and composition,
|
||||
allowing complex agent policies to be built from validated
|
||||
behavioral specification components while maintaining verification
|
||||
traceability throughout the composition hierarchy.
|
||||
|
||||
6. Runtime Verification Protocol
|
||||
|
||||
The Runtime Verification Protocol defines the message exchange
|
||||
patterns and procedures that enable continuous monitoring and
|
||||
validation of agent behavior against declared specifications. The
|
||||
protocol operates on a request-response model where Verification
|
||||
Requesters initiate compliance checks, Behavior Monitors observe
|
||||
agent actions, and Compliance Checkers evaluate adherence to
|
||||
behavioral specifications. All protocol participants MUST
|
||||
implement the core verification message set defined in this
|
||||
section, and MAY implement optional extensions for specialized
|
||||
verification scenarios. The protocol is designed to operate over
|
||||
existing transport mechanisms including HTTP/2 [RFC7540],
|
||||
WebSocket [RFC6455], or dedicated secure channels established
|
||||
through TLS 1.3 [RFC8446].
|
||||
|
||||
Verification sessions are initiated through a VERIFICATION_REQUEST
|
||||
message that specifies the agent identifier, behavioral
|
||||
specification reference, verification scope, and temporal
|
||||
parameters for the compliance check. The requesting entity MUST
|
||||
include a cryptographically secure session identifier, timestamp
|
||||
bounds for the verification window, and references to the specific
|
||||
behavioral constraints to be validated. Behavior Monitors respond
|
||||
with MONITORING_DATA messages containing timestamped observations
|
||||
of agent actions, decision traces, and relevant contextual
|
||||
information captured during the specified verification window.
|
||||
These messages MUST include integrity protection through digital
|
||||
signatures and SHOULD include privacy-preserving mechanisms when
|
||||
agent actions contain sensitive information.
|
||||
|
||||
Compliance evaluation proceeds through COMPLIANCE_CHECK messages
|
||||
exchanged between Verification Requesters and designated
|
||||
Compliance Checkers. Each compliance check message MUST reference
|
||||
the behavioral specification being evaluated, include the
|
||||
monitoring data to be assessed, and specify the verification
|
||||
algorithms or rules to be applied. Compliance Checkers process the
|
||||
monitoring data against the behavioral constraints and generate
|
||||
COMPLIANCE_RESULT messages indicating whether the observed
|
||||
behavior satisfies the specified requirements. Results MUST
|
||||
include binary compliance indicators, detailed violation reports
|
||||
when non-compliance is detected, and confidence metrics indicating
|
||||
the reliability of the compliance assessment.
|
||||
|
||||
The protocol includes mechanisms for handling streaming
|
||||
verification scenarios where agent behavior must be validated
|
||||
continuously rather than in discrete sessions. Streaming
|
||||
verification employs persistent connections where MONITORING_DATA
|
||||
messages are transmitted in near real-time as agent actions occur,
|
||||
enabling immediate detection of behavioral deviations. Compliance
|
||||
Checkers maintain running assessments of behavioral compliance and
|
||||
generate COMPLIANCE_ALERT messages when violations are detected or
|
||||
when behavioral patterns indicate potential drift from specified
|
||||
policies. All streaming verification sessions MUST implement flow
|
||||
control mechanisms to prevent resource exhaustion and SHOULD
|
||||
include adaptive sampling techniques to manage verification
|
||||
overhead in high-throughput scenarios.
|
||||
|
||||
Attestation generation occurs through ATTESTATION_REQUEST messages
|
||||
that trigger the creation of cryptographic proofs of compliance
|
||||
assessment results. These requests MUST specify the compliance
|
||||
results to be attested, the cryptographic algorithms to be used
|
||||
for attestation generation, and any additional claims or
|
||||
assertions to be included in the attestation. The resulting
|
||||
ATTESTATION_RESPONSE messages contain digitally signed
|
||||
attestations that bind compliance results to specific agents, time
|
||||
periods, and behavioral specifications through tamper-evident
|
||||
cryptographic structures. Attestations MUST include sufficient
|
||||
information to enable independent verification of compliance
|
||||
claims and SHOULD reference the complete verification audit trail
|
||||
to support forensic analysis when behavioral violations occur.
|
||||
|
||||
7. Compliance Reporting and Attestation
|
||||
|
||||
Compliance reporting in ABVP provides a standardized mechanism for
|
||||
documenting and cryptographically attesting to agent behavior
|
||||
verification results. A compliance report MUST contain the agent
|
||||
identifier, verification period, evaluated behavior
|
||||
specifications, compliance status for each specification, and
|
||||
supporting evidence including behavioral observations and
|
||||
verification computations. Reports MUST be generated at
|
||||
configurable intervals or upon detection of compliance violations,
|
||||
with emergency reports triggered immediately when critical policy
|
||||
violations occur. The reporting format MUST support both human-
|
||||
readable summaries and machine-processable structured data to
|
||||
enable automated compliance monitoring and audit trail generation.
|
||||
|
||||
Cryptographic attestation ensures the integrity and non-
|
||||
repudiation of compliance reports through digital signatures and
|
||||
hash chain mechanisms. Each compliance report MUST be digitally
|
||||
signed by the generating Compliance Checker using keys certified
|
||||
within the ABVP trust framework. Attestations MUST include a
|
||||
timestamp from a trusted time source, the hash of the previous
|
||||
attestation to form a verification chain, and sufficient
|
||||
cryptographic binding to prevent tampering or replay attacks. The
|
||||
attestation format SHOULD follow established standards such as RFC
|
||||
8392 (CWT) or RFC 7519 (JWT) to ensure interoperability with
|
||||
existing security infrastructures.
|
||||
|
||||
Trust chain establishment requires a hierarchical certification
|
||||
authority structure where Compliance Checkers obtain certificates
|
||||
from trusted ABVP Certificate Authorities. Root certificates for
|
||||
ABVP trust anchors MUST be distributed through secure channels and
|
||||
updated using standard certificate management practices as defined
|
||||
in RFC 5280. Verification entities MUST validate the complete
|
||||
certificate chain from the signing Compliance Checker to a trusted
|
||||
root before accepting attestations. Certificate revocation MUST be
|
||||
supported through standard mechanisms such as Certificate
|
||||
Revocation Lists (CRLs) or Online Certificate Status Protocol
|
||||
(OCSP) as specified in RFC 5280 and RFC 6960 respectively.
|
||||
|
||||
The compliance reporting protocol defines specific message formats
|
||||
for distributing attestations to interested parties including
|
||||
agent operators, regulatory authorities, and other verification
|
||||
systems. Compliance reports MAY be distributed through push
|
||||
mechanisms to subscribed entities or pulled on-demand through
|
||||
standardized query interfaces. Report distribution MUST preserve
|
||||
attestation integrity while allowing for appropriate access
|
||||
control based on the sensitivity of the reported agent behaviors.
|
||||
Long-term storage and archival of compliance reports SHOULD
|
||||
implement tamper-evident logging mechanisms to support forensic
|
||||
analysis and regulatory compliance requirements.
|
||||
|
||||
8. Security Considerations
|
||||
|
||||
The ABVP verification infrastructure introduces several security
|
||||
considerations that must be addressed to ensure the integrity and
|
||||
trustworthiness of agent behavior verification. The protocol's
|
||||
reliance on continuous monitoring and attestation creates
|
||||
potential attack vectors that could compromise the verification
|
||||
process itself. Attackers may attempt to subvert verification
|
||||
mechanisms to mask non-compliant agent behavior or to falsely
|
||||
indicate compliance violations where none exist. The verification
|
||||
system MUST be designed with the assumption that both the
|
||||
monitored agents and the verification infrastructure may be
|
||||
targets of sophisticated adversaries seeking to undermine
|
||||
behavioral compliance validation.
|
||||
|
||||
Attestation integrity represents a critical security requirement
|
||||
for ABVP implementations. Verification attestations MUST be
|
||||
cryptographically signed using mechanisms that provide non-
|
||||
repudiation and tamper detection capabilities. The attestation
|
||||
chain MUST be anchored in a trusted root of trust, such as
|
||||
hardware security modules or trusted platform modules, to prevent
|
||||
forgery of compliance attestations. Implementations SHOULD employ
|
||||
time-stamping mechanisms to prevent replay attacks where old
|
||||
attestations are reused to mask current non-compliance. The
|
||||
cryptographic algorithms used for attestation signing MUST conform
|
||||
to current best practices for digital signatures and SHOULD
|
||||
support algorithm agility to enable updates as cryptographic
|
||||
standards evolve. Key management for attestation signing MUST
|
||||
follow established security practices, including regular key
|
||||
rotation and secure key storage.
|
||||
|
||||
The distributed nature of ABVP verification creates additional
|
||||
security challenges related to verification node compromise and
|
||||
Byzantine behavior among verification participants. Verification
|
||||
nodes may be compromised by attackers seeking to manipulate
|
||||
compliance reporting or inject false verification results.
|
||||
Implementations MUST employ consensus mechanisms or threshold-
|
||||
based verification approaches to detect and mitigate the impact of
|
||||
compromised verification nodes. The protocol SHOULD include
|
||||
mechanisms for verification node authentication and authorization
|
||||
to prevent unauthorized participants from joining verification
|
||||
networks. Network communications between verification components
|
||||
MUST be encrypted and authenticated to prevent eavesdropping and
|
||||
man-in-the-middle attacks. Implementations SHOULD implement rate
|
||||
limiting and anomaly detection to identify potential denial-of-
|
||||
service attacks against verification infrastructure.
|
||||
|
||||
Behavioral specification tampering and specification substitution
|
||||
attacks pose significant threats to the ABVP framework's
|
||||
effectiveness. Attackers may attempt to modify behavioral
|
||||
specifications to make non-compliant behavior appear compliant or
|
||||
to introduce specifications that are impossible to verify
|
||||
accurately. Behavioral specifications MUST be cryptographically
|
||||
protected through digital signatures and integrity checking
|
||||
mechanisms. The protocol MUST include versioning and change
|
||||
tracking for behavioral specifications to detect unauthorized
|
||||
modifications. Verification systems SHOULD implement specification
|
||||
validation to detect specifications that contain logical
|
||||
inconsistencies or verification bypasses. Access controls for
|
||||
specification modification MUST follow principle of least
|
||||
privilege and include audit logging of all specification changes.
|
||||
|
||||
The ABVP verification process may inadvertently expose sensitive
|
||||
information about agent operations, internal state, or the systems
|
||||
being monitored. Verification data collection MUST be designed to
|
||||
minimize information disclosure while maintaining verification
|
||||
effectiveness. Implementations SHOULD employ privacy-preserving
|
||||
techniques such as zero-knowledge proofs or selective disclosure
|
||||
mechanisms where appropriate to limit exposure of sensitive
|
||||
operational details. Verification logs and attestations MUST be
|
||||
protected against unauthorized access and SHOULD include data
|
||||
retention policies that balance verification auditability with
|
||||
privacy requirements. The protocol MUST consider the implications
|
||||
of cross-border data flows when verification infrastructure spans
|
||||
multiple jurisdictions with different privacy regulations.
|
||||
|
||||
Side-channel attacks and covert channels represent additional
|
||||
security considerations for ABVP implementations. The verification
|
||||
process itself may create observable patterns that could be
|
||||
exploited by attackers to infer information about agent behavior
|
||||
or verification outcomes. Timing-based side channels in
|
||||
verification operations MAY reveal information about the
|
||||
complexity or results of compliance checking. Implementations
|
||||
SHOULD consider countermeasures such as constant-time operations
|
||||
and traffic analysis resistance where appropriate. The protocol
|
||||
design MUST consider how verification metadata and communication
|
||||
patterns might be used to build profiles of agent behavior that
|
||||
could compromise operational security or reveal sensitive system
|
||||
characteristics.
|
||||
|
||||
9. IANA Considerations
|
||||
|
||||
This document requires the registration of several new namespaces
|
||||
and protocol parameters with the Internet Assigned Numbers
|
||||
Authority (IANA). These registrations are necessary to ensure
|
||||
global uniqueness and interoperability of ABVP implementations
|
||||
across different vendors and deployment environments.
|
||||
|
||||
IANA SHALL establish a new registry titled "Agent Behavior
|
||||
Verification Protocol (ABVP) Parameters" under the "Structured
|
||||
Syntax Suffixes" registry group. This registry SHALL contain three
|
||||
sub-registries: "Behavior Specification Schema Types",
|
||||
"Verification Message Types", and "Attestation Format
|
||||
Identifiers". The registration policy for all ABVP parameter sub-
|
||||
registries SHALL follow the "Specification Required" policy as
|
||||
defined in RFC 8126, with the additional requirement that all
|
||||
registrations include a reference to a publicly available
|
||||
specification document and demonstrate interoperability with at
|
||||
least one existing ABVP implementation.
|
||||
|
||||
The "Behavior Specification Schema Types" sub-registry SHALL
|
||||
maintain identifiers for standardized behavior specification
|
||||
formats as defined in Section 5. Each registration MUST include a
|
||||
unique identifier string, a human-readable description, a
|
||||
reference specification, and version information. Initial
|
||||
registrations SHALL include "abvp-policy-v1" for the base policy
|
||||
specification format and "abvp-constraints-v1" for behavioral
|
||||
constraint specifications. The "Verification Message Types" sub-
|
||||
registry SHALL contain identifiers for protocol messages defined
|
||||
in Section 6, including verification requests, compliance reports,
|
||||
and attestation messages. Registration entries MUST specify the
|
||||
message identifier, purpose, required parameters, and applicable
|
||||
verification contexts.
|
||||
|
||||
The "Attestation Format Identifiers" sub-registry SHALL maintain
|
||||
identifiers for cryptographic attestation formats used in
|
||||
compliance reporting as specified in Section 7. Each registration
|
||||
MUST include the attestation format identifier, cryptographic
|
||||
algorithm requirements, trust model specifications, and
|
||||
interoperability considerations. IANA SHALL reserve the identifier
|
||||
prefix "abvp-" for protocol-specific attestation formats and MAY
|
||||
delegate sub-namespace management to recognized standards bodies
|
||||
for domain-specific attestation requirements. All registry entries
|
||||
MUST include contact information for the registrant and SHALL be
|
||||
subject to periodic review to ensure continued relevance and
|
||||
security adequacy.
|
||||
|
||||
Author's Address
|
||||
|
||||
Generated by IETF Draft Analyzer
|
||||
2026-03-01
|
||||
Reference in New Issue
Block a user