feat: add draft data, gap analysis report, and workspace config
This commit is contained in:
@@ -0,0 +1,660 @@
|
||||
---
|
||||
title: >
|
||||
Agent Behavioral Verification and
|
||||
Performance Benchmarking
|
||||
abbrev: "Agent Behavioral Verification"
|
||||
category: std
|
||||
docname: draft-nennemann-agent-behavioral-verification-00
|
||||
area: "OPS"
|
||||
workgroup: "NMOP"
|
||||
submissiontype: IETF
|
||||
v: 3
|
||||
|
||||
author:
|
||||
- fullname: Christian Nennemann
|
||||
organization: Independent Researcher
|
||||
email: ietf@nennemann.de
|
||||
|
||||
normative:
|
||||
RFC2119:
|
||||
RFC8174:
|
||||
RFC9334:
|
||||
RFC7519:
|
||||
RFC7515:
|
||||
I-D.nennemann-wimse-ect:
|
||||
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
||||
I-D.nennemann-agent-dag-hitl-safety:
|
||||
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
||||
|
||||
informative:
|
||||
RFC9110:
|
||||
I-D.nennemann-agent-gap-analysis:
|
||||
title: "Gap Analysis for Autonomous Agent Protocols"
|
||||
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-gap-analysis/
|
||||
I-D.ietf-scitt-architecture:
|
||||
|
||||
--- abstract
|
||||
|
||||
This document defines protocols for runtime
|
||||
verification that deployed AI agents behave
|
||||
according to their declared policies. It also
|
||||
specifies standardized metrics and a framework
|
||||
for benchmarking agent performance across
|
||||
implementations. Behavioral Evidence Tokens
|
||||
(BETs) extend the Execution Context Token
|
||||
architecture to provide cryptographically
|
||||
verifiable proof of policy compliance.
|
||||
Performance profiles enable objective comparison
|
||||
of agent capabilities.
|
||||
|
||||
--- middle
|
||||
|
||||
# Introduction
|
||||
|
||||
Autonomous AI agents increasingly operate in
|
||||
networked environments where they make decisions,
|
||||
invoke tools, and delegate tasks to other agents.
|
||||
Operators and relying parties need assurance that
|
||||
these agents behave according to their declared
|
||||
policies at runtime, not merely at deployment
|
||||
time.
|
||||
|
||||
{{I-D.nennemann-agent-gap-analysis}} identifies
|
||||
two critical gaps in the current standards
|
||||
landscape:
|
||||
|
||||
- Gap 1 (Behavioral Verification): Agents
|
||||
declare policies in their Execution Context
|
||||
Tokens but no standardized mechanism exists to
|
||||
verify that runtime behavior matches those
|
||||
declarations.
|
||||
|
||||
- Gap 11 (Performance Benchmarking): No
|
||||
standardized way exists to compare agent
|
||||
implementations objectively across dimensions
|
||||
such as task completion, latency, accuracy,
|
||||
and safety compliance.
|
||||
|
||||
This document addresses both gaps by defining:
|
||||
|
||||
1. A behavioral verification architecture
|
||||
aligned with the Remote Attestation Procedures
|
||||
(RATS) framework {{RFC9334}}.
|
||||
|
||||
2. Behavioral Evidence Tokens (BETs) that extend
|
||||
the Execution Context Token (ECT)
|
||||
{{I-D.nennemann-wimse-ect}} with runtime
|
||||
compliance claims.
|
||||
|
||||
3. A performance benchmarking framework with
|
||||
standard metrics, benchmark profiles, and an
|
||||
execution protocol.
|
||||
|
||||
# Terminology
|
||||
|
||||
{::boilerplate bcp14-tagged}
|
||||
|
||||
The following terms are used in this document:
|
||||
|
||||
Behavioral Attestation:
|
||||
: The process of generating verifiable evidence
|
||||
that an agent's runtime actions conform to its
|
||||
declared policies.
|
||||
|
||||
Policy-Behavior Binding:
|
||||
: A formal linkage between a declared policy in
|
||||
an agent's ECT and observable runtime actions
|
||||
that demonstrate compliance with that policy.
|
||||
|
||||
Behavioral Evidence Token (BET):
|
||||
: A signed token containing claims about an
|
||||
agent's observed runtime behavior relative to
|
||||
its declared policies. BETs extend the ECT
|
||||
architecture.
|
||||
|
||||
Runtime Monitor:
|
||||
: A component that observes agent actions and
|
||||
collects evidence for behavioral attestation.
|
||||
|
||||
Benchmark Suite:
|
||||
: A collection of standardized test scenarios
|
||||
designed to evaluate agent performance across
|
||||
defined metrics.
|
||||
|
||||
Performance Profile:
|
||||
: A structured record of benchmark results for
|
||||
a specific agent implementation.
|
||||
|
||||
# Behavioral Verification Architecture
|
||||
|
||||
## Verification Model Overview
|
||||
|
||||
The behavioral verification architecture aligns
|
||||
with the RATS {{RFC9334}} roles of Attester,
|
||||
Verifier, and Relying Party. A Runtime Monitor
|
||||
collects evidence of agent actions and produces
|
||||
Behavioral Evidence Tokens.
|
||||
|
||||
~~~
|
||||
+-------------+ +---------+
|
||||
| Agent |------>| Runtime |
|
||||
| (Attester) |actions| Monitor |
|
||||
+-------------+ +----+----+
|
||||
|
|
||||
evidence
|
||||
|
|
||||
+----v----+
|
||||
| BET |
|
||||
| Creator |
|
||||
+----+----+
|
||||
|
|
||||
BET
|
||||
|
|
||||
+---------v---------+
|
||||
| Verifier |
|
||||
| (Policy Engine) |
|
||||
+---------+---------+
|
||||
|
|
||||
attestation result
|
||||
|
|
||||
+---------v---------+
|
||||
| Relying Party |
|
||||
| (Orchestrator / |
|
||||
| Operator) |
|
||||
+-------------------+
|
||||
~~~
|
||||
{: #fig-arch title="Behavioral Verification
|
||||
Architecture"}
|
||||
|
||||
The architecture supports two modes of
|
||||
operation:
|
||||
|
||||
- Continuous Monitoring: The Runtime Monitor
|
||||
observes all agent actions in real time and
|
||||
generates BETs at configurable intervals or
|
||||
upon policy-relevant events.
|
||||
|
||||
- Point-in-Time Attestation: A Verifier
|
||||
requests behavioral evidence for a specific
|
||||
time window, and the Monitor assembles a BET
|
||||
covering that period.
|
||||
|
||||
## Policy-Behavior Binding
|
||||
|
||||
A Policy-Behavior Binding declares the expected
|
||||
behaviors associated with a policy and the
|
||||
observable actions that constitute compliance.
|
||||
|
||||
The binding is expressed as a JSON object:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"policy_id": "urn:example:policy:data-access",
|
||||
"version": "1.0",
|
||||
"expected_behaviors": [
|
||||
{
|
||||
"behavior_id": "bhv-001",
|
||||
"description": "Agent accesses only
|
||||
authorized data sources",
|
||||
"observable_actions": [
|
||||
"data_source_access"
|
||||
],
|
||||
"compliance_criteria": {
|
||||
"type": "allowlist",
|
||||
"values": [
|
||||
"urn:example:ds:approved-1",
|
||||
"urn:example:ds:approved-2"
|
||||
]
|
||||
}
|
||||
}
|
||||
],
|
||||
"evaluation_mode": "continuous"
|
||||
}
|
||||
~~~
|
||||
{: #fig-binding title="Policy-Behavior Binding
|
||||
Structure"}
|
||||
|
||||
Each binding MUST include:
|
||||
|
||||
- `policy_id`: A URI identifying the policy.
|
||||
- `expected_behaviors`: An array of behavior
|
||||
descriptors.
|
||||
- `evaluation_mode`: Either "continuous" or
|
||||
"on_demand".
|
||||
|
||||
Each behavior descriptor MUST include:
|
||||
|
||||
- `behavior_id`: A unique identifier.
|
||||
- `observable_actions`: Action types the monitor
|
||||
MUST observe.
|
||||
- `compliance_criteria`: The conditions under
|
||||
which the behavior is considered compliant.
|
||||
|
||||
## Behavioral Evidence Tokens (BET)
|
||||
|
||||
A Behavioral Evidence Token is a JSON Web Token
|
||||
(JWT) {{RFC7519}} signed using JSON Web Signature
|
||||
(JWS) {{RFC7515}}. BETs extend the ECT claim
|
||||
set with behavioral verification claims.
|
||||
|
||||
The following new claims are defined:
|
||||
|
||||
`bhv_policy`:
|
||||
: REQUIRED. A URI reference to the policy being
|
||||
verified.
|
||||
|
||||
`bhv_result`:
|
||||
: REQUIRED. The verification result. One of
|
||||
"pass", "fail", or "partial".
|
||||
|
||||
`bhv_evidence`:
|
||||
: REQUIRED. A base64url-encoded hash (SHA-256)
|
||||
of the collected observable actions during the
|
||||
observation window.
|
||||
|
||||
`bhv_window`:
|
||||
: REQUIRED. A JSON object with `start` and
|
||||
`end` fields containing NumericDate values
|
||||
(as defined in {{RFC7519}}) representing the
|
||||
observation period.
|
||||
|
||||
`bhv_details`:
|
||||
: OPTIONAL. An array of per-behavior results
|
||||
with `behavior_id` and individual `result`
|
||||
values.
|
||||
|
||||
Example BET payload:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"iss": "urn:example:monitor:m-001",
|
||||
"sub": "urn:example:agent:agent-42",
|
||||
"iat": 1700000000,
|
||||
"exp": 1700003600,
|
||||
"bhv_policy": "urn:example:policy:data-access",
|
||||
"bhv_result": "pass",
|
||||
"bhv_evidence": "dGhpcyBpcyBhIGhhc2g...",
|
||||
"bhv_window": {
|
||||
"start": 1699996400,
|
||||
"end": 1700000000
|
||||
},
|
||||
"bhv_details": [
|
||||
{
|
||||
"behavior_id": "bhv-001",
|
||||
"result": "pass"
|
||||
}
|
||||
]
|
||||
}
|
||||
~~~
|
||||
{: #fig-bet title="Example BET Payload"}
|
||||
|
||||
### BET Lifecycle
|
||||
|
||||
The lifecycle of a Behavioral Evidence Token
|
||||
consists of three phases:
|
||||
|
||||
1. Creation: The Runtime Monitor collects
|
||||
evidence of agent actions, evaluates them
|
||||
against the Policy-Behavior Binding, and
|
||||
constructs a BET with the appropriate claims.
|
||||
The BET is signed by the Monitor's key.
|
||||
|
||||
2. Submission: The signed BET is submitted to
|
||||
the Verifier. Submission MAY occur via a
|
||||
push model (Monitor sends to Verifier) or a
|
||||
pull model (Verifier requests from Monitor).
|
||||
|
||||
3. Verification: The Verifier validates the BET
|
||||
signature, checks the claims against its
|
||||
reference policies, and produces an
|
||||
attestation result for the Relying Party.
|
||||
|
||||
## Runtime Monitoring Protocol
|
||||
|
||||
### Monitor Placement
|
||||
|
||||
Runtime Monitors MAY be deployed in one of three
|
||||
configurations:
|
||||
|
||||
Inline:
|
||||
: The Monitor intercepts all agent
|
||||
communications as a proxy. This provides
|
||||
complete visibility but adds latency.
|
||||
|
||||
Sidecar:
|
||||
: The Monitor runs alongside the agent process
|
||||
and receives copies of all actions via a local
|
||||
interface. This minimizes latency while
|
||||
maintaining visibility.
|
||||
|
||||
External:
|
||||
: The Monitor operates as a separate service
|
||||
that receives action logs asynchronously.
|
||||
This provides the least overhead but may miss
|
||||
real-time events.
|
||||
|
||||
### Observation Collection
|
||||
|
||||
The Monitor MUST maintain a time-ordered log of
|
||||
observed actions. Each log entry MUST contain:
|
||||
|
||||
- Timestamp (NumericDate)
|
||||
- Action type
|
||||
- Action target (URI)
|
||||
- Action parameters (opaque to the Monitor)
|
||||
- Agent identifier
|
||||
|
||||
### Evidence Assembly
|
||||
|
||||
When assembling evidence for a BET, the Monitor
|
||||
MUST:
|
||||
|
||||
1. Select all log entries within the observation
|
||||
window.
|
||||
2. Compute a SHA-256 hash over the canonical
|
||||
JSON serialization of the selected entries.
|
||||
3. Evaluate each entry against the applicable
|
||||
Policy-Behavior Bindings.
|
||||
4. Determine the aggregate `bhv_result`.
|
||||
|
||||
### Anomaly Detection Signaling
|
||||
|
||||
When the Monitor detects behavior that violates
|
||||
a Policy-Behavior Binding, it MUST:
|
||||
|
||||
1. Generate a BET with `bhv_result` set to
|
||||
"fail" or "partial".
|
||||
2. Signal the anomaly to the Verifier
|
||||
immediately, regardless of the configured
|
||||
reporting interval.
|
||||
3. Optionally signal the agent's orchestrator
|
||||
to enable corrective action.
|
||||
|
||||
# Performance Benchmarking Framework
|
||||
|
||||
## Standard Metrics
|
||||
|
||||
The following metrics are defined for agent
|
||||
performance benchmarking:
|
||||
|
||||
Task Completion Rate (TCR):
|
||||
: The ratio of successfully completed tasks to
|
||||
total tasks attempted. Unit: percentage (%).
|
||||
Measured over a complete benchmark suite run.
|
||||
|
||||
Task Latency (TL):
|
||||
: The time elapsed from task assignment to task
|
||||
completion. Unit: milliseconds (ms).
|
||||
Reported as p50, p95, and p99 percentiles.
|
||||
|
||||
Task Accuracy (TA):
|
||||
: The degree to which task outputs match
|
||||
expected results. Unit: percentage (%).
|
||||
Measured using benchmark-specific evaluation
|
||||
functions.
|
||||
|
||||
Resource Efficiency (RE):
|
||||
: The computational resources consumed per task.
|
||||
Unit: normalized resource units (NRU).
|
||||
Includes CPU, memory, and network I/O.
|
||||
|
||||
Safety Compliance Score (SCS):
|
||||
: The ratio of tasks completed without safety
|
||||
policy violations to total tasks.
|
||||
Unit: percentage (%).
|
||||
|
||||
Delegation Success Rate (DSR):
|
||||
: The ratio of successful delegations to total
|
||||
delegation attempts. Unit: percentage (%).
|
||||
Applicable only to multi-agent scenarios.
|
||||
|
||||
## Benchmark Profiles
|
||||
|
||||
A Benchmark Profile defines a standardized set
|
||||
of test scenarios for a specific agent category.
|
||||
Profiles are expressed as JSON objects:
|
||||
|
||||
~~~json
|
||||
{
|
||||
"profile_id": "urn:ietf:bench:general-v1",
|
||||
"profile_name": "General Agent Benchmark",
|
||||
"version": "1.0",
|
||||
"agent_category": "general-purpose",
|
||||
"scenarios": [
|
||||
{
|
||||
"scenario_id": "s-001",
|
||||
"description": "Simple data retrieval",
|
||||
"difficulty": "basic",
|
||||
"metrics": ["TCR", "TL", "TA"],
|
||||
"timeout_ms": 30000,
|
||||
"expected_output_schema": "..."
|
||||
}
|
||||
],
|
||||
"scoring": {
|
||||
"weights": {
|
||||
"TCR": 0.3,
|
||||
"TL": 0.2,
|
||||
"TA": 0.3,
|
||||
"SCS": 0.2
|
||||
}
|
||||
}
|
||||
}
|
||||
~~~
|
||||
{: #fig-profile title="Benchmark Profile
|
||||
Structure"}
|
||||
|
||||
Predefined profiles SHOULD be registered for
|
||||
common agent types including:
|
||||
|
||||
- General-purpose agents
|
||||
- Code generation agents
|
||||
- Data analysis agents
|
||||
- Network management agents
|
||||
|
||||
## Benchmark Execution Protocol
|
||||
|
||||
### Test Harness Requirements
|
||||
|
||||
A conformant test harness MUST:
|
||||
|
||||
1. Execute all scenarios in the benchmark
|
||||
profile in a controlled environment.
|
||||
2. Isolate agent instances from external
|
||||
resources not specified in the scenario.
|
||||
3. Record all metrics defined in the profile.
|
||||
4. Produce a benchmark result document.
|
||||
|
||||
### Result Reporting Format
|
||||
|
||||
Benchmark results MUST be reported as a JSON
|
||||
object containing:
|
||||
|
||||
- `profile_id`: The benchmark profile used.
|
||||
- `agent_id`: Identifier of the tested agent.
|
||||
- `timestamp`: Time of benchmark execution.
|
||||
- `results`: Per-scenario metric values.
|
||||
- `aggregate`: Weighted aggregate scores.
|
||||
|
||||
### Anti-Gaming Provisions
|
||||
|
||||
To prevent agents from gaming benchmark results,
|
||||
the following provisions apply:
|
||||
|
||||
1. Randomized Scenarios: Test harnesses MUST
|
||||
randomize scenario ordering and MAY
|
||||
introduce minor variations in scenario
|
||||
parameters.
|
||||
|
||||
2. Blind Evaluation: The agent under test
|
||||
MUST NOT have access to the expected
|
||||
outputs or evaluation functions.
|
||||
|
||||
3. Holdback Scenarios: Benchmark profiles
|
||||
SHOULD include scenarios not disclosed to
|
||||
agent developers.
|
||||
|
||||
4. Temporal Variation: Repeated benchmark
|
||||
runs MUST vary timing to prevent
|
||||
memoization attacks.
|
||||
|
||||
## Performance Claims in ECT
|
||||
|
||||
Agent ECTs MAY include performance attestation
|
||||
claims in the `ext` field:
|
||||
|
||||
`perf_profile`:
|
||||
: The benchmark profile identifier.
|
||||
|
||||
`perf_score`:
|
||||
: The aggregate benchmark score.
|
||||
|
||||
`perf_timestamp`:
|
||||
: The time of the benchmark execution.
|
||||
|
||||
`perf_harness`:
|
||||
: Identifier of the test harness that produced
|
||||
the results.
|
||||
|
||||
These claims allow relying parties to evaluate
|
||||
agent capability before delegation.
|
||||
|
||||
# Integration with ECT
|
||||
|
||||
Behavioral Evidence Tokens integrate into the
|
||||
ECT DAG defined in
|
||||
{{I-D.nennemann-agent-dag-hitl-safety}} as
|
||||
follows:
|
||||
|
||||
1. Each BET references the ECT of the agent
|
||||
whose behavior was verified via the `sub`
|
||||
claim.
|
||||
|
||||
2. BETs are attached as child nodes in the
|
||||
ECT DAG, linked to the agent's execution
|
||||
node.
|
||||
|
||||
3. When an agent delegates to a sub-agent,
|
||||
the delegating agent's BET chain includes
|
||||
evidence covering the delegation decision.
|
||||
|
||||
4. Verifiers traversing the DAG can inspect
|
||||
BETs at each node to assess behavioral
|
||||
compliance across the entire execution
|
||||
chain.
|
||||
|
||||
~~~
|
||||
+----------+ +----------+
|
||||
| ECT |---->| ECT |
|
||||
| Agent A | | Agent B |
|
||||
+----+-----+ +----+-----+
|
||||
| |
|
||||
+----v-----+ +----v-----+
|
||||
| BET | | BET |
|
||||
| Agent A | | Agent B |
|
||||
+----------+ +----------+
|
||||
~~~
|
||||
{: #fig-dag title="BET Integration in ECT DAG"}
|
||||
|
||||
This structure enables end-to-end behavioral
|
||||
verification across multi-agent workflows.
|
||||
|
||||
# Security Considerations
|
||||
|
||||
## Adversarial Behavior
|
||||
|
||||
Agents MAY attempt to behave correctly only when
|
||||
they detect monitoring. Mitigations include:
|
||||
|
||||
- Unpredictable monitoring intervals
|
||||
- Covert observation modes where the agent is
|
||||
not informed of monitor presence
|
||||
- Cross-referencing BETs with external audit
|
||||
logs
|
||||
|
||||
## Monitor Compromise
|
||||
|
||||
A compromised Runtime Monitor could produce
|
||||
fraudulent BETs. Mitigations include:
|
||||
|
||||
- Monitor attestation using RATS {{RFC9334}}
|
||||
- Multiple independent monitors with
|
||||
cross-validation
|
||||
- Transparency logs for BETs, aligned with
|
||||
SCITT {{I-D.ietf-scitt-architecture}}
|
||||
|
||||
## Benchmark Manipulation
|
||||
|
||||
Agents or their operators MAY attempt to
|
||||
manipulate benchmark results. The anti-gaming
|
||||
provisions in Section 4.3.3 address this risk.
|
||||
Additionally:
|
||||
|
||||
- Benchmark harnesses MUST be operated by
|
||||
independent parties.
|
||||
- Results MUST be signed by the harness
|
||||
operator.
|
||||
- Benchmark profiles MUST be versioned and
|
||||
immutable once published.
|
||||
|
||||
## Privacy of Behavioral Evidence
|
||||
|
||||
BETs contain information about agent actions
|
||||
that may be sensitive. Implementations MUST:
|
||||
|
||||
- Minimize the detail in `bhv_evidence` to
|
||||
what is necessary for verification.
|
||||
- Support selective disclosure where possible.
|
||||
- Protect BETs in transit using TLS
|
||||
({{RFC9110}}).
|
||||
- Define retention policies for behavioral
|
||||
evidence.
|
||||
|
||||
# IANA Considerations
|
||||
|
||||
## ECT Extension Claim Keys
|
||||
|
||||
This document requests registration of the
|
||||
following claim keys in the ECT `ext` claims
|
||||
registry:
|
||||
|
||||
| Claim Key | Description |
|
||||
|:---------------|:---------------------------|
|
||||
| bhv_policy | Policy URI reference |
|
||||
| bhv_result | Verification result |
|
||||
| bhv_evidence | Observed actions hash |
|
||||
| bhv_window | Observation period |
|
||||
| bhv_details | Per-behavior results |
|
||||
| perf_profile | Benchmark profile ID |
|
||||
| perf_score | Aggregate benchmark score |
|
||||
| perf_timestamp | Benchmark execution time |
|
||||
| perf_harness | Test harness identifier |
|
||||
{: #tbl-claims title="ECT Extension Claims for
|
||||
Behavioral Verification"}
|
||||
|
||||
## Benchmark Profile Media Type
|
||||
|
||||
This document requests registration of the
|
||||
following media type:
|
||||
|
||||
Type name: application
|
||||
|
||||
Subtype name: agent-benchmark-profile+json
|
||||
|
||||
Required parameters: N/A
|
||||
|
||||
Optional parameters: N/A
|
||||
|
||||
Encoding considerations: binary (UTF-8 JSON)
|
||||
|
||||
Security considerations: See Section 6
|
||||
|
||||
--- back
|
||||
|
||||
# Acknowledgments
|
||||
{:numbered="false"}
|
||||
|
||||
The author thanks the contributors to the NMOP
|
||||
working group for discussions on agent
|
||||
operational requirements.
|
||||
Reference in New Issue
Block a user