Files
ietf-draft-analyzer/workspace/drafts/gap-analysis/draft-nennemann-agent-behavioral-verification-00.md
Christian Nennemann 2506b6325a
Some checks failed
CI / test (3.11) (push) Failing after 1m37s
CI / test (3.12) (push) Failing after 57s
feat: add draft data, gap analysis report, and workspace config
2026-04-06 18:47:15 +02:00

661 lines
17 KiB
Markdown

---
title: >
Agent Behavioral Verification and
Performance Benchmarking
abbrev: "Agent Behavioral Verification"
category: std
docname: draft-nennemann-agent-behavioral-verification-00
area: "OPS"
workgroup: "NMOP"
submissiontype: IETF
v: 3
author:
- fullname: Christian Nennemann
organization: Independent Researcher
email: ietf@nennemann.de
normative:
RFC2119:
RFC8174:
RFC9334:
RFC7519:
RFC7515:
I-D.nennemann-wimse-ect:
title: "Execution Context Tokens for Distributed Agentic Workflows"
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
I-D.nennemann-agent-dag-hitl-safety:
title: "Agent Context Policy Token: DAG Delegation with Human Override"
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
informative:
RFC9110:
I-D.nennemann-agent-gap-analysis:
title: "Gap Analysis for Autonomous Agent Protocols"
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-gap-analysis/
I-D.ietf-scitt-architecture:
--- abstract
This document defines protocols for runtime
verification that deployed AI agents behave
according to their declared policies. It also
specifies standardized metrics and a framework
for benchmarking agent performance across
implementations. Behavioral Evidence Tokens
(BETs) extend the Execution Context Token
architecture to provide cryptographically
verifiable proof of policy compliance.
Performance profiles enable objective comparison
of agent capabilities.
--- middle
# Introduction
Autonomous AI agents increasingly operate in
networked environments where they make decisions,
invoke tools, and delegate tasks to other agents.
Operators and relying parties need assurance that
these agents behave according to their declared
policies at runtime, not merely at deployment
time.
{{I-D.nennemann-agent-gap-analysis}} identifies
two critical gaps in the current standards
landscape:
- Gap 1 (Behavioral Verification): Agents
declare policies in their Execution Context
Tokens but no standardized mechanism exists to
verify that runtime behavior matches those
declarations.
- Gap 11 (Performance Benchmarking): No
standardized way exists to compare agent
implementations objectively across dimensions
such as task completion, latency, accuracy,
and safety compliance.
This document addresses both gaps by defining:
1. A behavioral verification architecture
aligned with the Remote Attestation Procedures
(RATS) framework {{RFC9334}}.
2. Behavioral Evidence Tokens (BETs) that extend
the Execution Context Token (ECT)
{{I-D.nennemann-wimse-ect}} with runtime
compliance claims.
3. A performance benchmarking framework with
standard metrics, benchmark profiles, and an
execution protocol.
# Terminology
{::boilerplate bcp14-tagged}
The following terms are used in this document:
Behavioral Attestation:
: The process of generating verifiable evidence
that an agent's runtime actions conform to its
declared policies.
Policy-Behavior Binding:
: A formal linkage between a declared policy in
an agent's ECT and observable runtime actions
that demonstrate compliance with that policy.
Behavioral Evidence Token (BET):
: A signed token containing claims about an
agent's observed runtime behavior relative to
its declared policies. BETs extend the ECT
architecture.
Runtime Monitor:
: A component that observes agent actions and
collects evidence for behavioral attestation.
Benchmark Suite:
: A collection of standardized test scenarios
designed to evaluate agent performance across
defined metrics.
Performance Profile:
: A structured record of benchmark results for
a specific agent implementation.
# Behavioral Verification Architecture
## Verification Model Overview
The behavioral verification architecture aligns
with the RATS {{RFC9334}} roles of Attester,
Verifier, and Relying Party. A Runtime Monitor
collects evidence of agent actions and produces
Behavioral Evidence Tokens.
~~~
+-------------+ +---------+
| Agent |------>| Runtime |
| (Attester) |actions| Monitor |
+-------------+ +----+----+
|
evidence
|
+----v----+
| BET |
| Creator |
+----+----+
|
BET
|
+---------v---------+
| Verifier |
| (Policy Engine) |
+---------+---------+
|
attestation result
|
+---------v---------+
| Relying Party |
| (Orchestrator / |
| Operator) |
+-------------------+
~~~
{: #fig-arch title="Behavioral Verification
Architecture"}
The architecture supports two modes of
operation:
- Continuous Monitoring: The Runtime Monitor
observes all agent actions in real time and
generates BETs at configurable intervals or
upon policy-relevant events.
- Point-in-Time Attestation: A Verifier
requests behavioral evidence for a specific
time window, and the Monitor assembles a BET
covering that period.
## Policy-Behavior Binding
A Policy-Behavior Binding declares the expected
behaviors associated with a policy and the
observable actions that constitute compliance.
The binding is expressed as a JSON object:
~~~json
{
"policy_id": "urn:example:policy:data-access",
"version": "1.0",
"expected_behaviors": [
{
"behavior_id": "bhv-001",
"description": "Agent accesses only
authorized data sources",
"observable_actions": [
"data_source_access"
],
"compliance_criteria": {
"type": "allowlist",
"values": [
"urn:example:ds:approved-1",
"urn:example:ds:approved-2"
]
}
}
],
"evaluation_mode": "continuous"
}
~~~
{: #fig-binding title="Policy-Behavior Binding
Structure"}
Each binding MUST include:
- `policy_id`: A URI identifying the policy.
- `expected_behaviors`: An array of behavior
descriptors.
- `evaluation_mode`: Either "continuous" or
"on_demand".
Each behavior descriptor MUST include:
- `behavior_id`: A unique identifier.
- `observable_actions`: Action types the monitor
MUST observe.
- `compliance_criteria`: The conditions under
which the behavior is considered compliant.
## Behavioral Evidence Tokens (BET)
A Behavioral Evidence Token is a JSON Web Token
(JWT) {{RFC7519}} signed using JSON Web Signature
(JWS) {{RFC7515}}. BETs extend the ECT claim
set with behavioral verification claims.
The following new claims are defined:
`bhv_policy`:
: REQUIRED. A URI reference to the policy being
verified.
`bhv_result`:
: REQUIRED. The verification result. One of
"pass", "fail", or "partial".
`bhv_evidence`:
: REQUIRED. A base64url-encoded hash (SHA-256)
of the collected observable actions during the
observation window.
`bhv_window`:
: REQUIRED. A JSON object with `start` and
`end` fields containing NumericDate values
(as defined in {{RFC7519}}) representing the
observation period.
`bhv_details`:
: OPTIONAL. An array of per-behavior results
with `behavior_id` and individual `result`
values.
Example BET payload:
~~~json
{
"iss": "urn:example:monitor:m-001",
"sub": "urn:example:agent:agent-42",
"iat": 1700000000,
"exp": 1700003600,
"bhv_policy": "urn:example:policy:data-access",
"bhv_result": "pass",
"bhv_evidence": "dGhpcyBpcyBhIGhhc2g...",
"bhv_window": {
"start": 1699996400,
"end": 1700000000
},
"bhv_details": [
{
"behavior_id": "bhv-001",
"result": "pass"
}
]
}
~~~
{: #fig-bet title="Example BET Payload"}
### BET Lifecycle
The lifecycle of a Behavioral Evidence Token
consists of three phases:
1. Creation: The Runtime Monitor collects
evidence of agent actions, evaluates them
against the Policy-Behavior Binding, and
constructs a BET with the appropriate claims.
The BET is signed by the Monitor's key.
2. Submission: The signed BET is submitted to
the Verifier. Submission MAY occur via a
push model (Monitor sends to Verifier) or a
pull model (Verifier requests from Monitor).
3. Verification: The Verifier validates the BET
signature, checks the claims against its
reference policies, and produces an
attestation result for the Relying Party.
## Runtime Monitoring Protocol
### Monitor Placement
Runtime Monitors MAY be deployed in one of three
configurations:
Inline:
: The Monitor intercepts all agent
communications as a proxy. This provides
complete visibility but adds latency.
Sidecar:
: The Monitor runs alongside the agent process
and receives copies of all actions via a local
interface. This minimizes latency while
maintaining visibility.
External:
: The Monitor operates as a separate service
that receives action logs asynchronously.
This provides the least overhead but may miss
real-time events.
### Observation Collection
The Monitor MUST maintain a time-ordered log of
observed actions. Each log entry MUST contain:
- Timestamp (NumericDate)
- Action type
- Action target (URI)
- Action parameters (opaque to the Monitor)
- Agent identifier
### Evidence Assembly
When assembling evidence for a BET, the Monitor
MUST:
1. Select all log entries within the observation
window.
2. Compute a SHA-256 hash over the canonical
JSON serialization of the selected entries.
3. Evaluate each entry against the applicable
Policy-Behavior Bindings.
4. Determine the aggregate `bhv_result`.
### Anomaly Detection Signaling
When the Monitor detects behavior that violates
a Policy-Behavior Binding, it MUST:
1. Generate a BET with `bhv_result` set to
"fail" or "partial".
2. Signal the anomaly to the Verifier
immediately, regardless of the configured
reporting interval.
3. Optionally signal the agent's orchestrator
to enable corrective action.
# Performance Benchmarking Framework
## Standard Metrics
The following metrics are defined for agent
performance benchmarking:
Task Completion Rate (TCR):
: The ratio of successfully completed tasks to
total tasks attempted. Unit: percentage (%).
Measured over a complete benchmark suite run.
Task Latency (TL):
: The time elapsed from task assignment to task
completion. Unit: milliseconds (ms).
Reported as p50, p95, and p99 percentiles.
Task Accuracy (TA):
: The degree to which task outputs match
expected results. Unit: percentage (%).
Measured using benchmark-specific evaluation
functions.
Resource Efficiency (RE):
: The computational resources consumed per task.
Unit: normalized resource units (NRU).
Includes CPU, memory, and network I/O.
Safety Compliance Score (SCS):
: The ratio of tasks completed without safety
policy violations to total tasks.
Unit: percentage (%).
Delegation Success Rate (DSR):
: The ratio of successful delegations to total
delegation attempts. Unit: percentage (%).
Applicable only to multi-agent scenarios.
## Benchmark Profiles
A Benchmark Profile defines a standardized set
of test scenarios for a specific agent category.
Profiles are expressed as JSON objects:
~~~json
{
"profile_id": "urn:ietf:bench:general-v1",
"profile_name": "General Agent Benchmark",
"version": "1.0",
"agent_category": "general-purpose",
"scenarios": [
{
"scenario_id": "s-001",
"description": "Simple data retrieval",
"difficulty": "basic",
"metrics": ["TCR", "TL", "TA"],
"timeout_ms": 30000,
"expected_output_schema": "..."
}
],
"scoring": {
"weights": {
"TCR": 0.3,
"TL": 0.2,
"TA": 0.3,
"SCS": 0.2
}
}
}
~~~
{: #fig-profile title="Benchmark Profile
Structure"}
Predefined profiles SHOULD be registered for
common agent types including:
- General-purpose agents
- Code generation agents
- Data analysis agents
- Network management agents
## Benchmark Execution Protocol
### Test Harness Requirements
A conformant test harness MUST:
1. Execute all scenarios in the benchmark
profile in a controlled environment.
2. Isolate agent instances from external
resources not specified in the scenario.
3. Record all metrics defined in the profile.
4. Produce a benchmark result document.
### Result Reporting Format
Benchmark results MUST be reported as a JSON
object containing:
- `profile_id`: The benchmark profile used.
- `agent_id`: Identifier of the tested agent.
- `timestamp`: Time of benchmark execution.
- `results`: Per-scenario metric values.
- `aggregate`: Weighted aggregate scores.
### Anti-Gaming Provisions
To prevent agents from gaming benchmark results,
the following provisions apply:
1. Randomized Scenarios: Test harnesses MUST
randomize scenario ordering and MAY
introduce minor variations in scenario
parameters.
2. Blind Evaluation: The agent under test
MUST NOT have access to the expected
outputs or evaluation functions.
3. Holdback Scenarios: Benchmark profiles
SHOULD include scenarios not disclosed to
agent developers.
4. Temporal Variation: Repeated benchmark
runs MUST vary timing to prevent
memoization attacks.
## Performance Claims in ECT
Agent ECTs MAY include performance attestation
claims in the `ext` field:
`perf_profile`:
: The benchmark profile identifier.
`perf_score`:
: The aggregate benchmark score.
`perf_timestamp`:
: The time of the benchmark execution.
`perf_harness`:
: Identifier of the test harness that produced
the results.
These claims allow relying parties to evaluate
agent capability before delegation.
# Integration with ECT
Behavioral Evidence Tokens integrate into the
ECT DAG defined in
{{I-D.nennemann-agent-dag-hitl-safety}} as
follows:
1. Each BET references the ECT of the agent
whose behavior was verified via the `sub`
claim.
2. BETs are attached as child nodes in the
ECT DAG, linked to the agent's execution
node.
3. When an agent delegates to a sub-agent,
the delegating agent's BET chain includes
evidence covering the delegation decision.
4. Verifiers traversing the DAG can inspect
BETs at each node to assess behavioral
compliance across the entire execution
chain.
~~~
+----------+ +----------+
| ECT |---->| ECT |
| Agent A | | Agent B |
+----+-----+ +----+-----+
| |
+----v-----+ +----v-----+
| BET | | BET |
| Agent A | | Agent B |
+----------+ +----------+
~~~
{: #fig-dag title="BET Integration in ECT DAG"}
This structure enables end-to-end behavioral
verification across multi-agent workflows.
# Security Considerations
## Adversarial Behavior
Agents MAY attempt to behave correctly only when
they detect monitoring. Mitigations include:
- Unpredictable monitoring intervals
- Covert observation modes where the agent is
not informed of monitor presence
- Cross-referencing BETs with external audit
logs
## Monitor Compromise
A compromised Runtime Monitor could produce
fraudulent BETs. Mitigations include:
- Monitor attestation using RATS {{RFC9334}}
- Multiple independent monitors with
cross-validation
- Transparency logs for BETs, aligned with
SCITT {{I-D.ietf-scitt-architecture}}
## Benchmark Manipulation
Agents or their operators MAY attempt to
manipulate benchmark results. The anti-gaming
provisions in Section 4.3.3 address this risk.
Additionally:
- Benchmark harnesses MUST be operated by
independent parties.
- Results MUST be signed by the harness
operator.
- Benchmark profiles MUST be versioned and
immutable once published.
## Privacy of Behavioral Evidence
BETs contain information about agent actions
that may be sensitive. Implementations MUST:
- Minimize the detail in `bhv_evidence` to
what is necessary for verification.
- Support selective disclosure where possible.
- Protect BETs in transit using TLS
({{RFC9110}}).
- Define retention policies for behavioral
evidence.
# IANA Considerations
## ECT Extension Claim Keys
This document requests registration of the
following claim keys in the ECT `ext` claims
registry:
| Claim Key | Description |
|:---------------|:---------------------------|
| bhv_policy | Policy URI reference |
| bhv_result | Verification result |
| bhv_evidence | Observed actions hash |
| bhv_window | Observation period |
| bhv_details | Per-behavior results |
| perf_profile | Benchmark profile ID |
| perf_score | Aggregate benchmark score |
| perf_timestamp | Benchmark execution time |
| perf_harness | Test harness identifier |
{: #tbl-claims title="ECT Extension Claims for
Behavioral Verification"}
## Benchmark Profile Media Type
This document requests registration of the
following media type:
Type name: application
Subtype name: agent-benchmark-profile+json
Required parameters: N/A
Optional parameters: N/A
Encoding considerations: binary (UTF-8 JSON)
Security considerations: See Section 6
--- back
# Acknowledgments
{:numbered="false"}
The author thanks the contributors to the NMOP
working group for discussions on agent
operational requirements.