661 lines
17 KiB
Markdown
661 lines
17 KiB
Markdown
---
|
|
title: >
|
|
Agent Behavioral Verification and
|
|
Performance Benchmarking
|
|
abbrev: "Agent Behavioral Verification"
|
|
category: std
|
|
docname: draft-nennemann-agent-behavioral-verification-00
|
|
area: "OPS"
|
|
workgroup: "NMOP"
|
|
submissiontype: IETF
|
|
v: 3
|
|
|
|
author:
|
|
- fullname: Christian Nennemann
|
|
organization: Independent Researcher
|
|
email: ietf@nennemann.de
|
|
|
|
normative:
|
|
RFC2119:
|
|
RFC8174:
|
|
RFC9334:
|
|
RFC7519:
|
|
RFC7515:
|
|
I-D.nennemann-wimse-ect:
|
|
title: "Execution Context Tokens for Distributed Agentic Workflows"
|
|
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
|
|
I-D.nennemann-agent-dag-hitl-safety:
|
|
title: "Agent Context Policy Token: DAG Delegation with Human Override"
|
|
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
|
|
|
|
informative:
|
|
RFC9110:
|
|
I-D.nennemann-agent-gap-analysis:
|
|
title: "Gap Analysis for Autonomous Agent Protocols"
|
|
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-gap-analysis/
|
|
I-D.ietf-scitt-architecture:
|
|
|
|
--- abstract
|
|
|
|
This document defines protocols for runtime
|
|
verification that deployed AI agents behave
|
|
according to their declared policies. It also
|
|
specifies standardized metrics and a framework
|
|
for benchmarking agent performance across
|
|
implementations. Behavioral Evidence Tokens
|
|
(BETs) extend the Execution Context Token
|
|
architecture to provide cryptographically
|
|
verifiable proof of policy compliance.
|
|
Performance profiles enable objective comparison
|
|
of agent capabilities.
|
|
|
|
--- middle
|
|
|
|
# Introduction
|
|
|
|
Autonomous AI agents increasingly operate in
|
|
networked environments where they make decisions,
|
|
invoke tools, and delegate tasks to other agents.
|
|
Operators and relying parties need assurance that
|
|
these agents behave according to their declared
|
|
policies at runtime, not merely at deployment
|
|
time.
|
|
|
|
{{I-D.nennemann-agent-gap-analysis}} identifies
|
|
two critical gaps in the current standards
|
|
landscape:
|
|
|
|
- Gap 1 (Behavioral Verification): Agents
|
|
declare policies in their Execution Context
|
|
Tokens but no standardized mechanism exists to
|
|
verify that runtime behavior matches those
|
|
declarations.
|
|
|
|
- Gap 11 (Performance Benchmarking): No
|
|
standardized way exists to compare agent
|
|
implementations objectively across dimensions
|
|
such as task completion, latency, accuracy,
|
|
and safety compliance.
|
|
|
|
This document addresses both gaps by defining:
|
|
|
|
1. A behavioral verification architecture
|
|
aligned with the Remote Attestation Procedures
|
|
(RATS) framework {{RFC9334}}.
|
|
|
|
2. Behavioral Evidence Tokens (BETs) that extend
|
|
the Execution Context Token (ECT)
|
|
{{I-D.nennemann-wimse-ect}} with runtime
|
|
compliance claims.
|
|
|
|
3. A performance benchmarking framework with
|
|
standard metrics, benchmark profiles, and an
|
|
execution protocol.
|
|
|
|
# Terminology
|
|
|
|
{::boilerplate bcp14-tagged}
|
|
|
|
The following terms are used in this document:
|
|
|
|
Behavioral Attestation:
|
|
: The process of generating verifiable evidence
|
|
that an agent's runtime actions conform to its
|
|
declared policies.
|
|
|
|
Policy-Behavior Binding:
|
|
: A formal linkage between a declared policy in
|
|
an agent's ECT and observable runtime actions
|
|
that demonstrate compliance with that policy.
|
|
|
|
Behavioral Evidence Token (BET):
|
|
: A signed token containing claims about an
|
|
agent's observed runtime behavior relative to
|
|
its declared policies. BETs extend the ECT
|
|
architecture.
|
|
|
|
Runtime Monitor:
|
|
: A component that observes agent actions and
|
|
collects evidence for behavioral attestation.
|
|
|
|
Benchmark Suite:
|
|
: A collection of standardized test scenarios
|
|
designed to evaluate agent performance across
|
|
defined metrics.
|
|
|
|
Performance Profile:
|
|
: A structured record of benchmark results for
|
|
a specific agent implementation.
|
|
|
|
# Behavioral Verification Architecture
|
|
|
|
## Verification Model Overview
|
|
|
|
The behavioral verification architecture aligns
|
|
with the RATS {{RFC9334}} roles of Attester,
|
|
Verifier, and Relying Party. A Runtime Monitor
|
|
collects evidence of agent actions and produces
|
|
Behavioral Evidence Tokens.
|
|
|
|
~~~
|
|
+-------------+ +---------+
|
|
| Agent |------>| Runtime |
|
|
| (Attester) |actions| Monitor |
|
|
+-------------+ +----+----+
|
|
|
|
|
evidence
|
|
|
|
|
+----v----+
|
|
| BET |
|
|
| Creator |
|
|
+----+----+
|
|
|
|
|
BET
|
|
|
|
|
+---------v---------+
|
|
| Verifier |
|
|
| (Policy Engine) |
|
|
+---------+---------+
|
|
|
|
|
attestation result
|
|
|
|
|
+---------v---------+
|
|
| Relying Party |
|
|
| (Orchestrator / |
|
|
| Operator) |
|
|
+-------------------+
|
|
~~~
|
|
{: #fig-arch title="Behavioral Verification
|
|
Architecture"}
|
|
|
|
The architecture supports two modes of
|
|
operation:
|
|
|
|
- Continuous Monitoring: The Runtime Monitor
|
|
observes all agent actions in real time and
|
|
generates BETs at configurable intervals or
|
|
upon policy-relevant events.
|
|
|
|
- Point-in-Time Attestation: A Verifier
|
|
requests behavioral evidence for a specific
|
|
time window, and the Monitor assembles a BET
|
|
covering that period.
|
|
|
|
## Policy-Behavior Binding
|
|
|
|
A Policy-Behavior Binding declares the expected
|
|
behaviors associated with a policy and the
|
|
observable actions that constitute compliance.
|
|
|
|
The binding is expressed as a JSON object:
|
|
|
|
~~~json
|
|
{
|
|
"policy_id": "urn:example:policy:data-access",
|
|
"version": "1.0",
|
|
"expected_behaviors": [
|
|
{
|
|
"behavior_id": "bhv-001",
|
|
"description": "Agent accesses only
|
|
authorized data sources",
|
|
"observable_actions": [
|
|
"data_source_access"
|
|
],
|
|
"compliance_criteria": {
|
|
"type": "allowlist",
|
|
"values": [
|
|
"urn:example:ds:approved-1",
|
|
"urn:example:ds:approved-2"
|
|
]
|
|
}
|
|
}
|
|
],
|
|
"evaluation_mode": "continuous"
|
|
}
|
|
~~~
|
|
{: #fig-binding title="Policy-Behavior Binding
|
|
Structure"}
|
|
|
|
Each binding MUST include:
|
|
|
|
- `policy_id`: A URI identifying the policy.
|
|
- `expected_behaviors`: An array of behavior
|
|
descriptors.
|
|
- `evaluation_mode`: Either "continuous" or
|
|
"on_demand".
|
|
|
|
Each behavior descriptor MUST include:
|
|
|
|
- `behavior_id`: A unique identifier.
|
|
- `observable_actions`: Action types the monitor
|
|
MUST observe.
|
|
- `compliance_criteria`: The conditions under
|
|
which the behavior is considered compliant.
|
|
|
|
## Behavioral Evidence Tokens (BET)
|
|
|
|
A Behavioral Evidence Token is a JSON Web Token
|
|
(JWT) {{RFC7519}} signed using JSON Web Signature
|
|
(JWS) {{RFC7515}}. BETs extend the ECT claim
|
|
set with behavioral verification claims.
|
|
|
|
The following new claims are defined:
|
|
|
|
`bhv_policy`:
|
|
: REQUIRED. A URI reference to the policy being
|
|
verified.
|
|
|
|
`bhv_result`:
|
|
: REQUIRED. The verification result. One of
|
|
"pass", "fail", or "partial".
|
|
|
|
`bhv_evidence`:
|
|
: REQUIRED. A base64url-encoded hash (SHA-256)
|
|
of the collected observable actions during the
|
|
observation window.
|
|
|
|
`bhv_window`:
|
|
: REQUIRED. A JSON object with `start` and
|
|
`end` fields containing NumericDate values
|
|
(as defined in {{RFC7519}}) representing the
|
|
observation period.
|
|
|
|
`bhv_details`:
|
|
: OPTIONAL. An array of per-behavior results
|
|
with `behavior_id` and individual `result`
|
|
values.
|
|
|
|
Example BET payload:
|
|
|
|
~~~json
|
|
{
|
|
"iss": "urn:example:monitor:m-001",
|
|
"sub": "urn:example:agent:agent-42",
|
|
"iat": 1700000000,
|
|
"exp": 1700003600,
|
|
"bhv_policy": "urn:example:policy:data-access",
|
|
"bhv_result": "pass",
|
|
"bhv_evidence": "dGhpcyBpcyBhIGhhc2g...",
|
|
"bhv_window": {
|
|
"start": 1699996400,
|
|
"end": 1700000000
|
|
},
|
|
"bhv_details": [
|
|
{
|
|
"behavior_id": "bhv-001",
|
|
"result": "pass"
|
|
}
|
|
]
|
|
}
|
|
~~~
|
|
{: #fig-bet title="Example BET Payload"}
|
|
|
|
### BET Lifecycle
|
|
|
|
The lifecycle of a Behavioral Evidence Token
|
|
consists of three phases:
|
|
|
|
1. Creation: The Runtime Monitor collects
|
|
evidence of agent actions, evaluates them
|
|
against the Policy-Behavior Binding, and
|
|
constructs a BET with the appropriate claims.
|
|
The BET is signed by the Monitor's key.
|
|
|
|
2. Submission: The signed BET is submitted to
|
|
the Verifier. Submission MAY occur via a
|
|
push model (Monitor sends to Verifier) or a
|
|
pull model (Verifier requests from Monitor).
|
|
|
|
3. Verification: The Verifier validates the BET
|
|
signature, checks the claims against its
|
|
reference policies, and produces an
|
|
attestation result for the Relying Party.
|
|
|
|
## Runtime Monitoring Protocol
|
|
|
|
### Monitor Placement
|
|
|
|
Runtime Monitors MAY be deployed in one of three
|
|
configurations:
|
|
|
|
Inline:
|
|
: The Monitor intercepts all agent
|
|
communications as a proxy. This provides
|
|
complete visibility but adds latency.
|
|
|
|
Sidecar:
|
|
: The Monitor runs alongside the agent process
|
|
and receives copies of all actions via a local
|
|
interface. This minimizes latency while
|
|
maintaining visibility.
|
|
|
|
External:
|
|
: The Monitor operates as a separate service
|
|
that receives action logs asynchronously.
|
|
This provides the least overhead but may miss
|
|
real-time events.
|
|
|
|
### Observation Collection
|
|
|
|
The Monitor MUST maintain a time-ordered log of
|
|
observed actions. Each log entry MUST contain:
|
|
|
|
- Timestamp (NumericDate)
|
|
- Action type
|
|
- Action target (URI)
|
|
- Action parameters (opaque to the Monitor)
|
|
- Agent identifier
|
|
|
|
### Evidence Assembly
|
|
|
|
When assembling evidence for a BET, the Monitor
|
|
MUST:
|
|
|
|
1. Select all log entries within the observation
|
|
window.
|
|
2. Compute a SHA-256 hash over the canonical
|
|
JSON serialization of the selected entries.
|
|
3. Evaluate each entry against the applicable
|
|
Policy-Behavior Bindings.
|
|
4. Determine the aggregate `bhv_result`.
|
|
|
|
### Anomaly Detection Signaling
|
|
|
|
When the Monitor detects behavior that violates
|
|
a Policy-Behavior Binding, it MUST:
|
|
|
|
1. Generate a BET with `bhv_result` set to
|
|
"fail" or "partial".
|
|
2. Signal the anomaly to the Verifier
|
|
immediately, regardless of the configured
|
|
reporting interval.
|
|
3. Optionally signal the agent's orchestrator
|
|
to enable corrective action.
|
|
|
|
# Performance Benchmarking Framework
|
|
|
|
## Standard Metrics
|
|
|
|
The following metrics are defined for agent
|
|
performance benchmarking:
|
|
|
|
Task Completion Rate (TCR):
|
|
: The ratio of successfully completed tasks to
|
|
total tasks attempted. Unit: percentage (%).
|
|
Measured over a complete benchmark suite run.
|
|
|
|
Task Latency (TL):
|
|
: The time elapsed from task assignment to task
|
|
completion. Unit: milliseconds (ms).
|
|
Reported as p50, p95, and p99 percentiles.
|
|
|
|
Task Accuracy (TA):
|
|
: The degree to which task outputs match
|
|
expected results. Unit: percentage (%).
|
|
Measured using benchmark-specific evaluation
|
|
functions.
|
|
|
|
Resource Efficiency (RE):
|
|
: The computational resources consumed per task.
|
|
Unit: normalized resource units (NRU).
|
|
Includes CPU, memory, and network I/O.
|
|
|
|
Safety Compliance Score (SCS):
|
|
: The ratio of tasks completed without safety
|
|
policy violations to total tasks.
|
|
Unit: percentage (%).
|
|
|
|
Delegation Success Rate (DSR):
|
|
: The ratio of successful delegations to total
|
|
delegation attempts. Unit: percentage (%).
|
|
Applicable only to multi-agent scenarios.
|
|
|
|
## Benchmark Profiles
|
|
|
|
A Benchmark Profile defines a standardized set
|
|
of test scenarios for a specific agent category.
|
|
Profiles are expressed as JSON objects:
|
|
|
|
~~~json
|
|
{
|
|
"profile_id": "urn:ietf:bench:general-v1",
|
|
"profile_name": "General Agent Benchmark",
|
|
"version": "1.0",
|
|
"agent_category": "general-purpose",
|
|
"scenarios": [
|
|
{
|
|
"scenario_id": "s-001",
|
|
"description": "Simple data retrieval",
|
|
"difficulty": "basic",
|
|
"metrics": ["TCR", "TL", "TA"],
|
|
"timeout_ms": 30000,
|
|
"expected_output_schema": "..."
|
|
}
|
|
],
|
|
"scoring": {
|
|
"weights": {
|
|
"TCR": 0.3,
|
|
"TL": 0.2,
|
|
"TA": 0.3,
|
|
"SCS": 0.2
|
|
}
|
|
}
|
|
}
|
|
~~~
|
|
{: #fig-profile title="Benchmark Profile
|
|
Structure"}
|
|
|
|
Predefined profiles SHOULD be registered for
|
|
common agent types including:
|
|
|
|
- General-purpose agents
|
|
- Code generation agents
|
|
- Data analysis agents
|
|
- Network management agents
|
|
|
|
## Benchmark Execution Protocol
|
|
|
|
### Test Harness Requirements
|
|
|
|
A conformant test harness MUST:
|
|
|
|
1. Execute all scenarios in the benchmark
|
|
profile in a controlled environment.
|
|
2. Isolate agent instances from external
|
|
resources not specified in the scenario.
|
|
3. Record all metrics defined in the profile.
|
|
4. Produce a benchmark result document.
|
|
|
|
### Result Reporting Format
|
|
|
|
Benchmark results MUST be reported as a JSON
|
|
object containing:
|
|
|
|
- `profile_id`: The benchmark profile used.
|
|
- `agent_id`: Identifier of the tested agent.
|
|
- `timestamp`: Time of benchmark execution.
|
|
- `results`: Per-scenario metric values.
|
|
- `aggregate`: Weighted aggregate scores.
|
|
|
|
### Anti-Gaming Provisions
|
|
|
|
To prevent agents from gaming benchmark results,
|
|
the following provisions apply:
|
|
|
|
1. Randomized Scenarios: Test harnesses MUST
|
|
randomize scenario ordering and MAY
|
|
introduce minor variations in scenario
|
|
parameters.
|
|
|
|
2. Blind Evaluation: The agent under test
|
|
MUST NOT have access to the expected
|
|
outputs or evaluation functions.
|
|
|
|
3. Holdback Scenarios: Benchmark profiles
|
|
SHOULD include scenarios not disclosed to
|
|
agent developers.
|
|
|
|
4. Temporal Variation: Repeated benchmark
|
|
runs MUST vary timing to prevent
|
|
memoization attacks.
|
|
|
|
## Performance Claims in ECT
|
|
|
|
Agent ECTs MAY include performance attestation
|
|
claims in the `ext` field:
|
|
|
|
`perf_profile`:
|
|
: The benchmark profile identifier.
|
|
|
|
`perf_score`:
|
|
: The aggregate benchmark score.
|
|
|
|
`perf_timestamp`:
|
|
: The time of the benchmark execution.
|
|
|
|
`perf_harness`:
|
|
: Identifier of the test harness that produced
|
|
the results.
|
|
|
|
These claims allow relying parties to evaluate
|
|
agent capability before delegation.
|
|
|
|
# Integration with ECT
|
|
|
|
Behavioral Evidence Tokens integrate into the
|
|
ECT DAG defined in
|
|
{{I-D.nennemann-agent-dag-hitl-safety}} as
|
|
follows:
|
|
|
|
1. Each BET references the ECT of the agent
|
|
whose behavior was verified via the `sub`
|
|
claim.
|
|
|
|
2. BETs are attached as child nodes in the
|
|
ECT DAG, linked to the agent's execution
|
|
node.
|
|
|
|
3. When an agent delegates to a sub-agent,
|
|
the delegating agent's BET chain includes
|
|
evidence covering the delegation decision.
|
|
|
|
4. Verifiers traversing the DAG can inspect
|
|
BETs at each node to assess behavioral
|
|
compliance across the entire execution
|
|
chain.
|
|
|
|
~~~
|
|
+----------+ +----------+
|
|
| ECT |---->| ECT |
|
|
| Agent A | | Agent B |
|
|
+----+-----+ +----+-----+
|
|
| |
|
|
+----v-----+ +----v-----+
|
|
| BET | | BET |
|
|
| Agent A | | Agent B |
|
|
+----------+ +----------+
|
|
~~~
|
|
{: #fig-dag title="BET Integration in ECT DAG"}
|
|
|
|
This structure enables end-to-end behavioral
|
|
verification across multi-agent workflows.
|
|
|
|
# Security Considerations
|
|
|
|
## Adversarial Behavior
|
|
|
|
Agents MAY attempt to behave correctly only when
|
|
they detect monitoring. Mitigations include:
|
|
|
|
- Unpredictable monitoring intervals
|
|
- Covert observation modes where the agent is
|
|
not informed of monitor presence
|
|
- Cross-referencing BETs with external audit
|
|
logs
|
|
|
|
## Monitor Compromise
|
|
|
|
A compromised Runtime Monitor could produce
|
|
fraudulent BETs. Mitigations include:
|
|
|
|
- Monitor attestation using RATS {{RFC9334}}
|
|
- Multiple independent monitors with
|
|
cross-validation
|
|
- Transparency logs for BETs, aligned with
|
|
SCITT {{I-D.ietf-scitt-architecture}}
|
|
|
|
## Benchmark Manipulation
|
|
|
|
Agents or their operators MAY attempt to
|
|
manipulate benchmark results. The anti-gaming
|
|
provisions in Section 4.3.3 address this risk.
|
|
Additionally:
|
|
|
|
- Benchmark harnesses MUST be operated by
|
|
independent parties.
|
|
- Results MUST be signed by the harness
|
|
operator.
|
|
- Benchmark profiles MUST be versioned and
|
|
immutable once published.
|
|
|
|
## Privacy of Behavioral Evidence
|
|
|
|
BETs contain information about agent actions
|
|
that may be sensitive. Implementations MUST:
|
|
|
|
- Minimize the detail in `bhv_evidence` to
|
|
what is necessary for verification.
|
|
- Support selective disclosure where possible.
|
|
- Protect BETs in transit using TLS
|
|
({{RFC9110}}).
|
|
- Define retention policies for behavioral
|
|
evidence.
|
|
|
|
# IANA Considerations
|
|
|
|
## ECT Extension Claim Keys
|
|
|
|
This document requests registration of the
|
|
following claim keys in the ECT `ext` claims
|
|
registry:
|
|
|
|
| Claim Key | Description |
|
|
|:---------------|:---------------------------|
|
|
| bhv_policy | Policy URI reference |
|
|
| bhv_result | Verification result |
|
|
| bhv_evidence | Observed actions hash |
|
|
| bhv_window | Observation period |
|
|
| bhv_details | Per-behavior results |
|
|
| perf_profile | Benchmark profile ID |
|
|
| perf_score | Aggregate benchmark score |
|
|
| perf_timestamp | Benchmark execution time |
|
|
| perf_harness | Test harness identifier |
|
|
{: #tbl-claims title="ECT Extension Claims for
|
|
Behavioral Verification"}
|
|
|
|
## Benchmark Profile Media Type
|
|
|
|
This document requests registration of the
|
|
following media type:
|
|
|
|
Type name: application
|
|
|
|
Subtype name: agent-benchmark-profile+json
|
|
|
|
Required parameters: N/A
|
|
|
|
Optional parameters: N/A
|
|
|
|
Encoding considerations: binary (UTF-8 JSON)
|
|
|
|
Security considerations: See Section 6
|
|
|
|
--- back
|
|
|
|
# Acknowledgments
|
|
{:numbered="false"}
|
|
|
|
The author thanks the contributors to the NMOP
|
|
working group for discussions on agent
|
|
operational requirements.
|