ietf-draft-analyzer/workspace/drafts/gap-analysis/draft-nennemann-agent-behavioral-verification-00.md

---
title: >
  Agent Behavioral Verification and
  Performance Benchmarking
abbrev: "Agent Behavioral Verification"
category: std
docname: draft-nennemann-agent-behavioral-verification-00
area: "OPS"
workgroup: "NMOP"
submissiontype: IETF
v: 3

author:
  - fullname: Christian Nennemann
    organization: Independent Researcher
    email: ietf@nennemann.de

normative:
  RFC2119:
  RFC8174:
  RFC9334:
  RFC7519:
  RFC7515:
  I-D.nennemann-wimse-ect:
    title: "Execution Context Tokens for Distributed Agentic Workflows"
    target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
  I-D.nennemann-agent-dag-hitl-safety:
    title: "Agent Context Policy Token: DAG Delegation with Human Override"
    target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/

informative:
  RFC9110:
  I-D.nennemann-agent-gap-analysis:
    title: "Gap Analysis for Autonomous Agent Protocols"
    target: https://datatracker.ietf.org/doc/draft-nennemann-agent-gap-analysis/
  I-D.ietf-scitt-architecture:

--- abstract

This document defines protocols for runtime
verification that deployed AI agents behave
according to their declared policies.  It also
specifies standardized metrics and a framework
for benchmarking agent performance across
implementations.  Behavioral Evidence Tokens
(BETs) extend the Execution Context Token
architecture to provide cryptographically
verifiable proof of policy compliance.
Performance profiles enable objective comparison
of agent capabilities.

--- middle

# Introduction

Autonomous AI agents increasingly operate in
networked environments where they make decisions,
invoke tools, and delegate tasks to other agents.
Operators and relying parties need assurance that
these agents behave according to their declared
policies at runtime, not merely at deployment
time.

{{I-D.nennemann-agent-gap-analysis}} identifies
two critical gaps in the current standards
landscape:

- Gap 1 (Behavioral Verification): Agents
  declare policies in their Execution Context
  Tokens but no standardized mechanism exists to
  verify that runtime behavior matches those
  declarations.

- Gap 11 (Performance Benchmarking): No
  standardized way exists to compare agent
  implementations objectively across dimensions
  such as task completion, latency, accuracy,
  and safety compliance.

This document addresses both gaps by defining:

1. A behavioral verification architecture
   aligned with the Remote Attestation Procedures
   (RATS) framework {{RFC9334}}.

2. Behavioral Evidence Tokens (BETs) that extend
   the Execution Context Token (ECT)
   {{I-D.nennemann-wimse-ect}} with runtime
   compliance claims.

3. A performance benchmarking framework with
   standard metrics, benchmark profiles, and an
   execution protocol.

# Terminology

{::boilerplate bcp14-tagged}

The following terms are used in this document:

Behavioral Attestation:
: The process of generating verifiable evidence
  that an agent's runtime actions conform to its
  declared policies.

Policy-Behavior Binding:
: A formal linkage between a declared policy in
  an agent's ECT and observable runtime actions
  that demonstrate compliance with that policy.

Behavioral Evidence Token (BET):
: A signed token containing claims about an
  agent's observed runtime behavior relative to
  its declared policies.  BETs extend the ECT
  architecture.

Runtime Monitor:
: A component that observes agent actions and
  collects evidence for behavioral attestation.

Benchmark Suite:
: A collection of standardized test scenarios
  designed to evaluate agent performance across
  defined metrics.

Performance Profile:
: A structured record of benchmark results for
  a specific agent implementation.

# Behavioral Verification Architecture

## Verification Model Overview

The behavioral verification architecture aligns
with the RATS {{RFC9334}} roles of Attester,
Verifier, and Relying Party.  A Runtime Monitor
collects evidence of agent actions and produces
Behavioral Evidence Tokens.

~~~
+-------------+       +---------+
|   Agent     |------>| Runtime |
| (Attester)  |actions| Monitor |
+-------------+       +----+----+
                            |
                       evidence
                            |
                       +----v----+
                       |   BET   |
                       | Creator |
                       +----+----+
                            |
                          BET
                            |
                  +---------v---------+
                  |     Verifier      |
                  | (Policy Engine)   |
                  +---------+---------+
                            |
                    attestation result
                            |
                  +---------v---------+
                  |   Relying Party   |
                  | (Orchestrator /   |
                  |  Operator)        |
                  +-------------------+
~~~
{: #fig-arch title="Behavioral Verification
Architecture"}

The architecture supports two modes of
operation:

- Continuous Monitoring: The Runtime Monitor
  observes all agent actions in real time and
  generates BETs at configurable intervals or
  upon policy-relevant events.

- Point-in-Time Attestation: A Verifier
  requests behavioral evidence for a specific
  time window, and the Monitor assembles a BET
  covering that period.

## Policy-Behavior Binding

A Policy-Behavior Binding declares the expected
behaviors associated with a policy and the
observable actions that constitute compliance.

The binding is expressed as a JSON object:

~~~json
{
  "policy_id": "urn:example:policy:data-access",
  "version": "1.0",
  "expected_behaviors": [
    {
      "behavior_id": "bhv-001",
      "description": "Agent accesses only
        authorized data sources",
      "observable_actions": [
        "data_source_access"
      ],
      "compliance_criteria": {
        "type": "allowlist",
        "values": [
          "urn:example:ds:approved-1",
          "urn:example:ds:approved-2"
        ]
      }
    }
  ],
  "evaluation_mode": "continuous"
}
~~~
{: #fig-binding title="Policy-Behavior Binding
Structure"}

Each binding MUST include:

- `policy_id`: A URI identifying the policy.
- `expected_behaviors`: An array of behavior
  descriptors.
- `evaluation_mode`: Either "continuous" or
  "on_demand".

Each behavior descriptor MUST include:

- `behavior_id`: A unique identifier.
- `observable_actions`: Action types the monitor
  MUST observe.
- `compliance_criteria`: The conditions under
  which the behavior is considered compliant.

## Behavioral Evidence Tokens (BET)

A Behavioral Evidence Token is a JSON Web Token
(JWT) {{RFC7519}} signed using JSON Web Signature
(JWS) {{RFC7515}}.  BETs extend the ECT claim
set with behavioral verification claims.

The following new claims are defined:

`bhv_policy`:
: REQUIRED.  A URI reference to the policy being
  verified.

`bhv_result`:
: REQUIRED.  The verification result.  One of
  "pass", "fail", or "partial".

`bhv_evidence`:
: REQUIRED.  A base64url-encoded hash (SHA-256)
  of the collected observable actions during the
  observation window.

`bhv_window`:
: REQUIRED.  A JSON object with `start` and
  `end` fields containing NumericDate values
  (as defined in {{RFC7519}}) representing the
  observation period.

`bhv_details`:
: OPTIONAL.  An array of per-behavior results
  with `behavior_id` and individual `result`
  values.

Example BET payload:

~~~json
{
  "iss": "urn:example:monitor:m-001",
  "sub": "urn:example:agent:agent-42",
  "iat": 1700000000,
  "exp": 1700003600,
  "bhv_policy": "urn:example:policy:data-access",
  "bhv_result": "pass",
  "bhv_evidence": "dGhpcyBpcyBhIGhhc2g...",
  "bhv_window": {
    "start": 1699996400,
    "end": 1700000000
  },
  "bhv_details": [
    {
      "behavior_id": "bhv-001",
      "result": "pass"
    }
  ]
}
~~~
{: #fig-bet title="Example BET Payload"}

### BET Lifecycle

The lifecycle of a Behavioral Evidence Token
consists of three phases:

1. Creation: The Runtime Monitor collects
   evidence of agent actions, evaluates them
   against the Policy-Behavior Binding, and
   constructs a BET with the appropriate claims.
   The BET is signed by the Monitor's key.

2. Submission: The signed BET is submitted to
   the Verifier.  Submission MAY occur via a
   push model (Monitor sends to Verifier) or a
   pull model (Verifier requests from Monitor).

3. Verification: The Verifier validates the BET
   signature, checks the claims against its
   reference policies, and produces an
   attestation result for the Relying Party.

## Runtime Monitoring Protocol

### Monitor Placement

Runtime Monitors MAY be deployed in one of three
configurations:

Inline:
: The Monitor intercepts all agent
  communications as a proxy.  This provides
  complete visibility but adds latency.

Sidecar:
: The Monitor runs alongside the agent process
  and receives copies of all actions via a local
  interface.  This minimizes latency while
  maintaining visibility.

External:
: The Monitor operates as a separate service
  that receives action logs asynchronously.
  This provides the least overhead but may miss
  real-time events.

### Observation Collection

The Monitor MUST maintain a time-ordered log of
observed actions.  Each log entry MUST contain:

- Timestamp (NumericDate)
- Action type
- Action target (URI)
- Action parameters (opaque to the Monitor)
- Agent identifier

### Evidence Assembly

When assembling evidence for a BET, the Monitor
MUST:

1. Select all log entries within the observation
   window.
2. Compute a SHA-256 hash over the canonical
   JSON serialization of the selected entries.
3. Evaluate each entry against the applicable
   Policy-Behavior Bindings.
4. Determine the aggregate `bhv_result`.

### Anomaly Detection Signaling

When the Monitor detects behavior that violates
a Policy-Behavior Binding, it MUST:

1. Generate a BET with `bhv_result` set to
   "fail" or "partial".
2. Signal the anomaly to the Verifier
   immediately, regardless of the configured
   reporting interval.
3. Optionally signal the agent's orchestrator
   to enable corrective action.

# Performance Benchmarking Framework

## Standard Metrics

The following metrics are defined for agent
performance benchmarking:

Task Completion Rate (TCR):
: The ratio of successfully completed tasks to
  total tasks attempted.  Unit: percentage (%).
  Measured over a complete benchmark suite run.

Task Latency (TL):
: The time elapsed from task assignment to task
  completion.  Unit: milliseconds (ms).
  Reported as p50, p95, and p99 percentiles.

Task Accuracy (TA):
: The degree to which task outputs match
  expected results.  Unit: percentage (%).
  Measured using benchmark-specific evaluation
  functions.

Resource Efficiency (RE):
: The computational resources consumed per task.
  Unit: normalized resource units (NRU).
  Includes CPU, memory, and network I/O.

Safety Compliance Score (SCS):
: The ratio of tasks completed without safety
  policy violations to total tasks.
  Unit: percentage (%).

Delegation Success Rate (DSR):
: The ratio of successful delegations to total
  delegation attempts.  Unit: percentage (%).
  Applicable only to multi-agent scenarios.

## Benchmark Profiles

A Benchmark Profile defines a standardized set
of test scenarios for a specific agent category.
Profiles are expressed as JSON objects:

~~~json
{
  "profile_id": "urn:ietf:bench:general-v1",
  "profile_name": "General Agent Benchmark",
  "version": "1.0",
  "agent_category": "general-purpose",
  "scenarios": [
    {
      "scenario_id": "s-001",
      "description": "Simple data retrieval",
      "difficulty": "basic",
      "metrics": ["TCR", "TL", "TA"],
      "timeout_ms": 30000,
      "expected_output_schema": "..."
    }
  ],
  "scoring": {
    "weights": {
      "TCR": 0.3,
      "TL": 0.2,
      "TA": 0.3,
      "SCS": 0.2
    }
  }
}
~~~
{: #fig-profile title="Benchmark Profile
Structure"}

Predefined profiles SHOULD be registered for
common agent types including:

- General-purpose agents
- Code generation agents
- Data analysis agents
- Network management agents

## Benchmark Execution Protocol

### Test Harness Requirements

A conformant test harness MUST:

1. Execute all scenarios in the benchmark
   profile in a controlled environment.
2. Isolate agent instances from external
   resources not specified in the scenario.
3. Record all metrics defined in the profile.
4. Produce a benchmark result document.

### Result Reporting Format

Benchmark results MUST be reported as a JSON
object containing:

- `profile_id`: The benchmark profile used.
- `agent_id`: Identifier of the tested agent.
- `timestamp`: Time of benchmark execution.
- `results`: Per-scenario metric values.
- `aggregate`: Weighted aggregate scores.

### Anti-Gaming Provisions

To prevent agents from gaming benchmark results,
the following provisions apply:

1. Randomized Scenarios: Test harnesses MUST
   randomize scenario ordering and MAY
   introduce minor variations in scenario
   parameters.

2. Blind Evaluation: The agent under test
   MUST NOT have access to the expected
   outputs or evaluation functions.

3. Holdback Scenarios: Benchmark profiles
   SHOULD include scenarios not disclosed to
   agent developers.

4. Temporal Variation: Repeated benchmark
   runs MUST vary timing to prevent
   memoization attacks.

## Performance Claims in ECT

Agent ECTs MAY include performance attestation
claims in the `ext` field:

`perf_profile`:
: The benchmark profile identifier.

`perf_score`:
: The aggregate benchmark score.

`perf_timestamp`:
: The time of the benchmark execution.

`perf_harness`:
: Identifier of the test harness that produced
  the results.

These claims allow relying parties to evaluate
agent capability before delegation.

# Integration with ECT

Behavioral Evidence Tokens integrate into the
ECT DAG defined in
{{I-D.nennemann-agent-dag-hitl-safety}} as
follows:

1. Each BET references the ECT of the agent
   whose behavior was verified via the `sub`
   claim.

2. BETs are attached as child nodes in the
   ECT DAG, linked to the agent's execution
   node.

3. When an agent delegates to a sub-agent,
   the delegating agent's BET chain includes
   evidence covering the delegation decision.

4. Verifiers traversing the DAG can inspect
   BETs at each node to assess behavioral
   compliance across the entire execution
   chain.

~~~
+----------+     +----------+
|  ECT     |---->|  ECT     |
| Agent A  |     | Agent B  |
+----+-----+     +----+-----+
     |                |
+----v-----+     +----v-----+
| BET      |     | BET      |
| Agent A  |     | Agent B  |
+----------+     +----------+
~~~
{: #fig-dag title="BET Integration in ECT DAG"}

This structure enables end-to-end behavioral
verification across multi-agent workflows.

# Security Considerations

## Adversarial Behavior

Agents MAY attempt to behave correctly only when
they detect monitoring.  Mitigations include:

- Unpredictable monitoring intervals
- Covert observation modes where the agent is
  not informed of monitor presence
- Cross-referencing BETs with external audit
  logs

## Monitor Compromise

A compromised Runtime Monitor could produce
fraudulent BETs.  Mitigations include:

- Monitor attestation using RATS {{RFC9334}}
- Multiple independent monitors with
  cross-validation
- Transparency logs for BETs, aligned with
  SCITT {{I-D.ietf-scitt-architecture}}

## Benchmark Manipulation

Agents or their operators MAY attempt to
manipulate benchmark results.  The anti-gaming
provisions in Section 4.3.3 address this risk.
Additionally:

- Benchmark harnesses MUST be operated by
  independent parties.
- Results MUST be signed by the harness
  operator.
- Benchmark profiles MUST be versioned and
  immutable once published.

## Privacy of Behavioral Evidence

BETs contain information about agent actions
that may be sensitive.  Implementations MUST:

- Minimize the detail in `bhv_evidence` to
  what is necessary for verification.
- Support selective disclosure where possible.
- Protect BETs in transit using TLS
  ({{RFC9110}}).
- Define retention policies for behavioral
  evidence.

# IANA Considerations

## ECT Extension Claim Keys

This document requests registration of the
following claim keys in the ECT `ext` claims
registry:

| Claim Key      | Description                |
|:---------------|:---------------------------|
| bhv_policy     | Policy URI reference       |
| bhv_result     | Verification result        |
| bhv_evidence   | Observed actions hash      |
| bhv_window     | Observation period         |
| bhv_details    | Per-behavior results       |
| perf_profile   | Benchmark profile ID       |
| perf_score     | Aggregate benchmark score  |
| perf_timestamp | Benchmark execution time   |
| perf_harness   | Test harness identifier    |
{: #tbl-claims title="ECT Extension Claims for
Behavioral Verification"}

## Benchmark Profile Media Type

This document requests registration of the
following media type:

Type name: application

Subtype name: agent-benchmark-profile+json

Required parameters: N/A

Optional parameters: N/A

Encoding considerations: binary (UTF-8 JSON)

Security considerations: See Section 6

--- back

# Acknowledgments
{:numbered="false"}

The author thanks the contributors to the NMOP
working group for discussions on agent
operational requirements.