Agent Behavioral Verification and Performance Benchmarking

Internet-Draft	Agent Behavioral Verification	March 2026
Nennemann	Expires 7 September 2026	[Page]

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶

This Internet-Draft will expire on 7 September 2026.¶

Copyright Notice

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶

1. Introduction

Autonomous AI agents increasingly operate in networked environments where they make decisions, invoke tools, and delegate tasks to other agents. Operators and relying parties need assurance that these agents behave according to their declared policies at runtime, not merely at deployment time.¶

[I-D.nennemann-agent-gap-analysis] identifies two critical gaps in the current standards landscape:¶

Gap 1 (Behavioral Verification): Agents declare policies in their Execution Context Tokens but no standardized mechanism exists to verify that runtime behavior matches those declarations.¶
Gap 11 (Performance Benchmarking): No standardized way exists to compare agent implementations objectively across dimensions such as task completion, latency, accuracy, and safety compliance.¶

This document addresses both gaps by defining:¶

A behavioral verification architecture aligned with the Remote Attestation Procedures (RATS) framework [RFC9334].¶
Behavioral Evidence Tokens (BETs) that extend the Execution Context Token (ECT) [I-D.nennemann-wimse-ect] with runtime compliance claims.¶
A performance benchmarking framework with standard metrics, benchmark profiles, and an execution protocol.¶

2. Terminology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶

The following terms are used in this document:¶

Behavioral Attestation:: The process of generating verifiable evidence that an agent's runtime actions conform to its declared policies.¶
Policy-Behavior Binding:: A formal linkage between a declared policy in an agent's ECT and observable runtime actions that demonstrate compliance with that policy.¶
Behavioral Evidence Token (BET):: A signed token containing claims about an agent's observed runtime behavior relative to its declared policies. BETs extend the ECT architecture.¶
Runtime Monitor:: A component that observes agent actions and collects evidence for behavioral attestation.¶
Benchmark Suite:: A collection of standardized test scenarios designed to evaluate agent performance across defined metrics.¶
Performance Profile:: A structured record of benchmark results for a specific agent implementation.¶

3. Behavioral Verification Architecture

3.1. Verification Model Overview

The behavioral verification architecture aligns with the RATS [RFC9334] roles of Attester, Verifier, and Relying Party. A Runtime Monitor collects evidence of agent actions and produces Behavioral Evidence Tokens.¶

+-------------+       +---------+
|   Agent     |------>| Runtime |
| (Attester)  |actions| Monitor |
+-------------+       +----+----+
                            |
                       evidence
                            |
                       +----v----+
                       |   BET   |
                       | Creator |
                       +----+----+
                            |
                          BET
                            |
                  +---------v---------+
                  |     Verifier      |
                  | (Policy Engine)   |
                  +---------+---------+
                            |
                    attestation result
                            |
                  +---------v---------+
                  |   Relying Party   |
                  | (Orchestrator /   |
                  |  Operator)        |
                  +-------------------+

Figure 1: Behavioral Verification Architecture

The architecture supports two modes of operation:¶

Continuous Monitoring: The Runtime Monitor observes all agent actions in real time and generates BETs at configurable intervals or upon policy-relevant events.¶
Point-in-Time Attestation: A Verifier requests behavioral evidence for a specific time window, and the Monitor assembles a BET covering that period.¶

3.2. Policy-Behavior Binding

A Policy-Behavior Binding declares the expected behaviors associated with a policy and the observable actions that constitute compliance.¶

The binding is expressed as a JSON object:¶

{
  "policy_id": "urn:example:policy:data-access",
  "version": "1.0",
  "expected_behaviors": [
    {
      "behavior_id": "bhv-001",
      "description": "Agent accesses only
        authorized data sources",
      "observable_actions": [
        "data_source_access"
      ],
      "compliance_criteria": {
        "type": "allowlist",
        "values": [
          "urn:example:ds:approved-1",
          "urn:example:ds:approved-2"
        ]
      }
    }
  ],
  "evaluation_mode": "continuous"
}

Figure 2: Policy-Behavior Binding Structure

Each binding MUST include:¶

policy_id: A URI identifying the policy.¶
expected_behaviors: An array of behavior descriptors.¶
evaluation_mode: Either "continuous" or "on_demand".¶

Each behavior descriptor MUST include:¶

behavior_id: A unique identifier.¶
observable_actions: Action types the monitor MUST observe.¶
compliance_criteria: The conditions under which the behavior is considered compliant.¶

3.3. Behavioral Evidence Tokens (BET)

A Behavioral Evidence Token is a JSON Web Token (JWT) [RFC7519] signed using JSON Web Signature (JWS) [RFC7515]. BETs extend the ECT claim set with behavioral verification claims.¶

The following new claims are defined:¶

bhv_policy:: REQUIRED. A URI reference to the policy being verified.¶
bhv_result:: REQUIRED. The verification result. One of "pass", "fail", or "partial".¶
bhv_evidence:: REQUIRED. A base64url-encoded hash (SHA-256) of the collected observable actions during the observation window.¶
bhv_window:: REQUIRED. A JSON object with start and end fields containing NumericDate values (as defined in [RFC7519]) representing the observation period.¶
bhv_details:: OPTIONAL. An array of per-behavior results with behavior_id and individual result values.¶

Example BET payload:¶

{
  "iss": "urn:example:monitor:m-001",
  "sub": "urn:example:agent:agent-42",
  "iat": 1700000000,
  "exp": 1700003600,
  "bhv_policy": "urn:example:policy:data-access",
  "bhv_result": "pass",
  "bhv_evidence": "dGhpcyBpcyBhIGhhc2g...",
  "bhv_window": {
    "start": 1699996400,
    "end": 1700000000
  },
  "bhv_details": [
    {
      "behavior_id": "bhv-001",
      "result": "pass"
    }
  ]
}

Figure 3: Example BET Payload

3.3.1. BET Lifecycle

The lifecycle of a Behavioral Evidence Token consists of three phases:¶

Creation: The Runtime Monitor collects evidence of agent actions, evaluates them against the Policy-Behavior Binding, and constructs a BET with the appropriate claims. The BET is signed by the Monitor's key.¶
Submission: The signed BET is submitted to the Verifier. Submission MAY occur via a push model (Monitor sends to Verifier) or a pull model (Verifier requests from Monitor).¶
Verification: The Verifier validates the BET signature, checks the claims against its reference policies, and produces an attestation result for the Relying Party.¶

3.4. Runtime Monitoring Protocol

3.4.1. Monitor Placement

Runtime Monitors MAY be deployed in one of three configurations:¶

Inline:: The Monitor intercepts all agent communications as a proxy. This provides complete visibility but adds latency.¶
Sidecar:: The Monitor runs alongside the agent process and receives copies of all actions via a local interface. This minimizes latency while maintaining visibility.¶
External:: The Monitor operates as a separate service that receives action logs asynchronously. This provides the least overhead but may miss real-time events.¶

3.4.2. Observation Collection

The Monitor MUST maintain a time-ordered log of observed actions. Each log entry MUST contain:¶

Timestamp (NumericDate)¶
Action type¶
Action target (URI)¶
Action parameters (opaque to the Monitor)¶
Agent identifier¶

3.4.3. Evidence Assembly

When assembling evidence for a BET, the Monitor MUST:¶

Select all log entries within the observation window.¶
Compute a SHA-256 hash over the canonical JSON serialization of the selected entries.¶
Evaluate each entry against the applicable Policy-Behavior Bindings.¶
Determine the aggregate bhv_result.¶

3.4.4. Anomaly Detection Signaling

When the Monitor detects behavior that violates a Policy-Behavior Binding, it MUST:¶

Generate a BET with bhv_result set to "fail" or "partial".¶
Signal the anomaly to the Verifier immediately, regardless of the configured reporting interval.¶
Optionally signal the agent's orchestrator to enable corrective action.¶

4. Performance Benchmarking Framework

4.1. Standard Metrics

The following metrics are defined for agent performance benchmarking:¶

Task Completion Rate (TCR):: The ratio of successfully completed tasks to total tasks attempted. Unit: percentage (%). Measured over a complete benchmark suite run.¶
Task Latency (TL):: The time elapsed from task assignment to task completion. Unit: milliseconds (ms). Reported as p50, p95, and p99 percentiles.¶
Task Accuracy (TA):: The degree to which task outputs match expected results. Unit: percentage (%). Measured using benchmark-specific evaluation functions.¶
Resource Efficiency (RE):: The computational resources consumed per task. Unit: normalized resource units (NRU). Includes CPU, memory, and network I/O.¶
Safety Compliance Score (SCS):: The ratio of tasks completed without safety policy violations to total tasks. Unit: percentage (%).¶
Delegation Success Rate (DSR):: The ratio of successful delegations to total delegation attempts. Unit: percentage (%). Applicable only to multi-agent scenarios.¶

4.2. Benchmark Profiles

A Benchmark Profile defines a standardized set of test scenarios for a specific agent category. Profiles are expressed as JSON objects:¶

{
  "profile_id": "urn:ietf:bench:general-v1",
  "profile_name": "General Agent Benchmark",
  "version": "1.0",
  "agent_category": "general-purpose",
  "scenarios": [
    {
      "scenario_id": "s-001",
      "description": "Simple data retrieval",
      "difficulty": "basic",
      "metrics": ["TCR", "TL", "TA"],
      "timeout_ms": 30000,
      "expected_output_schema": "..."
    }
  ],
  "scoring": {
    "weights": {
      "TCR": 0.3,
      "TL": 0.2,
      "TA": 0.3,
      "SCS": 0.2
    }
  }
}

Figure 4: Benchmark Profile Structure

Predefined profiles SHOULD be registered for common agent types including:¶

General-purpose agents¶
Code generation agents¶
Data analysis agents¶
Network management agents¶

4.3. Benchmark Execution Protocol

4.3.1. Test Harness Requirements

A conformant test harness MUST:¶

Execute all scenarios in the benchmark profile in a controlled environment.¶
Isolate agent instances from external resources not specified in the scenario.¶
Record all metrics defined in the profile.¶
Produce a benchmark result document.¶

4.3.2. Result Reporting Format

Benchmark results MUST be reported as a JSON object containing:¶

profile_id: The benchmark profile used.¶
agent_id: Identifier of the tested agent.¶
timestamp: Time of benchmark execution.¶
results: Per-scenario metric values.¶
aggregate: Weighted aggregate scores.¶

4.3.3. Anti-Gaming Provisions

To prevent agents from gaming benchmark results, the following provisions apply:¶

Randomized Scenarios: Test harnesses MUST randomize scenario ordering and MAY introduce minor variations in scenario parameters.¶
Blind Evaluation: The agent under test MUST NOT have access to the expected outputs or evaluation functions.¶
Holdback Scenarios: Benchmark profiles SHOULD include scenarios not disclosed to agent developers.¶
Temporal Variation: Repeated benchmark runs MUST vary timing to prevent memoization attacks.¶

4.4. Performance Claims in ECT

Agent ECTs MAY include performance attestation claims in the ext field:¶

perf_profile:: The benchmark profile identifier.¶
perf_score:: The aggregate benchmark score.¶
perf_timestamp:: The time of the benchmark execution.¶
perf_harness:: Identifier of the test harness that produced the results.¶

These claims allow relying parties to evaluate agent capability before delegation.¶

5. Integration with ECT

Behavioral Evidence Tokens integrate into the ECT DAG defined in [I-D.nennemann-agent-dag-hitl-safety] as follows:¶

Each BET references the ECT of the agent whose behavior was verified via the sub claim.¶
BETs are attached as child nodes in the ECT DAG, linked to the agent's execution node.¶
When an agent delegates to a sub-agent, the delegating agent's BET chain includes evidence covering the delegation decision.¶
Verifiers traversing the DAG can inspect BETs at each node to assess behavioral compliance across the entire execution chain.¶

+----------+     +----------+
|  ECT     |---->|  ECT     |
| Agent A  |     | Agent B  |
+----+-----+     +----+-----+
     |                |
+----v-----+     +----v-----+
| BET      |     | BET      |
| Agent A  |     | Agent B  |
+----------+     +----------+

Figure 5: BET Integration in ECT DAG

This structure enables end-to-end behavioral verification across multi-agent workflows.¶

6. Security Considerations

6.1. Adversarial Behavior

Agents MAY attempt to behave correctly only when they detect monitoring. Mitigations include:¶

Unpredictable monitoring intervals¶
Covert observation modes where the agent is not informed of monitor presence¶
Cross-referencing BETs with external audit logs¶

6.2. Monitor Compromise

A compromised Runtime Monitor could produce fraudulent BETs. Mitigations include:¶

Monitor attestation using RATS [RFC9334]¶
Multiple independent monitors with cross-validation¶
Transparency logs for BETs, aligned with SCITT [I-D.ietf-scitt-architecture]¶

6.3. Benchmark Manipulation

Agents or their operators MAY attempt to manipulate benchmark results. The anti-gaming provisions in Section 4.3.3 address this risk. Additionally:¶

Benchmark harnesses MUST be operated by independent parties.¶
Results MUST be signed by the harness operator.¶
Benchmark profiles MUST be versioned and immutable once published.¶

6.4. Privacy of Behavioral Evidence

BETs contain information about agent actions that may be sensitive. Implementations MUST:¶

Minimize the detail in bhv_evidence to what is necessary for verification.¶
Support selective disclosure where possible.¶
Protect BETs in transit using TLS ([RFC9110]).¶
Define retention policies for behavioral evidence.¶

7. IANA Considerations

7.1. ECT Extension Claim Keys

This document requests registration of the following claim keys in the ECT ext claims registry:¶

Table 1: ECT Extension Claims for Behavioral Verification
Claim Key	Description
bhv_policy	Policy URI reference
bhv_result	Verification result
bhv_evidence	Observed actions hash
bhv_window	Observation period
bhv_details	Per-behavior results
perf_profile	Benchmark profile ID
perf_score	Aggregate benchmark score
perf_timestamp	Benchmark execution time
perf_harness	Test harness identifier

7.2. Benchmark Profile Media Type

This document requests registration of the following media type:¶

Type name: application¶

Subtype name: agent-benchmark-profile+json¶

Required parameters: N/A¶

Optional parameters: N/A¶

Encoding considerations: binary (UTF-8 JSON)¶

Security considerations: See Section 6¶

8. References

8.1. Normative References

[I-D.nennemann-agent-dag-hitl-safety]: "Agent Context Policy Token: DAG Delegation with Human Override", n.d., <https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/>.
[I-D.nennemann-wimse-ect]: "Execution Context Tokens for Distributed Agentic Workflows", n.d., <https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/>.
[RFC2119]: Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, <https://www.rfc-editor.org/rfc/rfc2119>.
[RFC7515]: Jones, M., Bradley, J., and N. Sakimura, "JSON Web Signature (JWS)", RFC 7515, DOI 10.17487/RFC7515, May 2015, <https://www.rfc-editor.org/rfc/rfc7515>.
[RFC7519]: Jones, M., Bradley, J., and N. Sakimura, "JSON Web Token (JWT)", RFC 7519, DOI 10.17487/RFC7519, May 2015, <https://www.rfc-editor.org/rfc/rfc7519>.
[RFC8174]: Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, <https://www.rfc-editor.org/rfc/rfc8174>.
[RFC9334]: Birkholz, H., Thaler, D., Richardson, M., Smith, N., and W. Pan, "Remote ATtestation procedureS (RATS) Architecture", RFC 9334, DOI 10.17487/RFC9334, January 2023, <https://www.rfc-editor.org/rfc/rfc9334>.

8.2. Informative References

[I-D.ietf-scitt-architecture]: Birkholz, H., Delignat-Lavaud, A., Fournet, C., Deshpande, Y., and S. Lasker, "An Architecture for Trustworthy and Transparent Digital Supply Chains", Work in Progress, Internet-Draft, draft-ietf-scitt-architecture-22, 10 October 2025, <https://datatracker.ietf.org/doc/html/draft-ietf-scitt-architecture-22>.
[I-D.nennemann-agent-gap-analysis]: "Gap Analysis for Autonomous Agent Protocols", n.d., <https://datatracker.ietf.org/doc/draft-nennemann-agent-gap-analysis/>.
[RFC9110]: Fielding, R., Ed., Nottingham, M., Ed., and J. Reschke, Ed., "HTTP Semantics", STD 97, RFC 9110, DOI 10.17487/RFC9110, June 2022, <https://www.rfc-editor.org/rfc/rfc9110>.

Acknowledgments

The author thanks the contributors to the NMOP working group for discussions on agent operational requirements.¶

Author's Address

Christian Nennemann

Independent Researcher

Email: ietf@nennemann.de