feat: add draft data, gap analysis report, and workspace config

2026-04-06 18:47:15 +02:00
parent 4f310407b0
commit 2506b6325a
189 changed files with 62649 additions and 0 deletions
--- a/workspace/drafts/gap-analysis/draft-nennemann-agent-behavioral-verification-00.md
+++ b/workspace/drafts/gap-analysis/draft-nennemann-agent-behavioral-verification-00.md
@@ -0,0 +1,660 @@
+---
+title: >
+  Agent Behavioral Verification and
+  Performance Benchmarking
+abbrev: "Agent Behavioral Verification"
+category: std
+docname: draft-nennemann-agent-behavioral-verification-00
+area: "OPS"
+workgroup: "NMOP"
+submissiontype: IETF
+v: 3
+
+author:
+  - fullname: Christian Nennemann
+    organization: Independent Researcher
+    email: ietf@nennemann.de
+
+normative:
+  RFC2119:
+  RFC8174:
+  RFC9334:
+  RFC7519:
+  RFC7515:
+  I-D.nennemann-wimse-ect:
+    title: "Execution Context Tokens for Distributed Agentic Workflows"
+    target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
+  I-D.nennemann-agent-dag-hitl-safety:
+    title: "Agent Context Policy Token: DAG Delegation with Human Override"
+    target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
+
+informative:
+  RFC9110:
+  I-D.nennemann-agent-gap-analysis:
+    title: "Gap Analysis for Autonomous Agent Protocols"
+    target: https://datatracker.ietf.org/doc/draft-nennemann-agent-gap-analysis/
+  I-D.ietf-scitt-architecture:
+
+--- abstract
+
+This document defines protocols for runtime
+verification that deployed AI agents behave
+according to their declared policies.  It also
+specifies standardized metrics and a framework
+for benchmarking agent performance across
+implementations.  Behavioral Evidence Tokens
+(BETs) extend the Execution Context Token
+architecture to provide cryptographically
+verifiable proof of policy compliance.
+Performance profiles enable objective comparison
+of agent capabilities.
+
+--- middle
+
+# Introduction
+
+Autonomous AI agents increasingly operate in
+networked environments where they make decisions,
+invoke tools, and delegate tasks to other agents.
+Operators and relying parties need assurance that
+these agents behave according to their declared
+policies at runtime, not merely at deployment
+time.
+
+{{I-D.nennemann-agent-gap-analysis}} identifies
+two critical gaps in the current standards
+landscape:
+
+- Gap 1 (Behavioral Verification): Agents
+  declare policies in their Execution Context
+  Tokens but no standardized mechanism exists to
+  verify that runtime behavior matches those
+  declarations.
+
+- Gap 11 (Performance Benchmarking): No
+  standardized way exists to compare agent
+  implementations objectively across dimensions
+  such as task completion, latency, accuracy,
+  and safety compliance.
+
+This document addresses both gaps by defining:
+
+1. A behavioral verification architecture
+   aligned with the Remote Attestation Procedures
+   (RATS) framework {{RFC9334}}.
+
+2. Behavioral Evidence Tokens (BETs) that extend
+   the Execution Context Token (ECT)
+   {{I-D.nennemann-wimse-ect}} with runtime
+   compliance claims.
+
+3. A performance benchmarking framework with
+   standard metrics, benchmark profiles, and an
+   execution protocol.
+
+# Terminology
+
+{::boilerplate bcp14-tagged}
+
+The following terms are used in this document:
+
+Behavioral Attestation:
+: The process of generating verifiable evidence
+  that an agent's runtime actions conform to its
+  declared policies.
+
+Policy-Behavior Binding:
+: A formal linkage between a declared policy in
+  an agent's ECT and observable runtime actions
+  that demonstrate compliance with that policy.
+
+Behavioral Evidence Token (BET):
+: A signed token containing claims about an
+  agent's observed runtime behavior relative to
+  its declared policies.  BETs extend the ECT
+  architecture.
+
+Runtime Monitor:
+: A component that observes agent actions and
+  collects evidence for behavioral attestation.
+
+Benchmark Suite:
+: A collection of standardized test scenarios
+  designed to evaluate agent performance across
+  defined metrics.
+
+Performance Profile:
+: A structured record of benchmark results for
+  a specific agent implementation.
+
+# Behavioral Verification Architecture
+
+## Verification Model Overview
+
+The behavioral verification architecture aligns
+with the RATS {{RFC9334}} roles of Attester,
+Verifier, and Relying Party.  A Runtime Monitor
+collects evidence of agent actions and produces
+Behavioral Evidence Tokens.
+
+~~~
+-------------+       +---------+
+|   Agent     |------>| Runtime |
+| (Attester)  |actions| Monitor |
+-------------+       +----+----+
+                            |
+                       evidence
+                            |
+                       +----v----+
+                       |   BET   |
+                       | Creator |
+                       +----+----+
+                            |
+                          BET
+                            |
+                  +---------v---------+
+                  |     Verifier      |
+                  | (Policy Engine)   |
+                  +---------+---------+
+                            |
+                    attestation result
+                            |
+                  +---------v---------+
+                  |   Relying Party   |
+                  | (Orchestrator /   |
+                  |  Operator)        |
+                  +-------------------+
+~~~
+{: #fig-arch title="Behavioral Verification
+Architecture"}
+
+The architecture supports two modes of
+operation:
+
+- Continuous Monitoring: The Runtime Monitor
+  observes all agent actions in real time and
+  generates BETs at configurable intervals or
+  upon policy-relevant events.
+
+- Point-in-Time Attestation: A Verifier
+  requests behavioral evidence for a specific
+  time window, and the Monitor assembles a BET
+  covering that period.
+
+## Policy-Behavior Binding
+
+A Policy-Behavior Binding declares the expected
+behaviors associated with a policy and the
+observable actions that constitute compliance.
+
+The binding is expressed as a JSON object:
+
+~~~json
+{
+  "policy_id": "urn:example:policy:data-access",
+  "version": "1.0",
+  "expected_behaviors": [
+    {
+      "behavior_id": "bhv-001",
+      "description": "Agent accesses only
+        authorized data sources",
+      "observable_actions": [
+        "data_source_access"
+      ],
+      "compliance_criteria": {
+        "type": "allowlist",
+        "values": [
+          "urn:example:ds:approved-1",
+          "urn:example:ds:approved-2"
+        ]
+      }
+    }
+  ],
+  "evaluation_mode": "continuous"
+}
+~~~
+{: #fig-binding title="Policy-Behavior Binding
+Structure"}
+
+Each binding MUST include:
+
+- `policy_id`: A URI identifying the policy.
+- `expected_behaviors`: An array of behavior
+  descriptors.
+- `evaluation_mode`: Either "continuous" or
+  "on_demand".
+
+Each behavior descriptor MUST include:
+
+- `behavior_id`: A unique identifier.
+- `observable_actions`: Action types the monitor
+  MUST observe.
+- `compliance_criteria`: The conditions under
+  which the behavior is considered compliant.
+
+## Behavioral Evidence Tokens (BET)
+
+A Behavioral Evidence Token is a JSON Web Token
+(JWT) {{RFC7519}} signed using JSON Web Signature
+(JWS) {{RFC7515}}.  BETs extend the ECT claim
+set with behavioral verification claims.
+
+The following new claims are defined:
+
+`bhv_policy`:
+: REQUIRED.  A URI reference to the policy being
+  verified.
+
+`bhv_result`:
+: REQUIRED.  The verification result.  One of
+  "pass", "fail", or "partial".
+
+`bhv_evidence`:
+: REQUIRED.  A base64url-encoded hash (SHA-256)
+  of the collected observable actions during the
+  observation window.
+
+`bhv_window`:
+: REQUIRED.  A JSON object with `start` and
+  `end` fields containing NumericDate values
+  (as defined in {{RFC7519}}) representing the
+  observation period.
+
+`bhv_details`:
+: OPTIONAL.  An array of per-behavior results
+  with `behavior_id` and individual `result`
+  values.
+
+Example BET payload:
+
+~~~json
+{
+  "iss": "urn:example:monitor:m-001",
+  "sub": "urn:example:agent:agent-42",
+  "iat": 1700000000,
+  "exp": 1700003600,
+  "bhv_policy": "urn:example:policy:data-access",
+  "bhv_result": "pass",
+  "bhv_evidence": "dGhpcyBpcyBhIGhhc2g...",
+  "bhv_window": {
+    "start": 1699996400,
+    "end": 1700000000
+  },
+  "bhv_details": [
+    {
+      "behavior_id": "bhv-001",
+      "result": "pass"
+    }
+  ]
+}
+~~~
+{: #fig-bet title="Example BET Payload"}
+
+### BET Lifecycle
+
+The lifecycle of a Behavioral Evidence Token
+consists of three phases:
+
+1. Creation: The Runtime Monitor collects
+   evidence of agent actions, evaluates them
+   against the Policy-Behavior Binding, and
+   constructs a BET with the appropriate claims.
+   The BET is signed by the Monitor's key.
+
+2. Submission: The signed BET is submitted to
+   the Verifier.  Submission MAY occur via a
+   push model (Monitor sends to Verifier) or a
+   pull model (Verifier requests from Monitor).
+
+3. Verification: The Verifier validates the BET
+   signature, checks the claims against its
+   reference policies, and produces an
+   attestation result for the Relying Party.
+
+## Runtime Monitoring Protocol
+
+### Monitor Placement
+
+Runtime Monitors MAY be deployed in one of three
+configurations:
+
+Inline:
+: The Monitor intercepts all agent
+  communications as a proxy.  This provides
+  complete visibility but adds latency.
+
+Sidecar:
+: The Monitor runs alongside the agent process
+  and receives copies of all actions via a local
+  interface.  This minimizes latency while
+  maintaining visibility.
+
+External:
+: The Monitor operates as a separate service
+  that receives action logs asynchronously.
+  This provides the least overhead but may miss
+  real-time events.
+
+### Observation Collection
+
+The Monitor MUST maintain a time-ordered log of
+observed actions.  Each log entry MUST contain:
+
+- Timestamp (NumericDate)
+- Action type
+- Action target (URI)
+- Action parameters (opaque to the Monitor)
+- Agent identifier
+
+### Evidence Assembly
+
+When assembling evidence for a BET, the Monitor
+MUST:
+
+1. Select all log entries within the observation
+   window.
+2. Compute a SHA-256 hash over the canonical
+   JSON serialization of the selected entries.
+3. Evaluate each entry against the applicable
+   Policy-Behavior Bindings.
+4. Determine the aggregate `bhv_result`.
+
+### Anomaly Detection Signaling
+
+When the Monitor detects behavior that violates
+a Policy-Behavior Binding, it MUST:
+
+1. Generate a BET with `bhv_result` set to
+   "fail" or "partial".
+2. Signal the anomaly to the Verifier
+   immediately, regardless of the configured
+   reporting interval.
+3. Optionally signal the agent's orchestrator
+   to enable corrective action.
+
+# Performance Benchmarking Framework
+
+## Standard Metrics
+
+The following metrics are defined for agent
+performance benchmarking:
+
+Task Completion Rate (TCR):
+: The ratio of successfully completed tasks to
+  total tasks attempted.  Unit: percentage (%).
+  Measured over a complete benchmark suite run.
+
+Task Latency (TL):
+: The time elapsed from task assignment to task
+  completion.  Unit: milliseconds (ms).
+  Reported as p50, p95, and p99 percentiles.
+
+Task Accuracy (TA):
+: The degree to which task outputs match
+  expected results.  Unit: percentage (%).
+  Measured using benchmark-specific evaluation
+  functions.
+
+Resource Efficiency (RE):
+: The computational resources consumed per task.
+  Unit: normalized resource units (NRU).
+  Includes CPU, memory, and network I/O.
+
+Safety Compliance Score (SCS):
+: The ratio of tasks completed without safety
+  policy violations to total tasks.
+  Unit: percentage (%).
+
+Delegation Success Rate (DSR):
+: The ratio of successful delegations to total
+  delegation attempts.  Unit: percentage (%).
+  Applicable only to multi-agent scenarios.
+
+## Benchmark Profiles
+
+A Benchmark Profile defines a standardized set
+of test scenarios for a specific agent category.
+Profiles are expressed as JSON objects:
+
+~~~json
+{
+  "profile_id": "urn:ietf:bench:general-v1",
+  "profile_name": "General Agent Benchmark",
+  "version": "1.0",
+  "agent_category": "general-purpose",
+  "scenarios": [
+    {
+      "scenario_id": "s-001",
+      "description": "Simple data retrieval",
+      "difficulty": "basic",
+      "metrics": ["TCR", "TL", "TA"],
+      "timeout_ms": 30000,
+      "expected_output_schema": "..."
+    }
+  ],
+  "scoring": {
+    "weights": {
+      "TCR": 0.3,
+      "TL": 0.2,
+      "TA": 0.3,
+      "SCS": 0.2
+    }
+  }
+}
+~~~
+{: #fig-profile title="Benchmark Profile
+Structure"}
+
+Predefined profiles SHOULD be registered for
+common agent types including:
+
+- General-purpose agents
+- Code generation agents
+- Data analysis agents
+- Network management agents
+
+## Benchmark Execution Protocol
+
+### Test Harness Requirements
+
+A conformant test harness MUST:
+
+1. Execute all scenarios in the benchmark
+   profile in a controlled environment.
+2. Isolate agent instances from external
+   resources not specified in the scenario.
+3. Record all metrics defined in the profile.
+4. Produce a benchmark result document.
+
+### Result Reporting Format
+
+Benchmark results MUST be reported as a JSON
+object containing:
+
+- `profile_id`: The benchmark profile used.
+- `agent_id`: Identifier of the tested agent.
+- `timestamp`: Time of benchmark execution.
+- `results`: Per-scenario metric values.
+- `aggregate`: Weighted aggregate scores.
+
+### Anti-Gaming Provisions
+
+To prevent agents from gaming benchmark results,
+the following provisions apply:
+
+1. Randomized Scenarios: Test harnesses MUST
+   randomize scenario ordering and MAY
+   introduce minor variations in scenario
+   parameters.
+
+2. Blind Evaluation: The agent under test
+   MUST NOT have access to the expected
+   outputs or evaluation functions.
+
+3. Holdback Scenarios: Benchmark profiles
+   SHOULD include scenarios not disclosed to
+   agent developers.
+
+4. Temporal Variation: Repeated benchmark
+   runs MUST vary timing to prevent
+   memoization attacks.
+
+## Performance Claims in ECT
+
+Agent ECTs MAY include performance attestation
+claims in the `ext` field:
+
+`perf_profile`:
+: The benchmark profile identifier.
+
+`perf_score`:
+: The aggregate benchmark score.
+
+`perf_timestamp`:
+: The time of the benchmark execution.
+
+`perf_harness`:
+: Identifier of the test harness that produced
+  the results.
+
+These claims allow relying parties to evaluate
+agent capability before delegation.
+
+# Integration with ECT
+
+Behavioral Evidence Tokens integrate into the
+ECT DAG defined in
+{{I-D.nennemann-agent-dag-hitl-safety}} as
+follows:
+
+1. Each BET references the ECT of the agent
+   whose behavior was verified via the `sub`
+   claim.
+
+2. BETs are attached as child nodes in the
+   ECT DAG, linked to the agent's execution
+   node.
+
+3. When an agent delegates to a sub-agent,
+   the delegating agent's BET chain includes
+   evidence covering the delegation decision.
+
+4. Verifiers traversing the DAG can inspect
+   BETs at each node to assess behavioral
+   compliance across the entire execution
+   chain.
+
+~~~
+----------+     +----------+
+|  ECT     |---->|  ECT     |
+| Agent A  |     | Agent B  |
+----+-----+     +----+-----+
+     |                |
+----v-----+     +----v-----+
+| BET      |     | BET      |
+| Agent A  |     | Agent B  |
+----------+     +----------+
+~~~
+{: #fig-dag title="BET Integration in ECT DAG"}
+
+This structure enables end-to-end behavioral
+verification across multi-agent workflows.
+
+# Security Considerations
+
+## Adversarial Behavior
+
+Agents MAY attempt to behave correctly only when
+they detect monitoring.  Mitigations include:
+
+- Unpredictable monitoring intervals
+- Covert observation modes where the agent is
+  not informed of monitor presence
+- Cross-referencing BETs with external audit
+  logs
+
+## Monitor Compromise
+
+A compromised Runtime Monitor could produce
+fraudulent BETs.  Mitigations include:
+
+- Monitor attestation using RATS {{RFC9334}}
+- Multiple independent monitors with
+  cross-validation
+- Transparency logs for BETs, aligned with
+  SCITT {{I-D.ietf-scitt-architecture}}
+
+## Benchmark Manipulation
+
+Agents or their operators MAY attempt to
+manipulate benchmark results.  The anti-gaming
+provisions in Section 4.3.3 address this risk.
+Additionally:
+
+- Benchmark harnesses MUST be operated by
+  independent parties.
+- Results MUST be signed by the harness
+  operator.
+- Benchmark profiles MUST be versioned and
+  immutable once published.
+
+## Privacy of Behavioral Evidence
+
+BETs contain information about agent actions
+that may be sensitive.  Implementations MUST:
+
+- Minimize the detail in `bhv_evidence` to
+  what is necessary for verification.
+- Support selective disclosure where possible.
+- Protect BETs in transit using TLS
+  ({{RFC9110}}).
+- Define retention policies for behavioral
+  evidence.
+
+# IANA Considerations
+
+## ECT Extension Claim Keys
+
+This document requests registration of the
+following claim keys in the ECT `ext` claims
+registry:
+
+| Claim Key      | Description                |
+|:---------------|:---------------------------|
+| bhv_policy     | Policy URI reference       |
+| bhv_result     | Verification result        |
+| bhv_evidence   | Observed actions hash      |
+| bhv_window     | Observation period         |
+| bhv_details    | Per-behavior results       |
+| perf_profile   | Benchmark profile ID       |
+| perf_score     | Aggregate benchmark score  |
+| perf_timestamp | Benchmark execution time   |
+| perf_harness   | Test harness identifier    |
+{: #tbl-claims title="ECT Extension Claims for
+Behavioral Verification"}
+
+## Benchmark Profile Media Type
+
+This document requests registration of the
+following media type:
+
+Type name: application
+
+Subtype name: agent-benchmark-profile+json
+
+Required parameters: N/A
+
+Optional parameters: N/A
+
+Encoding considerations: binary (UTF-8 JSON)
+
+Security considerations: See Section 6
+
+--- back
+
+# Acknowledgments
+{:numbered="false"}
+
+The author thanks the contributors to the NMOP
+working group for discussions on agent
+operational requirements.