Fix security, data integrity, and accuracy issues from 4-perspective review

Security fixes: - Fix SQL injection in db.py:update_generation_run (column name whitelist) - Flask SECRET_KEY from env var instead of hardcoded - Add LLM rating bounds validation (_clamp_rating, 1-10) - Fix JSON extraction trailing whitespace handling Data integrity: - Normalize 21 legacy category names to 11 canonical short forms - Add false_positive column, flag 73 non-AI drafts (361 relevant remain) - Document verified counts: 434 total/361 relevant drafts, 557 authors, 419 ideas, 11 gaps Code quality: - Fix version string 0.1.0 → 0.2.0 - Add close()/context manager to Embedder class - Dynamic matrix size instead of hardcoded "260x260" Blog accuracy: - Fix EU AI Act timeline (enforcement Aug 2026, not "18 months") - Distinguish OAuth consent from GDPR Einwilligung - Add EU AI Act Annex III context to hospital scenario - Add FIPA, eIDAS 2.0 references where relevant Methodology: - Add methodology.md documenting pipeline, limitations, rating rubric - Add LLM-as-judge caveats to analyzer.py - Document clustering threshold rationale Reviews from: legal (German/EU law), statistics, development, science perspectives. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 10:52:33 +01:00
parent a386d0bb1a
commit 439424bd04
19 changed files with 1745 additions and 126 deletions
--- a/data/reports/blog-series/04-what-nobody-builds.md
+++ b/data/reports/blog-series/04-what-nobody-builds.md
@@ -1,14 +1,16 @@
 # What Nobody's Building (And Why It Matters)

-*The 12 gaps in the IETF's AI agent landscape -- and the real-world disasters they invite.*
+*The 11 gaps in the IETF's AI agent landscape -- and the real-world disasters they invite.*

 ---

 Imagine an AI agent managing a hospital's drug-dispensing system. It receives instructions from a prescribing agent, coordinates with a pharmacy agent, and issues delivery commands to a robotic dispensing agent. On Tuesday morning, the prescribing agent hallucinates a dosage. The pharmacy agent fills it. The dispensing agent delivers it. No human saw it happen. No system flagged it. No protocol exists to roll back the dispensed medication.

-This is not a hypothetical failure mode. It is the predictable consequence of the IETF's three most critical standardization gaps.
+To be clear: this scenario is already regulated. Under the EU AI Act (Regulation 2024/1689), a drug-dispensing AI agent is a high-risk AI system under Annex III, requiring conformity assessment, risk management, and human oversight before deployment. The Medical Devices Regulation (MDR 2017/745) imposes additional obligations. The gap is not one of legal accountability -- it is one of technical implementation. The standards that would let developers *comply* with these regulations in multi-agent architectures do not yet exist.

-We analyzed **361 Internet-Drafts**, extracted their technical components, and compared the result against what real-world agent deployments actually require. We found **12 gaps** -- areas where standardization work is missing or inadequate. Three of them are critical. And the critical ones all share a defining characteristic: they address what happens when autonomous agents fail or misbehave.
+This is the predictable consequence of the IETF's most critical standardization gaps.
+
+We analyzed **434 Internet-Drafts**, extracted their technical components, and compared the result against what real-world agent deployments actually require. We found **11 gaps** -- areas where standardization work is missing or inadequate. Two of them are critical. And the critical ones share a defining characteristic: they address what happens when autonomous agents fail or misbehave.

 Nobody is building the safety net.

@@ -16,60 +18,51 @@ Nobody is building the safety net.

 Our gap analysis sorted findings by severity based on the breadth of the shortfall and the consequences of leaving it unfilled:

-| # | Gap | Severity | Ideas Addressing It |
-|---|-----|----------|--------------------:|
-| 1 | Agent Behavior Verification | CRITICAL | 52 |
-| 2 | Agent Resource Management | CRITICAL | 117 |
-| 3 | Agent Error Recovery and Rollback | CRITICAL | 6 |
-| 4 | Cross-Protocol Translation | HIGH | 0 |
-| 5 | Agent Lifecycle Management | HIGH | 90 |
-| 6 | Multi-Agent Consensus | HIGH | 5 |
-| 7 | Human Override and Intervention | HIGH | 4 |
-| 8 | Cross-Domain Security Boundaries | HIGH | 10 |
-| 9 | Dynamic Trust and Reputation | HIGH | 5 |
-| 10 | Agent Performance Monitoring | MEDIUM | 26 |
-| 11 | Agent Explainability | MEDIUM | 5 |
-| 12 | Agent Data Provenance | MEDIUM | 79 |
+| # | Gap | Severity |
+|---|-----|----------|
+| 1 | Agent Behavioral Verification | CRITICAL |
+| 2 | Agent Failure Cascade Prevention | CRITICAL |
+| 3 | Real-Time Agent Rollback Mechanisms | HIGH |
+| 4 | Multi-Agent Consensus Protocols | HIGH |
+| 5 | Human Override Standardization | HIGH |
+| 6 | Cross-Domain Agent Audit Trails | HIGH |
+| 7 | Federated Agent Learning Privacy | HIGH |
+| 8 | Cross-Protocol Agent Migration | MEDIUM |
+| 9 | Agent Resource Accounting and Billing | MEDIUM |
+| 10 | Agent Capability Negotiation | MEDIUM |
+| 11 | Agent Performance Benchmarking | MEDIUM |

-Two numbers in that table should alarm you: the **6 ideas** addressing error recovery (all from a single draft), and the **0 ideas** addressing cross-protocol translation. Across 361 drafts, these gaps are not underserved. They are unserved.
+The gap names above match the automated gap analysis output. The two critical gaps -- behavioral verification and failure cascade prevention -- address what happens when autonomous agents deviate from declared behavior or trigger cascading failures across interconnected systems. Several high-severity gaps (rollback mechanisms, human override, consensus protocols) address the same theme: what happens when things go wrong, and nobody has built the safety net.
+
+A notable omission from this gap list: **GDPR-mandated capabilities**. The gap analysis focuses on technical desiderata but does not engage with the EU's legally binding data protection framework. Specific GDPR requirements that have no corresponding IETF draft work include: Data Protection Impact Assessment (DPIA) tooling for high-risk agent processing (Art. 35 GDPR), right-to-erasure propagation across multi-agent chains (Art. 17), data portability for agent-generated personal data (Art. 20), and purpose limitation enforcement when agents are authorized for specific tasks but may repurpose data (Art. 5(1)(b)). These are not optional features for EU-deployed agent systems -- they are legal requirements.

 ## Critical Gap 1: Agent Behavior Verification

 **The problem**: No mechanism exists to verify that a deployed AI agent actually behaves according to its declared policies or specifications.

-**The numbers**: Only **44 of 361 drafts** address AI safety and alignment. The 4:1 ratio of capability to safety work means the community is building agents four times faster than it is building the tools to keep them honest.
+**The numbers**: Only **47 of 434 drafts** address AI safety and alignment. The capability-to-safety ratio is roughly 4:1 on aggregate -- though it varies significantly by month, from as low as 1.5:1 to as high as 21:1. The trend is clear: the community is building agents faster than it is building the tools to keep them honest.

-**What 52 ideas partially address**: Some exist on the periphery. [draft-aylward-daap-v2](https://datatracker.ietf.org/doc/draft-aylward-daap-v2/) (score 4.8 -- the highest-rated draft in the corpus) defines a behavioral monitoring framework and cryptographic identity verification. [draft-birkholz-verifiable-agent-conversations](https://datatracker.ietf.org/doc/draft-birkholz-verifiable-agent-conversations/) (score 4.5) proposes verifiable conversation records using COSE signing. [draft-berlinai-vera](https://datatracker.ietf.org/doc/draft-berlinai-vera/) (score 3.9) introduces a zero-trust architecture with five enforcement pillars.
+**What partially addresses this**: Some work exists on the periphery. [draft-aylward-daap-v2](https://datatracker.ietf.org/doc/draft-aylward-daap-v2/) (score 4.75 -- the highest-rated draft in the corpus) defines a behavioral monitoring framework and cryptographic identity verification. [draft-birkholz-verifiable-agent-conversations](https://datatracker.ietf.org/doc/draft-birkholz-verifiable-agent-conversations/) (score 4.5) proposes verifiable conversation records using COSE signing. [draft-berlinai-vera](https://datatracker.ietf.org/doc/draft-berlinai-vera/) (score 3.9) introduces a zero-trust architecture with five enforcement pillars.

 **What is still missing**: Runtime verification. These drafts define what agents *should* do and how to *record* what they did. None provides a real-time mechanism to detect that an agent is deviating from its declared behavior *while it is operating*. The gap is between policy declaration and policy enforcement -- the difference between a speed limit sign and a speed camera.

 **The scenario**: A financial trading agent is authorized to execute trades within specified parameters. It begins operating within bounds but, after a model update, starts exceeding risk limits. Without runtime behavior verification, the deviation is only discovered in post-hoc audit -- potentially days later, after significant damage.

-## Critical Gap 2: Agent Resource Management
+## Critical Gap 2: Agent Failure Cascade Prevention

-**The problem**: No framework exists for managing computational resources, memory, and processing power across distributed AI agents.
+**The problem**: No protocols exist to prevent agent failures from cascading across interconnected autonomous systems. As agent interdependencies increase in production deployments, a failure in one agent can ripple outward.

-**The numbers**: **93 drafts** focus on autonomous network operations, and **117 ideas** touch on resource-adjacent topics. But those ideas address how agents communicate about tasks -- not how they compete for and share limited resources.
+**The numbers**: Only **47 drafts** address AI safety despite 434 total drafts, and the high interconnectivity implied by 155 A2A protocols and 114 autonomous netops drafts creates the conditions for cascade failures.

-**What is missing**: Scheduling, quotas, fair allocation, and priority mechanisms for multi-agent environments. When ten agents compete for the same GPU cluster, which gets priority? When an agent's computation exceeds its allocation, what happens? When a high-priority emergency response agent needs resources currently held by a routine monitoring agent, how does preemption work?
+**What is missing**: Circuit breakers for cascading failures. Checkpoint and rollback protocols. Blast radius containment. Graceful degradation. All concepts well-established in distributed systems engineering, but absent from the agent standards landscape.

-**The scenario**: A telecom operator deploys 50 AI agents for network monitoring, troubleshooting, and optimization. During a major outage, all 50 agents simultaneously request inference resources to diagnose the problem. With no resource management framework, agents compete chaotically. The most aggressive agents get resources; the most important diagnostic tasks may not. The outage extends because the agents that could fix it are starved by the agents that are observing it.
+**The scenario**: A telecom operator deploys 50 AI agents for network monitoring, troubleshooting, and optimization. During a major outage, all 50 agents simultaneously request inference resources to diagnose the problem. With no failure cascade prevention, agents compete chaotically. The most aggressive agents get resources; the most important diagnostic tasks may not. The outage extends because the agents that could fix it are starved by the agents that are observing it.

-## Critical Gap 3: Agent Error Recovery and Rollback
+## High Gap: Real-Time Agent Rollback Mechanisms

-**The problem**: No standards exist for how agents handle errors, cascading failures, or the rollback of autonomous decisions.
+**The problem**: No standards exist for how to quickly roll back incorrect decisions made by autonomous agents across distributed systems.

-**The numbers**: This is the starkest gap in the corpus. Only **6 extracted ideas** address it, and all come from a single draft: [draft-yue-anima-agent-recovery-networks](https://datatracker.ietf.org/doc/draft-yue-anima-agent-recovery-networks/) (score 4.1). One team, out of 557 authors, is working on this.
-
-**The 6 ideas from that draft**:
- Task-Oriented Multi-Agent Recovery Framework
- Inter-Agent Communication Protocol Requirements
- State Consistency Management
- Error and Success Reporting Framework (from a separate draft)
- Generic Agent Response Framework
- Mandatory restrictive failure behavior
-
-That is the entire body of work the IETF has produced on agent error recovery. For context, "Multi-Agent Communication Protocol" -- defining how agents *talk* -- appears in 8 drafts. The community has invested 8 times more effort in the plumbing than in the fire escape.
+**The numbers**: 114 autonomous netops drafts exist, but no rollback mechanisms for production network safety. [draft-yue-anima-agent-recovery-networks](https://datatracker.ietf.org/doc/draft-yue-anima-agent-recovery-networks/) (score 4.1) is among the few drafts that partially addresses this, with its Task-Oriented Multi-Agent Recovery Framework and State Consistency Management. For context, "Multi-Agent Communication Protocol" -- defining how agents *talk* -- appears in 8 drafts. The community has invested far more effort in the plumbing than in the fire escape.

 **What is missing**: Circuit breakers for cascading failures. Checkpoint and rollback protocols. Blast radius containment. Graceful degradation. All concepts well-established in distributed systems engineering, but absent from the agent standards landscape.

@@ -77,35 +70,29 @@ That is the entire body of work the IETF has produced on agent error recovery. F

 ## The High-Priority Gaps

-Six additional gaps scored HIGH severity. Each represents a missing piece that working deployments will hit:
+Several additional gaps scored HIGH severity. Each represents a missing piece that working deployments will hit:

-### Cross-Protocol Translation (0 ideas)
+### Human Override Standardization

-With **120 competing A2A protocols** and no translation layer, agents speaking different protocols simply cannot interoperate. This gap is entirely unaddressed -- zero technical ideas in the corpus. It is the only gap with literally no coverage.
-
-The parallel is the early web: HTTP won not because it was the best protocol but because it was the one protocol everyone could speak. The agent ecosystem has no HTTP equivalent. If the IETF does not build a translation layer, the market will -- and the result will be vendor-locked ecosystems rather than open interoperability.
-
-### Human Override and Intervention (4 ideas)
-
-Only **30 human-agent interaction drafts** exist versus **93 autonomous operations** and **120 A2A protocol** drafts. Agents are being designed to talk to each other at a 4:1 ratio over being designed to talk to humans. Emergency override protocols -- the "big red button" -- are almost entirely absent.
+Only **34 human-agent interaction drafts** exist versus **114 autonomous operations** and **155 A2A protocol** drafts. Agents are being designed to talk to each other at a roughly 4:1 ratio over being designed to talk to humans. Emergency override protocols -- the "big red button" -- are almost entirely absent. This is not merely an engineering preference. For high-risk AI systems deployed in the EU, the AI Act (Art. 14) mandates human oversight -- making this gap a compliance blocker, not just a design omission.

 [draft-rosenberg-aiproto-cheq](https://datatracker.ietf.org/doc/draft-rosenberg-aiproto-cheq/) (score 3.9) is a rare exception: it defines a protocol for human confirmation of agent decisions before execution. But CHEQ is opt-in and pre-execution. No draft defines what happens when a human needs to stop a running agent, constrain its behavior, or take over its task mid-execution.

-### Multi-Agent Consensus (5 ideas)
+### Multi-Agent Consensus Protocols

-When a group of agents disagree -- the diagnosis agent says the router is down, the monitoring agent says it is up, the optimization agent is rerouting traffic around it -- who arbitrates? No framework exists for agents to resolve conflicting assessments without human intervention.
+When a group of agents disagree -- the diagnosis agent says the router is down, the monitoring agent says it is up, the optimization agent is rerouting traffic around it -- who arbitrates? No framework exists for agents to resolve conflicting assessments without human intervention. This is not a new problem: FIPA (Foundation for Intelligent Physical Agents) defined agent communication languages and interaction protocols for multi-agent coordination as early as 1997. The IETF landscape has largely not engaged with this prior art.

-### Dynamic Trust and Reputation (5 ideas)
+### Cross-Domain Agent Audit Trails

-Static certificates authenticate identity but cannot express "this agent has been reliable for 6 months" or "this agent's accuracy degraded last week." Long-running agent ecosystems need trust that is earned, tracked, and revocable. The current landscape relies entirely on binary trust: either an agent has a valid certificate or it does not.
+An agent operating across multiple domains or organizations needs to maintain audit trails that satisfy different regulatory requirements simultaneously. Identity management exists -- the 152 identity/auth drafts cover authentication. What does not exist is cross-domain audit standardization: the format and semantics for recording agent actions across jurisdictions with varying compliance requirements. The EU's eIDAS 2.0 regulation (Regulation 2024/1183) and its European Digital Identity Wallet framework provide a mature trust model that the IETF drafts have not yet connected to.

-### Cross-Domain Security Boundaries (10 ideas)
+### Federated Agent Learning Privacy

-An agent authenticated in Company A's domain needs to perform a task in Company B's domain. Identity management exists -- the 108 identity/auth drafts cover this. What does not exist is trust *isolation*: preventing an agent authenticated for a narrow task from escalating privileges across domain boundaries.
+While federated architectures exist, there is insufficient specification for privacy-preserving agent learning that prevents data leakage between federated participants during model updates.

-### Agent Lifecycle Management (90 ideas)
+### Cross-Protocol Agent Migration

-Registration is covered. What happens after registration is not: versioning when an agent is updated, graceful retirement when an agent is decommissioned, migration when an agent moves between hosts, and dependency management when other agents rely on it.
+Agents need to migrate between different network protocols, domains, or infrastructure providers while maintaining state and identity. Current drafts focus on registration but not migration continuity.

 ## The Structural Problem

@@ -119,7 +106,9 @@ Now look back at the team bloc analysis from Post 2. The 18 team blocs are *isla

 This is the structural explanation for the safety deficit. It is not that people do not care about safety. It is that safety standards require coordination across boundaries that the current authorship structure cannot bridge. Capability standards can be built within a single team. Safety standards cannot.

-Our category co-occurrence analysis provides the concrete proof. Safety drafts are not entirely isolated -- they co-occur with 8 of 10 categories, coupling most strongly with policy and governance (**60% of safety drafts**, lift 2.3x) and identity/auth (**58%**, lift 1.7x). But the pattern is revealing: safety pairs with *governance* categories, not *implementation* categories. Of the 136 drafts tagged as A2A protocols, only **12 (8.8%) also address safety**. Safety has **zero co-occurrence** with agent discovery/registration and **zero co-occurrence** with model serving/inference. Its weakest links are to the categories where agents actually *do* things: A2A protocols (12), ML traffic management (3), and autonomous network operations (4). Safety is being discussed in governance papers. It is completely absent from discovery infrastructure and inference pipelines. It is barely present in the protocols that need it most. The traffic lights are not just behind the highways -- they are on a different road entirely.
+Our category co-occurrence analysis provides the concrete proof. Safety drafts are not entirely isolated -- they co-occur with several categories, coupling most strongly with policy and governance and identity/auth. But the pattern is revealing: safety pairs with *governance* categories, not *implementation* categories. Of the 155 drafts tagged as A2A protocols, very few also address safety. Safety has minimal co-occurrence with agent discovery/registration and model serving/inference. Its weakest links are to the categories where agents actually *do* things. Safety is being discussed in governance papers. It is barely present in the protocols that need it most. The traffic lights are not just behind the highways -- they are on a different road entirely.
+
+IEEE P3394 (Standard for Trustworthy AI Agents), a concurrent standardization effort, is attempting to address some of these safety and trust dimensions from a different angle. The IETF landscape should be compared against these parallel efforts to understand which gaps are being addressed elsewhere and which remain truly unserved.

 ## The 4:1 Ratio, Revisited

@@ -127,11 +116,11 @@ The safety deficit is not just a number. It is a structural property of how the

 | Category | Drafts | Team Blocs Active |
 |----------|-------:|------------------:|
-| A2A protocols | 120 | Many (distributed across blocs) |
-| Autonomous operations | 93 | Primarily Huawei, Chinese telecom |
-| Agent identity/auth | 108 | Ericsson, Nokia, ATHENA, multiple |
-| **AI safety/alignment** | **44** | **Few; mostly independents/startups** |
-| **Human-agent interaction** | **30** | **Rosenberg/White (2-person team)** |
+| A2A protocols | 155 | Many (distributed across blocs) |
+| Autonomous operations | 114 | Primarily Huawei, Chinese telecom |
+| Agent identity/auth | 152 | Ericsson, Nokia, ATHENA, multiple |
+| **AI safety/alignment** | **47** | **Few; mostly independents/startups** |
+| **Human-agent interaction** | **34** | **Rosenberg/White (2-person team)** |

 The capability categories have organized teams behind them. The safety categories rely on individual contributors and small, unconnected teams. The best safety draft in the corpus (DAAP, score 4.8) comes from an independent author (Aylward). The best human-agent drafts come from a two-person Five9/Bitwave team. There is no 13-person safety bloc with 94% cohesion.