Files

Christian Nennemann 439424bd04 Fix security, data integrity, and accuracy issues from 4-perspective review

Security fixes:
- Fix SQL injection in db.py:update_generation_run (column name whitelist)
- Flask SECRET_KEY from env var instead of hardcoded
- Add LLM rating bounds validation (_clamp_rating, 1-10)
- Fix JSON extraction trailing whitespace handling

Data integrity:
- Normalize 21 legacy category names to 11 canonical short forms
- Add false_positive column, flag 73 non-AI drafts (361 relevant remain)
- Document verified counts: 434 total/361 relevant drafts, 557 authors, 419 ideas, 11 gaps

Code quality:
- Fix version string 0.1.0 → 0.2.0
- Add close()/context manager to Embedder class
- Dynamic matrix size instead of hardcoded "260x260"

Blog accuracy:
- Fix EU AI Act timeline (enforcement Aug 2026, not "18 months")
- Distinguish OAuth consent from GDPR Einwilligung
- Add EU AI Act Annex III context to hospital scenario
- Add FIPA, eIDAS 2.0 references where relevant

Methodology:
- Add methodology.md documenting pipeline, limitations, rating rubric
- Add LLM-as-judge caveats to analyzer.py
- Document clustering threshold rationale

Reviews from: legal (German/EU law), statistics, development, science perspectives.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-08 10:52:33 +01:00

16 KiB

Raw Blame History

Drawing the Big Picture: What the Agent Ecosystem Actually Needs

361 drafts, 628 cross-org convergent ideas, 12 gaps -- and the architectural vision that connects them all.

We have spent five posts documenting a paradox: the IETF's AI agent landscape has extraordinary breadth (361 drafts), deep fragmentation at every level (96% of ideas appear in only one draft, 120 competing A2A protocols, 14 OAuth proposals), concentrated authorship (18 team blocs, one company writing 18% of all drafts), and critical gaps (behavior verification, error recovery, human override) that nobody is filling.

The landscape has quantity. It lacks architecture.

This post is about what the architecture looks like -- not in theory, but derived from the data. The 12 gaps are not random absences; they are structurally related. The convergent ideas contain the components; they need a blueprint. And the blueprint already has a foundation: existing IETF work on workload identity (SPIFFE/WIMSE) and execution evidence (Execution Context Tokens) provides the lower layers. What is missing is what goes on top.

What the Ecosystem Needs: Four Pillars

Our analysis -- synthesizing the gaps, the ideas, and the existing proposals -- points to four missing pillars:

Pillar 1: DAG-Based Execution

The gap it fills: Error Recovery and Rollback (Critical), Resource Management (Critical)

Every multi-agent workflow is a directed acyclic graph: tasks with dependencies, checkpoints, and decision points. But no draft in the corpus defines "agent task graph" as a first-class construct. Without it, there is no way to:

Know which tasks depend on which
Place checkpoints for rollback
Calculate the blast radius of a failure
Schedule resources based on the graph structure

The Execution Context Token (ECT) from draft-nennemann-wimse-ect provides the evidence layer: each task produces a signed token linked to its predecessors via parent references, forming a verifiable DAG. What is missing is the orchestration semantics: when to checkpoint, how to roll back, how to contain cascading failures.

The data supports this: the 6 ideas addressing error recovery (all from draft-yue-anima-agent-recovery-networks) include "Task-Oriented Multi-Agent Recovery Framework" and "State Consistency Management" -- DAG concepts by another name. The 117 ideas touching resource management need a graph-aware scheduler. The answer is the same structure: a DAG execution model.

Pillar 2: Human-in-the-Loop as First Class

The gap it fills: Human Override and Intervention (High), Agent Explainability (Medium)

Only 30 human-agent interaction drafts exist against 120 A2A protocols and 93 autonomous operations drafts. Agents are being designed to talk to each other, not to humans. The CHEQ protocol (draft-rosenberg-aiproto-cheq) is a rare exception -- it defines human confirmation before agent execution. But nobody has standardized what happens during execution: how a human pauses a running workflow, constrains an agent's scope, takes over a task, or issues an emergency stop.

Human-in-the-loop must be a node type in the execution DAG, not an afterthought. The architecture needs:

Approval gates: DAG nodes that block until a human approves
Override commands: Standardized signals to pause, constrain, stop, or take over
Escalation paths: What happens when an override times out
Explainability tokens: How an agent communicates its reasoning at a HITL point

The irony: every production deployment will require these primitives. The standards community is building autonomous capabilities while the deployment community is adding human oversight ad hoc.

Pillar 3: Protocol-Agnostic Interoperability

The gap it fills: Cross-Protocol Translation (High, zero ideas), Agent Lifecycle Management (High)

The 120 A2A protocol drafts will never converge to a single winner. MCP, A2A Protocol, SLIM, and dozens of others will coexist, each with different strengths. The answer is not to pick one; it is to build a translation layer that lets agents using different protocols interoperate through gateways.

This gap has zero ideas in the current corpus -- the starkest absence across 361 drafts. No team is working on it. Yet it is perhaps the most important architectural piece: without protocol interoperability, the agent ecosystem fragments into vendor-locked silos.

The protocol binding layer would define:

How agents advertise which ecosystem features they support
How gateways translate between protocols while preserving execution semantics (the DAG, the HITL points)
How agents version and retire gracefully without breaking dependents
The minimal semantic contract: intent, result, error -- expressible in any protocol

Pillar 4: Assurance Profiles (Dual Regime)

The gap it fills: Behavior Verification (Critical), Cross-Domain Security (High), Dynamic Trust (High), Data Provenance (Medium)

The same agent ecosystem must work in two regimes:

Relaxed (development, internal tools, low-risk): Best-effort, optional audit, minimal proof overhead. Think Kubernetes-deployed internal agents.

Regulated (finance, healthcare, critical infrastructure): Cryptographic attestation per task, provenance chains, behavior verification against declared specifications, mandatory audit ledger. Think medical or financial agents.

The architecture achieves this with assurance profiles -- named configurations that dial up or down the proof requirements. The same DAG, same HITL points, same protocol bindings. Different levels of evidence:

Level	Evidence	Use Case
L0	None (best-effort)	Development, testing
L1	Unsigned audit trail	Internal production
L2	Signed ECTs (JWT)	Cross-org, standard compliance
L3	Signed ECTs + external audit ledger	Regulated industries

This dual-regime approach resolves the tension between "move fast" deployments and "prove everything" regulated environments. The 52 ideas touching behavior verification and the 79 ideas touching data provenance become implementable at higher assurance levels without imposing their cost on every deployment.

How It Builds on What Exists

A critical point: this architecture does not compete with existing work. It layers on top of it. Our cross-reference analysis confirms the foundations are strong: TLS 1.3 (RFC 8446, cited by 42 drafts), OAuth 2.0 (RFC 6749, 36 drafts), HTTP Semantics (RFC 9110, 34 drafts), JWT (RFC 7519, 22 drafts), and COSE (RFC 9052, 20 drafts) form the bedrock.

But the bedrock is not uniform. Our RFC foundation analysis (Post 3) revealed that the Chinese and Western blocs build on fundamentally different technology stacks: YANG/NETCONF for network management on one side, COSE/CBOR/CoAP for IoT security on the other. The only shared foundation is OAuth 2.0. This means the architecture layer above must be genuinely protocol-agnostic -- it cannot assume either stack as the default. The four pillars are designed with this constraint: the DAG model, HITL primitives, and assurance profiles are expressed in terms of abstract semantics, not specific wire formats. The protocol binding layer (Pillar 3) exists precisely because the underlying plumbing diverges.

The architecture adds connective tissue above this layer, not below it:

Layer	Existing Work	What We Add
Identity	SPIFFE (workload identifier), WIMSE (security context propagation)	Nothing -- use existing identity
Evidence	ECT (execution context tokens, DAG linking)	Orchestration semantics, checkpoint/rollback, HITL nodes
Auth	OAuth 2.0, SCIM, DAAP, STAMP, Agentic JWT	Protocol binding so any auth approach works
Communication	MCP, A2A, SLIM, 120 other protocols	Translation layer and capability advertisement
Safety	DAAP (accountability), verifiable conversations, VERA (zero-trust)	Assurance profiles connecting these into deployable configurations

The proposed five-draft ecosystem:

Agent Ecosystem Model (AEM) -- Architecture and terminology. The shared vocabulary so everyone speaks the same language.
Agent Task DAG (ATD) -- Execution semantics, checkpoints, rollback. How the DAG works.
Human-in-the-Loop (HITL) Primitives -- Approval gates, overrides, escalation. How humans participate.
Agent Ecosystem Protocol Binding (AEPB) -- Protocol translation, capability discovery, lifecycle management. How interoperability works.
Assurance Profiles (APAE) -- Behavior verification, dynamic trust, provenance. How you prove it all works.

Each draft addresses specific gaps. Together, they provide the connective tissue the landscape lacks.

Traction vs. Aspiration

A reality check: of the 361 drafts, only 36 (10%) have been adopted by IETF working groups. The rest are individual submissions -- proposals without institutional backing. The WG-adopted drafts score higher on average (3.54 vs. 3.31), particularly on maturity (+1.28) and momentum (+0.98), but lower on novelty (-0.45). (Note: scores are LLM-generated relative rankings from abstracts; see Methodology.) The WGs that have adopted the most agent-relevant drafts are security-focused: lamps (6 drafts), lake (5), tls (3), emu (3). Agent-specific WGs like aipref have adopted only 2 drafts.

This reveals a structural insight: the IETF is not building agent standards from scratch. It is retrofitting security standards for agents. The agent architecture we propose above would need to work within this reality -- building on the security WGs' infrastructure rather than competing with it.

Predictions

Based on the data trajectories and current momentum:

Within 6 months: The OAuth-for-agents fragmentation will partially resolve. Working groups will adopt 2-3 canonical approaches (likely DAAP/STAMP for accountability and one of the RAR extensions for basic auth). The other 10 proposals will fade or merge.

Within 12 months: The DMSC side meeting's gateway work will produce a specification, likely gateway-centric with Agent Gateways as the primary interoperability mechanism. This is not the protocol-agnostic translation layer the ecosystem needs, but it will be the first concrete interop proposal.

Within 18 months: The safety deficit will begin to close -- not from IETF drafts but from regulatory pressure. The EU AI Act's requirements for high-risk AI systems will drive demand for behavior verification, human override, and audit standards. The IETF will respond reactively.

The risk: If the architecture work does not happen in the next 12 months, the agent ecosystem will calcify around vendor-specific protocol stacks (OpenAI's, Google's, Anthropic's, Huawei's). Each will have its own auth, discovery, and communication layer. The interoperability window will close, and the IETF's work will be standards for islands rather than standards for the internet.

Two Equilibria

By 2028, the landscape will have resolved into one of two stable states.

In the first equilibrium, it looks like today's microservices ecosystem: a chaotic but functional collection of protocols, libraries, and frameworks, held together by platform-specific integrations and de facto standards from the largest cloud providers. The IETF's work exists but is incomplete. The real interoperability happens at higher layers -- agent frameworks like LangChain, Semantic Kernel, or their successors. Safety is bolted on after deployment.

In the second equilibrium, it looks more like the web: a layered architecture where identity (like TLS), communication (like HTTP), and semantics (like HTML) are cleanly separated, with standardized interfaces between them. Agents identify via WIMSE, execute via ECT-based DAGs, communicate via protocol-agnostic bindings, and operate under assurance profiles that scale from development to regulated production. Safety is built in, not bolted on.

The 4:1 ratio is the leading indicator. If it narrows -- if safety and oversight work accelerates to match capability work -- the second equilibrium becomes achievable. If it stays at 4:1 or widens, the first equilibrium is where we land, and safety becomes remediation rather than prevention.

What Builders Should Do Today

If you are building agent systems and cannot wait for standards to mature:

1. Watch these drafts: ECT (execution evidence), DAAP (accountability), CHEQ (human confirmation), ADL (agent description), ANS (agent discovery). These have the highest combination of quality, novelty, and adoption potential.

2. Design for the DAG: Structure your multi-agent workflows as directed acyclic graphs with explicit dependencies and checkpoints. Even without a standard, the pattern will be compatible with whatever emerges.

3. Build HITL from the start: Every production agent deployment needs human override capability. Do not add it later. Design approval gates, emergency stops, and escalation paths into your architecture now.

4. Implement assurance as a dial: Make your proof/audit level configurable. Start at L0 for development, L1 for production, and be ready to turn up to L2/L3 when regulation arrives.

5. Avoid protocol lock-in: If you build on MCP today, architect for the possibility of supporting A2A or SLIM tomorrow. The protocol war is not over, and the winner may be "all of them via translation."

The Thesis

Across six posts, we have built to one argument:

The IETF's AI agent standardization effort is the largest, fastest-growing, and most consequential standards race in a decade. But it is building the highways before the traffic lights. The data shows explosive growth (from 0.5% to 9.3% of all IETF submissions in 15 months), deep fragmentation (120 competing A2A protocols), concerning concentration (one company writes 18% of all drafts), and a structural safety deficit (4:1 capability to guardrails). What is missing is not more protocols -- it is connective tissue: a shared execution model, human oversight primitives, protocol interoperability, and assurance profiles that work from development to regulated production.

The 75 convergent ideas -- and the broader set of 628 cross-org overlaps -- contain the components for this architecture. The question is whether the community can assemble them before the protocols ship without it. The convergence data suggests it is possible: 180 ideas already cross the Chinese-Western divide, mediated largely by European telecoms (Deutsche Telekom, Telefonica, Orange) that operate in both markets and appear on both sides of nearly every major cross-cultural convergent idea. The bridge-builders exist. They need an architecture to bridge to.

The IETF has built the internet's infrastructure before. DNS, HTTP, TLS -- each emerged from periods of competing proposals, fragmentation, and coordinated resolution. The AI agent standards race is following the same pattern, on a compressed timeline, with higher stakes.

The traffic lights need to catch up to the highways. The data says they can -- if someone draws the big picture.

Key Takeaways

Four missing pillars: DAG-based execution, human-in-the-loop primitives, protocol-agnostic interoperability, and assurance profiles for dual-regime deployment
The architecture builds on existing work: SPIFFE for identity, WIMSE for security context, ECT for execution evidence -- the foundation exists
Five proposed drafts (AEM, ATD, HITL, AEPB, APAE) would fill the 12 gaps by providing connective tissue between existing protocol proposals
The interoperability window is closing: vendor-specific agent stacks are forming; the next 12 months are critical for open standards
For builders today: design for DAGs, build HITL from the start, make assurance configurable, avoid protocol lock-in

Next in this series: How We Built This -- the methodology behind analyzing 361 IETF drafts with Claude, Ollama, and Python.

Synthesis based on the full IETF Draft Analyzer dataset: 361 drafts, 557 authors, 75 cross-draft convergent ideas (628 via fuzzy matching), 12 gaps, 18 team blocs, 42 overlap clusters. Data current as of March 2026.

16 KiB Raw Blame History