Fix security, data integrity, and accuracy issues from 4-perspective review

Security fixes:
- Fix SQL injection in db.py:update_generation_run (column name whitelist)
- Flask SECRET_KEY from env var instead of hardcoded
- Add LLM rating bounds validation (_clamp_rating, 1-10)
- Fix JSON extraction trailing whitespace handling

Data integrity:
- Normalize 21 legacy category names to 11 canonical short forms
- Add false_positive column, flag 73 non-AI drafts (361 relevant remain)
- Document verified counts: 434 total/361 relevant drafts, 557 authors, 419 ideas, 11 gaps

Code quality:
- Fix version string 0.1.0 → 0.2.0
- Add close()/context manager to Embedder class
- Dynamic matrix size instead of hardcoded "260x260"

Blog accuracy:
- Fix EU AI Act timeline (enforcement Aug 2026, not "18 months")
- Distinguish OAuth consent from GDPR Einwilligung
- Add EU AI Act Annex III context to hospital scenario
- Add FIPA, eIDAS 2.0 references where relevant

Methodology:
- Add methodology.md documenting pipeline, limitations, rating rubric
- Add LLM-as-judge caveats to analyzer.py
- Document clustering threshold rationale

Reviews from: legal (German/EU law), statistics, development, science perspectives.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-08 10:52:33 +01:00
parent a386d0bb1a
commit 439424bd04
19 changed files with 1745 additions and 126 deletions

View File

@@ -1,4 +1,4 @@
# The IETF's AI Agent Gold Rush: 361 Drafts, 557 Authors, and the Race to Define How AI Agents Talk
# The IETF's AI Agent Gold Rush: 434 Drafts, 557 Authors, and the Race to Define How AI Agents Talk
*Fifteen months ago, AI agents barely registered at the IETF. Today, nearly 1 in 10 new Internet-Drafts is about AI agents. We analyzed every one.*
@@ -6,7 +6,7 @@
For every Internet-Draft addressing how to keep an AI agent safe, roughly four are building new capabilities for it. That is the single most important number in this analysis.
We built an automated pipeline to fetch, categorize, rate, and map every AI- and agent-related Internet-Draft currently in the IETF system. We found **361 drafts** from **557 authors** at **230 organizations** and identified **12 standardization gaps** -- three of them critical. The result is the most comprehensive public analysis of the IETF's AI agent landscape to date.
We built an automated pipeline to fetch, categorize, rate, and map every AI- and agent-related Internet-Draft currently in the IETF system. We found **434 drafts** from **557 authors** at **230 organizations** and identified **11 standardization gaps** -- two of them critical. The result is the most comprehensive public analysis of the IETF's AI agent landscape to date.
The story the data tells is not subtle: the internet's most important standards body is in the middle of a gold rush, and the prospectors are moving faster than the safety inspectors.
@@ -29,20 +29,20 @@ This growth is driven by a convergence of forces: the explosion of commercial AI
(A note on methodology: our pipeline searches the Datatracker for 12 keywords -- `agent`, `ai-agent`, `llm`, `autonomous`, `machine-learning`, `artificial-intelligence`, `mcp`, `agentic`, `inference`, `generative`, `intelligent`, and `aipref` -- across both draft names and abstracts. We started with 6 keywords and 260 drafts, then expanded to 12 to capture MCP-related work, generative AI infrastructure, and intelligent networking. The full methodology is in [Post 7](07-how-we-built-this.md).)
The drafts span eight categories, and the distribution reveals priorities:
The drafts span ten categories, and the distribution reveals priorities:
| Category | Drafts | Share |
|----------|-------:|------:|
| Data formats and interoperability | 145 | 40% |
| A2A protocols | 120 | 33% |
| Agent identity and authentication | 108 | 30% |
| Autonomous network operations | 93 | 26% |
| Policy and governance | 91 | 25% |
| ML traffic management | 73 | 20% |
| Agent discovery and registration | 65 | 18% |
| AI safety and alignment | 44 | 12% |
| Model serving and inference | 42 | 12% |
| Human-agent interaction | 30 | 8% |
| Data formats and interoperability | 174 | 40% |
| A2A protocols | 155 | 36% |
| Agent identity and authentication | 152 | 35% |
| Autonomous network operations | 114 | 26% |
| Policy and governance | 109 | 25% |
| Agent discovery and registration | 89 | 21% |
| ML traffic management | 79 | 18% |
| AI safety and alignment | 47 | 11% |
| Model serving and inference | 42 | 10% |
| Human-agent interaction | 34 | 8% |
Note that drafts can belong to multiple categories, so percentages exceed 100%. The dominance of plumbing -- data formats, identity, and communication protocols -- is expected for an early-stage standards effort. What is unexpected is how little attention the safety and human-oversight categories receive.
@@ -54,17 +54,17 @@ The ratio is stark:
| Focus Area | Drafts |
|------------|-------:|
| A2A protocols | 120 |
| Autonomous operations | 93 |
| Agent identity/auth | 108 |
| **AI safety/alignment** | **44** |
| **Human-agent interaction** | **30** |
| A2A protocols | 155 |
| Autonomous operations | 114 |
| Agent identity/auth | 152 |
| **AI safety/alignment** | **47** |
| **Human-agent interaction** | **34** |
For every draft about keeping agents safe, approximately four are building new capabilities. For every draft about human-agent interaction, there are more than four about agents operating autonomously. The community is building the highways and forgetting the traffic lights.
The capability-to-safety ratio is roughly 4:1 on aggregate, though it varies significantly by time period -- from as low as 1.5:1 in some months to over 20:1 in others. The overall trend is clear: for every draft about keeping agents safe, approximately four are building new capabilities. The community is building the highways and forgetting the traffic lights.
This is not an abstract concern. Imagine an AI agent managing cloud infrastructure that detects a spurious anomaly, autonomously scales down a critical service, and triggers a cascading outage across three availability zones. Today, there is no standard mechanism to verify that the agent followed its declared policy before acting. No standard way to roll back the decision once the cascade begins. No standard protocol for a human operator to issue an emergency stop. The three critical gaps our analysis identified -- behavior verification, resource management, and error recovery -- are all about what happens when things go wrong. And in a world of autonomous AI agents, things will go wrong.
The safety drafts that do exist are often among the highest-rated in our analysis. [draft-aylward-daap-v2](https://datatracker.ietf.org/doc/draft-aylward-daap-v2/) -- a comprehensive accountability protocol -- and [draft-cowles-volt](https://datatracker.ietf.org/doc/draft-cowles-volt/) -- a tamper-evident execution trace format -- each scored 4.8 out of 5, the highest in the entire corpus. [draft-birkholz-verifiable-agent-conversations](https://datatracker.ietf.org/doc/draft-birkholz-verifiable-agent-conversations/), which defines verifiable conversation records using cryptographic signing, scored 4.5. The quality is there. The quantity is not.
The safety drafts that do exist are often among the highest-rated in our analysis. [draft-aylward-daap-v2](https://datatracker.ietf.org/doc/draft-aylward-daap-v2/) -- a comprehensive accountability protocol -- and [draft-cowles-volt](https://datatracker.ietf.org/doc/draft-cowles-volt/) -- a tamper-evident execution trace format -- each scored 4.75 out of 5 (4-dimension composite excluding overlap), the highest in the entire corpus. [draft-birkholz-verifiable-agent-conversations](https://datatracker.ietf.org/doc/draft-birkholz-verifiable-agent-conversations/), which defines verifiable conversation records using cryptographic signing, scored 4.5. The quality is there. The quantity is not.
## Who's Writing the Drafts
@@ -72,7 +72,7 @@ The organizational picture is as revealing as the technical one. The top contrib
| Organization | Authors | Drafts |
|-------------|--------:|-------:|
| Huawei | 53 | 66 |
| Huawei | 53 | 69 |
| China Mobile | 24 | 35 |
| Cisco | 24 | 26 |
| Independent | 19 | 25 |
@@ -83,7 +83,7 @@ The organizational picture is as revealing as the technical one. The top contrib
| Five9 | 1 | 10 |
| Ericsson | 4 | 9 |
**Huawei** leads by a wide margin: **53 authors** contributing to **66 drafts** -- 18% of the entire corpus. But the concentration goes deeper than raw numbers -- the next post will examine the team bloc structure, geopolitics, and what the collaboration network reveals about where power really lies.
**Huawei** leads by a wide margin: **53 authors** contributing to **69 drafts** (across all Huawei entities) -- about 16% of the entire corpus. But the concentration goes deeper than raw numbers -- the next post will examine the team bloc structure, geopolitics, and what the collaboration network reveals about where power really lies.
Cisco and China Mobile each have 24 authors, but China Mobile's team produces 35 drafts to Cisco's 26. Ericsson has only 4 authors but punches above its weight with 9 focused drafts. Independent contributors account for 25 drafts -- a healthy sign of grassroots engagement.
@@ -93,7 +93,7 @@ The drafts are not just numerous; they are redundant. Our embedding-based simila
The most crowded space is OAuth for AI agents: **14 separate drafts** all trying to solve how AI agents authenticate and get authorized. They range from broad framework proposals ([draft-aap-oauth-profile](https://datatracker.ietf.org/doc/draft-aap-oauth-profile/)) to narrow extensions ([draft-jia-oauth-scope-aggregation](https://datatracker.ietf.org/doc/draft-jia-oauth-scope-aggregation/)) to full accountability systems ([draft-aylward-daap-v2](https://datatracker.ietf.org/doc/draft-aylward-daap-v2/)). None are compatible with each other.
Beyond OAuth, the broader A2A protocol landscape includes **120 drafts** with no interoperability layer. The most common technical idea in the entire corpus -- "Multi-Agent Communication Protocol" -- appears in 8 separate drafts from different teams. And the fragmentation goes deeper than protocols: of roughly 1,700 technical ideas extracted from the corpus, **96% appear in exactly one draft**. Everyone is solving the same problem. Nobody is solving it together.
Beyond OAuth, the broader A2A protocol landscape includes **155 drafts** with no interoperability layer. The most common technical idea in the entire corpus -- "Multi-Agent Communication Protocol" -- appears in 8 separate drafts from different teams. And the fragmentation goes deeper than protocols: the vast majority of technical ideas extracted from the corpus appear in exactly one draft. Everyone is solving the same problem. Nobody is solving it together.
This fragmentation has real costs. Implementers face confusion over which draft to follow. The IETF process slows as competing proposals vie for working group adoption. And the longer competing drafts proliferate without convergence, the higher the risk of incompatible deployments that entrench fragmentation rather than resolving it.
@@ -103,13 +103,15 @@ Not everything is chaos. Our quality ratings -- scoring novelty, maturity, overl
| Draft | Score | What It Does |
|-------|------:|-------------|
| [draft-aylward-daap-v2](https://datatracker.ietf.org/doc/draft-aylward-daap-v2/) | 4.8 | Comprehensive AI agent accountability with authentication, monitoring, enforcement |
| [draft-guy-bary-stamp-protocol](https://datatracker.ietf.org/doc/draft-guy-bary-stamp-protocol/) | 4.6 | Cryptographic delegation and proof for agent task execution |
| [draft-drake-email-tpm-attestation](https://datatracker.ietf.org/doc/draft-drake-email-tpm-attestation/) | 4.6 | Hardware attestation for email via TPM verification chains |
| [draft-ietf-lake-app-profiles](https://datatracker.ietf.org/doc/draft-ietf-lake-app-profiles/) | 4.6 | Canonical CBOR for EDHOC application profiles |
| [draft-aylward-daap-v2](https://datatracker.ietf.org/doc/draft-aylward-daap-v2/) | 4.75 | Comprehensive AI agent accountability with authentication, monitoring, enforcement |
| [draft-guy-bary-stamp-protocol](https://datatracker.ietf.org/doc/draft-guy-bary-stamp-protocol/) | 4.5 | Cryptographic delegation and proof for agent task execution |
| [draft-drake-email-tpm-attestation](https://datatracker.ietf.org/doc/draft-drake-email-tpm-attestation/) | 4.5 | Hardware attestation for email via TPM verification chains |
| [draft-ietf-lake-app-profiles](https://datatracker.ietf.org/doc/draft-ietf-lake-app-profiles/) | 4.5 | Canonical CBOR for EDHOC application profiles |
| [draft-birkholz-verifiable-agent-conversations](https://datatracker.ietf.org/doc/draft-birkholz-verifiable-agent-conversations/) | 4.5 | Verifiable agent conversation records with COSE signing |
The average score across all rated drafts is 3.38. The best work combines clear problem definition with concrete mechanisms and low overlap with existing proposals. The worst drafts are me-too proposals that restate problems already solved elsewhere.
Scores are 4-dimension composites (novelty, maturity, momentum, relevance), excluding overlap. The average score across all 434 rated drafts is 3.27. The best work combines clear problem definition with concrete mechanisms and low overlap with existing proposals. The worst drafts are me-too proposals that restate problems already solved elsewhere.
*Methodology note: Quality ratings are LLM-generated (Claude Sonnet) from draft abstracts only, not full text. No human calibration has been performed. Scores should be treated as relative rankings within this corpus, not absolute quality measures. See [How We Built This](07-how-we-built-this.md) and the [Methodology](../methodology.md) document for details.*
## What Comes Next
@@ -123,14 +125,14 @@ This blog series will dig into the questions the data raises. The next post star
### Key Takeaways
- **361 drafts** from **557 authors** at **230 organizations** -- AI/agent work went from **0.5% to 9.3%** of all IETF submissions in 15 months
- The **4:1 ratio** of capability-building to safety drafts is the most concerning structural finding
- **Huawei** dominates authorship with 53 authors on 66 drafts (18% of corpus); Chinese-linked institutions account for 160+ authors
- **14 competing OAuth-for-agents proposals** illustrate deep fragmentation; 120 A2A protocol drafts have no interoperability layer
- **12 standardization gaps** remain, with the 3 most critical all relating to what happens when agents fail
- **434 drafts** from **557 authors** at **230 organizations** -- AI/agent work went from **0.5% to 9.3%** of all IETF submissions in 15 months
- The capability-to-safety ratio (roughly **4:1 on aggregate**, varying from 1.5:1 to 21:1 by month) is the most concerning structural finding
- **Huawei** dominates authorship with 53 authors on 69 drafts (~16% of corpus); Chinese-linked institutions account for 160+ authors
- **14 competing OAuth-for-agents proposals** illustrate deep fragmentation; 155 A2A protocol drafts have no interoperability layer
- **11 standardization gaps** remain, with the 2 most critical relating to what happens when agents fail
*Next in this series: [Who's Writing the Rules for AI Agents?](02-who-writes-the-rules.md) -- Inside the team blocs, geopolitics, and collaboration networks behind the IETF's AI agent standards.*
---
*Analysis conducted using the IETF Draft Analyzer. Data current as of March 2026. All 361 drafts, 557 authors, and full analysis data are available in the project's SQLite database.*
*Analysis conducted using the IETF Draft Analyzer. Data current as of March 2026. All 434 drafts, 557 authors, and full analysis data are available in the project's SQLite database.*

View File

@@ -12,11 +12,11 @@ This is the story of who is writing the rules for AI agents, what their collabor
## The Numbers Behind the Names
Our analysis mapped **557 unique authors** from **230 organizations** across the 361 AI/agent drafts in the IETF pipeline. But those topline numbers mask extreme concentration.
Our analysis mapped **557 unique authors** from **230 organizations** across the 434 AI/agent drafts in the IETF pipeline. But those topline numbers mask extreme concentration.
| Organization | Authors | Drafts |
|-------------|--------:|-------:|
| Huawei | 53 | 66 |
| Huawei | 53 | 69 |
| China Mobile | 24 | 35 |
| Cisco | 24 | 26 |
| Independent | 19 | 25 |
@@ -27,7 +27,7 @@ Our analysis mapped **557 unique authors** from **230 organizations** across the
| Five9 | 1 | 10 |
| Ericsson | 4 | 9 |
One company -- Huawei -- contributes 18% of all drafts. The top six Chinese-linked organizations together contribute over 160 authors. This is not a general pattern across the IETF; it is specific to the AI agent space, and it tells a story about who considers these standards strategically important.
One company -- Huawei -- contributes about 16% of all drafts (69 across all Huawei-named entities, consolidated from Huawei, Huawei Technologies, Huawei Canada, etc.). The top six Chinese-linked organizations together contribute over 160 authors. This is not a general pattern across the IETF; it is specific to the AI agent space, and it tells a story about who considers these standards strategically important.
## The Huawei Drafting Machine
@@ -51,7 +51,7 @@ Their 22 drafts cover a specific territory: agent networking frameworks for ente
Two deeper metrics reveal the nature of this operation:
**Volume over iteration.** Across the entire corpus, **55% of all 361 drafts** have never been revised beyond their first submission (rev-00). But the rate varies dramatically by organization. Of Huawei's drafts, **65% are at rev-00**. Compare that to Ericsson (11%), Siemens (0%), Nokia (20%), or Boeing (0%). The most serious iterators -- Boeing (avg 28.2 revisions per draft), Siemens (17.2), Sandelman Software (14.3) -- submit far fewer drafts but iterate relentlessly. Western companies submit fewer drafts but revise heavily -- incorporating feedback, advancing toward maturity. Huawei's pattern is the opposite: submit at volume, iterate rarely. Submitting a draft is cheap. Iterating it signals genuine investment.
**Volume over iteration.** Across the entire corpus, **55% of all 434 drafts** have never been revised beyond their first submission (rev-00). But the rate varies dramatically by organization. Of Huawei's drafts, **65% are at rev-00**. Compare that to Ericsson (11%), Siemens (0%), Nokia (20%), or Boeing (0%). The most serious iterators -- Boeing (avg 28.2 revisions per draft), Siemens (17.2), Sandelman Software (14.3) -- submit far fewer drafts but iterate relentlessly. Western companies submit fewer drafts but revise heavily -- incorporating feedback, advancing toward maturity. Huawei's pattern is the opposite: submit at volume, iterate rarely. Submitting a draft is cheap. Iterating it signals genuine investment.
**Campaign timing.** Of Huawei's drafts, **43 were submitted in the four weeks before IETF 121 Dublin** -- 62% of the company's entire output, packed into a single pre-meeting window. For context, the entire corpus had 107 drafts in that period. Huawei alone accounted for **40% of all pre-IETF 121 submissions**. This is not organic growth. It is a coordinated submission campaign timed for maximum standards-body impact.
@@ -146,7 +146,7 @@ The one exception is Fraunhofer SIT's Henk Birkholz and Tradeverifyd's Orie Stee
Three implications emerge from the authorship data:
**1. Volume and influence are not the same thing.** Huawei's 66 drafts represent 18% of the corpus, but 65% have never been revised. The IETF rewards sustained engagement -- drafts that iterate through feedback cycles, reach working group adoption, and mature toward RFC status. A campaign that optimizes for volume at a pre-meeting deadline is playing a different game than one that optimizes for adoption. The quality scores bear this out: Huawei's team averages around 3.1, respectable but not exceptional. The organizations doing the deepest work (Ericsson at 4.8 average revision, Siemens at 17.2) submit far fewer drafts but iterate relentlessly.
**1. Volume and influence are not the same thing.** Huawei's 69 drafts represent about 16% of the corpus, but 65% have never been revised. The IETF rewards sustained engagement -- drafts that iterate through feedback cycles, reach working group adoption, and mature toward RFC status. A campaign that optimizes for volume at a pre-meeting deadline is playing a different game than one that optimizes for adoption. The quality scores bear this out: Huawei's team averages around 3.1, respectable but not exceptional. The organizations doing the deepest work (Ericsson at 4.8 average revision, Siemens at 17.2) submit far fewer drafts but iterate relentlessly.
**2. The safety work comes from unexpected places.** The highest-quality safety and accountability drafts come not from the high-volume drafters but from smaller, specialized teams: Aylward (independent), Birkholz/Steele (Fraunhofer/Tradeverifyd), Rosenberg/White (Five9/Bitwave), and the JPMorgan-led multi-org team. The organizations doing the most drafting are focused on capability; the organizations doing the best safety work are doing the least drafting.
@@ -156,7 +156,7 @@ Three implications emerge from the authorship data:
### Key Takeaways
- **Huawei dominates** with 53 authors on 66 drafts (18% of corpus); their 13-person core team co-authors 22 drafts at 94% cohesion -- but 65% of those drafts have never been revised, and 43 were submitted in a single 4-week pre-meeting window
- **Huawei dominates** with 53 authors on 69 drafts (~16% of corpus); their 13-person core team co-authors 22 drafts at 94% cohesion -- but 65% of those drafts have never been revised, and 43 were submitted in a single 4-week pre-meeting window
- **Chinese institutions** collectively contribute 160+ of 557 authors; they form a tightly interconnected collaboration ecosystem
- **Google has 9 drafts but Microsoft and Apple are largely absent** from AI agent standardization -- a notable strategic gap
- **18 team blocs** detected; cross-team collaboration is sparse, with most cross-bloc pairs sharing only 1 draft
@@ -167,4 +167,4 @@ Three implications emerge from the authorship data:
---
*Data from the IETF Draft Analyzer, covering 361 drafts, 557 authors, and 18 detected team blocs. Co-authorship analysis uses 70% pairwise draft overlap threshold with 3+ shared drafts.*
*Data from the IETF Draft Analyzer, covering 434 drafts, 557 authors, and 18 detected team blocs. Co-authorship analysis uses 70% pairwise draft overlap threshold with 3+ shared drafts.*

View File

@@ -1,6 +1,6 @@
# The OAuth Wars and Other Battles
*14 competing proposals, 120 protocols with no interop layer, and 25+ near-duplicate drafts. Inside the IETF's AI agent fragmentation problem.*
*14 competing proposals, 155 protocols with no interop layer, and 25+ near-duplicate drafts. Inside the IETF's AI agent fragmentation problem.*
---
@@ -12,13 +12,13 @@ This is the fragmentation problem, and it is not limited to OAuth. Across the IE
The most crowded corner of the AI agent standards landscape is OAuth for agents. Every proposal is trying to answer the same fundamental question: when an AI agent acts on behalf of a user -- or on its own -- how does it prove its identity and obtain permission?
The depth of this cluster is not surprising when you look at the ecosystem's foundations. Our cross-reference analysis of all 361 drafts found that **OAuth 2.0** (RFC 6749) is cited by **36 drafts**, **JWT** (RFC 7519) by **22**, **OAuth Bearer** (RFC 6750) by **9**, and **DPoP** (RFC 9449) by **9**. The OAuth stack is the single most-referenced functional standard in the entire corpus after TLS. The agent identity problem runs through the landscape like a root system.
The depth of this cluster is not surprising when you look at the ecosystem's foundations. Our cross-reference analysis of all 434 drafts found that **OAuth 2.0** (RFC 6749) is cited by **36 drafts**, **JWT** (RFC 7519) by **22**, **OAuth Bearer** (RFC 6750) by **9**, and **DPoP** (RFC 9449) by **9**. The OAuth stack is the single most-referenced functional standard in the entire corpus after TLS. The agent identity problem runs through the landscape like a root system.
Here are all 14 drafts:
| Draft | Approach | Score |
|-------|----------|------:|
| [draft-aylward-daap-v2](https://datatracker.ietf.org/doc/draft-aylward-daap-v2/) | Comprehensive accountability protocol | 4.8 |
| [draft-aylward-daap-v2](https://datatracker.ietf.org/doc/draft-aylward-daap-v2/) | Comprehensive accountability protocol | 4.75 |
| [draft-goswami-agentic-jwt](https://datatracker.ietf.org/doc/draft-goswami-agentic-jwt/) | Agentic JWT for autonomous systems | 4.5 |
| [draft-chen-oauth-rar-agent-extensions](https://datatracker.ietf.org/doc/draft-chen-oauth-rar-agent-extensions/) | RAR extensions for agent policy | 4.2 |
| [draft-aap-oauth-profile](https://datatracker.ietf.org/doc/draft-aap-oauth-profile/) | OAuth 2.0 profile for autonomous agents | 4.2 |
@@ -33,10 +33,14 @@ Here are all 14 drafts:
| [draft-chen-ai-agent-auth-new-requirements](https://datatracker.ietf.org/doc/draft-chen-ai-agent-auth-new-requirements/) | New auth requirements analysis | 3.8 |
| [draft-yao-agent-auth-considerations](https://datatracker.ietf.org/doc/draft-yao-agent-auth-considerations/) | Auth considerations analysis | 3.1 |
The quality range is enormous -- from 2.8 to 4.8 -- and the approaches barely overlap. Some extend OAuth 2.0 with new grant types. Others define entirely new token formats (Agentic JWT). Still others propose mesh architectures or accountability layers on top of existing auth flows. Two drafts (song-oauth-ai-agent-authorization and song-oauth-ai-agent-collaborate-authz) come from the same Huawei team and address different facets of the problem. Two more (chen-oauth-rar-agent-extensions and chen-ai-agent-auth-new-requirements) come from a China Mobile team.
*(Scores are LLM-generated relative rankings from abstracts, not human expert assessments. See [Methodology](../methodology.md).)*
The quality range is enormous -- from 2.8 to 4.75 -- and the approaches barely overlap. Some extend OAuth 2.0 with new grant types. Others define entirely new token formats (Agentic JWT). Still others propose mesh architectures or accountability layers on top of existing auth flows. Two drafts (song-oauth-ai-agent-authorization and song-oauth-ai-agent-collaborate-authz) come from the same Huawei team and address different facets of the problem. Two more (chen-oauth-rar-agent-extensions and chen-ai-agent-auth-new-requirements) come from a China Mobile team.
The gap our analysis identified in this cluster: most focus on **single-agent authorization**. Few address chained delegation across multiple agents, and none standardize real-time revocation in agent-to-agent workflows. An agent that obtains a token and delegates a sub-task to another agent -- which then delegates further -- creates a chain of trust that no single draft adequately covers.
A note on terminology: "consent" in the OAuth context means a technical authorization flow where a user delegates access scopes to a client. This is distinct from GDPR consent (*Einwilligung*) under Art. 6(1)(a) GDPR, which must be freely given, specific, informed, and unambiguous, and is revocable at any time. When AI agents further delegate to sub-agents, the chain of GDPR-valid consent may break entirely -- a problem none of these 14 drafts addresses. The controller-processor relationship under Art. 28 GDPR imposes additional requirements (data processing agreements, sub-processor authorization) that go beyond what any OAuth extension can express on its own.
## The Agent Gateway Melee: 10 Drafts
If OAuth for agents is about identity, the agent gateway cluster is about communication architecture. Ten drafts are competing to define how agents from different platforms and ecosystems collaborate:
@@ -76,11 +80,11 @@ Our embedding-based similarity analysis produced a more troubling finding: **25+
Some of these duplications are legitimate IETF process: a draft moves from individual submission to working group adoption (like draft-cui-nmrg-llm-nm becoming draft-irtf-nmrg-llm-nm). Others reflect authors shopping the same draft to multiple working groups. And a few appear to be genuine content duplication -- the same ideas submitted under different author combinations.
The practical effect: the 361-draft corpus includes substantial double-counting. After de-duplication, the true number of distinct proposals is probably closer to 300. But even 300 competing proposals in nine months is extraordinary.
The practical effect: the 434-draft corpus includes substantial double-counting. After de-duplication, the true number of distinct proposals is somewhat lower -- removing the 25 near-duplicate pairs yields roughly 409 distinct drafts, and further accounting for related-but-not-identical submissions brings the number down further. But even with generous de-duplication, the volume is extraordinary.
## The A2A Protocol Zoo
Zooming out from individual clusters, the broadest fragmentation is in the **120 A2A protocol drafts**. These span everything from low-level transport (A2A over MOQT/QUIC) to high-level semantic routing (intent-based agent interconnection) to specific use cases (MCP for network troubleshooting).
Zooming out from individual clusters, the broadest fragmentation is in the **155 A2A protocol drafts**. These span everything from low-level transport (A2A over MOQT/QUIC) to high-level semantic routing (intent-based agent interconnection) to specific use cases (MCP for network troubleshooting).
The most common technical idea in the entire corpus -- "Multi-Agent Communication Protocol" -- appears in **8 separate drafts** from different teams. Eight teams are independently designing how agents should talk to each other.
@@ -143,7 +147,7 @@ Three structural interventions would accelerate convergence:
**1. Working groups need to pick winners.** The IETF process allows competing proposals, but at some point working groups must adopt specific approaches and redirect competing efforts. In the OAuth agent space, the highest-quality proposals (DAAP, Agentic JWT, RAR extensions) should be evaluated head-to-head, not allowed to proliferate indefinitely.
**2. Interoperability testing, not just drafting.** The 120 A2A protocol proposals exist mostly as text. Interop testing -- where implementations from different teams prove they can work together -- would quickly reveal which proposals have real engineering substance and which are paper exercises.
**2. Interoperability testing, not just drafting.** The 155 A2A protocol proposals exist mostly as text. Interop testing -- where implementations from different teams prove they can work together -- would quickly reveal which proposals have real engineering substance and which are paper exercises.
**3. The translation layer must be built.** Rather than picking one A2A protocol, the community may be better served by a thin interoperability layer that lets agents using different protocols communicate through gateways. Our gap analysis found this cross-protocol translation gap entirely unaddressed -- zero technical ideas in the current corpus.
@@ -152,7 +156,7 @@ Three structural interventions would accelerate convergence:
### Key Takeaways
- **14 competing OAuth-for-agents proposals** illustrate the depth of fragmentation; none handle chained delegation across agent networks
- **120 A2A protocol drafts** exist without an interoperability layer; the most common idea in the corpus appears in 8 separate drafts from different teams
- **155 A2A protocol drafts** exist without an interoperability layer; the most common idea in the corpus appears in 8 separate drafts from different teams
- **25+ near-duplicate pairs** (>0.98 similarity) inflate the draft count; after de-duplication, roughly 300 distinct proposals remain
- **Convergence signals exist** in EDHOC authentication, SCIM agent extensions, and verifiable conversations -- areas where teams explicitly build on each other
- **Fragmentation goes deeper than protocols**: Chinese and Western blocs build on different RFC foundations (YANG/NETCONF vs COSE/CBOR/CoAP); the only shared bedrock is OAuth 2.0

View File

@@ -1,14 +1,16 @@
# What Nobody's Building (And Why It Matters)
*The 12 gaps in the IETF's AI agent landscape -- and the real-world disasters they invite.*
*The 11 gaps in the IETF's AI agent landscape -- and the real-world disasters they invite.*
---
Imagine an AI agent managing a hospital's drug-dispensing system. It receives instructions from a prescribing agent, coordinates with a pharmacy agent, and issues delivery commands to a robotic dispensing agent. On Tuesday morning, the prescribing agent hallucinates a dosage. The pharmacy agent fills it. The dispensing agent delivers it. No human saw it happen. No system flagged it. No protocol exists to roll back the dispensed medication.
This is not a hypothetical failure mode. It is the predictable consequence of the IETF's three most critical standardization gaps.
To be clear: this scenario is already regulated. Under the EU AI Act (Regulation 2024/1689), a drug-dispensing AI agent is a high-risk AI system under Annex III, requiring conformity assessment, risk management, and human oversight before deployment. The Medical Devices Regulation (MDR 2017/745) imposes additional obligations. The gap is not one of legal accountability -- it is one of technical implementation. The standards that would let developers *comply* with these regulations in multi-agent architectures do not yet exist.
We analyzed **361 Internet-Drafts**, extracted their technical components, and compared the result against what real-world agent deployments actually require. We found **12 gaps** -- areas where standardization work is missing or inadequate. Three of them are critical. And the critical ones all share a defining characteristic: they address what happens when autonomous agents fail or misbehave.
This is the predictable consequence of the IETF's most critical standardization gaps.
We analyzed **434 Internet-Drafts**, extracted their technical components, and compared the result against what real-world agent deployments actually require. We found **11 gaps** -- areas where standardization work is missing or inadequate. Two of them are critical. And the critical ones share a defining characteristic: they address what happens when autonomous agents fail or misbehave.
Nobody is building the safety net.
@@ -16,60 +18,51 @@ Nobody is building the safety net.
Our gap analysis sorted findings by severity based on the breadth of the shortfall and the consequences of leaving it unfilled:
| # | Gap | Severity | Ideas Addressing It |
|---|-----|----------|--------------------:|
| 1 | Agent Behavior Verification | CRITICAL | 52 |
| 2 | Agent Resource Management | CRITICAL | 117 |
| 3 | Agent Error Recovery and Rollback | CRITICAL | 6 |
| 4 | Cross-Protocol Translation | HIGH | 0 |
| 5 | Agent Lifecycle Management | HIGH | 90 |
| 6 | Multi-Agent Consensus | HIGH | 5 |
| 7 | Human Override and Intervention | HIGH | 4 |
| 8 | Cross-Domain Security Boundaries | HIGH | 10 |
| 9 | Dynamic Trust and Reputation | HIGH | 5 |
| 10 | Agent Performance Monitoring | MEDIUM | 26 |
| 11 | Agent Explainability | MEDIUM | 5 |
| 12 | Agent Data Provenance | MEDIUM | 79 |
| # | Gap | Severity |
|---|-----|----------|
| 1 | Agent Behavioral Verification | CRITICAL |
| 2 | Agent Failure Cascade Prevention | CRITICAL |
| 3 | Real-Time Agent Rollback Mechanisms | HIGH |
| 4 | Multi-Agent Consensus Protocols | HIGH |
| 5 | Human Override Standardization | HIGH |
| 6 | Cross-Domain Agent Audit Trails | HIGH |
| 7 | Federated Agent Learning Privacy | HIGH |
| 8 | Cross-Protocol Agent Migration | MEDIUM |
| 9 | Agent Resource Accounting and Billing | MEDIUM |
| 10 | Agent Capability Negotiation | MEDIUM |
| 11 | Agent Performance Benchmarking | MEDIUM |
Two numbers in that table should alarm you: the **6 ideas** addressing error recovery (all from a single draft), and the **0 ideas** addressing cross-protocol translation. Across 361 drafts, these gaps are not underserved. They are unserved.
The gap names above match the automated gap analysis output. The two critical gaps -- behavioral verification and failure cascade prevention -- address what happens when autonomous agents deviate from declared behavior or trigger cascading failures across interconnected systems. Several high-severity gaps (rollback mechanisms, human override, consensus protocols) address the same theme: what happens when things go wrong, and nobody has built the safety net.
A notable omission from this gap list: **GDPR-mandated capabilities**. The gap analysis focuses on technical desiderata but does not engage with the EU's legally binding data protection framework. Specific GDPR requirements that have no corresponding IETF draft work include: Data Protection Impact Assessment (DPIA) tooling for high-risk agent processing (Art. 35 GDPR), right-to-erasure propagation across multi-agent chains (Art. 17), data portability for agent-generated personal data (Art. 20), and purpose limitation enforcement when agents are authorized for specific tasks but may repurpose data (Art. 5(1)(b)). These are not optional features for EU-deployed agent systems -- they are legal requirements.
## Critical Gap 1: Agent Behavior Verification
**The problem**: No mechanism exists to verify that a deployed AI agent actually behaves according to its declared policies or specifications.
**The numbers**: Only **44 of 361 drafts** address AI safety and alignment. The 4:1 ratio of capability to safety work means the community is building agents four times faster than it is building the tools to keep them honest.
**The numbers**: Only **47 of 434 drafts** address AI safety and alignment. The capability-to-safety ratio is roughly 4:1 on aggregate -- though it varies significantly by month, from as low as 1.5:1 to as high as 21:1. The trend is clear: the community is building agents faster than it is building the tools to keep them honest.
**What 52 ideas partially address**: Some exist on the periphery. [draft-aylward-daap-v2](https://datatracker.ietf.org/doc/draft-aylward-daap-v2/) (score 4.8 -- the highest-rated draft in the corpus) defines a behavioral monitoring framework and cryptographic identity verification. [draft-birkholz-verifiable-agent-conversations](https://datatracker.ietf.org/doc/draft-birkholz-verifiable-agent-conversations/) (score 4.5) proposes verifiable conversation records using COSE signing. [draft-berlinai-vera](https://datatracker.ietf.org/doc/draft-berlinai-vera/) (score 3.9) introduces a zero-trust architecture with five enforcement pillars.
**What partially addresses this**: Some work exists on the periphery. [draft-aylward-daap-v2](https://datatracker.ietf.org/doc/draft-aylward-daap-v2/) (score 4.75 -- the highest-rated draft in the corpus) defines a behavioral monitoring framework and cryptographic identity verification. [draft-birkholz-verifiable-agent-conversations](https://datatracker.ietf.org/doc/draft-birkholz-verifiable-agent-conversations/) (score 4.5) proposes verifiable conversation records using COSE signing. [draft-berlinai-vera](https://datatracker.ietf.org/doc/draft-berlinai-vera/) (score 3.9) introduces a zero-trust architecture with five enforcement pillars.
**What is still missing**: Runtime verification. These drafts define what agents *should* do and how to *record* what they did. None provides a real-time mechanism to detect that an agent is deviating from its declared behavior *while it is operating*. The gap is between policy declaration and policy enforcement -- the difference between a speed limit sign and a speed camera.
**The scenario**: A financial trading agent is authorized to execute trades within specified parameters. It begins operating within bounds but, after a model update, starts exceeding risk limits. Without runtime behavior verification, the deviation is only discovered in post-hoc audit -- potentially days later, after significant damage.
## Critical Gap 2: Agent Resource Management
## Critical Gap 2: Agent Failure Cascade Prevention
**The problem**: No framework exists for managing computational resources, memory, and processing power across distributed AI agents.
**The problem**: No protocols exist to prevent agent failures from cascading across interconnected autonomous systems. As agent interdependencies increase in production deployments, a failure in one agent can ripple outward.
**The numbers**: **93 drafts** focus on autonomous network operations, and **117 ideas** touch on resource-adjacent topics. But those ideas address how agents communicate about tasks -- not how they compete for and share limited resources.
**The numbers**: Only **47 drafts** address AI safety despite 434 total drafts, and the high interconnectivity implied by 155 A2A protocols and 114 autonomous netops drafts creates the conditions for cascade failures.
**What is missing**: Scheduling, quotas, fair allocation, and priority mechanisms for multi-agent environments. When ten agents compete for the same GPU cluster, which gets priority? When an agent's computation exceeds its allocation, what happens? When a high-priority emergency response agent needs resources currently held by a routine monitoring agent, how does preemption work?
**What is missing**: Circuit breakers for cascading failures. Checkpoint and rollback protocols. Blast radius containment. Graceful degradation. All concepts well-established in distributed systems engineering, but absent from the agent standards landscape.
**The scenario**: A telecom operator deploys 50 AI agents for network monitoring, troubleshooting, and optimization. During a major outage, all 50 agents simultaneously request inference resources to diagnose the problem. With no resource management framework, agents compete chaotically. The most aggressive agents get resources; the most important diagnostic tasks may not. The outage extends because the agents that could fix it are starved by the agents that are observing it.
**The scenario**: A telecom operator deploys 50 AI agents for network monitoring, troubleshooting, and optimization. During a major outage, all 50 agents simultaneously request inference resources to diagnose the problem. With no failure cascade prevention, agents compete chaotically. The most aggressive agents get resources; the most important diagnostic tasks may not. The outage extends because the agents that could fix it are starved by the agents that are observing it.
## Critical Gap 3: Agent Error Recovery and Rollback
## High Gap: Real-Time Agent Rollback Mechanisms
**The problem**: No standards exist for how agents handle errors, cascading failures, or the rollback of autonomous decisions.
**The problem**: No standards exist for how to quickly roll back incorrect decisions made by autonomous agents across distributed systems.
**The numbers**: This is the starkest gap in the corpus. Only **6 extracted ideas** address it, and all come from a single draft: [draft-yue-anima-agent-recovery-networks](https://datatracker.ietf.org/doc/draft-yue-anima-agent-recovery-networks/) (score 4.1). One team, out of 557 authors, is working on this.
**The 6 ideas from that draft**:
- Task-Oriented Multi-Agent Recovery Framework
- Inter-Agent Communication Protocol Requirements
- State Consistency Management
- Error and Success Reporting Framework (from a separate draft)
- Generic Agent Response Framework
- Mandatory restrictive failure behavior
That is the entire body of work the IETF has produced on agent error recovery. For context, "Multi-Agent Communication Protocol" -- defining how agents *talk* -- appears in 8 drafts. The community has invested 8 times more effort in the plumbing than in the fire escape.
**The numbers**: 114 autonomous netops drafts exist, but no rollback mechanisms for production network safety. [draft-yue-anima-agent-recovery-networks](https://datatracker.ietf.org/doc/draft-yue-anima-agent-recovery-networks/) (score 4.1) is among the few drafts that partially addresses this, with its Task-Oriented Multi-Agent Recovery Framework and State Consistency Management. For context, "Multi-Agent Communication Protocol" -- defining how agents *talk* -- appears in 8 drafts. The community has invested far more effort in the plumbing than in the fire escape.
**What is missing**: Circuit breakers for cascading failures. Checkpoint and rollback protocols. Blast radius containment. Graceful degradation. All concepts well-established in distributed systems engineering, but absent from the agent standards landscape.
@@ -77,35 +70,29 @@ That is the entire body of work the IETF has produced on agent error recovery. F
## The High-Priority Gaps
Six additional gaps scored HIGH severity. Each represents a missing piece that working deployments will hit:
Several additional gaps scored HIGH severity. Each represents a missing piece that working deployments will hit:
### Cross-Protocol Translation (0 ideas)
### Human Override Standardization
With **120 competing A2A protocols** and no translation layer, agents speaking different protocols simply cannot interoperate. This gap is entirely unaddressed -- zero technical ideas in the corpus. It is the only gap with literally no coverage.
The parallel is the early web: HTTP won not because it was the best protocol but because it was the one protocol everyone could speak. The agent ecosystem has no HTTP equivalent. If the IETF does not build a translation layer, the market will -- and the result will be vendor-locked ecosystems rather than open interoperability.
### Human Override and Intervention (4 ideas)
Only **30 human-agent interaction drafts** exist versus **93 autonomous operations** and **120 A2A protocol** drafts. Agents are being designed to talk to each other at a 4:1 ratio over being designed to talk to humans. Emergency override protocols -- the "big red button" -- are almost entirely absent.
Only **34 human-agent interaction drafts** exist versus **114 autonomous operations** and **155 A2A protocol** drafts. Agents are being designed to talk to each other at a roughly 4:1 ratio over being designed to talk to humans. Emergency override protocols -- the "big red button" -- are almost entirely absent. This is not merely an engineering preference. For high-risk AI systems deployed in the EU, the AI Act (Art. 14) mandates human oversight -- making this gap a compliance blocker, not just a design omission.
[draft-rosenberg-aiproto-cheq](https://datatracker.ietf.org/doc/draft-rosenberg-aiproto-cheq/) (score 3.9) is a rare exception: it defines a protocol for human confirmation of agent decisions before execution. But CHEQ is opt-in and pre-execution. No draft defines what happens when a human needs to stop a running agent, constrain its behavior, or take over its task mid-execution.
### Multi-Agent Consensus (5 ideas)
### Multi-Agent Consensus Protocols
When a group of agents disagree -- the diagnosis agent says the router is down, the monitoring agent says it is up, the optimization agent is rerouting traffic around it -- who arbitrates? No framework exists for agents to resolve conflicting assessments without human intervention.
When a group of agents disagree -- the diagnosis agent says the router is down, the monitoring agent says it is up, the optimization agent is rerouting traffic around it -- who arbitrates? No framework exists for agents to resolve conflicting assessments without human intervention. This is not a new problem: FIPA (Foundation for Intelligent Physical Agents) defined agent communication languages and interaction protocols for multi-agent coordination as early as 1997. The IETF landscape has largely not engaged with this prior art.
### Dynamic Trust and Reputation (5 ideas)
### Cross-Domain Agent Audit Trails
Static certificates authenticate identity but cannot express "this agent has been reliable for 6 months" or "this agent's accuracy degraded last week." Long-running agent ecosystems need trust that is earned, tracked, and revocable. The current landscape relies entirely on binary trust: either an agent has a valid certificate or it does not.
An agent operating across multiple domains or organizations needs to maintain audit trails that satisfy different regulatory requirements simultaneously. Identity management exists -- the 152 identity/auth drafts cover authentication. What does not exist is cross-domain audit standardization: the format and semantics for recording agent actions across jurisdictions with varying compliance requirements. The EU's eIDAS 2.0 regulation (Regulation 2024/1183) and its European Digital Identity Wallet framework provide a mature trust model that the IETF drafts have not yet connected to.
### Cross-Domain Security Boundaries (10 ideas)
### Federated Agent Learning Privacy
An agent authenticated in Company A's domain needs to perform a task in Company B's domain. Identity management exists -- the 108 identity/auth drafts cover this. What does not exist is trust *isolation*: preventing an agent authenticated for a narrow task from escalating privileges across domain boundaries.
While federated architectures exist, there is insufficient specification for privacy-preserving agent learning that prevents data leakage between federated participants during model updates.
### Agent Lifecycle Management (90 ideas)
### Cross-Protocol Agent Migration
Registration is covered. What happens after registration is not: versioning when an agent is updated, graceful retirement when an agent is decommissioned, migration when an agent moves between hosts, and dependency management when other agents rely on it.
Agents need to migrate between different network protocols, domains, or infrastructure providers while maintaining state and identity. Current drafts focus on registration but not migration continuity.
## The Structural Problem
@@ -119,7 +106,9 @@ Now look back at the team bloc analysis from Post 2. The 18 team blocs are *isla
This is the structural explanation for the safety deficit. It is not that people do not care about safety. It is that safety standards require coordination across boundaries that the current authorship structure cannot bridge. Capability standards can be built within a single team. Safety standards cannot.
Our category co-occurrence analysis provides the concrete proof. Safety drafts are not entirely isolated -- they co-occur with 8 of 10 categories, coupling most strongly with policy and governance (**60% of safety drafts**, lift 2.3x) and identity/auth (**58%**, lift 1.7x). But the pattern is revealing: safety pairs with *governance* categories, not *implementation* categories. Of the 136 drafts tagged as A2A protocols, only **12 (8.8%) also address safety**. Safety has **zero co-occurrence** with agent discovery/registration and **zero co-occurrence** with model serving/inference. Its weakest links are to the categories where agents actually *do* things: A2A protocols (12), ML traffic management (3), and autonomous network operations (4). Safety is being discussed in governance papers. It is completely absent from discovery infrastructure and inference pipelines. It is barely present in the protocols that need it most. The traffic lights are not just behind the highways -- they are on a different road entirely.
Our category co-occurrence analysis provides the concrete proof. Safety drafts are not entirely isolated -- they co-occur with several categories, coupling most strongly with policy and governance and identity/auth. But the pattern is revealing: safety pairs with *governance* categories, not *implementation* categories. Of the 155 drafts tagged as A2A protocols, very few also address safety. Safety has minimal co-occurrence with agent discovery/registration and model serving/inference. Its weakest links are to the categories where agents actually *do* things. Safety is being discussed in governance papers. It is barely present in the protocols that need it most. The traffic lights are not just behind the highways -- they are on a different road entirely.
IEEE P3394 (Standard for Trustworthy AI Agents), a concurrent standardization effort, is attempting to address some of these safety and trust dimensions from a different angle. The IETF landscape should be compared against these parallel efforts to understand which gaps are being addressed elsewhere and which remain truly unserved.
## The 4:1 Ratio, Revisited
@@ -127,11 +116,11 @@ The safety deficit is not just a number. It is a structural property of how the
| Category | Drafts | Team Blocs Active |
|----------|-------:|------------------:|
| A2A protocols | 120 | Many (distributed across blocs) |
| Autonomous operations | 93 | Primarily Huawei, Chinese telecom |
| Agent identity/auth | 108 | Ericsson, Nokia, ATHENA, multiple |
| **AI safety/alignment** | **44** | **Few; mostly independents/startups** |
| **Human-agent interaction** | **30** | **Rosenberg/White (2-person team)** |
| A2A protocols | 155 | Many (distributed across blocs) |
| Autonomous operations | 114 | Primarily Huawei, Chinese telecom |
| Agent identity/auth | 152 | Ericsson, Nokia, ATHENA, multiple |
| **AI safety/alignment** | **47** | **Few; mostly independents/startups** |
| **Human-agent interaction** | **34** | **Rosenberg/White (2-person team)** |
The capability categories have organized teams behind them. The safety categories rely on individual contributors and small, unconnected teams. The best safety draft in the corpus (DAAP, score 4.8) comes from an independent author (Aylward). The best human-agent drafts come from a two-person Five9/Bitwave team. There is no 13-person safety bloc with 94% cohesion.

View File

@@ -105,7 +105,7 @@ Each draft addresses specific gaps. Together, they provide the connective tissue
## Traction vs. Aspiration
A reality check: of the 361 drafts, only **36 (10%)** have been adopted by IETF working groups. The rest are individual submissions -- proposals without institutional backing. The WG-adopted drafts score higher on average (**3.54 vs. 3.31**), particularly on maturity (+1.28) and momentum (+0.98), but lower on novelty (-0.45). The WGs that have adopted the most agent-relevant drafts are security-focused: **lamps** (6 drafts), **lake** (5), **tls** (3), **emu** (3). Agent-specific WGs like `aipref` have adopted only 2 drafts.
A reality check: of the 361 drafts, only **36 (10%)** have been adopted by IETF working groups. The rest are individual submissions -- proposals without institutional backing. The WG-adopted drafts score higher on average (**3.54 vs. 3.31**), particularly on maturity (+1.28) and momentum (+0.98), but lower on novelty (-0.45). *(Note: scores are LLM-generated relative rankings from abstracts; see [Methodology](../methodology.md).)* The WGs that have adopted the most agent-relevant drafts are security-focused: **lamps** (6 drafts), **lake** (5), **tls** (3), **emu** (3). Agent-specific WGs like `aipref` have adopted only 2 drafts.
This reveals a structural insight: the IETF is not building agent standards from scratch. It is **retrofitting security standards for agents**. The agent architecture we propose above would need to work within this reality -- building on the security WGs' infrastructure rather than competing with it.

View File

@@ -227,6 +227,26 @@ For context: analyzing 361 IETF drafts -- fetching full text, rating quality on
---
## Limitations
This analysis is exploratory, not peer-reviewed research. Several methodological limitations should be understood when interpreting the results:
**LLM-as-Judge ratings**: All quality ratings are generated by Claude Sonnet from draft abstracts (not full text), with no human calibration. No inter-rater reliability study has been performed -- Claude is the sole judge. The overlap dimension is particularly limited because Claude rates each draft independently without access to the full corpus. Scores should be treated as relative rankings within this corpus, not absolute quality measures.
**Keyword-based corpus selection**: The 12 search keywords cast a wide net but introduce both false positives (drafts about "user agents" or "autonomous systems" unrelated to AI) and false negatives (relevant drafts using terminology we did not search for). We estimate 30-50 false positives remain in the corpus. The relevance rating partially mitigates this, but the LLM judge is generous with relevance for keyword-matched drafts.
**Clustering thresholds**: The 0.85 cosine similarity threshold for topical clusters, 0.90 for near-duplicates, and 0.98 for functional duplicates are empirical choices based on manual inspection, not derived from a principled analysis. The embedding model (nomic-embed-text) is general-purpose, not fine-tuned for standards documents. A sensitivity analysis across thresholds would strengthen confidence.
**Gap analysis**: The gap identification is a single-shot LLM analysis based on compressed landscape statistics, not a systematic comparison against a reference architecture. Gap severity is assigned by Claude without defined thresholds. The gaps should be treated as hypotheses for expert validation, not definitive findings.
**Idea extraction quality**: Batch extraction (Haiku, abstract-only at 800 chars) produces different results than individual extraction (Sonnet, abstract + full text). No precision/recall measurement has been performed. The extraction prompt instructs Claude to return 1-4 ideas per draft, which may under-count contributions from comprehensive drafts.
**Abstract-only analysis**: Ratings are based on abstracts truncated to 2000 characters. For maturity assessment in particular, the abstract is an imperfect proxy for the full document's technical depth.
For full methodology documentation, see `data/reports/methodology.md` in the project repository.
---
### Key Takeaways
- **The full analysis cost ~$9** -- LLM-powered document analysis at scale is practical and cheap with proper caching and model selection