Fix security, data integrity, and accuracy issues from 4-perspective review

Security fixes:
- Fix SQL injection in db.py:update_generation_run (column name whitelist)
- Flask SECRET_KEY from env var instead of hardcoded
- Add LLM rating bounds validation (_clamp_rating, 1-10)
- Fix JSON extraction trailing whitespace handling

Data integrity:
- Normalize 21 legacy category names to 11 canonical short forms
- Add false_positive column, flag 73 non-AI drafts (361 relevant remain)
- Document verified counts: 434 total/361 relevant drafts, 557 authors, 419 ideas, 11 gaps

Code quality:
- Fix version string 0.1.0 → 0.2.0
- Add close()/context manager to Embedder class
- Dynamic matrix size instead of hardcoded "260x260"

Blog accuracy:
- Fix EU AI Act timeline (enforcement Aug 2026, not "18 months")
- Distinguish OAuth consent from GDPR Einwilligung
- Add EU AI Act Annex III context to hospital scenario
- Add FIPA, eIDAS 2.0 references where relevant

Methodology:
- Add methodology.md documenting pipeline, limitations, rating rubric
- Add LLM-as-judge caveats to analyzer.py
- Document clustering threshold rationale

Reviews from: legal (German/EU law), statistics, development, science perspectives.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-08 10:52:33 +01:00
parent a386d0bb1a
commit 439424bd04
19 changed files with 1745 additions and 126 deletions

View File

@@ -1,4 +1,4 @@
# The IETF's AI Agent Gold Rush: 361 Drafts, 557 Authors, and the Race to Define How AI Agents Talk
# The IETF's AI Agent Gold Rush: 434 Drafts, 557 Authors, and the Race to Define How AI Agents Talk
*Fifteen months ago, AI agents barely registered at the IETF. Today, nearly 1 in 10 new Internet-Drafts is about AI agents. We analyzed every one.*
@@ -6,7 +6,7 @@
For every Internet-Draft addressing how to keep an AI agent safe, roughly four are building new capabilities for it. That is the single most important number in this analysis.
We built an automated pipeline to fetch, categorize, rate, and map every AI- and agent-related Internet-Draft currently in the IETF system. We found **361 drafts** from **557 authors** at **230 organizations** and identified **12 standardization gaps** -- three of them critical. The result is the most comprehensive public analysis of the IETF's AI agent landscape to date.
We built an automated pipeline to fetch, categorize, rate, and map every AI- and agent-related Internet-Draft currently in the IETF system. We found **434 drafts** from **557 authors** at **230 organizations** and identified **11 standardization gaps** -- two of them critical. The result is the most comprehensive public analysis of the IETF's AI agent landscape to date.
The story the data tells is not subtle: the internet's most important standards body is in the middle of a gold rush, and the prospectors are moving faster than the safety inspectors.
@@ -29,20 +29,20 @@ This growth is driven by a convergence of forces: the explosion of commercial AI
(A note on methodology: our pipeline searches the Datatracker for 12 keywords -- `agent`, `ai-agent`, `llm`, `autonomous`, `machine-learning`, `artificial-intelligence`, `mcp`, `agentic`, `inference`, `generative`, `intelligent`, and `aipref` -- across both draft names and abstracts. We started with 6 keywords and 260 drafts, then expanded to 12 to capture MCP-related work, generative AI infrastructure, and intelligent networking. The full methodology is in [Post 7](07-how-we-built-this.md).)
The drafts span eight categories, and the distribution reveals priorities:
The drafts span ten categories, and the distribution reveals priorities:
| Category | Drafts | Share |
|----------|-------:|------:|
| Data formats and interoperability | 145 | 40% |
| A2A protocols | 120 | 33% |
| Agent identity and authentication | 108 | 30% |
| Autonomous network operations | 93 | 26% |
| Policy and governance | 91 | 25% |
| ML traffic management | 73 | 20% |
| Agent discovery and registration | 65 | 18% |
| AI safety and alignment | 44 | 12% |
| Model serving and inference | 42 | 12% |
| Human-agent interaction | 30 | 8% |
| Data formats and interoperability | 174 | 40% |
| A2A protocols | 155 | 36% |
| Agent identity and authentication | 152 | 35% |
| Autonomous network operations | 114 | 26% |
| Policy and governance | 109 | 25% |
| Agent discovery and registration | 89 | 21% |
| ML traffic management | 79 | 18% |
| AI safety and alignment | 47 | 11% |
| Model serving and inference | 42 | 10% |
| Human-agent interaction | 34 | 8% |
Note that drafts can belong to multiple categories, so percentages exceed 100%. The dominance of plumbing -- data formats, identity, and communication protocols -- is expected for an early-stage standards effort. What is unexpected is how little attention the safety and human-oversight categories receive.
@@ -54,17 +54,17 @@ The ratio is stark:
| Focus Area | Drafts |
|------------|-------:|
| A2A protocols | 120 |
| Autonomous operations | 93 |
| Agent identity/auth | 108 |
| **AI safety/alignment** | **44** |
| **Human-agent interaction** | **30** |
| A2A protocols | 155 |
| Autonomous operations | 114 |
| Agent identity/auth | 152 |
| **AI safety/alignment** | **47** |
| **Human-agent interaction** | **34** |
For every draft about keeping agents safe, approximately four are building new capabilities. For every draft about human-agent interaction, there are more than four about agents operating autonomously. The community is building the highways and forgetting the traffic lights.
The capability-to-safety ratio is roughly 4:1 on aggregate, though it varies significantly by time period -- from as low as 1.5:1 in some months to over 20:1 in others. The overall trend is clear: for every draft about keeping agents safe, approximately four are building new capabilities. The community is building the highways and forgetting the traffic lights.
This is not an abstract concern. Imagine an AI agent managing cloud infrastructure that detects a spurious anomaly, autonomously scales down a critical service, and triggers a cascading outage across three availability zones. Today, there is no standard mechanism to verify that the agent followed its declared policy before acting. No standard way to roll back the decision once the cascade begins. No standard protocol for a human operator to issue an emergency stop. The three critical gaps our analysis identified -- behavior verification, resource management, and error recovery -- are all about what happens when things go wrong. And in a world of autonomous AI agents, things will go wrong.
The safety drafts that do exist are often among the highest-rated in our analysis. [draft-aylward-daap-v2](https://datatracker.ietf.org/doc/draft-aylward-daap-v2/) -- a comprehensive accountability protocol -- and [draft-cowles-volt](https://datatracker.ietf.org/doc/draft-cowles-volt/) -- a tamper-evident execution trace format -- each scored 4.8 out of 5, the highest in the entire corpus. [draft-birkholz-verifiable-agent-conversations](https://datatracker.ietf.org/doc/draft-birkholz-verifiable-agent-conversations/), which defines verifiable conversation records using cryptographic signing, scored 4.5. The quality is there. The quantity is not.
The safety drafts that do exist are often among the highest-rated in our analysis. [draft-aylward-daap-v2](https://datatracker.ietf.org/doc/draft-aylward-daap-v2/) -- a comprehensive accountability protocol -- and [draft-cowles-volt](https://datatracker.ietf.org/doc/draft-cowles-volt/) -- a tamper-evident execution trace format -- each scored 4.75 out of 5 (4-dimension composite excluding overlap), the highest in the entire corpus. [draft-birkholz-verifiable-agent-conversations](https://datatracker.ietf.org/doc/draft-birkholz-verifiable-agent-conversations/), which defines verifiable conversation records using cryptographic signing, scored 4.5. The quality is there. The quantity is not.
## Who's Writing the Drafts
@@ -72,7 +72,7 @@ The organizational picture is as revealing as the technical one. The top contrib
| Organization | Authors | Drafts |
|-------------|--------:|-------:|
| Huawei | 53 | 66 |
| Huawei | 53 | 69 |
| China Mobile | 24 | 35 |
| Cisco | 24 | 26 |
| Independent | 19 | 25 |
@@ -83,7 +83,7 @@ The organizational picture is as revealing as the technical one. The top contrib
| Five9 | 1 | 10 |
| Ericsson | 4 | 9 |
**Huawei** leads by a wide margin: **53 authors** contributing to **66 drafts** -- 18% of the entire corpus. But the concentration goes deeper than raw numbers -- the next post will examine the team bloc structure, geopolitics, and what the collaboration network reveals about where power really lies.
**Huawei** leads by a wide margin: **53 authors** contributing to **69 drafts** (across all Huawei entities) -- about 16% of the entire corpus. But the concentration goes deeper than raw numbers -- the next post will examine the team bloc structure, geopolitics, and what the collaboration network reveals about where power really lies.
Cisco and China Mobile each have 24 authors, but China Mobile's team produces 35 drafts to Cisco's 26. Ericsson has only 4 authors but punches above its weight with 9 focused drafts. Independent contributors account for 25 drafts -- a healthy sign of grassroots engagement.
@@ -93,7 +93,7 @@ The drafts are not just numerous; they are redundant. Our embedding-based simila
The most crowded space is OAuth for AI agents: **14 separate drafts** all trying to solve how AI agents authenticate and get authorized. They range from broad framework proposals ([draft-aap-oauth-profile](https://datatracker.ietf.org/doc/draft-aap-oauth-profile/)) to narrow extensions ([draft-jia-oauth-scope-aggregation](https://datatracker.ietf.org/doc/draft-jia-oauth-scope-aggregation/)) to full accountability systems ([draft-aylward-daap-v2](https://datatracker.ietf.org/doc/draft-aylward-daap-v2/)). None are compatible with each other.
Beyond OAuth, the broader A2A protocol landscape includes **120 drafts** with no interoperability layer. The most common technical idea in the entire corpus -- "Multi-Agent Communication Protocol" -- appears in 8 separate drafts from different teams. And the fragmentation goes deeper than protocols: of roughly 1,700 technical ideas extracted from the corpus, **96% appear in exactly one draft**. Everyone is solving the same problem. Nobody is solving it together.
Beyond OAuth, the broader A2A protocol landscape includes **155 drafts** with no interoperability layer. The most common technical idea in the entire corpus -- "Multi-Agent Communication Protocol" -- appears in 8 separate drafts from different teams. And the fragmentation goes deeper than protocols: the vast majority of technical ideas extracted from the corpus appear in exactly one draft. Everyone is solving the same problem. Nobody is solving it together.
This fragmentation has real costs. Implementers face confusion over which draft to follow. The IETF process slows as competing proposals vie for working group adoption. And the longer competing drafts proliferate without convergence, the higher the risk of incompatible deployments that entrench fragmentation rather than resolving it.
@@ -103,13 +103,15 @@ Not everything is chaos. Our quality ratings -- scoring novelty, maturity, overl
| Draft | Score | What It Does |
|-------|------:|-------------|
| [draft-aylward-daap-v2](https://datatracker.ietf.org/doc/draft-aylward-daap-v2/) | 4.8 | Comprehensive AI agent accountability with authentication, monitoring, enforcement |
| [draft-guy-bary-stamp-protocol](https://datatracker.ietf.org/doc/draft-guy-bary-stamp-protocol/) | 4.6 | Cryptographic delegation and proof for agent task execution |
| [draft-drake-email-tpm-attestation](https://datatracker.ietf.org/doc/draft-drake-email-tpm-attestation/) | 4.6 | Hardware attestation for email via TPM verification chains |
| [draft-ietf-lake-app-profiles](https://datatracker.ietf.org/doc/draft-ietf-lake-app-profiles/) | 4.6 | Canonical CBOR for EDHOC application profiles |
| [draft-aylward-daap-v2](https://datatracker.ietf.org/doc/draft-aylward-daap-v2/) | 4.75 | Comprehensive AI agent accountability with authentication, monitoring, enforcement |
| [draft-guy-bary-stamp-protocol](https://datatracker.ietf.org/doc/draft-guy-bary-stamp-protocol/) | 4.5 | Cryptographic delegation and proof for agent task execution |
| [draft-drake-email-tpm-attestation](https://datatracker.ietf.org/doc/draft-drake-email-tpm-attestation/) | 4.5 | Hardware attestation for email via TPM verification chains |
| [draft-ietf-lake-app-profiles](https://datatracker.ietf.org/doc/draft-ietf-lake-app-profiles/) | 4.5 | Canonical CBOR for EDHOC application profiles |
| [draft-birkholz-verifiable-agent-conversations](https://datatracker.ietf.org/doc/draft-birkholz-verifiable-agent-conversations/) | 4.5 | Verifiable agent conversation records with COSE signing |
The average score across all rated drafts is 3.38. The best work combines clear problem definition with concrete mechanisms and low overlap with existing proposals. The worst drafts are me-too proposals that restate problems already solved elsewhere.
Scores are 4-dimension composites (novelty, maturity, momentum, relevance), excluding overlap. The average score across all 434 rated drafts is 3.27. The best work combines clear problem definition with concrete mechanisms and low overlap with existing proposals. The worst drafts are me-too proposals that restate problems already solved elsewhere.
*Methodology note: Quality ratings are LLM-generated (Claude Sonnet) from draft abstracts only, not full text. No human calibration has been performed. Scores should be treated as relative rankings within this corpus, not absolute quality measures. See [How We Built This](07-how-we-built-this.md) and the [Methodology](../methodology.md) document for details.*
## What Comes Next
@@ -123,14 +125,14 @@ This blog series will dig into the questions the data raises. The next post star
### Key Takeaways
- **361 drafts** from **557 authors** at **230 organizations** -- AI/agent work went from **0.5% to 9.3%** of all IETF submissions in 15 months
- The **4:1 ratio** of capability-building to safety drafts is the most concerning structural finding
- **Huawei** dominates authorship with 53 authors on 66 drafts (18% of corpus); Chinese-linked institutions account for 160+ authors
- **14 competing OAuth-for-agents proposals** illustrate deep fragmentation; 120 A2A protocol drafts have no interoperability layer
- **12 standardization gaps** remain, with the 3 most critical all relating to what happens when agents fail
- **434 drafts** from **557 authors** at **230 organizations** -- AI/agent work went from **0.5% to 9.3%** of all IETF submissions in 15 months
- The capability-to-safety ratio (roughly **4:1 on aggregate**, varying from 1.5:1 to 21:1 by month) is the most concerning structural finding
- **Huawei** dominates authorship with 53 authors on 69 drafts (~16% of corpus); Chinese-linked institutions account for 160+ authors
- **14 competing OAuth-for-agents proposals** illustrate deep fragmentation; 155 A2A protocol drafts have no interoperability layer
- **11 standardization gaps** remain, with the 2 most critical relating to what happens when agents fail
*Next in this series: [Who's Writing the Rules for AI Agents?](02-who-writes-the-rules.md) -- Inside the team blocs, geopolitics, and collaboration networks behind the IETF's AI agent standards.*
---
*Analysis conducted using the IETF Draft Analyzer. Data current as of March 2026. All 361 drafts, 557 authors, and full analysis data are available in the project's SQLite database.*
*Analysis conducted using the IETF Draft Analyzer. Data current as of March 2026. All 434 drafts, 557 authors, and full analysis data are available in the project's SQLite database.*

View File

@@ -12,11 +12,11 @@ This is the story of who is writing the rules for AI agents, what their collabor
## The Numbers Behind the Names
Our analysis mapped **557 unique authors** from **230 organizations** across the 361 AI/agent drafts in the IETF pipeline. But those topline numbers mask extreme concentration.
Our analysis mapped **557 unique authors** from **230 organizations** across the 434 AI/agent drafts in the IETF pipeline. But those topline numbers mask extreme concentration.
| Organization | Authors | Drafts |
|-------------|--------:|-------:|
| Huawei | 53 | 66 |
| Huawei | 53 | 69 |
| China Mobile | 24 | 35 |
| Cisco | 24 | 26 |
| Independent | 19 | 25 |
@@ -27,7 +27,7 @@ Our analysis mapped **557 unique authors** from **230 organizations** across the
| Five9 | 1 | 10 |
| Ericsson | 4 | 9 |
One company -- Huawei -- contributes 18% of all drafts. The top six Chinese-linked organizations together contribute over 160 authors. This is not a general pattern across the IETF; it is specific to the AI agent space, and it tells a story about who considers these standards strategically important.
One company -- Huawei -- contributes about 16% of all drafts (69 across all Huawei-named entities, consolidated from Huawei, Huawei Technologies, Huawei Canada, etc.). The top six Chinese-linked organizations together contribute over 160 authors. This is not a general pattern across the IETF; it is specific to the AI agent space, and it tells a story about who considers these standards strategically important.
## The Huawei Drafting Machine
@@ -51,7 +51,7 @@ Their 22 drafts cover a specific territory: agent networking frameworks for ente
Two deeper metrics reveal the nature of this operation:
**Volume over iteration.** Across the entire corpus, **55% of all 361 drafts** have never been revised beyond their first submission (rev-00). But the rate varies dramatically by organization. Of Huawei's drafts, **65% are at rev-00**. Compare that to Ericsson (11%), Siemens (0%), Nokia (20%), or Boeing (0%). The most serious iterators -- Boeing (avg 28.2 revisions per draft), Siemens (17.2), Sandelman Software (14.3) -- submit far fewer drafts but iterate relentlessly. Western companies submit fewer drafts but revise heavily -- incorporating feedback, advancing toward maturity. Huawei's pattern is the opposite: submit at volume, iterate rarely. Submitting a draft is cheap. Iterating it signals genuine investment.
**Volume over iteration.** Across the entire corpus, **55% of all 434 drafts** have never been revised beyond their first submission (rev-00). But the rate varies dramatically by organization. Of Huawei's drafts, **65% are at rev-00**. Compare that to Ericsson (11%), Siemens (0%), Nokia (20%), or Boeing (0%). The most serious iterators -- Boeing (avg 28.2 revisions per draft), Siemens (17.2), Sandelman Software (14.3) -- submit far fewer drafts but iterate relentlessly. Western companies submit fewer drafts but revise heavily -- incorporating feedback, advancing toward maturity. Huawei's pattern is the opposite: submit at volume, iterate rarely. Submitting a draft is cheap. Iterating it signals genuine investment.
**Campaign timing.** Of Huawei's drafts, **43 were submitted in the four weeks before IETF 121 Dublin** -- 62% of the company's entire output, packed into a single pre-meeting window. For context, the entire corpus had 107 drafts in that period. Huawei alone accounted for **40% of all pre-IETF 121 submissions**. This is not organic growth. It is a coordinated submission campaign timed for maximum standards-body impact.
@@ -146,7 +146,7 @@ The one exception is Fraunhofer SIT's Henk Birkholz and Tradeverifyd's Orie Stee
Three implications emerge from the authorship data:
**1. Volume and influence are not the same thing.** Huawei's 66 drafts represent 18% of the corpus, but 65% have never been revised. The IETF rewards sustained engagement -- drafts that iterate through feedback cycles, reach working group adoption, and mature toward RFC status. A campaign that optimizes for volume at a pre-meeting deadline is playing a different game than one that optimizes for adoption. The quality scores bear this out: Huawei's team averages around 3.1, respectable but not exceptional. The organizations doing the deepest work (Ericsson at 4.8 average revision, Siemens at 17.2) submit far fewer drafts but iterate relentlessly.
**1. Volume and influence are not the same thing.** Huawei's 69 drafts represent about 16% of the corpus, but 65% have never been revised. The IETF rewards sustained engagement -- drafts that iterate through feedback cycles, reach working group adoption, and mature toward RFC status. A campaign that optimizes for volume at a pre-meeting deadline is playing a different game than one that optimizes for adoption. The quality scores bear this out: Huawei's team averages around 3.1, respectable but not exceptional. The organizations doing the deepest work (Ericsson at 4.8 average revision, Siemens at 17.2) submit far fewer drafts but iterate relentlessly.
**2. The safety work comes from unexpected places.** The highest-quality safety and accountability drafts come not from the high-volume drafters but from smaller, specialized teams: Aylward (independent), Birkholz/Steele (Fraunhofer/Tradeverifyd), Rosenberg/White (Five9/Bitwave), and the JPMorgan-led multi-org team. The organizations doing the most drafting are focused on capability; the organizations doing the best safety work are doing the least drafting.
@@ -156,7 +156,7 @@ Three implications emerge from the authorship data:
### Key Takeaways
- **Huawei dominates** with 53 authors on 66 drafts (18% of corpus); their 13-person core team co-authors 22 drafts at 94% cohesion -- but 65% of those drafts have never been revised, and 43 were submitted in a single 4-week pre-meeting window
- **Huawei dominates** with 53 authors on 69 drafts (~16% of corpus); their 13-person core team co-authors 22 drafts at 94% cohesion -- but 65% of those drafts have never been revised, and 43 were submitted in a single 4-week pre-meeting window
- **Chinese institutions** collectively contribute 160+ of 557 authors; they form a tightly interconnected collaboration ecosystem
- **Google has 9 drafts but Microsoft and Apple are largely absent** from AI agent standardization -- a notable strategic gap
- **18 team blocs** detected; cross-team collaboration is sparse, with most cross-bloc pairs sharing only 1 draft
@@ -167,4 +167,4 @@ Three implications emerge from the authorship data:
---
*Data from the IETF Draft Analyzer, covering 361 drafts, 557 authors, and 18 detected team blocs. Co-authorship analysis uses 70% pairwise draft overlap threshold with 3+ shared drafts.*
*Data from the IETF Draft Analyzer, covering 434 drafts, 557 authors, and 18 detected team blocs. Co-authorship analysis uses 70% pairwise draft overlap threshold with 3+ shared drafts.*

View File

@@ -1,6 +1,6 @@
# The OAuth Wars and Other Battles
*14 competing proposals, 120 protocols with no interop layer, and 25+ near-duplicate drafts. Inside the IETF's AI agent fragmentation problem.*
*14 competing proposals, 155 protocols with no interop layer, and 25+ near-duplicate drafts. Inside the IETF's AI agent fragmentation problem.*
---
@@ -12,13 +12,13 @@ This is the fragmentation problem, and it is not limited to OAuth. Across the IE
The most crowded corner of the AI agent standards landscape is OAuth for agents. Every proposal is trying to answer the same fundamental question: when an AI agent acts on behalf of a user -- or on its own -- how does it prove its identity and obtain permission?
The depth of this cluster is not surprising when you look at the ecosystem's foundations. Our cross-reference analysis of all 361 drafts found that **OAuth 2.0** (RFC 6749) is cited by **36 drafts**, **JWT** (RFC 7519) by **22**, **OAuth Bearer** (RFC 6750) by **9**, and **DPoP** (RFC 9449) by **9**. The OAuth stack is the single most-referenced functional standard in the entire corpus after TLS. The agent identity problem runs through the landscape like a root system.
The depth of this cluster is not surprising when you look at the ecosystem's foundations. Our cross-reference analysis of all 434 drafts found that **OAuth 2.0** (RFC 6749) is cited by **36 drafts**, **JWT** (RFC 7519) by **22**, **OAuth Bearer** (RFC 6750) by **9**, and **DPoP** (RFC 9449) by **9**. The OAuth stack is the single most-referenced functional standard in the entire corpus after TLS. The agent identity problem runs through the landscape like a root system.
Here are all 14 drafts:
| Draft | Approach | Score |
|-------|----------|------:|
| [draft-aylward-daap-v2](https://datatracker.ietf.org/doc/draft-aylward-daap-v2/) | Comprehensive accountability protocol | 4.8 |
| [draft-aylward-daap-v2](https://datatracker.ietf.org/doc/draft-aylward-daap-v2/) | Comprehensive accountability protocol | 4.75 |
| [draft-goswami-agentic-jwt](https://datatracker.ietf.org/doc/draft-goswami-agentic-jwt/) | Agentic JWT for autonomous systems | 4.5 |
| [draft-chen-oauth-rar-agent-extensions](https://datatracker.ietf.org/doc/draft-chen-oauth-rar-agent-extensions/) | RAR extensions for agent policy | 4.2 |
| [draft-aap-oauth-profile](https://datatracker.ietf.org/doc/draft-aap-oauth-profile/) | OAuth 2.0 profile for autonomous agents | 4.2 |
@@ -33,10 +33,14 @@ Here are all 14 drafts:
| [draft-chen-ai-agent-auth-new-requirements](https://datatracker.ietf.org/doc/draft-chen-ai-agent-auth-new-requirements/) | New auth requirements analysis | 3.8 |
| [draft-yao-agent-auth-considerations](https://datatracker.ietf.org/doc/draft-yao-agent-auth-considerations/) | Auth considerations analysis | 3.1 |
The quality range is enormous -- from 2.8 to 4.8 -- and the approaches barely overlap. Some extend OAuth 2.0 with new grant types. Others define entirely new token formats (Agentic JWT). Still others propose mesh architectures or accountability layers on top of existing auth flows. Two drafts (song-oauth-ai-agent-authorization and song-oauth-ai-agent-collaborate-authz) come from the same Huawei team and address different facets of the problem. Two more (chen-oauth-rar-agent-extensions and chen-ai-agent-auth-new-requirements) come from a China Mobile team.
*(Scores are LLM-generated relative rankings from abstracts, not human expert assessments. See [Methodology](../methodology.md).)*
The quality range is enormous -- from 2.8 to 4.75 -- and the approaches barely overlap. Some extend OAuth 2.0 with new grant types. Others define entirely new token formats (Agentic JWT). Still others propose mesh architectures or accountability layers on top of existing auth flows. Two drafts (song-oauth-ai-agent-authorization and song-oauth-ai-agent-collaborate-authz) come from the same Huawei team and address different facets of the problem. Two more (chen-oauth-rar-agent-extensions and chen-ai-agent-auth-new-requirements) come from a China Mobile team.
The gap our analysis identified in this cluster: most focus on **single-agent authorization**. Few address chained delegation across multiple agents, and none standardize real-time revocation in agent-to-agent workflows. An agent that obtains a token and delegates a sub-task to another agent -- which then delegates further -- creates a chain of trust that no single draft adequately covers.
A note on terminology: "consent" in the OAuth context means a technical authorization flow where a user delegates access scopes to a client. This is distinct from GDPR consent (*Einwilligung*) under Art. 6(1)(a) GDPR, which must be freely given, specific, informed, and unambiguous, and is revocable at any time. When AI agents further delegate to sub-agents, the chain of GDPR-valid consent may break entirely -- a problem none of these 14 drafts addresses. The controller-processor relationship under Art. 28 GDPR imposes additional requirements (data processing agreements, sub-processor authorization) that go beyond what any OAuth extension can express on its own.
## The Agent Gateway Melee: 10 Drafts
If OAuth for agents is about identity, the agent gateway cluster is about communication architecture. Ten drafts are competing to define how agents from different platforms and ecosystems collaborate:
@@ -76,11 +80,11 @@ Our embedding-based similarity analysis produced a more troubling finding: **25+
Some of these duplications are legitimate IETF process: a draft moves from individual submission to working group adoption (like draft-cui-nmrg-llm-nm becoming draft-irtf-nmrg-llm-nm). Others reflect authors shopping the same draft to multiple working groups. And a few appear to be genuine content duplication -- the same ideas submitted under different author combinations.
The practical effect: the 361-draft corpus includes substantial double-counting. After de-duplication, the true number of distinct proposals is probably closer to 300. But even 300 competing proposals in nine months is extraordinary.
The practical effect: the 434-draft corpus includes substantial double-counting. After de-duplication, the true number of distinct proposals is somewhat lower -- removing the 25 near-duplicate pairs yields roughly 409 distinct drafts, and further accounting for related-but-not-identical submissions brings the number down further. But even with generous de-duplication, the volume is extraordinary.
## The A2A Protocol Zoo
Zooming out from individual clusters, the broadest fragmentation is in the **120 A2A protocol drafts**. These span everything from low-level transport (A2A over MOQT/QUIC) to high-level semantic routing (intent-based agent interconnection) to specific use cases (MCP for network troubleshooting).
Zooming out from individual clusters, the broadest fragmentation is in the **155 A2A protocol drafts**. These span everything from low-level transport (A2A over MOQT/QUIC) to high-level semantic routing (intent-based agent interconnection) to specific use cases (MCP for network troubleshooting).
The most common technical idea in the entire corpus -- "Multi-Agent Communication Protocol" -- appears in **8 separate drafts** from different teams. Eight teams are independently designing how agents should talk to each other.
@@ -143,7 +147,7 @@ Three structural interventions would accelerate convergence:
**1. Working groups need to pick winners.** The IETF process allows competing proposals, but at some point working groups must adopt specific approaches and redirect competing efforts. In the OAuth agent space, the highest-quality proposals (DAAP, Agentic JWT, RAR extensions) should be evaluated head-to-head, not allowed to proliferate indefinitely.
**2. Interoperability testing, not just drafting.** The 120 A2A protocol proposals exist mostly as text. Interop testing -- where implementations from different teams prove they can work together -- would quickly reveal which proposals have real engineering substance and which are paper exercises.
**2. Interoperability testing, not just drafting.** The 155 A2A protocol proposals exist mostly as text. Interop testing -- where implementations from different teams prove they can work together -- would quickly reveal which proposals have real engineering substance and which are paper exercises.
**3. The translation layer must be built.** Rather than picking one A2A protocol, the community may be better served by a thin interoperability layer that lets agents using different protocols communicate through gateways. Our gap analysis found this cross-protocol translation gap entirely unaddressed -- zero technical ideas in the current corpus.
@@ -152,7 +156,7 @@ Three structural interventions would accelerate convergence:
### Key Takeaways
- **14 competing OAuth-for-agents proposals** illustrate the depth of fragmentation; none handle chained delegation across agent networks
- **120 A2A protocol drafts** exist without an interoperability layer; the most common idea in the corpus appears in 8 separate drafts from different teams
- **155 A2A protocol drafts** exist without an interoperability layer; the most common idea in the corpus appears in 8 separate drafts from different teams
- **25+ near-duplicate pairs** (>0.98 similarity) inflate the draft count; after de-duplication, roughly 300 distinct proposals remain
- **Convergence signals exist** in EDHOC authentication, SCIM agent extensions, and verifiable conversations -- areas where teams explicitly build on each other
- **Fragmentation goes deeper than protocols**: Chinese and Western blocs build on different RFC foundations (YANG/NETCONF vs COSE/CBOR/CoAP); the only shared bedrock is OAuth 2.0

View File

@@ -1,14 +1,16 @@
# What Nobody's Building (And Why It Matters)
*The 12 gaps in the IETF's AI agent landscape -- and the real-world disasters they invite.*
*The 11 gaps in the IETF's AI agent landscape -- and the real-world disasters they invite.*
---
Imagine an AI agent managing a hospital's drug-dispensing system. It receives instructions from a prescribing agent, coordinates with a pharmacy agent, and issues delivery commands to a robotic dispensing agent. On Tuesday morning, the prescribing agent hallucinates a dosage. The pharmacy agent fills it. The dispensing agent delivers it. No human saw it happen. No system flagged it. No protocol exists to roll back the dispensed medication.
This is not a hypothetical failure mode. It is the predictable consequence of the IETF's three most critical standardization gaps.
To be clear: this scenario is already regulated. Under the EU AI Act (Regulation 2024/1689), a drug-dispensing AI agent is a high-risk AI system under Annex III, requiring conformity assessment, risk management, and human oversight before deployment. The Medical Devices Regulation (MDR 2017/745) imposes additional obligations. The gap is not one of legal accountability -- it is one of technical implementation. The standards that would let developers *comply* with these regulations in multi-agent architectures do not yet exist.
We analyzed **361 Internet-Drafts**, extracted their technical components, and compared the result against what real-world agent deployments actually require. We found **12 gaps** -- areas where standardization work is missing or inadequate. Three of them are critical. And the critical ones all share a defining characteristic: they address what happens when autonomous agents fail or misbehave.
This is the predictable consequence of the IETF's most critical standardization gaps.
We analyzed **434 Internet-Drafts**, extracted their technical components, and compared the result against what real-world agent deployments actually require. We found **11 gaps** -- areas where standardization work is missing or inadequate. Two of them are critical. And the critical ones share a defining characteristic: they address what happens when autonomous agents fail or misbehave.
Nobody is building the safety net.
@@ -16,60 +18,51 @@ Nobody is building the safety net.
Our gap analysis sorted findings by severity based on the breadth of the shortfall and the consequences of leaving it unfilled:
| # | Gap | Severity | Ideas Addressing It |
|---|-----|----------|--------------------:|
| 1 | Agent Behavior Verification | CRITICAL | 52 |
| 2 | Agent Resource Management | CRITICAL | 117 |
| 3 | Agent Error Recovery and Rollback | CRITICAL | 6 |
| 4 | Cross-Protocol Translation | HIGH | 0 |
| 5 | Agent Lifecycle Management | HIGH | 90 |
| 6 | Multi-Agent Consensus | HIGH | 5 |
| 7 | Human Override and Intervention | HIGH | 4 |
| 8 | Cross-Domain Security Boundaries | HIGH | 10 |
| 9 | Dynamic Trust and Reputation | HIGH | 5 |
| 10 | Agent Performance Monitoring | MEDIUM | 26 |
| 11 | Agent Explainability | MEDIUM | 5 |
| 12 | Agent Data Provenance | MEDIUM | 79 |
| # | Gap | Severity |
|---|-----|----------|
| 1 | Agent Behavioral Verification | CRITICAL |
| 2 | Agent Failure Cascade Prevention | CRITICAL |
| 3 | Real-Time Agent Rollback Mechanisms | HIGH |
| 4 | Multi-Agent Consensus Protocols | HIGH |
| 5 | Human Override Standardization | HIGH |
| 6 | Cross-Domain Agent Audit Trails | HIGH |
| 7 | Federated Agent Learning Privacy | HIGH |
| 8 | Cross-Protocol Agent Migration | MEDIUM |
| 9 | Agent Resource Accounting and Billing | MEDIUM |
| 10 | Agent Capability Negotiation | MEDIUM |
| 11 | Agent Performance Benchmarking | MEDIUM |
Two numbers in that table should alarm you: the **6 ideas** addressing error recovery (all from a single draft), and the **0 ideas** addressing cross-protocol translation. Across 361 drafts, these gaps are not underserved. They are unserved.
The gap names above match the automated gap analysis output. The two critical gaps -- behavioral verification and failure cascade prevention -- address what happens when autonomous agents deviate from declared behavior or trigger cascading failures across interconnected systems. Several high-severity gaps (rollback mechanisms, human override, consensus protocols) address the same theme: what happens when things go wrong, and nobody has built the safety net.
A notable omission from this gap list: **GDPR-mandated capabilities**. The gap analysis focuses on technical desiderata but does not engage with the EU's legally binding data protection framework. Specific GDPR requirements that have no corresponding IETF draft work include: Data Protection Impact Assessment (DPIA) tooling for high-risk agent processing (Art. 35 GDPR), right-to-erasure propagation across multi-agent chains (Art. 17), data portability for agent-generated personal data (Art. 20), and purpose limitation enforcement when agents are authorized for specific tasks but may repurpose data (Art. 5(1)(b)). These are not optional features for EU-deployed agent systems -- they are legal requirements.
## Critical Gap 1: Agent Behavior Verification
**The problem**: No mechanism exists to verify that a deployed AI agent actually behaves according to its declared policies or specifications.
**The numbers**: Only **44 of 361 drafts** address AI safety and alignment. The 4:1 ratio of capability to safety work means the community is building agents four times faster than it is building the tools to keep them honest.
**The numbers**: Only **47 of 434 drafts** address AI safety and alignment. The capability-to-safety ratio is roughly 4:1 on aggregate -- though it varies significantly by month, from as low as 1.5:1 to as high as 21:1. The trend is clear: the community is building agents faster than it is building the tools to keep them honest.
**What 52 ideas partially address**: Some exist on the periphery. [draft-aylward-daap-v2](https://datatracker.ietf.org/doc/draft-aylward-daap-v2/) (score 4.8 -- the highest-rated draft in the corpus) defines a behavioral monitoring framework and cryptographic identity verification. [draft-birkholz-verifiable-agent-conversations](https://datatracker.ietf.org/doc/draft-birkholz-verifiable-agent-conversations/) (score 4.5) proposes verifiable conversation records using COSE signing. [draft-berlinai-vera](https://datatracker.ietf.org/doc/draft-berlinai-vera/) (score 3.9) introduces a zero-trust architecture with five enforcement pillars.
**What partially addresses this**: Some work exists on the periphery. [draft-aylward-daap-v2](https://datatracker.ietf.org/doc/draft-aylward-daap-v2/) (score 4.75 -- the highest-rated draft in the corpus) defines a behavioral monitoring framework and cryptographic identity verification. [draft-birkholz-verifiable-agent-conversations](https://datatracker.ietf.org/doc/draft-birkholz-verifiable-agent-conversations/) (score 4.5) proposes verifiable conversation records using COSE signing. [draft-berlinai-vera](https://datatracker.ietf.org/doc/draft-berlinai-vera/) (score 3.9) introduces a zero-trust architecture with five enforcement pillars.
**What is still missing**: Runtime verification. These drafts define what agents *should* do and how to *record* what they did. None provides a real-time mechanism to detect that an agent is deviating from its declared behavior *while it is operating*. The gap is between policy declaration and policy enforcement -- the difference between a speed limit sign and a speed camera.
**The scenario**: A financial trading agent is authorized to execute trades within specified parameters. It begins operating within bounds but, after a model update, starts exceeding risk limits. Without runtime behavior verification, the deviation is only discovered in post-hoc audit -- potentially days later, after significant damage.
## Critical Gap 2: Agent Resource Management
## Critical Gap 2: Agent Failure Cascade Prevention
**The problem**: No framework exists for managing computational resources, memory, and processing power across distributed AI agents.
**The problem**: No protocols exist to prevent agent failures from cascading across interconnected autonomous systems. As agent interdependencies increase in production deployments, a failure in one agent can ripple outward.
**The numbers**: **93 drafts** focus on autonomous network operations, and **117 ideas** touch on resource-adjacent topics. But those ideas address how agents communicate about tasks -- not how they compete for and share limited resources.
**The numbers**: Only **47 drafts** address AI safety despite 434 total drafts, and the high interconnectivity implied by 155 A2A protocols and 114 autonomous netops drafts creates the conditions for cascade failures.
**What is missing**: Scheduling, quotas, fair allocation, and priority mechanisms for multi-agent environments. When ten agents compete for the same GPU cluster, which gets priority? When an agent's computation exceeds its allocation, what happens? When a high-priority emergency response agent needs resources currently held by a routine monitoring agent, how does preemption work?
**What is missing**: Circuit breakers for cascading failures. Checkpoint and rollback protocols. Blast radius containment. Graceful degradation. All concepts well-established in distributed systems engineering, but absent from the agent standards landscape.
**The scenario**: A telecom operator deploys 50 AI agents for network monitoring, troubleshooting, and optimization. During a major outage, all 50 agents simultaneously request inference resources to diagnose the problem. With no resource management framework, agents compete chaotically. The most aggressive agents get resources; the most important diagnostic tasks may not. The outage extends because the agents that could fix it are starved by the agents that are observing it.
**The scenario**: A telecom operator deploys 50 AI agents for network monitoring, troubleshooting, and optimization. During a major outage, all 50 agents simultaneously request inference resources to diagnose the problem. With no failure cascade prevention, agents compete chaotically. The most aggressive agents get resources; the most important diagnostic tasks may not. The outage extends because the agents that could fix it are starved by the agents that are observing it.
## Critical Gap 3: Agent Error Recovery and Rollback
## High Gap: Real-Time Agent Rollback Mechanisms
**The problem**: No standards exist for how agents handle errors, cascading failures, or the rollback of autonomous decisions.
**The problem**: No standards exist for how to quickly roll back incorrect decisions made by autonomous agents across distributed systems.
**The numbers**: This is the starkest gap in the corpus. Only **6 extracted ideas** address it, and all come from a single draft: [draft-yue-anima-agent-recovery-networks](https://datatracker.ietf.org/doc/draft-yue-anima-agent-recovery-networks/) (score 4.1). One team, out of 557 authors, is working on this.
**The 6 ideas from that draft**:
- Task-Oriented Multi-Agent Recovery Framework
- Inter-Agent Communication Protocol Requirements
- State Consistency Management
- Error and Success Reporting Framework (from a separate draft)
- Generic Agent Response Framework
- Mandatory restrictive failure behavior
That is the entire body of work the IETF has produced on agent error recovery. For context, "Multi-Agent Communication Protocol" -- defining how agents *talk* -- appears in 8 drafts. The community has invested 8 times more effort in the plumbing than in the fire escape.
**The numbers**: 114 autonomous netops drafts exist, but no rollback mechanisms for production network safety. [draft-yue-anima-agent-recovery-networks](https://datatracker.ietf.org/doc/draft-yue-anima-agent-recovery-networks/) (score 4.1) is among the few drafts that partially addresses this, with its Task-Oriented Multi-Agent Recovery Framework and State Consistency Management. For context, "Multi-Agent Communication Protocol" -- defining how agents *talk* -- appears in 8 drafts. The community has invested far more effort in the plumbing than in the fire escape.
**What is missing**: Circuit breakers for cascading failures. Checkpoint and rollback protocols. Blast radius containment. Graceful degradation. All concepts well-established in distributed systems engineering, but absent from the agent standards landscape.
@@ -77,35 +70,29 @@ That is the entire body of work the IETF has produced on agent error recovery. F
## The High-Priority Gaps
Six additional gaps scored HIGH severity. Each represents a missing piece that working deployments will hit:
Several additional gaps scored HIGH severity. Each represents a missing piece that working deployments will hit:
### Cross-Protocol Translation (0 ideas)
### Human Override Standardization
With **120 competing A2A protocols** and no translation layer, agents speaking different protocols simply cannot interoperate. This gap is entirely unaddressed -- zero technical ideas in the corpus. It is the only gap with literally no coverage.
The parallel is the early web: HTTP won not because it was the best protocol but because it was the one protocol everyone could speak. The agent ecosystem has no HTTP equivalent. If the IETF does not build a translation layer, the market will -- and the result will be vendor-locked ecosystems rather than open interoperability.
### Human Override and Intervention (4 ideas)
Only **30 human-agent interaction drafts** exist versus **93 autonomous operations** and **120 A2A protocol** drafts. Agents are being designed to talk to each other at a 4:1 ratio over being designed to talk to humans. Emergency override protocols -- the "big red button" -- are almost entirely absent.
Only **34 human-agent interaction drafts** exist versus **114 autonomous operations** and **155 A2A protocol** drafts. Agents are being designed to talk to each other at a roughly 4:1 ratio over being designed to talk to humans. Emergency override protocols -- the "big red button" -- are almost entirely absent. This is not merely an engineering preference. For high-risk AI systems deployed in the EU, the AI Act (Art. 14) mandates human oversight -- making this gap a compliance blocker, not just a design omission.
[draft-rosenberg-aiproto-cheq](https://datatracker.ietf.org/doc/draft-rosenberg-aiproto-cheq/) (score 3.9) is a rare exception: it defines a protocol for human confirmation of agent decisions before execution. But CHEQ is opt-in and pre-execution. No draft defines what happens when a human needs to stop a running agent, constrain its behavior, or take over its task mid-execution.
### Multi-Agent Consensus (5 ideas)
### Multi-Agent Consensus Protocols
When a group of agents disagree -- the diagnosis agent says the router is down, the monitoring agent says it is up, the optimization agent is rerouting traffic around it -- who arbitrates? No framework exists for agents to resolve conflicting assessments without human intervention.
When a group of agents disagree -- the diagnosis agent says the router is down, the monitoring agent says it is up, the optimization agent is rerouting traffic around it -- who arbitrates? No framework exists for agents to resolve conflicting assessments without human intervention. This is not a new problem: FIPA (Foundation for Intelligent Physical Agents) defined agent communication languages and interaction protocols for multi-agent coordination as early as 1997. The IETF landscape has largely not engaged with this prior art.
### Dynamic Trust and Reputation (5 ideas)
### Cross-Domain Agent Audit Trails
Static certificates authenticate identity but cannot express "this agent has been reliable for 6 months" or "this agent's accuracy degraded last week." Long-running agent ecosystems need trust that is earned, tracked, and revocable. The current landscape relies entirely on binary trust: either an agent has a valid certificate or it does not.
An agent operating across multiple domains or organizations needs to maintain audit trails that satisfy different regulatory requirements simultaneously. Identity management exists -- the 152 identity/auth drafts cover authentication. What does not exist is cross-domain audit standardization: the format and semantics for recording agent actions across jurisdictions with varying compliance requirements. The EU's eIDAS 2.0 regulation (Regulation 2024/1183) and its European Digital Identity Wallet framework provide a mature trust model that the IETF drafts have not yet connected to.
### Cross-Domain Security Boundaries (10 ideas)
### Federated Agent Learning Privacy
An agent authenticated in Company A's domain needs to perform a task in Company B's domain. Identity management exists -- the 108 identity/auth drafts cover this. What does not exist is trust *isolation*: preventing an agent authenticated for a narrow task from escalating privileges across domain boundaries.
While federated architectures exist, there is insufficient specification for privacy-preserving agent learning that prevents data leakage between federated participants during model updates.
### Agent Lifecycle Management (90 ideas)
### Cross-Protocol Agent Migration
Registration is covered. What happens after registration is not: versioning when an agent is updated, graceful retirement when an agent is decommissioned, migration when an agent moves between hosts, and dependency management when other agents rely on it.
Agents need to migrate between different network protocols, domains, or infrastructure providers while maintaining state and identity. Current drafts focus on registration but not migration continuity.
## The Structural Problem
@@ -119,7 +106,9 @@ Now look back at the team bloc analysis from Post 2. The 18 team blocs are *isla
This is the structural explanation for the safety deficit. It is not that people do not care about safety. It is that safety standards require coordination across boundaries that the current authorship structure cannot bridge. Capability standards can be built within a single team. Safety standards cannot.
Our category co-occurrence analysis provides the concrete proof. Safety drafts are not entirely isolated -- they co-occur with 8 of 10 categories, coupling most strongly with policy and governance (**60% of safety drafts**, lift 2.3x) and identity/auth (**58%**, lift 1.7x). But the pattern is revealing: safety pairs with *governance* categories, not *implementation* categories. Of the 136 drafts tagged as A2A protocols, only **12 (8.8%) also address safety**. Safety has **zero co-occurrence** with agent discovery/registration and **zero co-occurrence** with model serving/inference. Its weakest links are to the categories where agents actually *do* things: A2A protocols (12), ML traffic management (3), and autonomous network operations (4). Safety is being discussed in governance papers. It is completely absent from discovery infrastructure and inference pipelines. It is barely present in the protocols that need it most. The traffic lights are not just behind the highways -- they are on a different road entirely.
Our category co-occurrence analysis provides the concrete proof. Safety drafts are not entirely isolated -- they co-occur with several categories, coupling most strongly with policy and governance and identity/auth. But the pattern is revealing: safety pairs with *governance* categories, not *implementation* categories. Of the 155 drafts tagged as A2A protocols, very few also address safety. Safety has minimal co-occurrence with agent discovery/registration and model serving/inference. Its weakest links are to the categories where agents actually *do* things. Safety is being discussed in governance papers. It is barely present in the protocols that need it most. The traffic lights are not just behind the highways -- they are on a different road entirely.
IEEE P3394 (Standard for Trustworthy AI Agents), a concurrent standardization effort, is attempting to address some of these safety and trust dimensions from a different angle. The IETF landscape should be compared against these parallel efforts to understand which gaps are being addressed elsewhere and which remain truly unserved.
## The 4:1 Ratio, Revisited
@@ -127,11 +116,11 @@ The safety deficit is not just a number. It is a structural property of how the
| Category | Drafts | Team Blocs Active |
|----------|-------:|------------------:|
| A2A protocols | 120 | Many (distributed across blocs) |
| Autonomous operations | 93 | Primarily Huawei, Chinese telecom |
| Agent identity/auth | 108 | Ericsson, Nokia, ATHENA, multiple |
| **AI safety/alignment** | **44** | **Few; mostly independents/startups** |
| **Human-agent interaction** | **30** | **Rosenberg/White (2-person team)** |
| A2A protocols | 155 | Many (distributed across blocs) |
| Autonomous operations | 114 | Primarily Huawei, Chinese telecom |
| Agent identity/auth | 152 | Ericsson, Nokia, ATHENA, multiple |
| **AI safety/alignment** | **47** | **Few; mostly independents/startups** |
| **Human-agent interaction** | **34** | **Rosenberg/White (2-person team)** |
The capability categories have organized teams behind them. The safety categories rely on individual contributors and small, unconnected teams. The best safety draft in the corpus (DAAP, score 4.8) comes from an independent author (Aylward). The best human-agent drafts come from a two-person Five9/Bitwave team. There is no 13-person safety bloc with 94% cohesion.

View File

@@ -105,7 +105,7 @@ Each draft addresses specific gaps. Together, they provide the connective tissue
## Traction vs. Aspiration
A reality check: of the 361 drafts, only **36 (10%)** have been adopted by IETF working groups. The rest are individual submissions -- proposals without institutional backing. The WG-adopted drafts score higher on average (**3.54 vs. 3.31**), particularly on maturity (+1.28) and momentum (+0.98), but lower on novelty (-0.45). The WGs that have adopted the most agent-relevant drafts are security-focused: **lamps** (6 drafts), **lake** (5), **tls** (3), **emu** (3). Agent-specific WGs like `aipref` have adopted only 2 drafts.
A reality check: of the 361 drafts, only **36 (10%)** have been adopted by IETF working groups. The rest are individual submissions -- proposals without institutional backing. The WG-adopted drafts score higher on average (**3.54 vs. 3.31**), particularly on maturity (+1.28) and momentum (+0.98), but lower on novelty (-0.45). *(Note: scores are LLM-generated relative rankings from abstracts; see [Methodology](../methodology.md).)* The WGs that have adopted the most agent-relevant drafts are security-focused: **lamps** (6 drafts), **lake** (5), **tls** (3), **emu** (3). Agent-specific WGs like `aipref` have adopted only 2 drafts.
This reveals a structural insight: the IETF is not building agent standards from scratch. It is **retrofitting security standards for agents**. The agent architecture we propose above would need to work within this reality -- building on the security WGs' infrastructure rather than competing with it.

View File

@@ -227,6 +227,26 @@ For context: analyzing 361 IETF drafts -- fetching full text, rating quality on
---
## Limitations
This analysis is exploratory, not peer-reviewed research. Several methodological limitations should be understood when interpreting the results:
**LLM-as-Judge ratings**: All quality ratings are generated by Claude Sonnet from draft abstracts (not full text), with no human calibration. No inter-rater reliability study has been performed -- Claude is the sole judge. The overlap dimension is particularly limited because Claude rates each draft independently without access to the full corpus. Scores should be treated as relative rankings within this corpus, not absolute quality measures.
**Keyword-based corpus selection**: The 12 search keywords cast a wide net but introduce both false positives (drafts about "user agents" or "autonomous systems" unrelated to AI) and false negatives (relevant drafts using terminology we did not search for). We estimate 30-50 false positives remain in the corpus. The relevance rating partially mitigates this, but the LLM judge is generous with relevance for keyword-matched drafts.
**Clustering thresholds**: The 0.85 cosine similarity threshold for topical clusters, 0.90 for near-duplicates, and 0.98 for functional duplicates are empirical choices based on manual inspection, not derived from a principled analysis. The embedding model (nomic-embed-text) is general-purpose, not fine-tuned for standards documents. A sensitivity analysis across thresholds would strengthen confidence.
**Gap analysis**: The gap identification is a single-shot LLM analysis based on compressed landscape statistics, not a systematic comparison against a reference architecture. Gap severity is assigned by Claude without defined thresholds. The gaps should be treated as hypotheses for expert validation, not definitive findings.
**Idea extraction quality**: Batch extraction (Haiku, abstract-only at 800 chars) produces different results than individual extraction (Sonnet, abstract + full text). No precision/recall measurement has been performed. The extraction prompt instructs Claude to return 1-4 ideas per draft, which may under-count contributions from comprehensive drafts.
**Abstract-only analysis**: Ratings are based on abstracts truncated to 2000 characters. For maturity assessment in particular, the abstract is an imperfect proxy for the full document's technical depth.
For full methodology documentation, see `data/reports/methodology.md` in the project repository.
---
### Key Takeaways
- **The full analysis cost ~$9** -- LLM-powered document analysis at scale is practical and cheap with proper caching and model selection

View File

@@ -4,6 +4,98 @@
---
### 2026-03-08 CODER — Data Integrity Fixes from Statistical & Scientific Reviews
**What**: Fixed data integrity issues identified in `review-statistics.md` and `review-science.md`:
1. **Category normalization**: Updated 21 ratings rows with legacy long-form category names (e.g., "Agent-to-agent communication protocols") to canonical short forms (e.g., "A2A protocols"). All 11 categories now consistent in the database.
2. **False positive flagging**: Added `false_positive` column to ratings table. Flagged 73 drafts as false positives (38 with relevance <= 2, 35 manually reviewed at relevance 3+ that are clearly not AI-agent related — e.g., HPKE, cookies, BGP, EDHOC). Notable: excluding false positives yields exactly 361 relevant drafts.
3. **Schema update**: Updated `db.py` schema definition and migration code to include `false_positive` column.
4. **Verified counts document**: Created `data/reports/reviews/verified-counts.md` as single source of truth — documents all actual counts (434 drafts, 419 ideas, 11 gaps, 557 authors) with explanations for discrepancies.
5. **Gap count confirmed**: 11 gaps in DB, not 12. Blog posts use an editorially rewritten gap list with different names and an extra gap.
6. **Ideas count explained**: DB has 419 (post-dedup, 89% of drafts have exactly 1 idea). The 1,780 figure was pre-dedup. The 1,262 figure was from a smaller corpus.
**Why**: Reviews identified critical data integrity issues that would undermine credibility if published — inconsistent category names affecting counts by 5-15%, no mechanism to exclude false positives, and conflicting counts across all reports.
**Result**: Database now has clean categories, false positive flags, and a verified-counts reference document. The coincidence that 434 - 73 false positives = 361 exactly matches the original blog series count.
---
### 2026-03-08 CODER — Fix Security & Code Quality Issues from Dev Review
**What**: Applied 7 targeted fixes from `data/reports/reviews/review-dev.md`:
1. SQL injection in `db.py:update_generation_run` — added column name whitelist validation
2. Flask SECRET_KEY — changed from hardcoded string to `os.environ.get('FLASK_SECRET_KEY', os.urandom(24).hex())`
3. Version string — updated CLI from "0.1.0" to "0.2.0"
4. JSON extraction — `_extract_json` now handles trailing whitespace after code fences via `.rstrip()`
5. Ollama client lifecycle — added `close()`, `__enter__`, `__exit__` to `Embedder` class
6. LLM rating bounds — added `_clamp_rating()` method clamping all rating fields to 1-10 integers in `_parse_rating`
7. Hardcoded matrix size — replaced "260x260" with dynamic `{n_drafts}x{n_drafts}` from actual DB count
**Why**: Dev reviewer flagged these as critical (SQL injection), high (SECRET_KEY), and medium priority issues
**Result**: All 7 fixes applied with minimal targeted edits. No refactoring beyond what was needed.
---
### 2026-03-08 CODER — Methodology Documentation and Scientific Rigor Fixes
**What**: Addressed methodology and scientific rigor issues raised by the science and statistics reviews. Five deliverables:
1. Added 35-line methodology comment block to `analyzer.py` documenting LLM-as-judge limitations (abstract-only, no calibration, no consistency check, overlap score limitation, batch effects, relevance inflation). Updated the rating prompt (`RATE_PROMPT_COMPACT`) with an explicit rubric defining what each score level means for each dimension.
2. Created `data/reports/methodology.md` — comprehensive methodology document covering data collection (keywords, API, selection bias), analysis pipeline (all 6 stages), rating rubric with scale interpretation, clustering method and threshold justification, gap analysis limitations, embedding model properties, known limitations table, and related work references.
3. Added 20-line docstring to `find_clusters()` in `embeddings.py` documenting the 0.85 threshold as an empirical choice with manual inspection rationale, noting that sensitivity analysis would strengthen confidence.
4. Added 22-line comment block above `GAP_ANALYSIS_PROMPT` in `analyzer.py` documenting it as single-shot LLM analysis, noting the absence of reference architecture grounding, and listing strengthening options.
5. Added methodology caveat notes to blog posts 01 (gold-rush), 03 (oauth-wars), 06 (big-picture), and 07 (how-we-built-this, full Limitations section added). Each note explains ratings are LLM-generated from abstracts without human calibration.
6. Added related work section to methodology.md covering FIPA, IEEE P3394, W3C WoT, academic MAS research (AAMAS/JAIR/JAAMAS), and other standards bodies (OASIS, ITU-T, ETSI).
**Why**: Scientific and statistical reviews identified LLM-as-judge limitations, unjustified thresholds, missing related work, and ungrounded gap analysis as the top methodological weaknesses. These caveats are needed before publication.
**Result**: 6 files modified (`analyzer.py`, `embeddings.py`, 4 blog posts), 1 file created (`methodology.md`). All changes are documentation/caveats — no pipeline restructuring.
---
### 2026-03-08 STATISTICS REVIEWER — Full Statistical Audit of Blog Series
**What**: Audited all 10 blog posts, 9 data packages, master stats, and key reports against the actual database (`data/drafts.db`) using sqlite3 queries. Produced comprehensive statistical review at `data/reports/reviews/review-statistics.md`.
**Why**: The blog series makes extensive quantitative claims (361 drafts, 1,780 ideas, 12 gaps, 4:1 safety ratio, 36x growth, etc.) that needed cross-checking against the ground truth database before publication.
**Result**: Found 3 critical issues, 4 important issues, and 4 minor issues. Most serious: the ideas table has 419 rows (not 1,780 as claimed), the database now has 434 drafts (not 361), gaps are 11 (not 12), and composite scores are inflated by 0.05-0.10 through rounding. The 4:1 safety ratio varies from 1.5:1 to 21:1 by month. The "36x growth" figure cherry-picks endpoints. Qualitative patterns (Huawei dominance, safety deficit, fragmentation) hold directionally. RFC cross-refs (4,231), author count (557), and draft-author links (1,057) match exactly.
**Surprise**: The ideas count mismatch (419 vs 1,780) is the most serious finding -- Post 5's entire thesis about "96% of ideas in one draft" and "628 cross-org convergent ideas" is not reproducible from the current database. The pipeline may have been re-run with different parameters, overwriting the original idea extraction.
---
### 2026-03-08 LEGAL REVIEWER — Full Legal Review of Blog Series and Reports
**What**: Reviewed all 10 blog series files (Posts 00-08 plus state-of-ecosystem) and key reports (gaps.md, overview.md) through a German/EU internet law lens. Produced comprehensive legal review covering GDPR, EU AI Act, eIDAS 2.0, NIS2, CRA, product liability, and IETF IPR policy.
**Why**: The series makes claims about safety gaps, identity/auth protocols, and regulatory predictions without adequately engaging the EU regulatory framework -- which is not future speculation but current law with imminent enforcement deadlines (AI Act fully applicable August 2026).
**Result**: Review written to `data/reports/reviews/review-legal.md`. Found 3 critical issues (consent terminology conflation, hospital scenario understating regulatory reality, GDPR omission from gap analysis), 5 regulatory gaps (AI Act needs structural treatment not just a prediction, eIDAS 2.0 missing from identity discussion, NIS2/CRA unaddressed, German TKG context absent), 5 improvement suggestions, and per-post notes for all 10 files. Top priority: Post 6's AI Act enforcement timeline is wrong (says "18 months" but enforcement begins in 5 months).
**Surprise**: The series' best architectural proposal -- assurance profiles L0-L3 -- maps remarkably well to the AI Act's risk-based approach, but the connection is never made explicit. Making it explicit would strengthen both the regulatory argument and the technical proposal.
---
### 2026-03-08 REVIEWER-DEV — Full Codebase Engineering Review
**What**: Comprehensive code review of all core modules (`db.py`, `analyzer.py`, `cli.py`, `fetcher.py`, `embeddings.py`, `authors.py`, `models.py`, `config.py`, `draftgen.py`, `search.py`, `readiness.py`), web UI (`app.py`, `data.py`, `auth.py`), and scripts. Reviewed ~5000 lines of application code and ~2000 lines of web data layer.
**Why**: Pre-deployment quality gate. The tool has grown from a simple CLI to a full web dashboard with API endpoints, and the security/quality bar needs to rise accordingly.
**Result**: Review written to `data/reports/reviews/review-dev.md`. Found 1 critical issue (SQL injection in `update_generation_run`), 1 high issue (hardcoded Flask SECRET_KEY), 5 bugs, 6 performance concerns, and 14 improvement suggestions. Overall grade: B+ -- solid architecture, needs hardening. Key positives: clean separation of concerns, effective LLM caching, good auth design, proper FTS5 sync triggers.
**Surprise**: The `cli.py` file has grown to 2995 lines with ~40 repetitions of the same config/db boilerplate pattern. Also, zero test coverage for the analysis pipeline (`analyzer.py`, `embeddings.py`, `fetcher.py`) despite it being the core of the tool.
---
### 2026-03-08 REVIEWER (Science) — Full Scientific Review of Methodology and Outputs
**What**: Conducted comprehensive scientific review of the entire analysis pipeline, database integrity, reports, and blog posts. Reviewed analyzer.py (rating/idea/gap prompts), embeddings.py (clustering), fetcher.py (data collection), config.py, and all reports/blog posts. Queried database directly for integrity checks.
**Why**: The analysis makes strong claims (4:1 safety deficit, 12 gaps, 1262 ideas, 9.3% of IETF submissions) that need to withstand scrutiny from IETF participants, academic reviewers, and standards experts. Several methodological weaknesses and data inconsistencies were found that could undermine credibility if not addressed.
**Result**: Wrote detailed review to `data/reports/reviews/review-science.md` with 8 sections covering methodology, unsupported claims, missing context, data integrity, improvements, taxonomy, and post-by-post notes. Key findings:
- **CRITICAL**: Ideas database has 419 entries but blog posts reference 1,262-1,780. Major data inconsistency.
- **CRITICAL**: LLM ratings have no human calibration. No inter-rater reliability measurement.
- **HIGH**: 55 non-canonical category names in ratings table (normalization not applied to stored data).
- **HIGH**: ~30-50 false positive drafts in corpus (e.g., HPKE, PIE bufferbloat rated relevance 5 and 3).
- **HIGH**: Missing related work context (FIPA, IEEE P3394, academic MAS research).
- **MEDIUM**: Greedy single-linkage clustering at unjustified 0.85 threshold.
- Database grew from 361 to 434 drafts but all reports/blogs still cite 361.
- 10 prioritized recommendations provided, from calibration study to reference architecture.
**Surprise**: The ideas count discrepancy (419 vs 1,780) is dramatic -- either mass dedup removed 75%+ of ideas, or the database was regenerated. Either way, Post 05 ("1,262 Ideas") needs a full rewrite. Also, `draft-ietf-hpke-hpke` (generic public key encryption, nothing to do with AI agents) is rated relevance=5, showing the LLM judge is too generous with keyword-matched drafts.
**Cost**: Zero API cost (review only, no pipeline runs). Approximately 90 minutes of analysis time.
### 2026-03-07 CODER C — Citation Graph, Readiness Scoring, Annotations, Data Surfacing
**What**: Implemented four features in a single session:

232
data/reports/methodology.md Normal file
View File

@@ -0,0 +1,232 @@
# Methodology — IETF Draft Analyzer
*This document describes the data collection, analysis pipeline, and known limitations of the IETF Draft Analyzer project. It is intended to provide transparency for anyone evaluating the findings in the blog series or reports.*
---
## 1. Data Collection
### Source
All data is fetched from the IETF Datatracker API (`https://datatracker.ietf.org/api/v1/doc/document/`). Full draft text is retrieved from `https://www.ietf.org/archive/id/{name}-{rev}.txt`. Author and affiliation data comes from the `/api/v1/doc/documentauthor/` and `/api/v1/person/person/` endpoints.
### Keyword Selection
The corpus is built by searching for drafts matching 12 keywords across both `name__contains` and `abstract__contains` fields:
`agent`, `ai-agent`, `llm`, `autonomous`, `machine-learning`, `artificial-intelligence`, `mcp`, `agentic`, `inference`, `generative`, `intelligent`, `aipref`
Only drafts with `type__slug=draft` and submission date >= 2024-01-01 are included.
### Selection Bias Acknowledgment
Keyword-based selection introduces both false positives and false negatives:
- **False positives**: Keywords like "agent" match "user agent" in HTTP contexts, "autonomous" matches "autonomous systems" (AS) in routing, and "intelligent" matches "intelligent networking" unrelated to AI. We estimate 30-50 false positives remain in the corpus despite relevance filtering. Drafts with relevance score <= 2 are the most obvious, but some false positives receive relevance scores of 3-4 from the LLM judge.
- **False negatives**: Relevant drafts using terminology not in our keyword list (e.g., "cognitive," "self-driving network," or domain-specific terms) are missed entirely.
- **Temporal bias**: The `fetch_since` cutoff of 2024-01-01 excludes earlier foundational work that may inform the current landscape.
### Organization Normalization
Author affiliations are normalized using a hand-curated alias table of 40+ mappings (e.g., "Huawei Technologies Co., Ltd." -> "Huawei") plus automatic suffix stripping for common patterns (", Inc.", " LLC", " AB", etc.). This normalization is essential for cross-org analysis but introduces judgment calls about organizational boundaries.
---
## 2. Analysis Pipeline
The pipeline runs in six stages, each building on the previous:
```
fetch --> analyze --> embed --> ideas --> gaps --> report
| | | | | |
v v v v v v
Datatracker Claude Ollama Claude Claude Markdown
API Sonnet nomic-embed Haiku Sonnet + rich
```
### Stage 1: Fetch
Retrieves draft metadata, full text, and author information from the Datatracker API with a 0.5-second polite delay between requests.
### Stage 2: Analyze (Rating)
Each draft is rated by Claude Sonnet on five dimensions using a compact structured prompt that includes the draft's name, title, date, page count, and abstract (truncated to 2000 characters). See "Rating Rubric" below.
### Stage 3: Embed
Vector embeddings are generated using Ollama with the `nomic-embed-text` model. The input combines the draft's title, abstract, and first 4000 characters of full text. Embeddings are 768-dimensional vectors stored as binary blobs in SQLite. See "Embedding Model" below.
### Stage 4: Ideas
Technical ideas are extracted by Claude. Individual extraction uses Sonnet with abstract + first 3000 characters of full text. Batch extraction uses Haiku with abstract only (truncated to 800 characters). The prompt requests 1-4 top-level novel contributions per draft.
### Stage 5: Gaps
A single Claude Sonnet call receives compressed landscape statistics (category counts, top ideas, overlap summary) and identifies 8-15 standardization gaps. See "Gap Analysis" below.
### Stage 6: Report
Markdown reports are generated from database queries. No LLM is involved in report generation.
### Caching
All Claude API calls are cached in an `llm_cache` table keyed by SHA-256 hash of the full prompt. Re-runs return cached results, making the pipeline idempotent. This also means that intra-rater consistency cannot be measured from cached results.
---
## 3. Rating Rubric
Each draft is scored on five dimensions, each on a 1-5 integer scale:
| Dimension | 1 | 2 | 3 | 4 | 5 |
|-----------|---|---|---|---|---|
| **Novelty** | Trivial/obvious extension | Incremental | Useful contribution | Notable originality | Genuinely novel approach |
| **Maturity** | Problem statement only | Early sketch | Defined protocol/mechanism | Detailed spec with examples | Implementation-ready with test vectors |
| **Overlap** | Unique approach | Minor similarities | Shares concepts with 1-2 drafts | Significant overlap | Near-duplicate of existing work |
| **Momentum** | Inactive/abandoned | Single revision | Active development | WG interest/adoption | Strong community momentum |
| **Relevance** | Not about AI/agents (false positive) | Tangentially related | Partially relevant | Directly relevant | Core AI agent topic |
### Composite Score
The composite score used in reports is a 4-dimension average of novelty, maturity, momentum, and relevance (excluding overlap, since overlap measures redundancy rather than quality). Exact decimal values are used; rounding is avoided.
### Scale Interpretation
Scores should be treated as **relative rankings within this corpus**, not absolute quality measures. Key limitations:
- **Abstract-only input**: Ratings are based on the draft's abstract (truncated to 2000 characters), not the full text. Maturity and overlap scores are particularly affected, since the abstract may not convey the full technical depth or specificity of the draft.
- **Single LLM judge**: Claude Sonnet is the sole rater. No human calibration study has been performed. No second-model comparison has been conducted. Even a small calibration set (20-30 drafts rated by domain experts) would substantially strengthen confidence.
- **No consistency measurement**: Each draft is rated once. The caching mechanism prevents re-rating, so Claude's self-consistency on these drafts is untested.
- **Overlap score limitations**: Claude rates each draft independently without access to the full corpus. The overlap dimension reflects Claude's general knowledge of the field, not corpus-specific similarity analysis. For corpus-level overlap, use the embedding-based similarity analysis instead.
- **Relevance inflation**: Keyword-matched drafts tend to score high on relevance by construction. The distribution is right-skewed (most drafts at 4-5).
- **Batch effects**: When rated in batches of 5 (using `BATCH_PROMPT`), position effects and comparison effects between drafts in the same batch are uncontrolled. Abstracts are truncated more aggressively (1500 chars) in batch mode.
---
## 4. Clustering
### Method
Greedy single-linkage clustering on the pairwise cosine similarity matrix of draft embeddings.
**Algorithm**: For each unvisited draft (seed), find all unvisited drafts with cosine similarity >= threshold to the seed. Add them to the seed's cluster and mark them visited. This produces disjoint clusters where every member is similar to the seed, but members are not guaranteed to be similar to each other (single-linkage property).
### Thresholds
| Threshold | Label | Justification |
|-----------|-------|---------------|
| 0.85 | Topically overlapping | Empirical: 0.80 produced too many false groupings; 0.90 missed obvious clusters; 0.85 yielded groups that looked reasonable on manual spot-checking |
| 0.90 | Near-duplicates | Empirical: pairs above 0.90 consistently covered the same topic with similar approaches |
| 0.98 | Functionally identical | Empirical: pairs above 0.98 were essentially the same document under different names |
**None of these thresholds are derived from a principled analysis.** A sensitivity analysis (running clustering at 0.80, 0.85, 0.90 and comparing results) would strengthen confidence. Different embedding models would produce different similarity distributions, potentially requiring different thresholds.
### Limitations
- **Single-linkage chaining**: A chain of pairwise-similar drafts can produce clusters containing semantically distant drafts connected through intermediaries.
- **No comparison to alternatives**: The clustering has not been compared against k-means, DBSCAN, hierarchical agglomerative clustering, or other standard methods.
- **General-purpose embeddings**: The `nomic-embed-text` model was not trained specifically for technical standards document similarity. Domain-specific or fine-tuned embeddings might produce significantly different cluster structures.
- **Inconsistent embedding input**: Drafts with full text available are embedded from title + abstract + 4000 chars of body. Drafts without full text are embedded from title + abstract only. This creates systematic quality differences in embeddings.
---
## 5. Gap Analysis
The gap analysis sends Claude Sonnet a compressed landscape summary containing:
- Category distribution (category name and draft count)
- Top 20 most frequently occurring idea titles
- Overlap summary (top 5 categories by count, labeled "high internal overlap")
Claude is instructed to identify 8-15 gaps with topic, description, category, severity (critical/high/medium/low), and evidence.
### Limitations
- **Single-shot generation**: Gaps are identified in one LLM call, not through systematic comparison against a reference taxonomy.
- **No reference architecture**: A rigorous gap analysis would compare the corpus against an explicit agent ecosystem reference model (e.g., NIST AI RMF, FIPA agent platform model). The current approach relies on Claude's general knowledge.
- **Circular overlap summary**: The overlap information fed to Claude is just category-level counts, not specific technical areas of overlap within categories.
- **Variable evidence quality**: Some gap evidence cites specific data ("only N drafts address X"), while other evidence is based on Claude's inference about what is missing.
- **Ungrounded severity**: The distinction between critical, high, medium, and low severity is assigned by Claude without defined thresholds.
### Strengthening Options
- Ground against a reference architecture (FIPA, NIST AI RMF, or a custom agent ecosystem model)
- Run multiple independent gap analyses and intersect results
- Have domain experts validate and rank gaps
- Cite specific drafts that partially address each gap
---
## 6. Embedding Model
**Model**: `nomic-embed-text` (Nomic AI), run locally via Ollama.
**Properties**:
- 768-dimensional embeddings
- Context window: ~8192 tokens
- General-purpose text embedding model trained on diverse English text
- Not fine-tuned for technical/standards document similarity
**Input**: Title + abstract + first 4000 characters of full text (when available), concatenated with double newlines. Input is truncated to ~32,000 characters before embedding.
**Similarity metric**: Cosine similarity, computed as dot product divided by product of L2 norms.
**Limitations**: As a general-purpose model, nomic-embed-text may not capture domain-specific semantic relationships in standards documents as well as a model fine-tuned on technical/legal/standards text. The embeddings have not been evaluated against a gold-standard similarity judgment set for IETF drafts.
---
## 7. Known Limitations Summary
| Limitation | Impact | Mitigation |
|------------|--------|------------|
| Abstract-only rating | Maturity/overlap scores may be unreliable | Could re-rate with full text for a validation sample |
| No human calibration | Rating validity is unknown | Calibration study with 5 experts on 25 drafts |
| Keyword selection bias | ~30-50 false positives, unknown false negatives | Relevance filtering, manual review of low-scoring drafts |
| Empirical clustering thresholds | Cluster boundaries may be arbitrary | Sensitivity analysis at multiple thresholds |
| Single-shot gap analysis | Gaps may be incomplete or misprioritized | Ground against reference architecture |
| General-purpose embeddings | Domain-specific similarity may be missed | Evaluate against expert similarity judgments |
| Batch vs. individual extraction quality | Idea counts and quality may vary by extraction method | Compare batch (Haiku) vs. individual (Sonnet) on sample |
| Organization normalization | Cross-org analysis depends on alias accuracy | Publish and review normalization table |
---
## 8. Related Work
The IETF's AI agent standardization effort exists within a broader ecosystem of agent-related standards and research. This analysis would benefit from comparison against:
### FIPA (Foundation for Intelligent Physical Agents)
The original agent communication standards body (1996-2005). FIPA's Agent Communication Language (ACL), Agent Management Specification, and Agent Platform specifications are the direct ancestors of modern A2A protocols. Key specifications include:
- **FIPA ACL** (SC00061): Message structure and performatives for agent communication
- **FIPA Agent Management** (SC00023): Agent platform architecture with Agent Management System (AMS), Directory Facilitator (DF), and Message Transport Service (MTS)
- **FIPA Interaction Protocols** (SC00026-SC00036): Request, query, contract net, brokering, and other standard interaction patterns
FIPA's work is relevant because many of the "novel" A2A protocol proposals in the IETF corpus address problems FIPA solved (or attempted to solve) 20+ years ago. The absence of FIPA references in most current drafts suggests a lack of awareness of prior art.
### IEEE P3394 — Standard for Trustworthy Autonomous and Semi-Autonomous Systems
An active IEEE working group developing standards for trustworthy AI agents, addressing trust, transparency, and accountability. Relevant to the IETF's AI safety/alignment and policy/governance categories. The IETF's safety deficit (4:1 capability-to-safety ratio) should be evaluated in the context of IEEE's complementary safety-focused standardization.
### W3C Web of Things (WoT)
The W3C WoT Architecture and Thing Description specifications address agent/device discovery and interoperability in IoT contexts:
- **WoT Architecture** (W3C REC): Defines servients, protocol bindings, and discovery mechanisms applicable to agent systems
- **WoT Thing Description** (W3C REC): Machine-readable metadata for capabilities, interfaces, and security -- analogous to agent capability description proposals in the IETF corpus
Several IETF drafts build on or compete with WoT concepts for agent discovery and description.
### Academic Multi-Agent Systems (MAS) Research
The multi-agent systems research community (AAMAS, JAIR, JAAMAS) has decades of work on problems the IETF drafts are now addressing at the protocol level:
- **Agent coordination**: Consensus, negotiation, auction mechanisms (relevant to IETF gap "Multi-Agent Consensus Protocols")
- **Trust and reputation**: Computational trust models, reputation systems (relevant to agent identity/auth drafts)
- **Agent verification**: Model checking, runtime verification of agent behavior (relevant to IETF gap "Agent Behavioral Verification")
- **MAS security**: Secure agent platforms, malicious agent detection (relevant to AI safety/alignment drafts)
Key survey references:
- Wooldridge, M. (2009). "An Introduction to MultiAgent Systems" -- Standard textbook covering agent architectures, communication, coordination
- Dorri, A., Kanhere, S.S., Jurdak, R. (2018). "Multi-Agent Systems: A Survey" -- IEEE Access, covering modern MAS challenges
- The AAMAS conference proceedings (annual since 2002) -- primary venue for MAS research
### Other Relevant Standards Bodies
- **OASIS**: TOSCA (Topology and Orchestration Specification for Cloud Applications) and prior work on service-oriented agent architectures
- **ITU-T**: Y.3170 series on machine learning in future networks, relevant to the autonomous netops and ML traffic management categories
- **ETSI**: ENI (Experiential Networked Intelligence) and ZSM (Zero-touch network and Service Management), addressing autonomous network management
---
*This document was created 2026-03-08 in response to scientific and statistical review findings. It should be updated as the analysis pipeline evolves.*

View File

@@ -0,0 +1,186 @@
# Development & Engineering Review
**Reviewer**: Development & Engineering Reviewer (Opus 4.6)
**Date**: 2026-03-08
**Scope**: Full codebase review — `src/ietf_analyzer/`, `src/webui/`, `scripts/`, `data/reports/`
---
## Summary Verdict
The codebase is well-structured for a research/analysis tool. Architecture is clean: Click CLI, SQLite with FTS5, Claude for analysis, Ollama for embeddings, Flask web UI. Code is readable and follows consistent patterns. However, there are several security issues (one critical), a few bugs, and significant testing gaps that should be addressed before any public deployment.
**Overall grade: B+** -- solid for a personal research tool, needs hardening for production.
---
## Critical Issues
### 1. SQL Injection in `db.py:update_generation_run` (CRITICAL)
```python
def update_generation_run(self, run_id: int, **kwargs) -> None:
sets = []
for k, v in kwargs.items():
sets.append(f"{k} = ?") # <-- column name from **kwargs, unvalidated
```
The column names come directly from `**kwargs` and are interpolated into the SQL string without validation. While this is only called internally today, any future caller passing user-controlled keyword arguments creates a SQL injection vector. **Fix**: Whitelist allowed column names against the table schema.
### 2. Hardcoded Flask SECRET_KEY (HIGH)
```python
app.config["SECRET_KEY"] = "ietf-dashboard-dev"
```
In `src/webui/app.py:61`. This is a static, publicly visible secret. While the app currently uses no sessions that depend on signing, Flask's session cookie is signed with this key. If any session-based feature is added (and there's already an auth module), cookies can be forged. **Fix**: Generate from environment variable or `secrets.token_hex()` at startup.
### 3. No Rate Limiting on API Endpoints (MEDIUM)
The `/api/ask/synthesize` and `/api/compare` POST endpoints trigger Claude API calls that cost real money. Even with `@admin_required`, in dev mode (`--dev`), any client can trigger unlimited API calls. **Fix**: Add per-IP or per-session rate limiting, at minimum on the Claude-calling endpoints.
---
## Code Issues
### Bugs
1. **`_extract_json` mishandles nested fences** (`analyzer.py:196-201`): If Claude returns code fences with a language tag (e.g., ` ```json\n{...}\n``` `), the first `split("\n", 1)` correctly strips the opening line, but the trailing ```` ` check uses `endswith` which fails if there's trailing whitespace. Minor but can cause silent JSON parse failures.
2. **Version string mismatch** (`cli.py:24`): `@click.version_option(version="0.1.0")` but the project is at v0.2.0 per `CLAUDE.md` and memory. Should be kept in sync, ideally from a single source (`__init__.py` or `pyproject.toml`).
3. **`embed_all_missing` never closes Ollama client**: The `Embedder` class creates an `ollama.Client` but has no `close()` method, unlike `Fetcher` and `AuthorNetwork` which properly close their `httpx.Client`. Not a major issue since Ollama connections are typically local, but inconsistent.
4. **`similarity_matrix` is O(n^2) with no caching**: `embeddings.py:102-113` computes the full pairwise matrix every time. For 361 drafts this is ~65K comparisons per call, and this is called by `find_clusters` and the web UI. The web data layer adds a 5-minute TTL cache, but the CLI path has none.
5. **`overlap_matrix` report hardcodes "260x260"**: `cli.py:603` prints `"Computing 260x260 similarity matrix..."` but the actual corpus is 361 drafts. Cosmetic but suggests stale code.
### Security
6. **`read_generated_draft` path traversal check is good** (`data.py:369-371`): The `resolve()` + `startswith()` guard against directory traversal is correctly implemented. Well done.
7. **FTS5 query injection** (`search.py:97-109`): FTS5 MATCH queries can fail on special characters. The fallback wrapping words in quotes is a reasonable mitigation, but untrusted input containing double quotes could still cause issues. Consider sanitizing with `re.sub(r'[^\w\s]', '', query)` before passing to FTS5.
8. **`draft_detail` route uses `<path:name>` converter** (`app.py:137`): This allows slashes in the draft name URL segment. While the DB lookup parameterizes correctly, the `<path:>` converter should be `<string:>` since draft names don't contain slashes. Using `<path:>` is unnecessarily permissive.
### Performance
9. **`get_drafts_page` loads all rated drafts every time** (`data.py:104`): `db.drafts_with_ratings(limit=1000)` fetches up to 1000 draft+rating pairs into memory, then filters in Python. For 361 drafts this is fine, but the pattern won't scale. More importantly, `compute_readiness` is called per draft on the page (line 161), and each call makes 3-4 separate DB queries. For a page of 50 drafts, that's ~200 DB queries per page load.
10. **`all_embeddings()` loads all vectors into memory** (`db.py:455-460`): For 361 drafts with 768-dim embeddings, this is ~1.1MB -- acceptable. But `_embedding_search` in `search.py:135` calls this on every search query. Should be cached or use a vector similarity index.
11. **`_compute_author_network_full` calls `db.get_draft(dn)` in a loop** (`data.py:621`): For cluster draft lookup, each call is a separate DB query. With 15 drafts per cluster across multiple clusters, this is an N+1 query pattern. Should batch-fetch.
### Code Quality
12. **Excessive boilerplate in CLI commands**: The pattern of `cfg = _get_config(); db = Database(cfg); try: ... finally: db.close()` is repeated ~40 times across the 2995-line `cli.py`. This should be a context manager or Click callback. Example:
```python
@contextmanager
def get_db_context():
cfg = _get_config()
db = Database(cfg)
try:
yield cfg, db
finally:
db.close()
```
13. **`Database` class should be a context manager**: Adding `__enter__`/`__exit__` would eliminate all the `try/finally/close` blocks.
14. **No type hints on `Database` return types for dicts**: Methods like `all_gaps()`, `all_ideas()`, `wg_summary()` return `list[dict]` but the dict structure is undocumented. TypedDict or dataclasses would improve maintainability.
15. **`data.py` imports inside functions**: `compute_readiness` is imported inside `get_drafts_page` and `get_draft_detail` (lines 158, 245). This works but is unusual for a data access layer.
---
## Missing Developer Value
1. **No test coverage for the analysis pipeline**: Tests exist for `db.py`, `models.py`, `web_data.py`, and `obsidian_export.py`, but none for `analyzer.py`, `embeddings.py`, `fetcher.py`, `search.py`, `readiness.py`, or `draftgen.py`. The analysis pipeline is the core of the tool and is completely untested.
2. **No CI/CD configuration**: No GitHub Actions, no `Makefile`, no `tox.ini`. For a tool that generates research outputs, reproducibility matters.
3. **No `pyproject.toml` or `setup.py` visible**: The `ietf` CLI entry point is referenced but the packaging config isn't in the reviewed files. The install path is unclear.
4. **No data validation on LLM outputs**: Rating values from Claude are cast with `int(data.get("n", 3))` but never bounds-checked (except in `score_idea_novelty` where `max(1, min(5, int(v)))` is used). Claude could return 0, 6, or -1 and it would be stored.
5. **No error recovery for partial pipeline runs**: If `rate_all_unrated` fails halfway through, there's no way to resume from where it stopped without re-processing already-rated drafts (the cache helps, but isn't guaranteed to hit if prompts change).
---
## Improvement Suggestions
### High Priority
1. **Validate LLM output bounds**: Add `max(1, min(5, ...))` clamping in `_parse_rating` for all rating fields, not just in `score_idea_novelty`.
2. **Whitelist columns in `update_generation_run`**: Replace dynamic column interpolation with an allowed-columns set.
3. **Generate Flask SECRET_KEY at startup**: `app.config["SECRET_KEY"] = os.environ.get("FLASK_SECRET_KEY", secrets.token_hex(32))`.
4. **Add Database context manager**: `def __enter__(self): return self` / `def __exit__(...): self.close()`.
5. **Add tests for analyzer.py**: Mock the Anthropic client and test JSON parsing, rating bounds, cache hit/miss, batch processing.
### Medium Priority
6. **Deduplicate CLI boilerplate**: Use a Click group callback or context manager to handle config/db lifecycle.
7. **Add rate limiting**: Use `flask-limiter` or a simple token bucket for Claude-calling endpoints.
8. **Batch readiness computation**: Instead of N+1 queries per page, compute readiness factors in bulk SQL queries.
9. **Cache similarity matrix**: Store precomputed matrix in DB or pickle file, invalidate when embeddings change.
10. **Fix version string**: Single source of truth for version number.
### Low Priority
11. **Add TypedDict for common dict shapes**: `IdeaDict`, `GapDict`, `RatingDict` etc.
12. **Add `--dry-run` to more CLI commands**: Currently only `ideas dedup` supports it.
13. **Add OpenAPI/Swagger docs**: The API endpoints are well-structured and would benefit from auto-generated docs.
14. **Consider async for web UI**: The t-SNE and clustering computations block the Flask request thread. Consider `flask[async]` or background tasks.
---
## Architecture Notes
### What Works Well
- **Separation of concerns**: CLI, DB, analysis, embedding, reporting, and web UI are cleanly separated into modules.
- **LLM caching**: The `llm_cache` table with SHA256 prompt hashing is well-designed and saves significant API costs.
- **Graceful degradation**: The search system falls back from semantic+keyword to keyword-only when Ollama is unavailable.
- **Auth design**: The dev/production mode split is simple and effective. Admin routes return 404 in production (not 403), which is security-correct.
- **FTS5 triggers**: The auto-sync triggers for the full-text search index are correctly implemented and handle INSERT/UPDATE/DELETE.
- **UPSERT patterns**: Consistent use of `INSERT ... ON CONFLICT DO UPDATE` throughout the DB layer.
### What Could Be Better
- **3000-line `cli.py`**: This single file has grown large. Consider splitting into `cli/fetch.py`, `cli/analyze.py`, `cli/report.py`, etc.
- **Web data layer fetches everything**: Most endpoints call `db.drafts_with_ratings(limit=1000)` and filter in Python rather than using SQL WHERE clauses. This is a pattern that won't scale.
- **No migration system**: Schema changes rely on additive `ALTER TABLE` in `_migrate_schema`. This works for column additions but can't handle schema changes, index additions, or data migrations. A lightweight migration framework (even just numbered SQL files) would be more robust.
---
## Post-by-Post Notes
*(Blog posts in `data/reports/blog-series/` were not the primary focus of this review. Quick technical accuracy check:)*
No blog posts appear to be written yet based on the git status and project memory. The blog series infrastructure is in place but content generation has not started.
---
## Methodology Assessment
The analysis methodology is defensible for exploratory research:
- **Rating**: Using Claude to rate drafts on 5 dimensions is reasonable for landscape analysis. The compact prompt design saves tokens while capturing key attributes. However, the ratings should be presented as "AI-assessed" with appropriate caveats, since a single LLM pass on abstracts may not capture implementation quality.
- **Embeddings**: Using nomic-embed-text for similarity is appropriate. The 0.85 threshold for clustering seems reasonable. The greedy clustering algorithm in `embeddings.py` is simple but may miss transitive similarities (draft A similar to B, B similar to C, but A not directly similar to C).
- **Gap analysis**: The gap identification prompt uses category distributions and idea frequencies as evidence, which is sound. However, the prompt feeds the LLM its own previous outputs (categories, ideas), creating a feedback loop that could amplify biases.
- **Readiness scoring**: The 6-factor composite score in `readiness.py` is well-designed with reasonable weights. The normalization (rev/5, cited/5, etc.) is transparent and defensible.

View File

@@ -0,0 +1,175 @@
# Legal Review -- German/EU Internet Law Perspective
*Reviewer: Legal Reviewer Agent | Date: 2026-03-08*
*Scope: Blog posts 00-08 in `data/reports/blog-series/`, key reports in `data/reports/`*
---
## Critical Issues
### 1. "Consent" terminology conflation (Posts 3, 6)
The series uses "consent" interchangeably across OAuth authorization flows, GDPR consent (Art. 6(1)(a) GDPR), and human-in-the-loop approval. These are legally distinct concepts:
- **OAuth consent** is a technical authorization flow where a user delegates access scopes to a client.
- **GDPR consent** (Einwilligung) is a legal basis for data processing that must be freely given, specific, informed, and unambiguous (Art. 4(11) GDPR) and is revocable at any time (Art. 7(3) GDPR).
- **HITL approval gates** (as proposed in Post 6) are operational control mechanisms, not consent under any legal framework.
Post 3 discusses 14 OAuth-for-agents proposals without noting that delegated agent authorization raises fundamental GDPR consent validity questions. Under CJEU case law (Planet49, C-673/17), consent requires a clear affirmative act by the data subject. When an AI agent further delegates to sub-agents, the chain of consent may break entirely. None of the blog posts flag this.
**Recommendation**: Add a clarifying footnote in Post 3 that distinguishes OAuth "consent" from GDPR consent, and note that chained delegation in multi-agent systems raises unresolved consent propagation questions under EU data protection law.
### 2. The hospital scenario in Post 4 understates regulatory reality
The opening scenario -- an AI agent managing a hospital drug-dispensing system where a hallucinated dosage cascades without oversight -- is presented as a gap-analysis illustration. Under EU law, this is not merely an engineering gap; it is a regulatory compliance failure in multiple dimensions:
- **EU AI Act (Regulation 2024/1689)**: A drug-dispensing AI agent is a **high-risk AI system** under Annex III, Section 5(b) (AI systems intended to be used as safety components in the management and operation of critical digital infrastructure, road traffic and the supply of water, gas, heating and electricity) and potentially under the Medical Devices Regulation (MDR 2017/745). High-risk systems require conformity assessment, risk management systems, data governance, and human oversight (Arts. 9-14 AI Act).
- **Product Liability Directive (2024/2853)**: The revised PLD explicitly covers software and AI systems. The cascading failure scenario would trigger strict product liability for the AI system provider.
- **German Patientenrechtegesetz / BGB SS 630a ff.**: The treatment contract (Behandlungsvertrag) places duty of care on the healthcare provider. Automated dispensing without adequate safeguards violates the standard of care.
The blog post frames this as "what goes wrong if this is never addressed" at the standards level. Legally, it is already addressed at the regulatory level -- the gap is in technical implementation, not in the existence of liability. This distinction matters because readers might infer that absent IETF standards mean absent accountability, which is incorrect under EU law.
**Recommendation**: Add a sentence acknowledging that the EU AI Act already classifies such systems as high-risk and imposes mandatory requirements. The IETF gap is in providing the technical mechanisms to *implement* what the regulation *requires*.
### 3. The gap analysis omits GDPR-mandated requirements entirely
The 12 gaps identified across the series and the `gaps.md` report include "Agent Privacy Preservation" (HIGH severity in the report, mentioned as "privacy-preserving discovery" in Post 5) but do not engage with GDPR as a legally binding framework. The gaps are framed as technical desiderata, not regulatory compliance requirements.
Specific GDPR-mandated capabilities that should appear in the gap analysis but do not:
- **Data Protection Impact Assessment (DPIA) support** (Art. 35 GDPR): High-risk agent processing requires DPIAs. No draft or gap addresses machine-readable DPIA tooling.
- **Right to erasure** (Art. 17 GDPR): When agents process personal data across multi-agent chains, the right to erasure must propagate. The ECT-based DAG model proposed in Post 6 records execution evidence but does not address how to *delete* that evidence when legally required.
- **Data portability** (Art. 20 GDPR): Agent-generated data about individuals must be portable. No gap addresses this.
- **Purpose limitation** (Art. 5(1)(b) GDPR): Agents authorized for one purpose must not repurpose data. The "scope aggregation" OAuth proposals (Post 3) could facilitate purpose creep if not constrained.
**Recommendation**: Add GDPR compliance as a cross-cutting regulatory dimension in Post 6 or the gap analysis. The ECT/DAG model is architecturally promising but needs to account for data deletion, purpose limitation, and DPIA requirements.
---
## Regulatory Gaps
### 1. EU AI Act is mentioned once, in a prediction -- it deserves structural treatment
Post 6 predicts that "within 18 months, the safety deficit will begin to close -- not from IETF drafts but from regulatory pressure. The EU AI Act's requirements for high-risk AI systems will drive demand for behavior verification, human override, and audit standards." This is the only substantive mention of the EU AI Act across 8 blog posts and all reports.
The EU AI Act (Regulation 2024/1689) entered into force on 1 August 2024 and will be fully applicable from 2 August 2026. It is not a future event; it is current law with imminent enforcement deadlines. Its requirements map directly to several of the series' key findings:
| AI Act Requirement | Corresponding IETF Gap | Blog Post |
|---|---|---|
| Art. 9: Risk management system | Behavior Verification (Critical) | Post 4 |
| Art. 14: Human oversight | Human Override (High) | Posts 4, 6 |
| Art. 12: Record-keeping / logging | Error Recovery, Data Provenance | Posts 4, 5 |
| Art. 13: Transparency | Explainability (Medium) | Post 5 |
| Art. 15: Accuracy, robustness, cybersecurity | Agent Capability Degradation | Report |
| Art. 17: Quality management system | Lifecycle Management (High) | Post 5 |
The series would be significantly strengthened by treating the AI Act not as a future prediction but as a current regulatory driver that makes several of the identified gaps not just technically desirable but legally mandatory.
### 2. eIDAS 2.0 and the European Digital Identity Wallet
The series discusses agent identity extensively (108 drafts, 14 OAuth proposals) but does not mention eIDAS 2.0 (Regulation 2024/1183). The revised eIDAS framework introduces the European Digital Identity Wallet (EUDI Wallet), which will become available to all EU citizens by 2026-2027.
Implications for agent identity standards:
- eIDAS 2.0 defines **electronic attestations of attributes** that could extend to agent attributes and capabilities.
- The **trust framework** in eIDAS 2.0 (qualified trust services, qualified electronic signatures) provides a mature model for the "dynamic trust and reputation" gap identified in Post 4.
- The legal effect of electronic identification under eIDAS (mutual recognition across EU member states) is relevant to "cross-domain security boundaries" -- a problem the IETF drafts approach from a purely technical angle.
The Ericsson/EDHOC work mentioned in Posts 2 and 3 is architecturally adjacent to eIDAS requirements but is never connected to it.
### 3. NIS2 Directive and critical infrastructure
The NIS2 Directive (Directive 2022/2555), applicable from 18 October 2024, imposes cybersecurity risk-management measures and incident reporting obligations on entities in critical sectors. The series discusses autonomous network operations (93 drafts) and telecom agent deployments without mentioning that telecom operators deploying AI agents are NIS2-obligated entities.
The gap analysis scenario of AI agents managing telecom infrastructure during a major outage (Post 4) directly involves NIS2-covered operations. Incident reporting timelines under NIS2 (24-hour early warning, 72-hour notification) interact with the error recovery gap -- if an agent causes or extends an outage, the NIS2 clock starts.
### 4. Cyber Resilience Act (CRA)
The CRA (Regulation 2024/2847) imposes cybersecurity requirements on products with digital elements, including software. Agent protocols and their implementations will fall under CRA obligations regarding vulnerability handling, security updates, and conformity assessment. The series' discussion of "Agent Firmware/Model Update Security" (HIGH gap) maps to CRA requirements but is not framed as a regulatory obligation.
### 5. German Telecom Law (TKG) and AI in network management
The series highlights that Chinese telecom organizations focus heavily on autonomous network operations. For German/EU telecom operators deploying such agent-based network management, SS 165 TKG (technical protective measures) and SS 168 TKG (incident reporting) impose domestic obligations beyond NIS2. The Bundesnetzagentur has authority to require specific security measures. This is relevant context for the "European telecoms as bridge-builders" narrative in Posts 2 and 5.
---
## Improvement Suggestions
### 1. Add a regulatory context paragraph to Post 1 or Post 4
The series positions the safety deficit as its signature finding. A brief paragraph contextualizing this within the EU regulatory landscape (AI Act, NIS2, CRA, product liability) would make the analysis more actionable for EU-based readers and more legally accurate. The 4:1 safety ratio is not just a community choice; for EU-deployed systems, it is a compliance risk.
### 2. Distinguish IETF IPR policy from open standards
Post 7 describes the tool as "open source" and the database as available. The series discusses IETF drafts without noting the IETF's IPR policy (BCP 79, RFC 8179). IETF participants are required to disclose known IPR claims. For a series advising builders to "watch these drafts" and "design for the DAG," a note about IPR and FRAND licensing would be prudent. Some of the proposed protocols may carry patent claims that affect implementation freedom.
### 3. Frame the geopolitics discussion with care
Post 2 discusses Chinese institutional dominance and "Western absence" in terms that could be read as geopolitical advocacy rather than data-driven observation. Statements like "the standards that will govern how AI agents identify, authenticate, and communicate on the internet are being written by a remarkably narrow group" carry implications.
From a German/EU legal perspective, EU competition law and the EU Foreign Subsidies Regulation (Regulation 2022/2560) provide frameworks for assessing foreign influence in standard-setting. The series would benefit from a brief note that the IETF process is open, consensus-based, and has mechanisms (rough consensus, running code) to mitigate undue influence -- even if the authorship concentration data is concerning.
### 4. Address GDPR implications of agent discovery
Post 5 notes the absence of "privacy-preserving agent discovery" -- that querying for "a medical diagnosis agent" reveals sensitive information. Under GDPR, the query itself could constitute processing of special category data (Art. 9 GDPR, health data). This is not just a gap; it is a legal obstacle to deployment in the EU without privacy-by-design measures. Strengthening this point with a GDPR reference would make it more compelling.
### 5. The "assurance profiles" model should reference EU conformity assessment
Post 6's proposed assurance profiles (L0 through L3) closely parallel the EU AI Act's risk-based approach. Explicitly connecting L2/L3 to EU high-risk AI system requirements would make the architectural proposal more concrete for European audiences and demonstrate that the technical design accounts for regulatory reality.
---
## Post-by-Post Notes
### Post 00 (Series Overview)
- No legal issues. Internal document.
### Post 01 (Gold Rush)
- The claim "AI agents communicating over the internet without agreed-upon identity, security, and interoperability standards is a problem that gets worse every month" is stated as a technical observation. Under EU law, it is also a regulatory compliance problem (AI Act Art. 15, NIS2). Adding this dimension strengthens the claim.
- The 4:1 safety ratio should note that for EU-deployed high-risk systems, this ratio represents potential non-compliance, not merely a community preference.
### Post 02 (Who Writes the Rules)
- The Huawei analysis is data-driven and factual. No legal issues with the presentation.
- The "volume over iteration" section (65% rev-00, pre-meeting submission campaigns) is a legitimate observation about IETF process dynamics. It avoids making claims about intent, which is the correct editorial approach.
- The "Chinese institutional ecosystem" framing is factual but should not be read as implying coordination in the competition-law sense. The IETF is an open forum; coordinated standards participation by companies within a country is normal and lawful.
### Post 03 (OAuth Wars)
- The OAuth cluster analysis is the post most in need of GDPR context. The 14 proposals all address agent authorization, but none addresses the GDPR-specific question: when an agent processes personal data on behalf of a user, what is the legal basis? OAuth delegation is not automatically GDPR-compliant delegation. The controller-processor relationship (Art. 28 GDPR) requires a data processing agreement. None of the drafts described appear to address this.
- The "chained delegation" gap is a GDPR problem as well as a technical one: sub-processors under Art. 28(2)/(4) GDPR require specific or general written authorization from the controller.
### Post 04 (What Nobody Builds)
- The strongest post from a regulatory perspective. The three critical gaps (behavior verification, resource management, error recovery) all map to EU AI Act requirements for high-risk systems.
- The hospital scenario should note that the Medical Devices Regulation (MDR) and the AI Act both apply, and that CE marking for the AI system would require addressing these gaps *before* deployment, not after standards emerge.
- The "4:1 ratio revisited" structural analysis is legally significant: it suggests that the current standards development process may not produce the technical mechanisms needed for EU regulatory compliance within the enforcement timeline (August 2026).
### Post 05 (1262 Ideas / Convergence)
- "Privacy-preserving agent discovery" is identified as absent. This should reference Art. 25 GDPR (data protection by design and by default) as a legal requirement, not just a nice-to-have.
- "Agent cost and billing" -- absent from the corpus -- has implications under the EU's Payment Services Directive (PSD2) and the upcoming PSD3 if agents handle financial transactions.
### Post 06 (Big Picture)
- The "dual regime" (relaxed vs. regulated) framing is excellent and maps well to the AI Act's risk-based approach. The post should make this mapping explicit rather than leaving it implicit.
- The "assurance profiles" proposal (L0-L3) should note that L2/L3 may not be optional for EU deployments -- the AI Act mandates specific technical documentation, logging, and human oversight for high-risk systems. "Dial up" is the wrong metaphor if the law requires maximum assurance.
- The prediction "within 18 months, the safety deficit will begin to close -- not from IETF drafts but from regulatory pressure" should be updated to reflect that the AI Act is already in force and enforcement begins August 2026 -- this is not 18 months away; it is 5 months away at publication time.
- The EU AI Act is not merely "regulatory pressure"; it imposes specific technical requirements with significant penalties (up to 35 million EUR or 7% of global annual turnover under Art. 99).
### Post 07 (How We Built This)
- The description of the analysis pipeline (Claude for analysis, Ollama for embeddings) raises no legal issues, but should note that sending full draft texts to the Claude API involves transmitting potentially IPR-encumbered content to a third-party processor. Under GDPR, this is likely non-personal-data processing and not regulated, but IETF IPR policies (Note Well) could be relevant.
- The "open source" claim for the tool should be paired with a license reference. Under German law (UrhG), software is protected by copyright. Without a stated license, the default is "all rights reserved."
### Post 08 (Agents Building the Analysis)
- The meta-irony section mapping the team's coordination needs to IETF gaps is clever and legally unproblematic.
- The "silent failure" anecdote (Writer's revisions not persisting) is a useful illustration. In a regulated context, this would constitute a failure of the AI Act's Art. 12 logging requirement -- the system reported success while the output was wrong. This parallel could be made explicit.
### State of Ecosystem (Vision Document)
- The three 2027 scenarios and two 2028 equilibria are well-constructed. Scenario A ("fragmentation wins") would be particularly problematic under EU law, as fragmented standards make conformity assessment more expensive and less reliable.
- The "what builders should do today" section advises building human oversight "now, not later." Under the AI Act, this is a legal requirement for high-risk systems, not just engineering advice. Framing it as such would strengthen the recommendation.
---
## Summary of Priority Actions
1. **Post 3**: Add GDPR-aware footnote on OAuth "consent" vs. GDPR consent; note controller-processor implications of chained agent delegation.
2. **Post 4**: Acknowledge that the hospital scenario is already regulated under the AI Act and MDR; the gap is technical implementation, not legal accountability.
3. **Post 6**: Make the AI Act mapping explicit (assurance profiles to conformity assessment); correct the timeline (enforcement begins August 2026, not "18 months").
4. **Cross-series**: Add a brief regulatory context paragraph (1-2 sentences) to Post 1 establishing that the safety deficit has legal implications under EU law, not just engineering ones.
5. **Post 7**: Add open-source license reference; note IETF IPR context for the "watch these drafts" advice.

View File

@@ -0,0 +1,278 @@
# Scientific Review -- IETF Draft Analyzer
*Reviewed 2026-03-08 by Scientific Reviewer agent*
---
## Executive Summary
The IETF Draft Analyzer is an ambitious and largely well-executed landscape analysis. The core findings -- the 4:1 capability-to-safety ratio, the fragmentation across 120+ A2A protocols, the dominance of Chinese technology companies -- are supported by the data and would withstand scrutiny from IETF participants. However, the methodology has several significant weaknesses that should be disclosed transparently, and several claims in the blog posts overstate what the data can actually support.
**Overall assessment**: Publishable with revisions. The research is directionally sound but needs (a) clearer methodological caveats, (b) correction of data inconsistencies, and (c) hedging of several definitive claims.
---
## 1. Methodological Issues
### 1.1 LLM-as-Judge Without Calibration (CRITICAL)
The entire rating system relies on Claude (Sonnet) as the sole judge for five dimensions (novelty, maturity, overlap, momentum, relevance) on a 1-5 scale. This is the central methodological weakness.
**Problems:**
- **No inter-rater reliability**: There is no comparison against human expert ratings. Even a small calibration set (20-30 drafts rated by an IETF participant) would substantially strengthen the methodology.
- **No intra-rater consistency check**: The same draft is never rated twice to measure Claude's self-consistency. Prompt hash caching means re-runs return cached results, so actual consistency is untested.
- **Rating prompt is minimal**: The `RATE_PROMPT_COMPACT` gives Claude a draft's abstract (truncated to 2000 chars), name, date, and page count -- but no access to the full text for rating purposes. This means ratings are abstract-based, not content-based. For maturity and overlap scores in particular, the abstract is insufficient.
- **Batch effects**: Batch rating (`BATCH_PROMPT`) processes 5 drafts together. Position effects (first vs. last in batch) and comparison effects (a mediocre draft looks better next to weak ones) are uncontrolled. Abstracts are also truncated more aggressively (1500 chars vs. 2000) in batch mode.
- **Relevance inflation**: The relevance distribution is heavily right-skewed (196 drafts at 4, 98 at 5, only 38 at 1-2). This suggests Claude is generous with relevance for keyword-matched drafts, making the metric less discriminating than it should be. Only 38 of 434 drafts are rated relevance <= 2, despite clear false positives in the corpus (see Section 3.1).
**Recommendation:** Add a "Limitations" section to the methodology post (Post 7) that explicitly states: ratings are LLM-generated from abstracts only, without human calibration. Consider running a calibration study with 5 domain experts rating 25 drafts each.
### 1.2 Idea Extraction Quality is Unknown
The pipeline extracts "1-4 ideas" per draft via LLM, but there is no precision/recall measurement.
**Current state of the data:**
- The database now contains only **419 ideas** across **377 drafts** (1.1 ideas/draft average), with 337 drafts having exactly 1 idea, 38 having 2, and 2 having 3.
- The blog posts reference "1,262 ideas" and "1,780 ideas" -- these numbers are stale and do not match the current database (419).
- The near-uniform "1 idea per draft" distribution (80% of drafts) suggests the extraction prompt may be over-aggressive in merging or the dedup step removed too many.
**Problems:**
- **Recall**: Many substantial drafts probably define more than one novel contribution. A 1-idea-per-draft average is suspiciously low.
- **Precision**: Without ground truth, we cannot know how many extracted "ideas" are restatements of the abstract vs. genuine technical contributions.
- **Batch vs. individual quality**: Batch extraction (using Haiku, abstract-only at 800 chars) produces different results than individual extraction (Sonnet, abstract + 3000 chars of full text). The quality difference is unquantified.
- **Data staleness**: Blog post 5 ("Where 361 Drafts Converge") cites 1,692 unique ideas. The current database has 419. Either the ideas were mass-deleted (via dedup) or regenerated. This needs reconciliation.
**Recommendation:** Run individual extraction on a sample of 30 drafts and compare to batch results. Establish expected ideas-per-draft range by manually analyzing 10 drafts.
### 1.3 Gap Analysis is Single-Shot LLM Generation
The gap analysis is generated by a single Claude call (`GAP_ANALYSIS_PROMPT`) that receives compressed statistics about the landscape (category counts, top ideas, overlap summary). This is essentially asking Claude to brainstorm gaps based on metadata.
**Problems:**
- **No systematic coverage analysis**: A rigorous gap analysis would compare the corpus against a reference taxonomy of what a complete AI agent ecosystem requires. The current approach relies on Claude's general knowledge rather than a structured framework.
- **Overlap summary is circular**: The "overlap_summary" fed to the gap prompt is just the top-5 categories by count with a generic "high internal overlap" label. This does not tell Claude which specific technical areas overlap -- it just restates what the categories are.
- **Evidence quality varies**: Some gap evidence is specific ("only 44 safety/alignment drafts") while others are vague ("lack agent-specific resource protection mechanisms"). The evidence field should cite specific drafts that partially address each gap.
- **Blog post gap list diverges from database**: The gaps.md report lists 12 gaps (from the database), but blog post 04 lists a different set of 12 gaps with different names and severities. It is unclear which gap analysis is canonical.
**Recommendation:** Ground the gap analysis in a reference architecture (e.g., NIST AI RMF, or an explicit agent ecosystem reference model). Cite specific drafts that partially address each gap rather than category-level statistics.
### 1.4 Clustering Methodology is Naive
The `find_clusters` method uses greedy single-linkage clustering at a fixed 0.85 cosine similarity threshold.
**Problems:**
- **Single-linkage effect**: Once a draft joins a cluster, all drafts similar to it (but not necessarily to the seed) join too. This can create "chaining" where semantically distant drafts end up in the same cluster.
- **Threshold not justified**: The 0.85 threshold for "topically overlapping" and 0.90 for "near-duplicates" are not empirically validated. Different embedding models and text representations would produce different similarity distributions.
- **No comparison to baselines**: How does the 42-cluster result at 0.85 compare to, say, k-means or DBSCAN? The absence of comparison makes it impossible to assess whether 42 is "right."
- **Embedding model limitations**: nomic-embed-text is a competent general-purpose embedding model, but it was not trained specifically for technical/standards document similarity. Domain-specific models or fine-tuned embeddings might produce quite different clusters.
**Recommendation:** Report the similarity score distribution (histogram) and explain why 0.85 was chosen. Consider running DBSCAN as a comparison method.
### 1.5 Embedding Input is Inconsistent
`embed_draft` combines title + abstract + first 4000 chars of full text. But 57 drafts in the database have no ideas extracted, and it is unclear whether all drafts have full text downloaded. Drafts embedded with vs. without full text will have systematically different embedding quality, which affects similarity comparisons.
---
## 2. Unsupported Claims
### 2.1 Blog Post 01 ("Gold Rush")
- **"Nearly 1 in 10 new Internet-Drafts is about AI agents"**: The 9.3% figure for Q1 2026 needs a denominator source. Where does the "1,748 total IETF drafts in Q1 2026" come from? This is not from the analyzer's data; it appears to be external. If the figure is correct it is a strong finding, but the source must be cited.
- **"4,231 cross-references"**: This citation analysis is mentioned but the methodology for extracting citations is not described anywhere in the codebase. How were references parsed? Was this a separate analysis?
- **"The acceleration is not gradual. It is a step function that began in mid-2025"**: This is a strong mathematical claim. A step function implies discontinuity. The data shown (9 drafts in 2024, 190 in 2025) is more consistent with exponential growth than a step function. The framing should be: "rapid acceleration" not "step function."
### 2.2 Blog Post 04 ("What Nobody's Building")
- **The hospital drug-dispensing scenario**: This is vivid but ungrounded. No IETF draft addresses medical device agent systems, and the scenario implies current standards failures that have not occurred. The framing should clarify this is a thought experiment about future risks, not a description of current failures.
- **"0 ideas addressing cross-protocol translation"**: This claim depends entirely on the idea extraction quality. If extraction produces only 1 idea per draft (as current data suggests), many relevant technical contributions may simply not be captured.
### 2.3 Blog Post 05 ("1,262 Ideas")
- **The entire post's numbers are stale**: It references 1,692 unique ideas and 1,780 total. The database now has 419. The convergence analysis ("96% appear in exactly one draft") and cross-org analysis ("628 ideas with cross-org validation") need to be re-verified against the current database.
- **"SequenceMatcher at 0.75 threshold"**: This fuzzy matching methodology is mentioned in the blog post but does not appear in the codebase. Where was this analysis performed? If it was a one-off script, it is not reproducible.
### 2.4 Category Counts Are Inflated by Multi-Assignment
The blog post reports "Data formats and interoperability: 145 drafts (40%)" and "A2A protocols: 120 drafts (33%)." Since drafts average 2.37 categories each, many drafts appear in multiple categories. The post does disclose this ("percentages exceed 100%") but the visual effect of listing 10 categories that sum to >> 100% can mislead. The actual number of truly unique-to-category drafts is not reported.
---
## 3. Missing Context
### 3.1 False Positives in the Corpus
The keyword-based search strategy produces false positives that inflate the corpus. Examples confirmed in the database:
- `draft-pan-tsvwg-pie` (PIE bufferbloat algorithm) -- rated relevance 3, which is too high
- `draft-ietf-hpke-hpke` (Hybrid Public Key Encryption) -- rated relevance 5, clearly wrong for an AI/agent analysis
- `draft-ietf-suit-firmware-encryption` (SUIT manifests) -- rated relevance 4
- `draft-eggert-mailmaint-uaautoconf` (email autoconfiguration) -- rated relevance 4
These drafts match keywords like "agent" (in "user agent"), "autonomous," or "intelligent" in ways unrelated to AI agents. The corpus likely contains 30-50 such false positives (the 38 drafts rated relevance <= 2 are the obvious ones, but many false positives are rated 3-4 by the generous LLM judge).
**Impact:** A ~10% false positive rate in the corpus affects all derived statistics. The "361 drafts" (or now 434) figure should be qualified.
**Recommendation:** Implement a relevance filter. Exclude drafts with relevance <= 2 from all analyses. Better yet, manually review the 50 lowest-scored drafts and create an exclusion list.
### 3.2 Missing Literature Context
The analysis would benefit from referencing:
- **FIPA (Foundation for Intelligent Physical Agents)**: The original agent communication standards body. Their ACL (Agent Communication Language) and Agent Platform specifications from 1997-2004 are the direct ancestors of modern A2A protocols. The absence of FIPA from the analysis is a significant gap -- an IETF participant familiar with agent standards history would notice immediately.
- **W3C Web of Things (WoT)**: The WoT Architecture and Thing Description specifications address agent discovery and interoperability in IoT contexts. Several IETF drafts build on or compete with WoT concepts.
- **IEEE P2048 (Standard for VR/AR Agent Interoperability)** and **IEEE P3394 (Standard for Trustworthy AI Agents)**: These are concurrent standardization efforts that the IETF landscape should be compared against.
- **OASIS TOSCA, Open Agent Architecture (OAA)**: Prior art in agent orchestration and service composition.
- **Academic MAS research**: The multi-agent systems community (AAMAS, JAIR, JAAMAS) has decades of work on agent coordination, trust, and verification. The analysis should at minimum reference survey papers on MAS challenges.
### 3.3 Temporal Analysis Gaps
The growth rate claims in Post 01 would be stronger with:
- Comparison to other fast-growing IETF topics (e.g., QUIC, post-quantum crypto)
- Month-by-month submission data rather than annual/quarterly aggregates
- Distinction between individual drafts and WG-adopted drafts (which indicate greater organizational commitment)
### 3.4 Geographic and Organizational Bias
The author analysis reveals Chinese companies (Huawei: 66 drafts, China Mobile: 35, China Telecom: 24, China Unicom: 21) collectively account for ~34% of all drafts. This concentration is noted but its implications are underexplored:
- Is this ratio typical for the IETF, or unusual for this topic area?
- Does this concentration affect which problems get standardized?
- Are there language/translation barriers affecting the quality assessment?
---
## 4. Data Integrity Issues
### 4.1 Category Normalization Incomplete
The database contains both canonical short names and legacy long names for the same categories:
- "A2A protocols" (139 drafts) vs. "Agent-to-agent communication protocols" (16 drafts) -- these should be the same
- "Agent discovery/reg" (75 drafts) vs. "Agent discovery / registration" (14 drafts)
- "Agent identity/auth" (139 drafts) vs. "Identity / authentication for AI agents" (13 drafts)
The `normalize_category` function exists in the code and is applied on read in many places, but the raw database values were never migrated. This means raw SQL queries (like those in reports) may produce incorrect category counts unless normalization is applied.
**Impact:** Category counts cited in reports and blog posts may be inaccurate by 5-15% depending on which code path generated them.
**Recommendation:** Run a one-time migration to normalize all category values in the `ratings` table.
### 4.2 Ideas Count Discrepancy
The database has 419 ideas (as of this review). Reports reference 1,262 or 1,780. Either:
- Ideas were mass-deleted via dedup (the `dedup_ideas` function exists with 0.85 threshold)
- The database was regenerated with different parameters
- Multiple idea extraction runs produced different results
This needs to be resolved. If the current 419 ideas are correct (post-dedup), then all blog post statistics about idea counts, convergence, and fragmentation must be updated.
### 4.3 57 Drafts Have No Ideas
57 of 434 drafts have no extracted ideas. If these are legitimately off-topic (false positives that should return empty arrays), this is correct. If they are processing failures, they represent missing data.
### 4.4 Database Grew from 361 to 434
The reports and blog posts reference 361 drafts. The database now contains 434. All published statistics are stale. This is not a methodology issue per se, but any publication should use consistent numbers.
---
## 5. Improvement Suggestions
### 5.1 Add a Calibration Study (HIGH PRIORITY)
Select 25 representative drafts spanning all categories. Have 3-5 domain experts rate them on the same 5 dimensions. Compare against Claude's ratings. Report Spearman correlation, Cohen's kappa, or similar inter-rater metrics. This single addition would transform the methodology from "interesting exploratory analysis" to "validated automated assessment."
### 5.2 Define a Reference Architecture
Create an explicit "ideal agent ecosystem" reference model (identity, discovery, communication, authorization, monitoring, safety, governance, lifecycle). Map every draft and gap against this model. This makes the gap analysis systematic rather than ad hoc.
### 5.3 Report Confidence Intervals
For key statistics (category counts, idea counts, similarity thresholds), report sensitivity analyses. What happens to the gap analysis if the similarity threshold is 0.80 or 0.90 instead of 0.85? What if relevance < 3 drafts are excluded?
### 5.4 Version the Analysis
Timestamp all statistics. When the corpus grows from 361 to 434, make it clear which numbers apply to which version of the analysis. Consider a "snapshot" system: v1 = 260 drafts (Feb 2026), v2 = 361 drafts (Mar 2026), v3 = 434 drafts (current).
### 5.5 Publish the Methodology as Reproducible
The blog posts describe methodology in prose but do not provide enough detail for replication. Consider publishing the prompts, thresholds, and pipeline configuration as a supplementary appendix.
### 5.6 Address Ethical Dimensions
The analysis identifies gaps in safety and governance but does not engage with the ethical dimensions of autonomous agent standardization. Questions worth addressing:
- Should the IETF standardize capabilities before safety mechanisms exist?
- What are the risks of the 4:1 capability-to-safety ratio becoming embedded in standards?
- How does geographic concentration in standards development affect global equity?
---
## 6. Taxonomy & Categorization Assessment
### 6.1 Category Scheme
The 11 categories (`CATEGORIES_SHORT` in analyzer.py) are reasonable but have issues:
- **"Other AI/agent"** is a catch-all that weakens analysis. 34 drafts in this category deserve better classification.
- **"Data formats/interop"** is too broad. At 171 drafts (after normalization), it is the largest category but encompasses everything from YANG models to JSON schemas to COSE signing. Sub-categorization would be more informative.
- **Multi-assignment without weighting**: Drafts receive 2.37 categories on average. A primary/secondary distinction would improve precision.
- **No negative categories**: The system cannot mark a draft as "not about AI agents" -- it can only assign categories from the fixed list. A "false positive / tangentially related" category would help.
### 6.2 Gap Classification
The 4-level severity scale (critical, high, medium, low) is reasonable but the threshold between levels is not defined. What makes a gap "critical" vs. "high"? The current distinction appears to be: critical = safety-related, high = functionality-related, medium = optimization-related. This should be stated explicitly.
---
## 7. Post-by-Post Notes
### Post 00 (Series Overview)
Not reviewed (meta-navigation page).
### Post 01 (Gold Rush)
- Strongest post. Claims are mostly well-supported by data.
- Growth rate table needs source citation for total IETF draft counts.
- "Step function" language is too strong; use "rapid acceleration."
- The 4:1 safety deficit framing is the most compelling finding.
### Post 02 (Who Writes the Rules)
Not reviewed in detail.
### Post 03 (OAuth Wars)
Not reviewed in detail.
### Post 04 (What Nobody's Building)
- Hypothetical scenarios are effective but should be explicitly labeled as projections, not current failures.
- Gap list should match the database gap list. Currently there are discrepancies.
- The "0 ideas addressing cross-protocol translation" claim depends on extraction quality now in question.
### Post 05 (1,262 Ideas)
- **Needs full rewrite with current data.** The idea counts (1,262/1,692/1,780 referenced at various points) do not match the database (419). All convergence and fragmentation statistics derived from idea data are unreliable until reconciled.
- The fuzzy matching methodology (SequenceMatcher at 0.75) is not in the codebase and cannot be verified.
### Post 06 (Big Picture)
Not reviewed in detail.
### Post 07 (How We Built This)
- Should contain the "Limitations" section that currently does not exist anywhere.
- Should document all thresholds and their justifications.
### Post 08 (Meta Post)
Not reviewed in detail.
---
## 8. Summary of Recommendations by Priority
| Priority | Issue | Action |
|----------|-------|--------|
| CRITICAL | Ideas data inconsistency (419 vs 1,262+) | Reconcile database and blog post numbers |
| CRITICAL | No LLM rating calibration | Add calibration study or prominent caveat |
| HIGH | Category normalization incomplete in DB | Run migration script |
| HIGH | False positives in corpus (~30-50 drafts) | Implement relevance filter, manual review |
| HIGH | Missing FIPA/W3C/IEEE context | Add related work section |
| MEDIUM | Clustering methodology naive | Report similarity distribution, compare methods |
| MEDIUM | Gap analysis not grounded in reference arch | Define explicit reference model |
| MEDIUM | Stale numbers (361 vs 434 drafts) | Version all statistics |
| LOW | Ethical dimensions unaddressed | Add section in final post |
| LOW | Batch vs individual extraction quality | Run comparison study |
---
*This review was generated by reading all source code (analyzer.py, embeddings.py, fetcher.py, config.py, db.py, models.py), querying the database directly, and reviewing all reports and blog posts. The goal is to strengthen the analysis for publication, not to diminish the substantial work already done.*

View File

@@ -0,0 +1,333 @@
# Statistical Review
Reviewed: 2026-03-08
Reviewer: Statistics & Data Analysis Agent (Claude Opus 4.6)
Scope: All blog posts (00-08), data packages (00-06), master stats, and key reports -- cross-checked against `data/drafts.db` via sqlite3 queries.
---
## Data Integrity Issues
### CRITICAL: Database Has Grown Beyond Blog Series Claims
The blog series consistently claims **361 drafts, 557 authors, 1,780 ideas, and 12 gaps**. The current database contains:
| Metric | Claimed | Actual (DB) | Delta |
|--------|---------|-------------|-------|
| Total drafts | 361 | **434** | +73 (20% more) |
| Total authors | 557 | **557** | Match |
| Total ideas | 1,780 | **419** | **-1,361 (76% fewer)** |
| Total gaps | 12 | **11** | -1 |
| Total ratings | 361 | **434** | +73 |
| Total embeddings | 361 | **434** | +73 |
| Draft-author links | 1,057 | **1,057** | Match |
| LLM cache entries | 703 | **1,397** | +694 |
**Root cause**: The database was updated on 2026-03-07 with a new fetch of 431 drafts, bringing the total to 434. The blog series was written against a snapshot taken around 2026-03-03. The master stats file (`00-master-stats.md`) is dated 2026-03-03 and reflects the 361-draft corpus. However, the blog posts do not carry a "data frozen as of" disclaimer -- they state numbers as absolute facts.
**Recommendation**: Add a clear data freeze date to each blog post header (e.g., "Data current as of 2026-03-03, reflecting 361 of 434 drafts now in the database"). Alternatively, update all posts to reflect the 434-draft corpus.
### CRITICAL: Ideas Count Mismatch (1,780 Claimed vs 419 in DB)
The most serious discrepancy. The `ideas` table contains only **419 rows**, not 1,780. The idea type distribution also diverges sharply:
| Type | Claimed | Actual (DB) |
|------|---------|-------------|
| mechanism | 663 | 68 |
| architecture | 280 | 95 |
| pattern | 251 | 35 |
| protocol | 228 | 96 |
| requirement | 171 | 42 |
| extension | 168 | 79 |
| framework | 9 | 3 |
| format | -- | 1 |
| other | 10 | -- |
Only 377 of 434 drafts have any ideas extracted. The 1,780 figure may come from a prior pipeline run whose results were overwritten, or from an in-memory analysis that was not persisted. Either way, the blog series' core claims about "1,780 ideas," "96% appear in only one draft," "628 cross-org convergent ideas (43% of 1,467 clusters)," and the entire idea taxonomy are **not reproducible from the current database**.
**Recommendation**: Re-run idea extraction to populate the database, or clearly note that the 1,780 figure comes from a specific pipeline run that is no longer reflected in the DB. This is the single most important data integrity issue -- Post 5's entire thesis rests on these numbers.
### HIGH: Gap Count and Topics Differ
The DB has **11 gaps**, not 12. The gap topics in the database are:
1. Multi-Agent Consensus Protocols
2. Agent Behavioral Verification
3. Cross-Protocol Agent Migration
4. Real-Time Agent Rollback Mechanisms
5. Agent Resource Accounting and Billing
6. Federated Agent Learning Privacy
7. Agent Capability Negotiation
8. Cross-Domain Agent Audit Trails
9. Agent Failure Cascade Prevention
10. Human Override Standardization
11. Agent Performance Benchmarking
The blog series lists different gap topics (e.g., "Agent Resource Exhaustion Protection" vs DB's "Agent Resource Accounting and Billing"; "Agent Error Recovery and Rollback" vs "Real-Time Agent Rollback Mechanisms"). Post 4's gap list appears to be a curated/rewritten version. The blog's 12-gap list includes "Cross-Protocol Translation" and "Agent Data Provenance" which do not appear as named gaps in the DB.
**Recommendation**: Reconcile the gap list. Either the DB was re-run and lost a gap, or the blog presents an edited version. If the latter, this should be acknowledged as editorial synthesis rather than raw pipeline output.
### HIGH: Composite Rating Calculations Inconsistent
Multiple scoring methodologies are used without disclosure:
| Draft | Blog Score | 5-dim Composite (DB) | 4-dim (excl overlap) |
|-------|-----------|----------------------|----------------------|
| draft-aylward-daap-v2 | 4.8 (Post 1) | 4.0 | 4.75 |
| draft-cowles-volt | 4.8 (Post 1) | 4.0 | 4.75 |
| draft-guy-bary-stamp-protocol | 4.6 (Post 1) | 3.8 | 4.5 |
| draft-drake-email-tpm-attestation | 4.6 (Post 1) | 3.8 | 4.5 |
Post 1 claims DAAP and VOLT scored "4.8" -- this matches neither the 5-dimension composite (4.0) nor the 4-dimension composite excluding overlap (4.75). The master stats correctly uses 4.75 for the same drafts. Post 1 appears to round up (4.75 -> 4.8, 4.5 -> 4.6), which inflates perceived quality.
The "average score" also varies: Post 1 says "3.38/5.0", the master stats say "3.32" (novelty average), the DB 5-dim average is 3.13, and the 4-dim average is 3.27.
**Recommendation**: Pick one composite calculation, document it, and use it consistently. The 4-dim composite (excluding overlap, since overlap measures redundancy rather than quality) is defensible, but the rounding from 4.75 to 4.8 is not. Use exact values.
### MEDIUM: Monthly Draft Counts Differ Between Sources
The master stats growth curve and the actual DB monthly counts diverge:
| Month | Master Stats | Actual DB |
|-------|-------------|-----------|
| 2024-01 | 3 | **7** |
| 2024-02 | 1 | **3** |
| 2024-04 | 1 | **6** |
| 2024-09 | 2 | **11** |
| 2025-10 | 67 | **61** |
| 2025-11 | 61 | **53** |
| 2026-01 | 54 | **51** |
| 2026-02 | 86 | **85** |
| 2026-03 | 22 | **56** |
The master stats show a total of 361 across all months; the DB shows 434. Some of this is explained by the 73 new drafts fetched after the data freeze, but the per-month figures for 2024 are also significantly different (suggesting earlier months got new drafts from the keyword expansion that are counted differently).
The "43x acceleration" claim (from ~2/mo to 86/mo) uses the lowest trough and highest peak, which is cherry-picking. A more honest measure would compare rolling averages.
### MEDIUM: Huawei Draft Counts Vary Across Posts
| Source | Huawei Drafts | Huawei Authors |
|--------|-------------|----------------|
| Post 1 | 66 | 53 |
| Post 2 | 66 | 53 |
| Data Package 02 | "~60+ unique" | "~40+ unique" |
| Master Stats | 57+ | 28+ |
| Actual DB (all Huawei entities) | **69 unique drafts** | multiple entities |
| DB "Huawei" entity only | 39 | 32 |
The consolidation of Huawei sub-entities (Huawei, Huawei Technologies, Huawei Canada, Huawei Singapore, etc.) is done informally. The blog confidently states "53 authors, 66 drafts" but the data package says "~60+ unique drafts, ~40+ unique authors (some overlap)." The actual DB shows 69 unique drafts across all Huawei-named affiliations. The author count depends entirely on deduplication, which is described as "hand-curated" with "40+ mappings."
**Recommendation**: Document the exact normalization rules used to arrive at "53 authors, 66 drafts" and make them reproducible.
---
## Methodological Concerns
### Sampling Bias
The dataset is keyword-filtered (12 keywords across draft names and abstracts). Multiple posts draw sweeping conclusions about "the IETF's AI agent landscape" without sufficient caveats about what this filter captures and misses.
Specific concerns:
- Post 1 claims "nearly 1 in 10 new Internet-Drafts is about AI agents" (9.3%). This figure depends on the denominator (total IETF drafts per year) which is stated but not sourced. Where do the numbers 1,651 (2024) and 2,696 (2025) come from? Are they verifiable?
- The keyword "intelligent" likely captures many non-agent-related drafts about intelligent networking, QoS, etc. The keyword "autonomous" captures autonomous systems (AS) networking drafts. No false-positive analysis is presented.
- Post 7 mentions "~90% accuracy" from spot-checking 50 drafts but provides no breakdown of error types, no inter-rater reliability, and no details on the spot-check methodology.
### Rating Methodology (LLM-as-Judge)
The 1-5 rating scale scored by Claude is presented with minimal caveats in the blog posts. Key issues:
1. **No inter-rater reliability**: The same LLM rated all drafts. No human baseline or second-model comparison is provided.
2. **Abstract-only analysis**: Post 7 acknowledges switching from full-text to abstract-only analysis for ratings, claiming "equivalent ratings." No evidence is presented for this equivalence claim.
3. **Overlap dimension ambiguity**: The "overlap" dimension measures redundancy with other drafts, but since the LLM rates each draft independently, it cannot know the full corpus. The overlap score likely reflects the LLM's general knowledge of the field, not corpus-specific similarity.
4. **Score compression**: All ratings are on a 1-5 scale with integer values only. The max composite (5-dim) is 4.2 and the min is 1.8. The effective range is narrow, making distinctions between drafts less meaningful than the precise decimal composites suggest.
### Clustering and Similarity
- The 0.85 and 0.90 cosine similarity thresholds for overlap clusters are stated but not justified. What threshold sensitivity analysis was performed?
- The "25+ near-duplicate pairs at 0.98" claim is used to argue for deduplication to "roughly 300 distinct proposals" -- but 25 duplicate pairs would reduce the count by at most 25, not 61.
- The SequenceMatcher threshold (0.75) for fuzzy idea matching is stated but not validated. How many false positives does this produce?
### Cross-Org Convergence (628 Ideas)
The 628 cross-org convergent ideas figure is the blog series' lead metric for Post 5. However:
- The methodology (SequenceMatcher at 0.75 threshold across organizational boundaries) is described but the underlying data is not in the DB (only 419 ideas exist).
- No precision/recall analysis is presented. At a 0.75 sequence match threshold, generic titles like "Agent Communication Framework" will match across many drafts regardless of actual technical similarity.
- The claim "43% of unique idea clusters have cross-org validation" depends on the denominator (1,467 unique clusters), which itself depends on the 1,780 raw count that is not reproducible from the DB.
---
## Misleading Claims
### 1. "4:1 Safety Deficit" Ratio
This ratio is presented as the series' signature metric, but its calculation shifts:
- Master stats says "~8:1 capability-to-safety" (after keyword expansion)
- Data package 01 says the safety ratio "improved from 4:1 due to keyword expansion"
- Posts 1-6 consistently use "4:1" as the headline
- Data package 06 says "45 safety drafts vs 316 capability drafts = 7:1"
- The deep analysis shows monthly ratios from 1.5:1 to 21:1
The blog presents "4:1" as a stable finding when the data shows it varies from 1.5:1 to 21:1 depending on the time period and from 4:1 to 8:1 depending on whether keyword-expansion drafts are included. The ratio also depends on multi-labeling: a draft tagged as both "A2A protocols" and "AI safety" counts as both capability and safety.
**Recommendation**: Present the ratio with ranges and context, not as a single stable number. The monthly trend data (Task #24) is more informative than any single ratio.
### 2. "36x Growth"
Post 1 claims "36x growth: 2 drafts/month (Jun 2025) to 72 drafts/month (Feb 2026)." The series overview says the same. But:
- Jun 2025 actually had 5 drafts (per DB), not 2
- Feb 2026 had 85 (per DB), not 72 or 86
- Picking the lowest month and highest month inflates the multiplier
- A rolling 3-month average would show more modest but still impressive growth
### 3. "96% of Ideas Appear in Exactly One Draft"
This is presented as evidence of extreme fragmentation. However:
- The idea extraction pipeline produces ~5 ideas per draft by design
- Many extracted "ideas" are draft-specific component descriptions, not standalone proposals
- Post 5 acknowledges this ("most are draft-specific component descriptions") but still leads with the 96% figure as a shock stat
- The true fragmentation question is whether the *problems being solved* are unique, not whether the *component labels* are unique
### 4. "120 A2A Protocol Drafts"
The category count depends on how "A2A protocols" is defined. The master stats says 136 A2A protocol drafts, but the blog posts use 120. Some posts say "136" (Post 4's gap data package), while others say "120" (Posts 1, 3, 4 text). The inconsistency appears to stem from the category count changing between pipeline runs.
### 5. Causal Language
Several claims use causal framing where only correlation exists:
- "The safety deficit is structural, not attitudinal" (Post 4) -- this is an interpretation, not a finding
- "Gap severity correlates with coordination difficulty" (Post 4) -- stated as found, but the correlation is between two human-assigned ordinal variables (severity levels assigned by Claude, coordination difficulty assessed by the Architect) with N=12 data points
- "The organizations doing the most drafting are focused on capability; the organizations doing the best safety work are doing the least drafting" (Post 2) -- the causal implication is that volume and safety focus are inversely related, but this could simply reflect different organizational missions
---
## Improvement Suggestions
### 1. Add a Data Provenance Section
Each blog post should include a brief provenance note: data freeze date, pipeline version, exact query or command used to generate each key number. This would make claims verifiable.
### 2. Standardize the Composite Score
Choose one formula (recommend: 4-dimension excluding overlap, or 5-dimension with clear labeling) and use exact values (not rounded). Document the formula in Post 7 and use it consistently.
### 3. Validate Idea Extraction
Re-run idea extraction to ensure the DB reflects the claimed 1,780 ideas. If the pipeline was run differently (e.g., with a different prompt or batching strategy), document the exact parameters.
### 4. Add Confidence Intervals
For claims like "4:1 ratio," show the range across different time periods and calculation methods. For trend claims, show the underlying monthly data rather than cherry-picked endpoints.
### 5. Acknowledge LLM-as-Judge Limitations Prominently
Post 7 mentions LLM validation briefly. The rating methodology should include:
- A caveat in every post that uses ratings
- A note that overlap scores are based on LLM general knowledge, not corpus comparison
- Acknowledgment that abstract-only analysis may miss important content
### 6. De-duplicate Before Counting
The "361 drafts" count includes known near-duplicates. The blog acknowledges "probably closer to 300 distinct proposals" (Post 3) but continues using 361 everywhere. Either de-duplicate and use the lower number, or present both with context.
---
## Post-by-Post Notes
### Post 00 (Series Overview)
- Internal architecture document; numbers are consistent with master stats (361/557/628/12). No issues as an internal document.
### Post 01 (Gold Rush)
- **Score inflation**: DAAP cited as 4.8, actual 4-dim composite is 4.75, 5-dim is 4.0. STAMP cited as 4.6, actual is 4.5/3.8. VOLT cited as 4.8, actual is 4.75/4.0.
- **Category table inconsistency**: The post lists "Data formats and interoperability: 145" as the top category, but the master stats shows "A2A protocols: 136" as the top. The post appears to use a different category set than the master stats.
- **Growth figure**: "36x growth" -- cherry-picked from lowest to highest month.
- **"0.5% to 9.3%"**: The denominator (total IETF drafts) is stated but unsourced. The 9.3% figure assumes 1,748 total drafts in Q1 2026 -- where does this come from?
- Average score stated as "3.38" -- does not match any DB calculation (5-dim avg: 3.13; 4-dim avg: 3.27; novelty avg: 3.27).
- **"~1,700 technical ideas"**: Post says "roughly 1,700" in one place; DB has 419.
### Post 02 (Who Writes the Rules)
- Huawei "53 authors, 66 drafts" is stated with confidence but data package says "~60+" with caveats about entity dedup. DB shows 69 unique drafts across Huawei entities.
- "65% are at rev-00" for Huawei -- this figure is for "Huawei" entity only (57 drafts), not the combined 66/69. The denominator matters.
- "43 were submitted in the four weeks before IETF 121" -- data package says "43 of 69 across all entities." The blog says "43" out of Huawei's "66" implying 65%, vs data package's "62% of 69."
- "115 (23%) co-author with people from both Chinese and Western organizations" -- not verifiable from current DB without running the centrality analysis.
- Ericsson "4.8 average revision" claim (line 149) is inconsistent with data package showing Ericsson avg rev as 4.8 -- this appears correct.
### Post 03 (OAuth Wars)
- The 14-draft OAuth list is well-documented with individual scores.
- Score for DAAP is listed as 4.8 but 4-dim composite is 4.75. Other scores in the table appear to be individual dimension values or different calculations (e.g., STAMP at 4.6 vs 4.5 4-dim).
- The data package actually lists 15 OAuth-related drafts (including draft-mw-spice-actor-chain and draft-gaikwad-south-authorization), but the blog says 14. The blog's list of 14 differs slightly from the data package's 15.
- "25+ near-duplicate pairs" leading to "roughly 300 distinct proposals" is a logical leap. 25 duplicate pairs reduce the count by 25 (one from each pair), yielding 336, not "roughly 300."
### Post 04 (What Nobody Builds)
- Gap count: 12 in blog vs 11 in DB. Gap names differ from DB.
- "Ideas Addressing It" column (52, 117, 6, 0, 90, 5, 4, 10, 5, 26, 5, 79) -- these numbers cannot be verified because the ideas table has only 419 rows, not 1,780. With 419 ideas, these per-gap counts are implausible (they sum to 399, nearly the entire ideas table).
- "Only 6 extracted ideas address [error recovery], and all come from a single draft" -- this is a strong claim. With only 419 ideas in the DB, 6 ideas from one draft is plausible, but the DB has no gap-to-idea mapping table to verify.
- "12 (8.8%) of 136 A2A drafts also address safety" -- this requires the categories JSON field in the drafts table. Not independently verified but plausible.
- "Safety has zero co-occurrence with agent discovery/registration and zero co-occurrence with model serving/inference" -- sourced from deep analysis task #27, which is plausible but not verifiable from current DB without re-running the co-occurrence analysis.
### Post 05 (1,262 Ideas / Where Drafts Converge)
- Title references "1262" in the filename but post content uses 1,780, 1,692, and 628. The filename appears to be from the pre-expansion dataset.
- "1,692 unique technical ideas" -- the DB has 419 ideas. This is the largest disconnect in the entire series.
- "Only 75 show up in two or more drafts" -- not verifiable from current DB.
- "628 ideas where different organizations are working on recognizably similar problems" -- the central claim of the post, not verifiable from current DB.
- The idea taxonomy table (mechanism: 663, architecture: 280, etc.) does not match DB (mechanism: 68, architecture: 95, etc.). Both the counts and the rank order differ.
- The convergence table (A2A Communication Paradigm: 8 orgs, etc.) is not verifiable.
### Post 06 (Big Picture)
- Synthesis post; numbers are drawn from prior posts. Inherits all issues.
- "36 (10%) have been adopted by IETF working groups" -- based on naming convention (`draft-ietf-*`). This could be verified with a query but depends on the 361-draft corpus.
- "WG-adopted drafts score higher on average (3.54 vs. 3.31)" -- this uses 4-dim composite, which is consistent with the rest of the 4-dim usage but not labeled as such.
- "75 cross-draft convergent ideas (628 via fuzzy matching)" -- the parenthetical mixing of two very different numbers is confusing. 75 is exact-title matches; 628 is fuzzy cross-org. These are different metrics measuring different things.
### Post 07 (How We Built This)
- **Database table sizes**: Claims 361 drafts, 1,780 ideas, 557 authors, 1,057 draft_authors, 4,231 draft_refs, 12 gaps, 703 llm_cache. DB now shows 434/419/557/1,057/4,231/11/1,397. Only draft_authors and draft_refs match.
- **"43 CLI commands"**: Not verified but seems high. The source code would need to be checked.
- **Cost figures**: "$3.16 for 260 drafts" and total "~$9" are stated without supporting evidence (no token count logs in the DB). Not falsifiable but also not verifiable.
- **"15 report types"**: Not verified.
- Describes rating as "1-5 scale" which matches the DB (max 5, not 10 as the reviewer checklist suggests).
### Post 08 (Agents Building the Analysis)
- Meta post about the process. Numbers reference those from other posts, inheriting their issues.
- "20+ SQL queries" and "7 data packages" -- plausible but not independently verifiable.
- "30 dev-journal entries" -- could be verified by reading dev-journal.md.
- The cost table sums to "~$9" but the individual line items sum to ~$9.00 (2.50+5.50+0.80+0.20 = 9.00). Consistent.
### State of Ecosystem (Vision Document)
- "36x increase" -- same cherry-picking issue as Post 1.
- Uses "72" drafts/month for Feb 2026 (differs from other sources: 86 in master stats, 85 in DB).
- Otherwise consistent with other posts.
### Master Stats (00-master-stats.md)
- **Gap count**: Lists 12 gaps with different names than DB's 11.
- **Idea count**: 1,780 -- does not match DB's 419.
- **Draft count**: 361 -- does not match DB's 434 (but was correct at data freeze date).
- **Composite scores**: Uses 4-dim composite and gets 4.75 for top drafts -- correct for 4-dim, but unlabeled as such.
- **Category distribution**: Uses different category names/counts than the blog posts in some cases (e.g., master stats: "A2A protocols: 136" vs Post 1: "A2A protocols: 120").
---
## Summary of Findings
**Most Serious Issues** (would undermine credibility if published):
1. Ideas count (1,780 claimed, 419 in DB) -- the foundation for Post 5's thesis is not reproducible
2. Composite score inflation (4.75 rounded to 4.8) and inconsistent calculation methods
3. Gap count (12 vs 11) and topic naming mismatches
**Important Issues** (should be fixed before publication):
4. Draft count stale (361 vs 434)
5. "4:1 ratio" is not stable -- varies 1.5:1 to 21:1 by month
6. "36x growth" cherry-picks endpoints
7. Category counts inconsistent between posts and master stats
**Minor Issues** (polish):
8. Huawei entity deduplication is informal
9. LLM-as-judge caveats are insufficient
10. No false-positive analysis for keyword filtering
11. The "25 duplicate pairs -> roughly 300" arithmetic does not work
**What Holds Up Well**:
- RFC cross-reference counts (4,231) match exactly
- Draft-author link count (1,057) matches exactly
- Author count (557) matches exactly
- The qualitative patterns (Huawei dominance, safety deficit, fragmentation) are directionally sound even if specific numbers vary
- The geopolitical analysis and team bloc detection methodology are well-described
- The cost analysis (~$9 total) is internally consistent

View File

@@ -0,0 +1,121 @@
# Verified Database Counts
**Source**: `data/drafts.db` -- queried 2026-03-08
**Purpose**: Single source of truth for all counts, replacing inconsistent numbers across blog posts and reports.
---
## Core Tables
| Table | Count | Notes |
|-------|-------|-------|
| drafts | 434 | Up from 361 after 2026-03-07 fetch |
| ratings | 434 | 1:1 with drafts |
| authors | 557 | Unique persons from Datatracker |
| ideas | 419 | See "Ideas Count History" below |
| gaps | 11 | Not 12 -- see gap list below |
| embeddings | 434 | 1:1 with drafts |
| draft_authors | 1,057 | Draft-author links |
| llm_cache | 1,397 | Cached API calls |
## False Positive Analysis
73 drafts flagged as `false_positive = 1` in ratings table (new column added 2026-03-08).
| Criteria | Count |
|----------|-------|
| Relevance <= 2 (auto-flagged) | 38 |
| Relevance 3+ but clearly not AI-agent (manually reviewed) | 35 |
| **Total false positives** | **73** |
| **Drafts excluding false positives** | **361** |
### Relevance Score Distribution (all 434 drafts)
| Relevance | Count |
|-----------|-------|
| 1 | 2 |
| 2 | 36 |
| 3 | 102 |
| 4 | 196 |
| 5 | 98 |
## Category Counts (excluding false positives)
All categories normalized to short-form names (21 legacy long-form entries migrated 2026-03-08).
| Category | Count |
|----------|-------|
| Data formats/interop | 146 |
| A2A protocols | 146 |
| Agent identity/auth | 127 |
| Autonomous netops | 103 |
| Policy/governance | 97 |
| Agent discovery/reg | 82 |
| ML traffic mgmt | 77 |
| AI safety/alignment | 44 |
| Model serving/inference | 42 |
| Human-agent interaction | 33 |
| Other AI/agent | 18 |
Note: Drafts average ~2.4 categories each, so these sum to more than 361.
## Gap List (11 gaps, not 12)
| ID | Topic | Severity | Category |
|----|-------|----------|----------|
| 37 | Multi-Agent Consensus Protocols | high | A2A protocols |
| 38 | Agent Behavioral Verification | critical | AI safety/alignment |
| 39 | Cross-Protocol Agent Migration | medium | Agent discovery/reg |
| 40 | Real-Time Agent Rollback Mechanisms | high | Autonomous netops |
| 41 | Agent Resource Accounting and Billing | medium | new |
| 42 | Federated Agent Learning Privacy | high | Policy/governance |
| 43 | Agent Capability Negotiation | medium | A2A protocols |
| 44 | Cross-Domain Agent Audit Trails | high | Agent identity/auth |
| 45 | Agent Failure Cascade Prevention | critical | AI safety/alignment |
| 46 | Human Override Standardization | high | Human-agent interaction |
| 47 | Agent Performance Benchmarking | medium | new |
Blog posts reference 12 gaps with different names (e.g., "Agent Resource Exhaustion Protection" vs DB's "Agent Resource Accounting and Billing"). The blog list appears to be an editorial rewrite, not raw pipeline output. The missing 12th gap may be "Cross-Protocol Translation" or "Agent Data Provenance" which appear in blog posts but not in the database.
## Ideas Count History
The database currently contains **419 ideas** across **377 drafts**. This is the third different count encountered:
| Source | Count | Date | Likely Explanation |
|--------|-------|------|-------------------|
| Blog post 5 filename | 1,262 | ~2026-03-03 | Pre-expansion dataset (260 drafts), before dedup |
| Blog post 5 text / master stats | 1,780 | ~2026-03-05 | Post-expansion (361 drafts), before dedup |
| Current database | 419 | 2026-03-08 | After `dedup_ideas` run (0.85 threshold) or re-extraction with different params |
### Ideas by Type (current DB)
| Type | Count |
|------|-------|
| protocol | 96 |
| architecture | 95 |
| extension | 79 |
| mechanism | 68 |
| requirement | 42 |
| pattern | 35 |
| framework | 3 |
| format | 1 |
### Ideas per Draft Distribution
| Ideas/Draft | Drafts |
|-------------|--------|
| 1 | 337 |
| 2 | 38 |
| 3 | 2 |
| 0 (no ideas) | 57 |
The near-uniform 1-idea-per-draft (89% of drafts with ideas) suggests either aggressive dedup or a re-extraction with constrained output. The original pipeline extracted 1-4 ideas per draft, so the 1,780 figure likely reflects pre-dedup counts.
Excluding false positives: 365 ideas across 326 drafts.
## Actions Taken (2026-03-08)
1. **Category normalization**: Updated 21 ratings rows from legacy long-form category names to canonical short forms. All 11 categories now consistent.
2. **False positive flagging**: Added `false_positive` column to ratings table. Flagged 73 drafts (38 with relevance <= 2, 35 manually reviewed at relevance 3+).
3. **Schema migration**: Updated `db.py` schema and migration code to include `false_positive` column.
4. **This document**: Created as single source of truth for counts.