Fix blog accuracy and add methodology documentation
Blog posts (all 10 files updated): - Update all counts to match DB: 434 drafts, 557 authors, 419 ideas, 11 gaps - Fix EU AI Act timeline to August 2026 (5 months, not 18) - Reframe growth claim from "36x" to actual monthly figures (5→61→85) - Add safety ratio nuance (1.5:1 to 21:1 monthly variation) - Fix composite scores (4.8→4.75, 4.6→4.5) - Add OAuth/GDPR consent distinction (Art. 6(1)(a), Art. 28) - Add EU AI Act Annex III + MDR context to hospital scenario - Add FIPA, IEEE P3394, eIDAS 2.0 references - Add GDPR gap paragraph (DPIA, erasure, portability, purpose limitation) - Rewrite Post 04 gap table to match actual DB gap names Methodology: - Expand methodology.md: pipeline docs, limitations, related work - Add LLM-as-judge caveats and explicit rating rubric to analyzer.py - Add clustering threshold rationale to embeddings.py - Add gap analysis grounding notes to analyzer.py - Add Limitations section to Post 07 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1,10 +1,10 @@
|
||||
# Agents Building the Agent Analysis
|
||||
|
||||
*We used a team of AI agents to analyze, write about, and draw conclusions from 361 IETF drafts on AI agents. Here is what that looked like from the inside.*
|
||||
*We used a team of AI agents to analyze, write about, and draw conclusions from 434 IETF drafts on AI agents. Here is what that looked like from the inside.*
|
||||
|
||||
---
|
||||
|
||||
There is an irony we should address up front: this entire blog series -- analyzing 361 Internet-Drafts about how AI agents should work -- was itself produced by a team of AI agents. Four Claude instances, each with a distinct role, reading the same data, building on each other's output, and coordinating through a shared task system and development journal.
|
||||
There is an irony we should address up front: this entire blog series -- analyzing 434 Internet-Drafts about how AI agents should work -- was itself produced by a team of AI agents. Four Claude instances, each with a distinct role, reading the same data, building on each other's output, and coordinating through a shared task system and development journal.
|
||||
|
||||
This post is the story of that process: what worked, what surprised us, and what it reveals about the state of AI agent coordination in practice -- which, as it happens, is exactly the problem the IETF drafts are trying to solve.
|
||||
|
||||
@@ -15,7 +15,7 @@ We designed a four-agent team, each with a one-page definition file and a shared
|
||||
| Agent | Role | What They Did |
|
||||
|-------|------|---------------|
|
||||
| **Architect** | "The Big Picture" | Read all reports, designed the narrative arc, wrote the vision document, reviewed every post across multiple passes |
|
||||
| **Analyst** | "The Data Whisperer" | Ran the full pipeline on 361 drafts, executed 20+ SQL queries, produced 7 data packages |
|
||||
| **Analyst** | "The Data Whisperer" | Ran the full pipeline on 434 drafts, executed 20+ SQL queries, produced 7 data packages |
|
||||
| **Coder** | "The Feature Builder" | Implemented 7 new analysis features (refs, trends, idea-overlap, WG adoption, revisions, centrality, co-occurrence) |
|
||||
| **Writer** | "The Storyteller" | Drafted all 8 blog posts, applied 6+ revision passes incorporating data refreshes, architectural reframes, and editorial redirections |
|
||||
|
||||
@@ -31,7 +31,7 @@ The process unfolded in roughly six phases -- not the four we planned.
|
||||
|
||||
All four agents started simultaneously. The Analyst began running the analysis pipeline on 101 new drafts. The Architect read all 10 existing reports and started designing the narrative arc. The Coder read the Architect's initial notes and began implementing new features. The Writer read every data report in the project.
|
||||
|
||||
The key design decision: **agents did not wait for each other when they could work in parallel.** The Writer's tasks were formally blocked by the Analyst's pipeline run, but the Writer had enough existing data (260 analyzed drafts) to start drafting. Rather than sitting idle, the Writer produced first drafts of all 6 core posts while waiting for updated numbers. This turned out to be the right call -- the structure and narrative mattered more than whether the draft count was 260 or 361.
|
||||
The key design decision: **agents did not wait for each other when they could work in parallel.** The Writer's tasks were formally blocked by the Analyst's pipeline run, but the Writer had enough existing data (260 analyzed drafts) to start drafting. Rather than sitting idle, the Writer produced first drafts of all 6 core posts while waiting for updated numbers. This turned out to be the right call -- the structure and narrative mattered more than whether the draft count was 260 or 434.
|
||||
|
||||
### Phase 2: The Architect Sets the Frame
|
||||
|
||||
@@ -79,15 +79,15 @@ This is exactly the kind of silent failure that agent teams need guardrails for.
|
||||
|
||||
### Phase 5: The Data Arrives and the Reframing Battle
|
||||
|
||||
While the writing and reviewing unfolded, the Analyst completed the full pipeline: 361 drafts rated, 557 authors mapped (up from 403), 1,780 ideas extracted (up from 1,262). The numbers changed significantly: Huawei's share grew from 12% to 18%, A2A protocols from 92 to 120, and the safety ratio held steady at roughly 4:1. Every blog post needed a numbers-update pass.
|
||||
While the writing and reviewing unfolded, the Analyst completed the full pipeline: 434 drafts rated, 557 authors mapped (up from 403), 419 ideas extracted (up from 1,262, though subsequent re-extraction with different parameters consolidated the count). The numbers changed significantly: Huawei's share grew from 12% to ~16%, A2A protocols from 92 to 155, and the safety ratio held steady at roughly 4:1. Every blog post needed a numbers-update pass.
|
||||
|
||||
But the most consequential event in Phase 5 was not the data refresh. It was the project lead challenging the Writer's headline claim.
|
||||
|
||||
**The "1,780 ideas" reframing.** The series had been built around a headline number: "1,780 technical ideas extracted from 361 drafts." The project lead asked: what does that number actually mean? The answer was uncomfortable. The pipeline extracts approximately 5 ideas per draft on average -- a mechanical process that produces "ideas" like "A2A Communication Paradigm" and "Agent Network Architecture." The raw count sounds impressive but is mostly scaffolding.
|
||||
**The ideas reframing.** The series had been built around a headline number: "1,780 technical ideas extracted from the drafts." The project lead asked: what does that number actually mean? The answer was uncomfortable. The pipeline extracts approximately 5 ideas per draft on average -- a mechanical process that produces "ideas" like "A2A Communication Paradigm" and "Agent Network Architecture." The raw count sounds impressive but is mostly scaffolding.
|
||||
|
||||
The real signal was hiding in the Coder's cross-org overlap analysis: of 1,692 unique idea titles, **96% appear in exactly one draft.** Only 75 show up in two or more drafts. Only 11 in three or more. The fragmentation that defines the protocol landscape extends all the way down to the idea level.
|
||||
|
||||
This required rewriting Post 5 entirely -- its title changed from "The 1,780 Ideas That Will Shape Agent Infrastructure" to "Where 361 Drafts Converge (And Where They Don't)." The lead metric shifted from raw extraction count (impressive but hollow) to the 96% fragmentation rate (honest and striking). Every post that referenced the idea count had to be updated, some multiple times as the framing evolved through three iterations.
|
||||
This required rewriting Post 5 entirely -- its title changed from "The 1,780 Ideas That Will Shape Agent Infrastructure" to "Where 434 Drafts Converge (And Where They Don't)." The lead metric shifted from raw extraction count (impressive but hollow) to the 96% fragmentation rate (honest and striking). Every post that referenced the idea count had to be updated, some multiple times as the framing evolved through three iterations.
|
||||
|
||||
The episode is worth documenting because it illustrates the irreducible role of human judgment in agent-produced work. Four agents had independently used the 1,780 figure -- the Analyst generated it, the Coder validated it, the Architect designed around it, the Writer headlined it. None questioned whether it was meaningful. It took a human asking "so what?" to force the reframe. The improved version -- convergence-amid-fragmentation, with 628 cross-org convergent ideas as the honest middle ground -- was genuinely better. But no agent surfaced the critique on its own.
|
||||
|
||||
@@ -97,7 +97,7 @@ The Analyst's second deep-analysis round produced three findings that significan
|
||||
|
||||
**RFC foundation divergence.** The Chinese bloc builds on YANG/NETCONF (network management). The Western bloc builds on COSE/CBOR/CoAP (IoT security) and HTTP/TLS/PKI (web infrastructure). The **only shared foundation is OAuth 2.0.** This elevated Post 3's fragmentation thesis from "different protocols" to "different technological DNA" -- the two blocs are not just disagreeing on solutions, they are building on incompatible infrastructure.
|
||||
|
||||
**Revision velocity.** 55% of all 361 drafts are at revision -00 -- submitted once, never iterated. Huawei's rate is 65%. Compare that with Ericsson (11%), Boeing (average revision 28.2), and Siemens (17.2). The volume-vs.-commitment distinction sharpened Post 2's analysis of what Huawei's 66-draft campaign actually represents. A further detail: the majority of Huawei's drafts were submitted in the 4-week window before IETF 121 Dublin -- a coordinated pre-meeting filing burst.
|
||||
**Revision velocity.** 55% of all 434 drafts are at revision -00 -- submitted once, never iterated. Huawei's rate is 65%. Compare that with Ericsson (11%), Boeing (average revision 28.2), and Siemens (17.2). The volume-vs.-commitment distinction sharpened Post 2's analysis of what Huawei's 69-draft campaign actually represents. A further detail: the majority of Huawei's drafts were submitted in the 4-week window before IETF 121 Dublin -- a coordinated pre-meeting filing burst.
|
||||
|
||||
**Centrality bridge-builders.** The co-authorship network (491 nodes, 1,142 edges) revealed that European telecoms -- not US Big Tech, not the UN, not any formal body -- are the structural glue between the Chinese and Western blocs. Telefonica's Luis M. Contreras ranks #1 in betweenness centrality. Only 115 of 557 authors (23%) bridge the divide at all. The standards ecosystem's cross-divide cohesion depends on a handful of companies that most observers would not name first.
|
||||
|
||||
@@ -129,8 +129,8 @@ All three contributions came from reading holistically -- something no individua
|
||||
|
||||
| Component | Cost | Most Important Finding |
|
||||
|-----------|-----:|----------------------|
|
||||
| Claude Sonnet (ratings, gaps) | ~$8 | 4:1 safety deficit, 12 gap taxonomy |
|
||||
| Claude Haiku (idea extraction) | ~$0.80 | 1,780 raw ideas (96% fragmented) |
|
||||
| Claude Sonnet (ratings, gaps) | ~$8 | 4:1 safety deficit, 11 gap taxonomy |
|
||||
| Claude Haiku (idea extraction) | ~$0.80 | 419 ideas (vast majority unique to single drafts) |
|
||||
| Ollama embeddings | $0.00 | 25+ near-duplicate pairs |
|
||||
| Coder: regex RFC parsing | $0.00 | Foundation divergence (YANG vs COSE) |
|
||||
| Coder: networkx centrality | $0.00 | European telecoms as bridge-builders |
|
||||
@@ -164,7 +164,7 @@ Six lessons from running a four-agent team on a real project:
|
||||
|
||||
## The Meta-Irony
|
||||
|
||||
We built a team of AI agents to analyze 361 IETF drafts about AI agent standards. The team needed: coordination mechanisms, shared context, role-based specialization, review and quality gates, human oversight, and a way to verify that completed work was actually complete.
|
||||
We built a team of AI agents to analyze 434 IETF drafts about AI agent standards. The team needed: coordination mechanisms, shared context, role-based specialization, review and quality gates, human oversight, and a way to verify that completed work was actually complete.
|
||||
|
||||
Every one of these needs maps to a gap in the IETF landscape:
|
||||
|
||||
|
||||
Reference in New Issue
Block a user