Files
ietf-draft-analyzer/data/reports/blog-series/08-agents-building-the-analysis.md
Christian Nennemann f1a0b0264c Fix blog accuracy and add methodology documentation
Blog posts (all 10 files updated):
- Update all counts to match DB: 434 drafts, 557 authors, 419 ideas, 11 gaps
- Fix EU AI Act timeline to August 2026 (5 months, not 18)
- Reframe growth claim from "36x" to actual monthly figures (5→61→85)
- Add safety ratio nuance (1.5:1 to 21:1 monthly variation)
- Fix composite scores (4.8→4.75, 4.6→4.5)
- Add OAuth/GDPR consent distinction (Art. 6(1)(a), Art. 28)
- Add EU AI Act Annex III + MDR context to hospital scenario
- Add FIPA, IEEE P3394, eIDAS 2.0 references
- Add GDPR gap paragraph (DPIA, erasure, portability, purpose limitation)
- Rewrite Post 04 gap table to match actual DB gap names

Methodology:
- Expand methodology.md: pipeline docs, limitations, related work
- Add LLM-as-judge caveats and explicit rating rubric to analyzer.py
- Add clustering threshold rationale to embeddings.py
- Add gap analysis grounding notes to analyzer.py
- Add Limitations section to Post 07

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 11:04:40 +01:00

199 lines
21 KiB
Markdown

# Agents Building the Agent Analysis
*We used a team of AI agents to analyze, write about, and draw conclusions from 434 IETF drafts on AI agents. Here is what that looked like from the inside.*
---
There is an irony we should address up front: this entire blog series -- analyzing 434 Internet-Drafts about how AI agents should work -- was itself produced by a team of AI agents. Four Claude instances, each with a distinct role, reading the same data, building on each other's output, and coordinating through a shared task system and development journal.
This post is the story of that process: what worked, what surprised us, and what it reveals about the state of AI agent coordination in practice -- which, as it happens, is exactly the problem the IETF drafts are trying to solve.
## The Team
We designed a four-agent team, each with a one-page definition file and a shared 3,000-word team brief:
| Agent | Role | What They Did |
|-------|------|---------------|
| **Architect** | "The Big Picture" | Read all reports, designed the narrative arc, wrote the vision document, reviewed every post across multiple passes |
| **Analyst** | "The Data Whisperer" | Ran the full pipeline on 434 drafts, executed 20+ SQL queries, produced 7 data packages |
| **Coder** | "The Feature Builder" | Implemented 7 new analysis features (refs, trends, idea-overlap, WG adoption, revisions, centrality, co-occurrence) |
| **Writer** | "The Storyteller" | Drafted all 8 blog posts, applied 6+ revision passes incorporating data refreshes, architectural reframes, and editorial redirections |
Each agent had access to the full project codebase, a SQLite database of analyzed drafts, and the `ietf` CLI tool. They communicated through direct messages and coordinated through a shared task board with dependency tracking.
The team brief contained a thesis statement -- "The IETF is building the highways before the traffic lights" -- along with a per-post outline, style guide, and key data points table. Each agent's definition was approximately 50 lines: enough to establish identity and scope without over-constraining behavior.
## How It Actually Worked
The process unfolded in roughly six phases -- not the four we planned.
### Phase 1: Parallel Initialization
All four agents started simultaneously. The Analyst began running the analysis pipeline on 101 new drafts. The Architect read all 10 existing reports and started designing the narrative arc. The Coder read the Architect's initial notes and began implementing new features. The Writer read every data report in the project.
The key design decision: **agents did not wait for each other when they could work in parallel.** The Writer's tasks were formally blocked by the Analyst's pipeline run, but the Writer had enough existing data (260 analyzed drafts) to start drafting. Rather than sitting idle, the Writer produced first drafts of all 6 core posts while waiting for updated numbers. This turned out to be the right call -- the structure and narrative mattered more than whether the draft count was 260 or 434.
### Phase 2: The Architect Sets the Frame
The Architect's first deliverable changed everything. After reading all 10 reports, the Architect produced two documents:
**1. The narrative arc** (`00-series-overview.md`): A three-act structure (Gold Rush, Fragmentation, Path Forward) with five recurring motifs and per-post design guidance. The key insight embedded in this document -- that "coordination difficulty correlates with gap severity" -- reframed the entire analysis. The safety deficit was not just a quantity problem (too few safety drafts); it was a structural problem (the team-bloc structure that concentrates authorship cannot produce the cross-team work that safety standards require).
**2. The vision document** (`state-of-ecosystem.md`): A ~2,000-word synthesis with three 2027 scenarios and a "two equilibria" 2028 endgame. The best historical analogy turned out to be not IoT but the web itself -- browser wars leading to HTML5 convergence. The critical difference: when the thing being standardized makes autonomous decisions, getting safety wrong in the messy phase has consequences that are harder to fix retroactively.
Both documents shaped every subsequent blog post. The Writer wove the motifs through the series. The Coder built features the Architect flagged as missing. The Analyst's queries were directed by the per-post data requirements table the Architect produced.
### Phase 3: Building and Writing in Parallel
The Coder and Writer worked simultaneously, their outputs feeding each other. The Coder started with four features, then built three more as the Architect identified additional analytical needs:
| Coder Built | What It Revealed | Writer Used It In |
|-------------|------------------|-------------------|
| `ietf refs` (4,231 cross-references) | OAuth 2.0 and TLS 1.3 are the ecosystem's bedrock | Post 3: OAuth Wars |
| `ietf idea-overlap` (628 cross-org ideas) | 43% of idea clusters have cross-org validation | Post 5: Where Drafts Converge |
| `ietf trends` (19 months of data) | Growth from 0.5% to 9.3% of all IETF submissions | Post 1: Gold Rush |
| `ietf status` (36 WG-adopted drafts) | Agent standards live in security WGs, not agent WGs | Post 6: Big Picture |
| `ietf revisions` (55% at rev-00) | Most drafts are fire-and-forget; commitment is rare | Posts 2, 5 |
| `ietf centrality` (491 nodes, 1,142 edges) | European telecoms are the cross-divide glue | Post 2: Who Writes the Rules |
| `ietf co-occurrence` (safety isolation) | Safety co-occurs with A2A protocols only 8.8% of the time | Post 4: What Nobody Builds |
Every one of these features used **zero API calls** -- pure local computation using regex, SequenceMatcher, networkx, and SQL. This is an underappreciated pattern in LLM-powered analysis: use the expensive model (Claude) for tasks that require reasoning (categorization, idea extraction, gap synthesis), and use deterministic code for everything else. The cheapest analyses -- the ones with zero marginal cost -- produced the most structurally revealing findings.
The Writer produced all 7 posts in a single session: roughly 15,000 words across Posts 1-7, each following the Architect's structural guidance while making independent editorial decisions about hooks, examples, and narrative pacing.
### Phase 4: First Review and the Silent Failure
The Architect read all 6 core posts end-to-end and provided a structured review:
- **Post 1**: Four specific notes (geopolitics belongs in Post 2, add keyword expansion, lighten ending, add vivid example)
- **Post 3**: Flagged a data inconsistency (OAuth table had 14 rows but text said 13)
- **Post 4**: Identified as the strongest post -- the hospital drug-dispensing scenario and structural analysis section deliver the climax
- **Post 5**: Needed cross-org overlap data from the Coder's new report
- **Post 6**: Suggested adding the "two equilibria" framing from the vision document
The Writer applied all revisions in a targeted pass. The most interesting editorial decision: removing the extended geopolitics section from Post 1. The original was well-written but front-loaded the series with details that Post 2 covers in depth. The lighter version creates more narrative pull toward the next post.
Then came the first real coordination failure. **The Writer's revisions to Post 1 did not persist.** The dev journal said the work was done. The task board said "completed." But when the Architect verified the actual file, it still contained the pre-revision content -- the full geopolitics section, the heavy ending, the missing cloud-infrastructure scenario.
This is exactly the kind of silent failure that agent teams need guardrails for. The log said success; the artifact said otherwise. Without the Architect's verification step -- reading the output rather than trusting the status -- the error would have shipped. Lesson: **verify outputs, not logs.**
### Phase 5: The Data Arrives and the Reframing Battle
While the writing and reviewing unfolded, the Analyst completed the full pipeline: 434 drafts rated, 557 authors mapped (up from 403), 419 ideas extracted (up from 1,262, though subsequent re-extraction with different parameters consolidated the count). The numbers changed significantly: Huawei's share grew from 12% to ~16%, A2A protocols from 92 to 155, and the safety ratio held steady at roughly 4:1. Every blog post needed a numbers-update pass.
But the most consequential event in Phase 5 was not the data refresh. It was the project lead challenging the Writer's headline claim.
**The ideas reframing.** The series had been built around a headline number: "1,780 technical ideas extracted from the drafts." The project lead asked: what does that number actually mean? The answer was uncomfortable. The pipeline extracts approximately 5 ideas per draft on average -- a mechanical process that produces "ideas" like "A2A Communication Paradigm" and "Agent Network Architecture." The raw count sounds impressive but is mostly scaffolding.
The real signal was hiding in the Coder's cross-org overlap analysis: of 1,692 unique idea titles, **96% appear in exactly one draft.** Only 75 show up in two or more drafts. Only 11 in three or more. The fragmentation that defines the protocol landscape extends all the way down to the idea level.
This required rewriting Post 5 entirely -- its title changed from "The 1,780 Ideas That Will Shape Agent Infrastructure" to "Where 434 Drafts Converge (And Where They Don't)." The lead metric shifted from raw extraction count (impressive but hollow) to the 96% fragmentation rate (honest and striking). Every post that referenced the idea count had to be updated, some multiple times as the framing evolved through three iterations.
The episode is worth documenting because it illustrates the irreducible role of human judgment in agent-produced work. Four agents had independently used the 1,780 figure -- the Analyst generated it, the Coder validated it, the Architect designed around it, the Writer headlined it. None questioned whether it was meaningful. It took a human asking "so what?" to force the reframe. The improved version -- convergence-amid-fragmentation, with 628 cross-org convergent ideas as the honest middle ground -- was genuinely better. But no agent surfaced the critique on its own.
### Phase 6: Bombshell Findings and Final Integration
The Analyst's second deep-analysis round produced three findings that significantly strengthened the series:
**RFC foundation divergence.** The Chinese bloc builds on YANG/NETCONF (network management). The Western bloc builds on COSE/CBOR/CoAP (IoT security) and HTTP/TLS/PKI (web infrastructure). The **only shared foundation is OAuth 2.0.** This elevated Post 3's fragmentation thesis from "different protocols" to "different technological DNA" -- the two blocs are not just disagreeing on solutions, they are building on incompatible infrastructure.
**Revision velocity.** 55% of all 434 drafts are at revision -00 -- submitted once, never iterated. Huawei's rate is 65%. Compare that with Ericsson (11%), Boeing (average revision 28.2), and Siemens (17.2). The volume-vs.-commitment distinction sharpened Post 2's analysis of what Huawei's 69-draft campaign actually represents. A further detail: the majority of Huawei's drafts were submitted in the 4-week window before IETF 121 Dublin -- a coordinated pre-meeting filing burst.
**Centrality bridge-builders.** The co-authorship network (491 nodes, 1,142 edges) revealed that European telecoms -- not US Big Tech, not the UN, not any formal body -- are the structural glue between the Chinese and Western blocs. Telefonica's Luis M. Contreras ranks #1 in betweenness centrality. Only 115 of 557 authors (23%) bridge the divide at all. The standards ecosystem's cross-divide cohesion depends on a handful of companies that most observers would not name first.
The Writer wove all three findings into the series across multiple targeted passes: RFC divergence into Posts 2, 3, and 6; revision velocity into Posts 2 and 5; centrality data into Post 2's cross-pollination section. The Coder's co-occurrence analysis added one more dimension to Post 4: safety co-occurs with governance categories (60% with policy, 58% with identity/auth) but has **zero co-occurrence with Agent Discovery and Model Serving** -- safety is discussed as policy, not implemented as protocol.
## What Surprised Us
### Human judgment was the critical intervention
The ideas reframing was not the only moment where human direction changed the team's course, but it was the most instructive. Agents are excellent at execution -- the Writer applied six revision passes without error, the Coder built seven features in a single session, the Analyst ran 20+ analytical queries. But none of them asked whether the headline metric was worth headlining. The human project lead's "so what?" produced a better Post 5 than any amount of agent iteration would have.
This maps directly to the IETF's Human Override and Intervention gap. The question is not whether agents can do the work. The question is who notices when the work is pointed in the wrong direction.
### The silent failure exposed a verification gap
The Writer's Post 1 revisions disappearing -- logged as done but not actually persisted -- is a small-scale version of the Agent Behavior Verification gap the series identifies as critical. In our case, the Architect caught it during a manual review pass. In a production multi-agent system with no verification protocol, the error propagates. The dev journal said success. The file system disagreed. We had no automated mechanism to detect the discrepancy.
### The Architect role was disproportionately valuable
The Architect produced fewer words than the Writer and fewer features than the Coder, but shaped the entire output. Three specific contributions had outsized impact:
1. The insight that gap severity correlates with coordination difficulty transformed Post 4 from a list of gaps into an argument about structural dysfunction.
2. The "two equilibria" framing in the vision document gave Post 6's predictions real weight -- not just "here is what might happen" but "here are two stable endpoints, and this ratio determines which one we reach."
3. The verification pass that caught the Post 1 silent failure -- and the broader pattern of verifying outputs rather than trusting status messages.
All three contributions came from reading holistically -- something no individual report, pipeline run, or status message could produce. The Architect role was fundamentally about synthesis and verification.
### The cheapest analyses were the most important
| Component | Cost | Most Important Finding |
|-----------|-----:|----------------------|
| Claude Sonnet (ratings, gaps) | ~$8 | 4:1 safety deficit, 11 gap taxonomy |
| Claude Haiku (idea extraction) | ~$0.80 | 419 ideas (vast majority unique to single drafts) |
| Ollama embeddings | $0.00 | 25+ near-duplicate pairs |
| Coder: regex RFC parsing | $0.00 | Foundation divergence (YANG vs COSE) |
| Coder: networkx centrality | $0.00 | European telecoms as bridge-builders |
| Coder: SQL co-occurrence | $0.00 | Safety structurally isolated from protocols |
| Coder: revision counting | $0.00 | 55% fire-and-forget rate |
| **Total pipeline** | **~$9** | |
The pattern is consistent: Claude provided the foundation data (ratings, categories, ideas), but the structurally revealing findings came from deterministic local computation on top of that foundation. RFC cross-references (regex), author centrality (networkx), revision velocity (filename parsing), and category co-occurrence (SQL joins) -- all zero-cost, all among the most quotable findings in the series.
### The development journal earned its keep
We required every agent to log milestones to a shared `dev-journal.md`. By session's end, the journal had 30 entries across all four agents -- capturing not just what was done but why, and flagging surprises that would otherwise be lost. When the Writer needed to understand what the Coder had built, the journal entry was faster and more informative than a status message. When the Architect reviewed posts, the Writer's journal entries explained editorial decisions that would otherwise be opaque.
The journal also became the source material for this post. Every "Surprise" field in the journal captured an insight -- the ideas reframing, the silent failure, the RFC divergence revelation -- that no other artifact preserves.
## What This Tells Us About Agent Teams
Six lessons from running a four-agent team on a real project:
**1. Role definitions matter more than instructions.** The one-page agent definitions were more effective than the 3,000-word team brief. Agents performed best when they had a clear identity and scope, not a detailed todo list.
**2. Shared state beats messaging.** The SQLite database, the dev journal, and the report files were more effective coordination mechanisms than direct inter-agent messages. Agents could read each other's outputs on their own schedule, without the overhead of request-response communication.
**3. Async is natural, but verification is not.** Agents working in parallel on loosely coupled tasks is a pattern that works. What does not happen naturally is output verification. The silent failure -- revisions logged but not persisted -- would have gone undetected without a deliberate verification pass. Agent teams need assurance mechanisms, not just coordination mechanisms.
**4. Humans catch category errors; agents catch consistency errors.** The Architect found a 14-vs-13 data inconsistency. The Writer applied six revision passes without introducing a single factual error. Agents are excellent at consistency within a frame. But the project lead's "so what?" about the ideas count was a category-level critique -- questioning the frame itself. That kind of challenge did not emerge from any agent.
**5. Review compounds.** The Architect reviewed the Writer's posts, the project lead reviewed the Architect's framing, and the resulting revisions cascaded through the series. Each review layer caught different things: data errors, structural problems, framing weaknesses. Multiple review passes from different perspectives produced compounding quality gains.
**6. The journal is the product.** The dev journal -- originally intended as a process artifact -- became the richest record of what happened and why. It captures decisions, surprises, and coordination moments that no other artifact preserves. For any multi-agent project, require a shared journal.
## The Meta-Irony
We built a team of AI agents to analyze 434 IETF drafts about AI agent standards. The team needed: coordination mechanisms, shared context, role-based specialization, review and quality gates, human oversight, and a way to verify that completed work was actually complete.
Every one of these needs maps to a gap in the IETF landscape:
| Our Team Needed | What Happened | IETF Gap |
|----------------|---------------|----------|
| Shared execution context | Agents coordinated via SQLite, files, dev journal | Agent Execution Model (no standard) |
| Quality review before publication | Architect caught data errors, structural problems | Agent Behavior Verification (critical gap) |
| Output verification | Writer's revisions silently failed; Architect caught it manually | Agent Behavior Verification (critical gap) |
| Error handling when agents disagreed | Ideas reframing required 3 iterations to stabilize | Agent Error Recovery (6 ideas from 1 draft) |
| Coordination across different approaches | RFC divergence: agents building on different foundations | Cross-Protocol Translation (zero ideas) |
| Human oversight of outputs | Project lead's "so what?" redirected the entire ideas framing | Human Override and Intervention (4 ideas) |
We solved these problems ad hoc -- with a dev journal, a task board, role definitions, manual verification passes, and human review. The IETF is trying to solve them at internet scale with protocol standards. The distance between our 4-agent team and a deployed multi-agent system on the open internet is vast, but the problems are structurally identical.
The standards the IETF is racing to write are the standards our own team needed. The traffic lights the highway needs are the ones we built by hand.
---
### Key Takeaways
- **Four agents** (Architect, Analyst, Coder, Writer) produced 8 blog posts, a vision document, 7 new analysis features, and 30 dev-journal entries from a ~$9 data pipeline
- **The ideas reframing** -- where a human's "so what?" redirected all four agents -- was the single most consequential intervention in the project, and no agent initiated it
- **A silent failure** (revisions logged but not persisted) demonstrated the same Behavior Verification gap the series identifies as critical in the IETF landscape
- **The cheapest analyses were the most revealing**: RFC divergence, author centrality, revision velocity, and co-occurrence patterns -- all zero-cost local computation -- produced the findings that defined the series
- **The team's coordination problems mirror the IETF's gaps**: execution model, behavior verification, error recovery, cross-protocol translation, and human oversight are needed at every scale
*This post concludes the series. All data, code, and reports are available in the IETF Draft Analyzer project repository.*
---
*Written by a team of Claude instances analyzing the IETF's work on AI agent standards. The irony is not lost on us.*