Run pipeline, write Post 08, commit untracked files

Pipeline:
- Extract ideas for 38 new drafts → 462 ideas total
- Convergence analysis: 132 cross-org convergent ideas (33% rate)
- Fetch authors for 102 drafts → 709 authors (up from 403)
- Refresh gap analysis: 12 gaps across full 474-draft corpus
- Update verified counts with new totals

Post 08:
- Complete rewrite of "Agents Building the Agent Analysis" (2,953 words)
- Covers 3 phases: writing team → review cycle → fix cycle
- Meta-irony table mapping team coordination to IETF gap names
- Specific examples from dev journal (SQL injection, consent conflation, ideas mismatch)

Untracked files committed:
- scripts/: backfill-wg-names, classify-unrated, compare-classifiers, download-relevant-text, run-webui
- src/ietf_analyzer/classifier.py: two-stage Ollama classifier
- src/webui/: analytics (GDPR-compliant), auth, obsidian_export
- tests/test_obsidian_export.py (10 tests)
- data/reports/: wg-analysis, generated draft for gap #37

Housekeeping:
- .gitignore: exclude LaTeX artifacts, stale DBs, analytics.db

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-08 15:31:30 +01:00
parent 20c45a7eba
commit e247bfef8f
19 changed files with 2758 additions and 586 deletions

View File

@@ -1,197 +1,167 @@
# Agents Building the Agent Analysis
*We used a team of AI agents to analyze, write about, and draw conclusions from 434 IETF drafts on AI agents. Here is what that looked like from the inside.*
*We used a team of AI agents to analyze, write about, and review 434 IETF Internet-Drafts on AI agents. Here is what that looked like from the inside.*
*Analysis based on IETF Datatracker data collected through March 2026. Counts and statistics reflect this snapshot.*
---
There is an irony we should address up front: this entire blog series -- analyzing 434 Internet-Drafts about how AI agents should work -- was itself produced by a team of AI agents. Four Claude instances, each with a distinct role, reading the same data, building on each other's output, and coordinating through a shared task system and development journal.
There is an irony we should address up front: this entire blog series -- analyzing 434 Internet-Drafts about how AI agents should work -- was itself produced by a team of AI agents. Twelve Claude instances across three phases, each with a distinct role, reading the same database, building on each other's output, and coordinating through a shared journal and file system.
This post is the story of that process: what worked, what surprised us, and what it reveals about the state of AI agent coordination in practice -- which, as it happens, is exactly the problem the IETF drafts are trying to solve.
This post is the story of that process: what worked, what broke, what surprised us, and what it reveals about the state of AI agent coordination in practice -- which, as it happens, is exactly the problem the IETF drafts are trying to solve.
## The Team
## Phase 1: The Writing Team
We designed a four-agent team, each with a one-page definition file and a shared 3,000-word team brief:
We started with four agents, each defined in a one-page file and grounded by a shared 3,000-word team brief:
| Agent | Role | What They Did |
|-------|------|---------------|
| **Architect** | "The Big Picture" | Read all reports, designed the narrative arc, wrote the vision document, reviewed every post across multiple passes |
| **Analyst** | "The Data Whisperer" | Ran the full pipeline on 434 drafts, executed 20+ SQL queries, produced 7 data packages |
| **Coder** | "The Feature Builder" | Implemented 7 new analysis features (refs, trends, idea-overlap, WG adoption, revisions, centrality, co-occurrence) |
| **Writer** | "The Storyteller" | Drafted all 8 blog posts, applied 6+ revision passes incorporating data refreshes, architectural reframes, and editorial redirections |
| **Architect** | The Big Picture | Read all reports, designed the narrative arc, wrote the vision document, reviewed every post |
| **Analyst** | The Data Whisperer | Ran the pipeline on 434 drafts, executed 20+ SQL queries, produced data packages |
| **Coder** | The Feature Builder | Implemented 7 new analysis features (refs, trends, idea-overlap, WG adoption, revisions, centrality, co-occurrence) |
| **Writer** | The Storyteller | Drafted all 8 blog posts, applied 6+ revision passes |
Each agent had access to the full project codebase, a SQLite database of analyzed drafts, and the `ietf` CLI tool. They communicated through direct messages and coordinated through a shared task board with dependency tracking.
Each agent had access to the full project codebase, a SQLite database, and the `ietf` CLI tool. They communicated through files and coordinated through a shared development journal. The team brief contained a thesis statement -- "The IETF is building the highways before the traffic lights" -- a per-post outline, and a data requirements table.
The team brief contained a thesis statement -- "The IETF is building the highways before the traffic lights" -- along with a per-post outline, style guide, and key data points table. Each agent's definition was approximately 50 lines: enough to establish identity and scope without over-constraining behavior.
### Parallel by default
## How It Actually Worked
The key design decision: agents did not wait for each other when they could work in parallel. The Writer's tasks were formally blocked by the Analyst's pipeline run, but the Writer had enough existing data (260 analyzed drafts) to start drafting. Rather than sitting idle, the Writer produced first drafts of all 7 posts while waiting for updated numbers. This turned out to be the right call -- the structure and narrative mattered more than whether the draft count was 260 or 434.
The process unfolded in roughly six phases -- not the four we planned.
The Coder and Writer worked simultaneously, their outputs feeding each other. Every feature the Coder built used zero API calls -- pure local computation via regex, SQL, SequenceMatcher, and networkx. The RFC cross-reference parser revealed that the Chinese and Western blocs build on incompatible infrastructure foundations (YANG/NETCONF vs. COSE/CBOR), with OAuth 2.0 as the only shared bedrock. The co-occurrence analysis showed safety has zero overlap with Agent Discovery and Model Serving. These zero-cost local analyses produced the most structurally revealing findings in the entire series.
### Phase 1: Parallel Initialization
### The Architect shaped everything
All four agents started simultaneously. The Analyst began running the analysis pipeline on 101 new drafts. The Architect read all 10 existing reports and started designing the narrative arc. The Coder read the Architect's initial notes and began implementing new features. The Writer read every data report in the project.
The Architect produced fewer words than the Writer and fewer features than the Coder, but had disproportionate impact. Three contributions reshaped the output:
The key design decision: **agents did not wait for each other when they could work in parallel.** The Writer's tasks were formally blocked by the Analyst's pipeline run, but the Writer had enough existing data (260 analyzed drafts) to start drafting. Rather than sitting idle, the Writer produced first drafts of all 6 core posts while waiting for updated numbers. This turned out to be the right call -- the structure and narrative mattered more than whether the draft count was 260 or 434.
1. The insight that **gap severity correlates with coordination difficulty** transformed Post 4 from a list of gaps into an argument about structural dysfunction.
2. The **"two equilibria" framing** -- microservices chaos vs. layered web architecture -- gave Post 6's predictions real structural weight.
3. A **verification pass** that caught the Writer's revisions silently failing (logged as done, not actually persisted in the file).
### Phase 2: The Architect Sets the Frame
That third point is worth dwelling on. The dev journal said "Post 1 revisions complete." The file still contained the pre-revision content. Without the Architect reading the actual output rather than trusting the status message, the error would have shipped. This is a small-scale version of the Behavior Verification gap the series identifies as critical -- and we will come back to it.
The Architect's first deliverable changed everything. After reading all 10 reports, the Architect produced two documents:
### The human who said "so what?"
**1. The narrative arc** (`00-series-overview.md`): A three-act structure (Gold Rush, Fragmentation, Path Forward) with five recurring motifs and per-post design guidance. The key insight embedded in this document -- that "coordination difficulty correlates with gap severity" -- reframed the entire analysis. The safety deficit was not just a quantity problem (too few safety drafts); it was a structural problem (the team-bloc structure that concentrates authorship cannot produce the cross-team work that safety standards require).
The most consequential intervention in the entire project came not from an agent but from the human project lead. The series had been built around a headline number: "1,780 technical ideas extracted from the drafts." The project lead asked: what does that number actually mean?
**2. The vision document** (`state-of-ecosystem.md`): A ~2,000-word synthesis with three 2027 scenarios and a "two equilibria" 2028 endgame. The best historical analogy turned out to be not IoT but the web itself -- browser wars leading to HTML5 convergence. The critical difference: when the thing being standardized makes autonomous decisions, getting safety wrong in the messy phase has consequences that are harder to fix retroactively.
The answer was uncomfortable. The pipeline extracts roughly 5 ideas per draft on average -- a mechanical process that produces items like "A2A Communication Paradigm" and "Agent Network Architecture." The raw count sounds impressive but is mostly scaffolding. The real signal was hiding in the cross-org overlap analysis: 96% of unique idea titles appear in exactly one draft. Only 75 show up in two or more. The fragmentation that defines the protocol landscape extends all the way down to the idea level.
Both documents shaped every subsequent blog post. The Writer wove the motifs through the series. The Coder built features the Architect flagged as missing. The Analyst's queries were directed by the per-post data requirements table the Architect produced.
This required rewriting Post 5 entirely. Its title changed from "The 1,780 Ideas That Will Shape Agent Infrastructure" to "Where 434 Drafts Converge (And Where They Don't)." The lead metric shifted from raw extraction count (impressive but hollow) to the convergence rate (honest and striking). Four agents had independently used the 1,780 figure -- the Analyst generated it, the Coder validated it, the Architect designed around it, the Writer headlined it. None questioned whether it was meaningful.
### Phase 3: Building and Writing in Parallel
## Phase 2: The Review Cycle
The Coder and Writer worked simultaneously, their outputs feeding each other. The Coder started with four features, then built three more as the Architect identified additional analytical needs:
After the writing team produced 8 blog posts, a vision document, 7 new analysis features, and 30 dev-journal entries, we did something that turned out to matter more than the writing itself: we sent the entire output to four specialist reviewers, each running in parallel.
| Coder Built | What It Revealed | Writer Used It In |
|-------------|------------------|-------------------|
| `ietf refs` (4,231 cross-references) | OAuth 2.0 and TLS 1.3 are the ecosystem's bedrock | Post 3: OAuth Wars |
| `ietf idea-overlap` (130 cross-org ideas) | 36% of idea clusters have cross-org validation | Post 5: Where Drafts Converge |
| `ietf trends` (19 months of data) | Growth from 0.5% to 9.3% of all IETF submissions | Post 1: Gold Rush |
| `ietf status` (36 WG-adopted drafts) | Agent standards live in security WGs, not agent WGs | Post 6: Big Picture |
| `ietf revisions` (55% at rev-00) | Most drafts are fire-and-forget; commitment is rare | Posts 2, 5 |
| `ietf centrality` (491 nodes, 1,142 edges) | European telecoms are the cross-divide glue | Post 2: Who Writes the Rules |
| `ietf co-occurrence` (safety isolation) | Safety co-occurs with A2A protocols only 8.8% of the time | Post 4: What Nobody Builds |
| Reviewer | Lens | Issues Found |
|----------|------|-------------|
| **Statistics** | Data integrity, sampling bias, quantitative accuracy | 3 critical, 4 important, 4 minor |
| **Legal** | German/EU internet law, GDPR, EU AI Act, eIDAS 2.0 | 3 critical, 5 regulatory gaps, 5 improvements |
| **Engineering** | Code quality, security, performance, DX | 1 critical, 1 high, 5 bugs, 6 perf issues |
| **Science** | Methodology, reproducibility, related work, hedging | 2 critical, 3 high, 4 medium |
Every one of these features used **zero API calls** -- pure local computation using regex, SequenceMatcher, networkx, and SQL. This is an underappreciated pattern in LLM-powered analysis: use the expensive model (Claude) for tasks that require reasoning (categorization, idea extraction, gap synthesis), and use deterministic code for everything else. The cheapest analyses -- the ones with zero marginal cost -- produced the most structurally revealing findings.
Four agents, four completely different perspectives, run simultaneously. Together they surfaced **36 distinct issues** that the writing team had missed. The findings were often surprising.
The Writer produced all 7 posts in a single session: roughly 15,000 words across Posts 1-7, each following the Architect's structural guidance while making independent editorial decisions about hooks, examples, and narrative pacing.
### The statistics reviewer found the numbers did not add up
### Phase 4: First Review and the Silent Failure
The statistical audit cross-checked every quantitative claim in the blog series against the actual database using raw SQL queries. The results were sobering. The blog claimed 361 drafts; the database held 434. The blog claimed 1,780 ideas; the database held 419. The blog claimed 12 gaps; the database held 11. Composite scores were inflated by 0.05-0.10 through rounding. The "4:1 safety ratio" varied from 1.5:1 to 21:1 by month -- a fact the flat claim obscured.
The Architect read all 6 core posts end-to-end and provided a structured review:
The ideas count mismatch was the most serious finding. The entire thesis of Post 5 -- "96% of ideas appear in one draft" and "628 cross-org convergent ideas" -- was not reproducible from the current database. The pipeline had been re-run with different parameters, overwriting the original extraction. Nobody had noticed because the numbers in the blog posts were never re-checked against the live database.
- **Post 1**: Four specific notes (geopolitics belongs in Post 2, add keyword expansion, lighten ending, add vivid example)
- **Post 3**: Flagged a data inconsistency (OAuth table had 14 rows but text said 13)
- **Post 4**: Identified as the strongest post -- the hospital drug-dispensing scenario and structural analysis section deliver the climax
- **Post 5**: Needed cross-org overlap data from the Coder's new report
- **Post 6**: Suggested adding the "two equilibria" framing from the vision document
### The legal reviewer found regulatory blindspots
The Writer applied all revisions in a targeted pass. The most interesting editorial decision: removing the extended geopolitics section from Post 1. The original was well-written but front-loaded the series with details that Post 2 covers in depth. The lighter version creates more narrative pull toward the next post.
The legal review, written from a German/EU internet law perspective, identified three critical issues that no technically-focused agent would have caught:
Then came the first real coordination failure. **The Writer's revisions to Post 1 did not persist.** The dev journal said the work was done. The task board said "completed." But when the Architect verified the actual file, it still contained the pre-revision content -- the full geopolitics section, the heavy ending, the missing cloud-infrastructure scenario.
**Consent conflation.** The series used "consent" interchangeably across OAuth authorization flows, GDPR consent (Einwilligung under Art. 6(1)(a)), and human-in-the-loop approval gates. These are legally distinct concepts. Under CJEU case law (Planet49), consent requires a clear affirmative act by the data subject. When an AI agent delegates to sub-agents, the chain of consent may break entirely. None of the 14 OAuth-for-agents proposals the series analyzed -- and none of the agents writing about them -- flagged this.
This is exactly the kind of silent failure that agent teams need guardrails for. The log said success; the artifact said otherwise. Without the Architect's verification step -- reading the output rather than trusting the status -- the error would have shipped. Lesson: **verify outputs, not logs.**
**The hospital scenario understated regulatory reality.** Post 4's opening scenario -- an AI agent managing drug dispensing with a hallucinated dosage -- was framed as "what goes wrong if this gap is never addressed." Under EU law, it is already addressed: the EU AI Act classifies such systems as high-risk under Annex III, the revised Product Liability Directive covers AI systems explicitly, and German medical law (BGB SS 630a ff.) places duty of care on the provider. The IETF gap is not in accountability but in technical mechanisms to implement what the regulation already requires.
### Phase 5: The Data Arrives and the Reframing Battle
**GDPR was entirely absent from the gap analysis.** The series identified 11 standardization gaps. None mentioned GDPR-mandated capabilities: data protection impact assessments, right to erasure propagation through multi-agent chains, data portability, or purpose limitation. These are not aspirational -- they are legally binding requirements that agent systems operating in the EU must satisfy.
While the writing and reviewing unfolded, the Analyst completed the full pipeline: 434 drafts rated, 557 authors mapped (up from 403), 419 ideas extracted (up from 1,262, though subsequent re-extraction with different parameters consolidated the count). The numbers changed significantly: Huawei's share grew from 12% to ~16%, A2A protocols from 92 to 155, and the safety ratio held steady at roughly 4:1 on aggregate (varying from 1.5:1 to 21:1 month-to-month). Every blog post needed a numbers-update pass.
### The engineering reviewer found a SQL injection
But the most consequential event in Phase 5 was not the data refresh. It was the project lead challenging the Writer's headline claim.
The codebase review graded the project B+ overall -- "solid for a research tool, needs hardening for production" -- but found a critical SQL injection vulnerability in `db.py`. The `update_generation_run` method interpolated column names from `**kwargs` directly into SQL strings without validation. The Flask SECRET_KEY was hardcoded as the string `"ietf-dashboard-dev"`. There was no rate limiting on endpoints that trigger paid Claude API calls.
**The ideas reframing.** The series had been built around a headline number: "1,780 technical ideas extracted from the drafts." The project lead asked: what does that number actually mean? The answer was uncomfortable. The pipeline extracts approximately 5 ideas per draft on average -- a mechanical process that produces "ideas" like "A2A Communication Paradigm" and "Agent Network Architecture." The raw count sounds impressive but is mostly scaffolding.
The engineering reviewer also noted that `cli.py` had grown to 2,995 lines with approximately 40 repetitions of the same config/db boilerplate pattern. And that test coverage for the analysis pipeline -- the core of the tool -- was exactly zero.
The real signal was hiding in the Coder's cross-org overlap analysis: of 1,692 unique idea titles, **96% appear in exactly one draft.** Only 75 show up in two or more drafts. Only 11 in three or more. The fragmentation that defines the protocol landscape extends all the way down to the idea level.
### The science reviewer questioned the methodology
This required rewriting Post 5 entirely -- its title changed from "The 1,780 Ideas That Will Shape Agent Infrastructure" to "Where 434 Drafts Converge (And Where They Don't)." The lead metric shifted from raw extraction count (impressive but hollow) to the 96% fragmentation rate (honest and striking). Every post that referenced the idea count had to be updated, some multiple times as the framing evolved through three iterations.
The scientific review identified the central methodological weakness: the entire rating system relies on Claude as the sole judge for five dimensions, with no human calibration, no inter-rater reliability measurement, and ratings based on abstracts only (truncated to 2,000 characters), not full draft text. The clustering threshold of 0.85 was described as "empirical" with no sensitivity analysis. The gap analysis was single-shot LLM generation from compressed metadata.
The episode is worth documenting because it illustrates the irreducible role of human judgment in agent-produced work. Four agents had independently used the 1,780 figure -- the Analyst generated it, the Coder validated it, the Architect designed around it, the Writer headlined it. None questioned whether it was meaningful. It took a human asking "so what?" to force the reframe. The improved version -- convergence-amid-fragmentation, with cross-org convergent ideas as the honest middle ground (130 from the current 419-idea extraction, or 628 from the earlier 1,780-idea run; the convergence rate of ~36% holds across both) -- was genuinely better. But no agent surfaced the critique on its own.
One finding was particularly striking: of 434 drafts rated for relevance, the distribution was heavily right-skewed (196 at 4, 98 at 5, only 38 at 1-2). Claude was generous with relevance for keyword-matched drafts, making the metric less discriminating than it should be. Upon manual review, 73 drafts turned out to be false positives -- including `draft-ietf-hpke-hpke` (generic public key encryption, nothing to do with AI agents) rated at relevance 5.
### Phase 6: Bombshell Findings and Final Integration
## Phase 3: The Fix Cycle
The Analyst's second deep-analysis round produced three findings that significantly strengthened the series:
With 36 issues identified, we launched fix agents -- the Coder handling engineering and data integrity issues, an Editor handling legal and statistical corrections across the blog posts.
**RFC foundation divergence.** The Chinese bloc builds on YANG/NETCONF (network management). The Western bloc builds on COSE/CBOR/CoAP (IoT security) and HTTP/TLS/PKI (web infrastructure). The **only shared foundation is OAuth 2.0.** This elevated Post 3's fragmentation thesis from "different protocols" to "different technological DNA" -- the two blocs are not just disagreeing on solutions, they are building on incompatible infrastructure.
The fixes unfolded in three rounds, prioritized by severity:
**Revision velocity.** 55% of all 434 drafts are at revision -00 -- submitted once, never iterated. Huawei's rate is 65%. Compare that with Ericsson (11%), Boeing (average revision 28.2), and Siemens (17.2). The volume-vs.-commitment distinction sharpened Post 2's analysis of what Huawei's 69-draft campaign actually represents. A further detail: the majority of Huawei's drafts were submitted in the 4-week window before IETF 121 Dublin -- a coordinated pre-meeting filing burst.
**Round 1 -- Critical.** SQL injection patched with a column name whitelist. Flask SECRET_KEY replaced with `os.environ.get()` fallback to `os.urandom()`. FTS5 query sanitization added to prevent search injection. False-positive column added to the ratings table; 73 drafts flagged. All blog posts updated from 361 to 434 drafts. Ideas count discrepancy reconciled (419 current with methodology note explaining the 1,780 historical figure). Gap count corrected from 12 to 11 with rewritten gap table matching database reality.
**Centrality bridge-builders.** The co-authorship network (491 nodes, 1,142 edges) revealed that European telecoms -- not US Big Tech, not the UN, not any formal body -- are the structural glue between the Chinese and Western blocs. Telefonica's Luis M. Contreras ranks #1 in betweenness centrality. Only 115 of 557 authors (23%) bridge the divide at all. The standards ecosystem's cross-divide cohesion depends on a handful of companies that most observers would not name first.
**Round 2 -- High.** Rate limiting added to Claude-calling endpoints (10 req/min/IP). Category names normalized in the database (21 legacy entries migrated). EU AI Act timeline corrected from "within 18 months" to "within 5 months (August 2026)" with enforcement details and article references. OAuth/GDPR consent distinction added. Hospital scenario annotated with AI Act Annex III and Medical Devices Regulation context. Safety ratio qualified everywhere from flat "4:1" to "averaging ~4:1 but varying from 1.5:1 to 21:1 month-to-month."
The Writer wove all three findings into the series across multiple targeted passes: RFC divergence into Posts 2, 3, and 6; revision velocity into Posts 2 and 5; centrality data into Post 2's cross-pollination section. The Coder's co-occurrence analysis added one more dimension to Post 4: safety co-occurs with governance categories (60% with policy, 58% with identity/auth) but has **zero co-occurrence with Agent Discovery and Model Serving** -- safety is discussed as policy, not implemented as protocol.
**Round 3 -- Medium.** Methodology documentation created (comprehensive `methodology.md` covering all pipeline stages, limitations, and related work). IETF IPR notes added. Language hedged where causal claims were only supported by correlation. MIT LICENSE file created (the project claimed "open source" but had no license). FIPA, IEEE P3394, and eIDAS 2.0 references added where they naturally strengthen arguments. Coder reduced `cli.py` by 200 lines of boilerplate, added `--dry-run` flags to destructive commands, fixed N+1 query patterns.
## What Surprised Us
In total: 14 files modified across the blog series, 7 security/quality fixes applied to the codebase, test count increased from 23 to 64, and a verified-counts document created as a single source of truth.
### Human judgment was the critical intervention
## What This Reveals
The ideas reframing was not the only moment where human direction changed the team's course, but it was the most instructive. Agents are excellent at execution -- the Writer applied six revision passes without error, the Coder built seven features in a single session, the Analyst ran 20+ analytical queries. But none of them asked whether the headline metric was worth headlining. The human project lead's "so what?" produced a better Post 5 than any amount of agent iteration would have.
### Specialized perspectives catch different things
This maps directly to the IETF's Human Override and Intervention gap. The question is not whether agents can do the work. The question is who notices when the work is pointed in the wrong direction.
This is the headline finding from the review cycle. Four reviewers looked at the same output and found almost entirely non-overlapping issues. The statistician found number mismatches. The lawyer found consent conflation. The engineer found SQL injection. The scientist found methodological gaps. No single reviewer -- no matter how thorough -- would have caught all 36 issues.
### The silent failure exposed a verification gap
This is not a theoretical observation about diverse review. It is an empirical result from running the experiment. The legal reviewer's consent-conflation finding required knowledge of CJEU case law. The statistical reviewer's ideas-count discovery required querying the live database. The engineering reviewer's SQL injection required reading the source code line by line. These are genuinely different skills applied to the same artifact.
The Writer's Post 1 revisions disappearing -- logged as done but not actually persisted -- is a small-scale version of the Agent Behavior Verification gap the series identifies as critical. In our case, the Architect caught it during a manual review pass. In a production multi-agent system with no verification protocol, the error propagates. The dev journal said success. The file system disagreed. We had no automated mechanism to detect the discrepancy.
### The review-fix-verify pattern works
### The Architect role was disproportionately valuable
The cycle ran cleanly: four parallel reviews produced a prioritized list; fix agents resolved issues in severity order; the fixes were verified against the review documents. Three rounds (critical, high, medium) imposed natural prioritization. The entire cycle -- 4 reviews plus 3 fix rounds -- happened in a single day.
The Architect produced fewer words than the Writer and fewer features than the Coder, but shaped the entire output. Three specific contributions had outsized impact:
The pattern mirrors what the IETF itself does with Last Call reviews, directorate reviews, and IESG evaluation. Multiple specialized perspectives, applied in sequence, with verification that issues are resolved. The difference is that our cycle took hours, not months. The cost is that our reviewers share the same underlying model and its blindspots.
1. The insight that gap severity correlates with coordination difficulty transformed Post 4 from a list of gaps into an argument about structural dysfunction.
2. The "two equilibria" framing in the vision document gave Post 6's predictions real weight -- not just "here is what might happen" but "here are two stable endpoints, and this ratio determines which one we reach."
3. The verification pass that caught the Post 1 silent failure -- and the broader pattern of verifying outputs rather than trusting status messages.
### Agents modifying the same files is the hard problem
All three contributions came from reading holistically -- something no individual report, pipeline run, or status message could produce. The Architect role was fundamentally about synthesis and verification.
The most persistent coordination difficulty was not conceptual but logistical: multiple agents editing the same blog posts. The Writer updated Post 4's gap table. The Editor changed the safety ratio phrasing. The Coder corrected the draft count. Each edit was correct in isolation. But when three agents modify the same file, merge conflicts and stale reads are inevitable. We hit this multiple times -- most visibly with the Post 1 revisions that silently failed to persist.
### The cheapest analyses were the most important
This maps directly to the IETF's Agent Execution Model gap. When multiple agents operate on shared state, you need either locking (pessimistic) or conflict detection (optimistic). We had neither. We used a file system, a dev journal, and hope.
| Component | Cost | Most Important Finding |
|-----------|-----:|----------------------|
| Claude Sonnet (ratings, gaps) | ~$8 | 4:1 safety deficit, 11 gap taxonomy |
| Claude Haiku (idea extraction) | ~$0.80 | 419 ideas (vast majority unique to single drafts) |
### The cheapest analyses mattered most
| Component | Cost | Key Finding |
|-----------|-----:|-------------|
| Claude Sonnet (ratings, gaps) | ~$8 | 4:1 safety deficit, 11 gaps |
| Claude Haiku (idea extraction) | ~$0.80 | 419 ideas, 96% unique to one draft |
| 4 reviewers (parallel) | ~$4 | 36 issues across 4 dimensions |
| Ollama embeddings | $0.00 | 25+ near-duplicate pairs |
| Coder: regex RFC parsing | $0.00 | Foundation divergence (YANG vs COSE) |
| Coder: networkx centrality | $0.00 | European telecoms as bridge-builders |
| Coder: SQL co-occurrence | $0.00 | Safety structurally isolated from protocols |
| Coder: revision counting | $0.00 | 55% fire-and-forget rate |
| **Total pipeline** | **~$9** | |
| Coder: regex, SQL, networkx | $0.00 | RFC divergence, centrality, co-occurrence |
| **Total** | **~$13** | |
The pattern is consistent: Claude provided the foundation data (ratings, categories, ideas), but the structurally revealing findings came from deterministic local computation on top of that foundation. RFC cross-references (regex), author centrality (networkx), revision velocity (filename parsing), and category co-occurrence (SQL joins) -- all zero-cost, all among the most quotable findings in the series.
### The development journal earned its keep
We required every agent to log milestones to a shared `dev-journal.md`. By session's end, the journal had 30 entries across all four agents -- capturing not just what was done but why, and flagging surprises that would otherwise be lost. When the Writer needed to understand what the Coder had built, the journal entry was faster and more informative than a status message. When the Architect reviewed posts, the Writer's journal entries explained editorial decisions that would otherwise be opaque.
The journal also became the source material for this post. Every "Surprise" field in the journal captured an insight -- the ideas reframing, the silent failure, the RFC divergence revelation -- that no other artifact preserves.
## What This Tells Us About Agent Teams
Six lessons from running a four-agent team on a real project:
**1. Role definitions matter more than instructions.** The one-page agent definitions were more effective than the 3,000-word team brief. Agents performed best when they had a clear identity and scope, not a detailed todo list.
**2. Shared state beats messaging.** The SQLite database, the dev journal, and the report files were more effective coordination mechanisms than direct inter-agent messages. Agents could read each other's outputs on their own schedule, without the overhead of request-response communication.
**3. Async is natural, but verification is not.** Agents working in parallel on loosely coupled tasks is a pattern that works. What does not happen naturally is output verification. The silent failure -- revisions logged but not persisted -- would have gone undetected without a deliberate verification pass. Agent teams need assurance mechanisms, not just coordination mechanisms.
**4. Humans catch category errors; agents catch consistency errors.** The Architect found a 14-vs-13 data inconsistency. The Writer applied six revision passes without introducing a single factual error. Agents are excellent at consistency within a frame. But the project lead's "so what?" about the ideas count was a category-level critique -- questioning the frame itself. That kind of challenge did not emerge from any agent.
**5. Review compounds.** The Architect reviewed the Writer's posts, the project lead reviewed the Architect's framing, and the resulting revisions cascaded through the series. Each review layer caught different things: data errors, structural problems, framing weaknesses. Multiple review passes from different perspectives produced compounding quality gains.
**6. The journal is the product.** The dev journal -- originally intended as a process artifact -- became the richest record of what happened and why. It captures decisions, surprises, and coordination moments that no other artifact preserves. For any multi-agent project, require a shared journal.
The LLM provided the foundation data. Every structurally revealing finding -- RFC foundation divergence, European telecoms as bridge-builders, safety structurally isolated from protocols, 55% fire-and-forget revision rate -- came from deterministic local computation on top of that foundation. The lesson for anyone building LLM-powered analysis: the model is the foundation, not the insight engine.
## The Meta-Irony
We built a team of AI agents to analyze 434 IETF drafts about AI agent standards. The team needed: coordination mechanisms, shared context, role-based specialization, review and quality gates, human oversight, and a way to verify that completed work was actually complete.
Every one of these needs maps to a gap in the IETF landscape:
We built a team of AI agents to analyze IETF drafts about AI agent standards. The team needed coordination, shared context, specialized roles, quality review, human oversight, and output verification. Every one of these needs maps to a gap in the IETF landscape:
| Our Team Needed | What Happened | IETF Gap |
|----------------|---------------|----------|
| Shared execution context | Agents coordinated via SQLite, files, dev journal | Agent Execution Model (no standard) |
| Quality review before publication | Architect caught data errors, structural problems | Agent Behavior Verification (critical gap) |
| Output verification | Writer's revisions silently failed; Architect caught it manually | Agent Behavior Verification (critical gap) |
| Error handling when agents disagreed | Ideas reframing required 3 iterations to stabilize | Agent Error Recovery (6 ideas from 1 draft) |
| Coordination across different approaches | RFC divergence: agents building on different foundations | Cross-Protocol Translation (zero ideas) |
| Human oversight of outputs | Project lead's "so what?" redirected the entire ideas framing | Human Override and Intervention (4 ideas) |
| Output verification | Writer's revisions silently failed; Architect caught it manually | Agent Behavioral Verification (critical) |
| Quality review | 4 parallel reviewers found 36 issues the writing team missed | Agent Behavioral Verification (critical) |
| Error handling | Ideas reframing required 3 iterations to stabilize numbers | Real-Time Agent Rollback (high) |
| Coordination across approaches | Agents editing the same files with no merge mechanism | Cross-Protocol Agent Migration (medium) |
| Human oversight | Project lead's "so what?" redirected the entire ideas framing | Human Override Standardization (high) |
| Specialized perspectives | Legal, statistical, engineering, and scientific reviewers each found unique issues | Agent Capability Negotiation (medium) |
We solved these problems ad hoc -- with a dev journal, a task board, role definitions, manual verification passes, and human review. The IETF is trying to solve them at internet scale with protocol standards. The distance between our 4-agent team and a deployed multi-agent system on the open internet is vast, but the problems are structurally identical.
We solved these problems ad hoc -- with a journal, role definitions, manual verification passes, severity-prioritized fix rounds, and human review. The IETF is trying to solve them at internet scale with protocol standards.
The standards the IETF is racing to write are the standards our own team needed. The traffic lights the highway needs are the ones we built by hand.
The distance between our 12-agent team and a deployed multi-agent system on the open internet is vast. But the problems are structurally identical. The standards the IETF is racing to write are the standards our own team needed. The traffic lights the highway needs are the ones we built by hand.
---
### Key Takeaways
- **Four agents** (Architect, Analyst, Coder, Writer) produced 8 blog posts, a vision document, 7 new analysis features, and 30 dev-journal entries from a ~$9 data pipeline
- **The ideas reframing** -- where a human's "so what?" redirected all four agents -- was the single most consequential intervention in the project, and no agent initiated it
- **Twelve agents across three phases** (4 writers, 4 reviewers, 4 fixers) produced 8 blog posts, a vision document, 7 analysis features, 36 identified issues, and 64 tests -- from a ~$13 pipeline
- **Four parallel reviewers found 36 non-overlapping issues**: a SQL injection, consent conflation with EU law, a 76% ideas count mismatch, and uncalibrated LLM-as-judge methodology. No single reviewer would have caught all of them
- **The human project lead's "so what?"** was the single most consequential intervention -- no agent questioned whether the headline metric was meaningful
- **A silent failure** (revisions logged but not persisted) demonstrated the same Behavior Verification gap the series identifies as critical in the IETF landscape
- **The cheapest analyses were the most revealing**: RFC divergence, author centrality, revision velocity, and co-occurrence patterns -- all zero-cost local computation -- produced the findings that defined the series
- **The team's coordination problems mirror the IETF's gaps**: execution model, behavior verification, error recovery, cross-protocol translation, and human oversight are needed at every scale
- **The team's coordination problems mirror the IETF's gaps**: shared state, output verification, error recovery, capability negotiation, and human oversight are needed at every scale
*This post concludes the series. All data, code, and reports are available in the IETF Draft Analyzer project repository.*