# Agents Building the Agent Analysis

*We used a team of AI agents to analyze, write about, and review 434 IETF Internet-Drafts on AI agents. Here is what that looked like from the inside.*

*Analysis based on IETF Datatracker data collected through March 2026. Counts and statistics reflect this snapshot.*

---

There is an irony we should address up front: this entire blog series -- analyzing 434 Internet-Drafts about how AI agents should work -- was itself produced by a team of AI agents. Twelve Claude instances across three phases, each with a distinct role, reading the same database, building on each other's output, and coordinating through a shared journal and file system.

This post is the story of that process: what worked, what broke, what surprised us, and what it reveals about the state of AI agent coordination in practice -- which, as it happens, is exactly the problem the IETF drafts are trying to solve.

## Phase 1: The Writing Team

We started with four agents, each defined in a one-page file and grounded by a shared 3,000-word team brief:

| Agent | Role | What They Did |
|-------|------|---------------|
| **Architect** | The Big Picture | Read all reports, designed the narrative arc, wrote the vision document, reviewed every post |
| **Analyst** | The Data Whisperer | Ran the pipeline on 434 drafts, executed 20+ SQL queries, produced data packages |
| **Coder** | The Feature Builder | Implemented 7 new analysis features (refs, trends, idea-overlap, WG adoption, revisions, centrality, co-occurrence) |
| **Writer** | The Storyteller | Drafted all 8 blog posts, applied 6+ revision passes |

Each agent had access to the full project codebase, a SQLite database, and the `ietf` CLI tool. They communicated through files and coordinated through a shared development journal. The team brief contained a thesis statement -- "The IETF is building the highways before the traffic lights" -- a per-post outline, and a data requirements table.

### Parallel by default

The key design decision: agents did not wait for each other when they could work in parallel. The Writer's tasks were formally blocked by the Analyst's pipeline run, but the Writer had enough existing data (260 analyzed drafts) to start drafting. Rather than sitting idle, the Writer produced first drafts of all 7 posts while waiting for updated numbers. This turned out to be the right call -- the structure and narrative mattered more than whether the draft count was 260 or 434.

The Coder and Writer worked simultaneously, their outputs feeding each other. Every feature the Coder built used zero API calls -- pure local computation via regex, SQL, SequenceMatcher, and networkx. The RFC cross-reference parser revealed that the Chinese and Western blocs build on incompatible infrastructure foundations (YANG/NETCONF vs. COSE/CBOR), with OAuth 2.0 as the only shared bedrock. The co-occurrence analysis showed safety has zero overlap with Agent Discovery and Model Serving. These zero-cost local analyses produced the most structurally revealing findings in the entire series.

### The Architect shaped everything

The Architect produced fewer words than the Writer and fewer features than the Coder, but had disproportionate impact. Three contributions reshaped the output:

1. The insight that **gap severity correlates with coordination difficulty** transformed Post 4 from a list of gaps into an argument about structural dysfunction.
2. The **"two equilibria" framing** -- microservices chaos vs. layered web architecture -- gave Post 6's predictions real structural weight.
3. A **verification pass** that caught the Writer's revisions silently failing (logged as done, not actually persisted in the file).

That third point is worth dwelling on. The dev journal said "Post 1 revisions complete." The file still contained the pre-revision content. Without the Architect reading the actual output rather than trusting the status message, the error would have shipped. This is a small-scale version of the Behavior Verification gap the series identifies as critical -- and we will come back to it.

### The human who said "so what?"

The most consequential intervention in the entire project came not from an agent but from the human project lead. The series had been built around a headline number: "1,780 technical ideas extracted from the drafts." The project lead asked: what does that number actually mean?

The answer was uncomfortable. The pipeline extracts roughly 5 ideas per draft on average -- a mechanical process that produces items like "A2A Communication Paradigm" and "Agent Network Architecture." The raw count sounds impressive but is mostly scaffolding. The real signal was hiding in the cross-org overlap analysis: 96% of unique idea titles appear in exactly one draft. Only 75 show up in two or more. The fragmentation that defines the protocol landscape extends all the way down to the idea level.

This required rewriting Post 5 entirely. Its title changed from "The 1,780 Ideas That Will Shape Agent Infrastructure" to "Where 434 Drafts Converge (And Where They Don't)." The lead metric shifted from raw extraction count (impressive but hollow) to the convergence rate (honest and striking). Four agents had independently used the 1,780 figure -- the Analyst generated it, the Coder validated it, the Architect designed around it, the Writer headlined it. None questioned whether it was meaningful.

## Phase 2: The Review Cycle

After the writing team produced 8 blog posts, a vision document, 7 new analysis features, and 30 dev-journal entries, we did something that turned out to matter more than the writing itself: we sent the entire output to four specialist reviewers, each running in parallel.

| Reviewer | Lens | Issues Found |
|----------|------|-------------|
| **Statistics** | Data integrity, sampling bias, quantitative accuracy | 3 critical, 4 important, 4 minor |
| **Legal** | German/EU internet law, GDPR, EU AI Act, eIDAS 2.0 | 3 critical, 5 regulatory gaps, 5 improvements |
| **Engineering** | Code quality, security, performance, DX | 1 critical, 1 high, 5 bugs, 6 perf issues |
| **Science** | Methodology, reproducibility, related work, hedging | 2 critical, 3 high, 4 medium |

Four agents, four completely different perspectives, run simultaneously. Together they surfaced **36 distinct issues** that the writing team had missed. The findings were often surprising.

### The statistics reviewer found the numbers did not add up

The statistical audit cross-checked every quantitative claim in the blog series against the actual database using raw SQL queries. The results were sobering. The blog claimed 361 drafts; the database held 434. The blog claimed 1,780 ideas; the database held 419. The blog claimed 12 gaps; the database held 11. Composite scores were inflated by 0.05-0.10 through rounding. The "4:1 safety ratio" varied from 1.5:1 to 21:1 by month -- a fact the flat claim obscured.

The ideas count mismatch was the most serious finding. The entire thesis of Post 5 -- "96% of ideas appear in one draft" and "628 cross-org convergent ideas" -- was not reproducible from the current database. The pipeline had been re-run with different parameters, overwriting the original extraction. Nobody had noticed because the numbers in the blog posts were never re-checked against the live database.

### The legal reviewer found regulatory blindspots

The legal review, written from a German/EU internet law perspective, identified three critical issues that no technically-focused agent would have caught:

**Consent conflation.** The series used "consent" interchangeably across OAuth authorization flows, GDPR consent (Einwilligung under Art. 6(1)(a)), and human-in-the-loop approval gates. These are legally distinct concepts. Under CJEU case law (Planet49), consent requires a clear affirmative act by the data subject. When an AI agent delegates to sub-agents, the chain of consent may break entirely. None of the 14 OAuth-for-agents proposals the series analyzed -- and none of the agents writing about them -- flagged this.

**The hospital scenario understated regulatory reality.** Post 4's opening scenario -- an AI agent managing drug dispensing with a hallucinated dosage -- was framed as "what goes wrong if this gap is never addressed." Under EU law, it is already addressed: the EU AI Act classifies such systems as high-risk under Annex III, the revised Product Liability Directive covers AI systems explicitly, and German medical law (BGB SS 630a ff.) places duty of care on the provider. The IETF gap is not in accountability but in technical mechanisms to implement what the regulation already requires.

**GDPR was entirely absent from the gap analysis.** The series identified 11 standardization gaps. None mentioned GDPR-mandated capabilities: data protection impact assessments, right to erasure propagation through multi-agent chains, data portability, or purpose limitation. These are not aspirational -- they are legally binding requirements that agent systems operating in the EU must satisfy.

### The engineering reviewer found a SQL injection

The codebase review graded the project B+ overall -- "solid for a research tool, needs hardening for production" -- but found a critical SQL injection vulnerability in `db.py`. The `update_generation_run` method interpolated column names from `**kwargs` directly into SQL strings without validation. The Flask SECRET_KEY was hardcoded as the string `"ietf-dashboard-dev"`. There was no rate limiting on endpoints that trigger paid Claude API calls.

The engineering reviewer also noted that `cli.py` had grown to 2,995 lines with approximately 40 repetitions of the same config/db boilerplate pattern. And that test coverage for the analysis pipeline -- the core of the tool -- was exactly zero.

### The science reviewer questioned the methodology

The scientific review identified the central methodological weakness: the entire rating system relies on Claude as the sole judge for five dimensions, with no human calibration, no inter-rater reliability measurement, and ratings based on abstracts only (truncated to 2,000 characters), not full draft text. The clustering threshold of 0.85 was described as "empirical" with no sensitivity analysis. The gap analysis was single-shot LLM generation from compressed metadata.

One finding was particularly striking: of 434 drafts rated for relevance, the distribution was heavily right-skewed (196 at 4, 98 at 5, only 38 at 1-2). Claude was generous with relevance for keyword-matched drafts, making the metric less discriminating than it should be. Upon manual review, 73 drafts turned out to be false positives -- including `draft-ietf-hpke-hpke` (generic public key encryption, nothing to do with AI agents) rated at relevance 5.

## Phase 3: The Fix Cycle

With 36 issues identified, we launched fix agents -- the Coder handling engineering and data integrity issues, an Editor handling legal and statistical corrections across the blog posts.

The fixes unfolded in three rounds, prioritized by severity:

**Round 1 -- Critical.** SQL injection patched with a column name whitelist. Flask SECRET_KEY replaced with `os.environ.get()` fallback to `os.urandom()`. FTS5 query sanitization added to prevent search injection. False-positive column added to the ratings table; 73 drafts flagged. All blog posts updated from 361 to 434 drafts. Ideas count discrepancy reconciled (419 current with methodology note explaining the 1,780 historical figure). Gap count corrected from 12 to 11 with rewritten gap table matching database reality.

**Round 2 -- High.** Rate limiting added to Claude-calling endpoints (10 req/min/IP). Category names normalized in the database (21 legacy entries migrated). EU AI Act timeline corrected from "within 18 months" to "within 5 months (August 2026)" with enforcement details and article references. OAuth/GDPR consent distinction added. Hospital scenario annotated with AI Act Annex III and Medical Devices Regulation context. Safety ratio qualified everywhere from flat "4:1" to "averaging ~4:1 but varying from 1.5:1 to 21:1 month-to-month."

**Round 3 -- Medium.** Methodology documentation created (comprehensive `methodology.md` covering all pipeline stages, limitations, and related work). IETF IPR notes added. Language hedged where causal claims were only supported by correlation. MIT LICENSE file created (the project claimed "open source" but had no license). FIPA, IEEE P3394, and eIDAS 2.0 references added where they naturally strengthen arguments. Coder reduced `cli.py` by 200 lines of boilerplate, added `--dry-run` flags to destructive commands, fixed N+1 query patterns.

In total: 14 files modified across the blog series, 7 security/quality fixes applied to the codebase, test count increased from 23 to 64, and a verified-counts document created as a single source of truth.

## What This Reveals

### Specialized perspectives catch different things

This is the headline finding from the review cycle. Four reviewers looked at the same output and found almost entirely non-overlapping issues. The statistician found number mismatches. The lawyer found consent conflation. The engineer found SQL injection. The scientist found methodological gaps. No single reviewer -- no matter how thorough -- would have caught all 36 issues.

This is not a theoretical observation about diverse review. It is an empirical result from running the experiment. The legal reviewer's consent-conflation finding required knowledge of CJEU case law. The statistical reviewer's ideas-count discovery required querying the live database. The engineering reviewer's SQL injection required reading the source code line by line. These are genuinely different skills applied to the same artifact.

### The review-fix-verify pattern works

The cycle ran cleanly: four parallel reviews produced a prioritized list; fix agents resolved issues in severity order; the fixes were verified against the review documents. Three rounds (critical, high, medium) imposed natural prioritization. The entire cycle -- 4 reviews plus 3 fix rounds -- happened in a single day.

The pattern mirrors what the IETF itself does with Last Call reviews, directorate reviews, and IESG evaluation. Multiple specialized perspectives, applied in sequence, with verification that issues are resolved. The difference is that our cycle took hours, not months. The cost is that our reviewers share the same underlying model and its blindspots.

### Agents modifying the same files is the hard problem

The most persistent coordination difficulty was not conceptual but logistical: multiple agents editing the same blog posts. The Writer updated Post 4's gap table. The Editor changed the safety ratio phrasing. The Coder corrected the draft count. Each edit was correct in isolation. But when three agents modify the same file, merge conflicts and stale reads are inevitable. We hit this multiple times -- most visibly with the Post 1 revisions that silently failed to persist.

This maps directly to the IETF's Agent Execution Model gap. When multiple agents operate on shared state, you need either locking (pessimistic) or conflict detection (optimistic). We had neither. We used a file system, a dev journal, and hope.

### The cheapest analyses mattered most

| Component | Cost | Key Finding |
|-----------|-----:|-------------|
| Claude Sonnet (ratings, gaps) | ~$8 | 4:1 safety deficit, 11 gaps |
| Claude Haiku (idea extraction) | ~$0.80 | 419 ideas, 96% unique to one draft |
| 4 reviewers (parallel) | ~$4 | 36 issues across 4 dimensions |
| Ollama embeddings | $0.00 | 25+ near-duplicate pairs |
| Coder: regex, SQL, networkx | $0.00 | RFC divergence, centrality, co-occurrence |
| **Total** | **~$13** | |

The LLM provided the foundation data. Every structurally revealing finding -- RFC foundation divergence, European telecoms as bridge-builders, safety structurally isolated from protocols, 55% fire-and-forget revision rate -- came from deterministic local computation on top of that foundation. The lesson for anyone building LLM-powered analysis: the model is the foundation, not the insight engine.

## The Meta-Irony

We built a team of AI agents to analyze IETF drafts about AI agent standards. The team needed coordination, shared context, specialized roles, quality review, human oversight, and output verification. Every one of these needs maps to a gap in the IETF landscape:

| Our Team Needed | What Happened | IETF Gap |
|----------------|---------------|----------|
| Shared execution context | Agents coordinated via SQLite, files, dev journal | Agent Execution Model (no standard) |
| Output verification | Writer's revisions silently failed; Architect caught it manually | Agent Behavioral Verification (critical) |
| Quality review | 4 parallel reviewers found 36 issues the writing team missed | Agent Behavioral Verification (critical) |
| Error handling | Ideas reframing required 3 iterations to stabilize numbers | Real-Time Agent Rollback (high) |
| Coordination across approaches | Agents editing the same files with no merge mechanism | Cross-Protocol Agent Migration (medium) |
| Human oversight | Project lead's "so what?" redirected the entire ideas framing | Human Override Standardization (high) |
| Specialized perspectives | Legal, statistical, engineering, and scientific reviewers each found unique issues | Agent Capability Negotiation (medium) |

We solved these problems ad hoc -- with a journal, role definitions, manual verification passes, severity-prioritized fix rounds, and human review. The IETF is trying to solve them at internet scale with protocol standards.

The distance between our 12-agent team and a deployed multi-agent system on the open internet is vast. But the problems are structurally identical. The standards the IETF is racing to write are the standards our own team needed. The traffic lights the highway needs are the ones we built by hand.

---

### Key Takeaways

- **Twelve agents across three phases** (4 writers, 4 reviewers, 4 fixers) produced 8 blog posts, a vision document, 7 analysis features, 36 identified issues, and 64 tests -- from a ~$13 pipeline
- **Four parallel reviewers found 36 non-overlapping issues**: a SQL injection, consent conflation with EU law, a 76% ideas count mismatch, and uncalibrated LLM-as-judge methodology. No single reviewer would have caught all of them
- **The human project lead's "so what?"** was the single most consequential intervention -- no agent questioned whether the headline metric was meaningful
- **A silent failure** (revisions logged but not persisted) demonstrated the same Behavior Verification gap the series identifies as critical in the IETF landscape
- **The team's coordination problems mirror the IETF's gaps**: shared state, output verification, error recovery, capability negotiation, and human oversight are needed at every scale

*This post concludes the series. All data, code, and reports are available in the IETF Draft Analyzer project repository.*

---

*Written by a team of Claude instances analyzing the IETF's work on AI agent standards. The irony is not lost on us.*