Fix blog accuracy and add methodology documentation

Blog posts (all 10 files updated): - Update all counts to match DB: 434 drafts, 557 authors, 419 ideas, 11 gaps - Fix EU AI Act timeline to August 2026 (5 months, not 18) - Reframe growth claim from "36x" to actual monthly figures (5→61→85) - Add safety ratio nuance (1.5:1 to 21:1 monthly variation) - Fix composite scores (4.8→4.75, 4.6→4.5) - Add OAuth/GDPR consent distinction (Art. 6(1)(a), Art. 28) - Add EU AI Act Annex III + MDR context to hospital scenario - Add FIPA, IEEE P3394, eIDAS 2.0 references - Add GDPR gap paragraph (DPIA, erasure, portability, purpose limitation) - Rewrite Post 04 gap table to match actual DB gap names Methodology: - Expand methodology.md: pipeline docs, limitations, related work - Add LLM-as-judge caveats and explicit rating rubric to analyzer.py - Add clustering threshold rationale to embeddings.py - Add gap analysis grounding notes to analyzer.py - Add Limitations section to Post 07 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 11:04:40 +01:00
parent 439424bd04
commit f1a0b0264c
11 changed files with 169 additions and 144 deletions
--- a/data/reports/blog-series/07-how-we-built-this.md
+++ b/data/reports/blog-series/07-how-we-built-this.md
@@ -1,10 +1,10 @@
-# How We Built This: Analyzing 361 IETF Drafts with Claude and Ollama
+# How We Built This: Analyzing 434 IETF Drafts with Claude and Ollama

 *The engineering behind the analysis -- a Python CLI, two LLMs, one SQLite database, and ~$9.*

 ---

-Every claim in this series -- the 4:1 safety ratio, the 14 competing OAuth proposals, the 18 team blocs, the 12 gaps, the 180 ideas crossing the Chinese-Western divide -- comes from an automated analysis pipeline we built in Python. This post describes how it works, what it costs, what it found that surprised us, and what we learned about LLM-powered document analysis at scale.
+Every claim in this series -- the 4:1 safety ratio, the 14 competing OAuth proposals, the 18 team blocs, the 11 gaps, the 180 ideas crossing the Chinese-Western divide -- comes from an automated analysis pipeline we built in Python. This post describes how it works, what it costs, what it found that surprised us, and what we learned about LLM-powered document analysis at scale.

 The tool is open source. If you want to run it on a different corner of the IETF -- or adapt it for another standards body -- everything you need is in the repository.

@@ -40,7 +40,7 @@ We search for drafts matching 12 keywords: `agent`, `ai-agent`, `llm`, `autonomo

 **Gotchas learned the hard way**: The Datatracker API uses `type__slug=draft` (not `type=draft`) to filter to drafts. Pagination requires tracking `meta.next` through the response chain. Affiliation data comes from the `documentauthor` record, not the `person` record. We add a 0.5-second polite delay between requests.

-The result: **361 drafts** fetched, with full metadata and text stored in SQLite.
+The result: **434 drafts** fetched, with full metadata and text stored in SQLite.

 ### Stage 2: Analyze

@@ -59,7 +59,7 @@ We initially sent full draft text to Claude, but switched to abstract-only analy

 For similarity analysis, we generate vector embeddings using Ollama running locally with the `nomic-embed-text` model. Each draft's abstract is embedded into a 768-dimensional vector, stored as raw bytes in the database.

-**Why not Claude for embeddings?** Cost and speed. Ollama runs locally, is free, and processes all 361 drafts in under a minute. The embeddings are used for approximate similarity (cosine distance), overlap detection, and t-SNE visualization -- tasks where a small local model is perfectly adequate.
+**Why not Claude for embeddings?** Cost and speed. Ollama runs locally, is free, and processes all 434 drafts in under a minute. The embeddings are used for approximate similarity (cosine distance), overlap detection, and t-SNE visualization -- tasks where a small local model is perfectly adequate.

 The embeddings enable:
 - **Overlap clusters**: Draft pairs with >0.85 cosine similarity grouped together
@@ -72,7 +72,7 @@ The most expensive stage. Each draft's full text is analyzed by Claude to extrac

 **Batch optimization**: Rather than calling Claude once per draft, we batch 5 drafts per API call using Claude Haiku (`--cheap --batch 5`). This cuts the number of API calls by 5x and uses the cheaper model. The batch prompt includes all 5 drafts' texts and asks for ideas from each, reducing per-idea cost to fractions of a cent.

-**Result**: **1,780 technical components** extracted from 361 drafts (averaging ~5 per draft). Of 1,692 unique titles, **96% appear in exactly one draft** -- most are draft-specific component descriptions ("Agent Gateway," "Transport Configuration System"), not standalone innovations. Only **75 ideas** show genuine cross-draft convergence (appearing in 2+ drafts), and only **11** appear in 3+ drafts. The real signal comes from the cross-org overlap analysis (idea-overlap feature), which uses fuzzy matching to identify **628 ideas** where 2+ organizations work on recognizably similar problems -- 43% of all unique idea clusters.
+**Result**: The current database contains **419 ideas** across 377 drafts. An earlier pipeline run produced roughly 1,780 components from 361 drafts (averaging ~5 per draft). The difference reflects changes in extraction parameters, batching strategy, and deduplication -- a known limitation of LLM-based extraction. What is consistent across both runs: the vast majority of extracted ideas appear in exactly one draft, and most are draft-specific component descriptions rather than standalone innovations. The real signal comes from the cross-org overlap analysis (idea-overlap feature), which uses fuzzy matching to identify **628 ideas** where 2+ organizations work on recognizably similar problems.

 ### Stage 5: Gaps

@@ -80,7 +80,7 @@ The gap analysis is a synthesis step. We send Claude Sonnet the full landscape c

 This is the one stage where the LLM is doing genuine reasoning, not just extraction. The prompt provides the data; Claude identifies the structural gaps. We validate its findings against the raw data (e.g., confirming that only 6 ideas address error recovery, or that cross-protocol translation has zero ideas).

-**Result**: **12 gaps** identified (3 critical, 6 high, 3 medium), each cross-referenced with related drafts and ideas.
+**Result**: **11 gaps** identified (2 critical, 5 high, 4 medium), each cross-referenced with related drafts and ideas.

 ### Stage 6: Report

@@ -92,15 +92,15 @@ The SQLite database is the real product. At **28 MB**, it contains everything ne

 | Table | Rows | Purpose |
 |-------|-----:|---------|
-| drafts | 361 | Full metadata + text for every draft |
-| ratings | 361 | 5-dimension quality scores + summaries |
-| embeddings | 361 | 768-dim vectors as binary blobs |
-| ideas | 1,780 | Extracted technical components with types |
+| drafts | 434 | Full metadata + text for every draft |
+| ratings | 434 | 5-dimension quality scores + summaries |
+| embeddings | 434 | 768-dim vectors as binary blobs |
+| ideas | 419 | Extracted technical components with types |
 | authors | 557 | Person records from Datatracker |
 | draft_authors | 1,057 | Author-to-draft linkage with affiliation |
 | draft_refs | 4,231 | RFC/draft/BCP cross-references |
-| gaps | 12 | Identified standardization gaps |
-| llm_cache | 703 | Cached Claude API responses |
+| gaps | 11 | Identified standardization gaps |
+| llm_cache | 1,397 | Cached Claude API responses |

 FTS5 full-text search is enabled on drafts, supporting queries like `ietf search "agent authentication"` that return ranked results in milliseconds. Indexes on `draft_refs(ref_type, ref_id)` and `ideas(draft_name)` keep query performance fast even for cross-table joins.

@@ -132,7 +132,7 @@ Four features were added during the analysis session, each unlocking a deeper an

 ### RFC Cross-References (`ietf refs`)

-**What it does**: Parses all 361 drafts for RFC references using regex (`RFC\s*\d{4,}`, `\[RFC\d+\]`, `BCP\s*\d+`, `draft-[\w-]+`). Stores results in a `draft_refs` table for querying.
+**What it does**: Parses all 434 drafts for RFC references using regex (`RFC\s*\d{4,}`, `\[RFC\d+\]`, `BCP\s*\d+`, `draft-[\w-]+`). Stores results in a `draft_refs` table for querying.

 **What it found**: **4,231 cross-references** (2,443 RFC, 698 draft, 1,090 BCP) across 360 drafts with text. The most-referenced standards reveal what the agent ecosystem builds on:

@@ -160,15 +160,15 @@ Four features were added during the analysis session, each unlocking a deeper an

 **What it does**: Groups similar ideas using `SequenceMatcher` (threshold 0.75), then checks which ideas span drafts from multiple organizations. This separates genuine cross-org consensus from intra-team duplication.

-**What it found**: By exact title, only 75 of 1,692 unique ideas appear in 2+ drafts -- 96% are islands. But fuzzy matching reveals **628 ideas** where 2+ organizations work on recognizably similar problems (43% of unique clusters). The top convergence signal -- "A2A Communication Paradigm" -- spans **8 organizations from 4 countries**. The deeper finding: **180 ideas cross the Chinese-Western organizational divide**. European telecoms (Deutsche Telekom, Telefonica, Orange) act as bridges between Chinese institutions and Western companies. US Big Tech (Google, Apple, Amazon) is almost entirely absent from cross-divide collaboration.
+**What it found**: By exact title, the vast majority of unique ideas appear in only a single draft. But fuzzy matching reveals **628 ideas** where 2+ organizations work on recognizably similar problems. The top convergence signal -- "A2A Communication Paradigm" -- spans **8 organizations from 5 countries**. The deeper finding: **180 ideas cross the Chinese-Western organizational divide**. European telecoms (Deutsche Telekom, Telefonica, Orange) act as bridges between Chinese institutions and Western companies. US Big Tech (Google, Apple, Amazon) is almost entirely absent from cross-divide collaboration.

 ### WG Adoption Status (`ietf status`)

 **What it does**: Determines which drafts have been formally adopted by IETF Working Groups based on the `draft-ietf-{wg}-*` naming convention. Compares scores, categories, and gap coverage between WG-adopted and individual drafts.

-**What it found**: Only **36 of 361 drafts (10%)** are WG-adopted. The remaining 90% are individual submissions -- ideas seeking institutional backing. WG-adopted drafts score slightly higher on average (**3.54 vs 3.31**), validating our rating methodology.
+**What it found**: **52 of 434 drafts (12%)** are WG-adopted. The remaining 90% are individual submissions -- ideas seeking institutional backing. WG-adopted drafts score slightly higher on average (**3.61 vs 3.23**), validating our rating methodology.

-The most revealing finding: **19 of 36 WG-adopted drafts are in security Working Groups** (lamps, lake, tls, emu, ace). The agent-focused `aipref` WG has only 2 adopted drafts. The IETF is not building agent standards in agent-focused groups -- it is retrofitting its existing security infrastructure for agent use cases. The standards that will actually govern AI agents on the internet are being written by the same people who write TLS and OAuth, not by new agent-specific working groups.
+The most revealing finding: **a majority of WG-adopted drafts are in security Working Groups** (lamps, lake, tls, emu, ace). The agent-focused `aipref` WG has only 2 adopted drafts. The IETF is not building agent standards in agent-focused groups -- it is retrofitting its existing security infrastructure for agent use cases. The standards that will actually govern AI agents on the internet are being written by the same people who write TLS and OAuth, not by new agent-specific working groups.

 ## What We Learned

@@ -202,16 +202,16 @@ The most valuable output is not any single report -- it is the SQLite database.
 |-------|-------|-------:|-----:|
 | Analyze | Claude Sonnet | 260 | ~$2.50 |
 | Analyze | Claude Sonnet | 101 | ~$5.50 |
-| Ideas | Claude Haiku (batch 5) | 361 | ~$0.80 |
+| Ideas | Claude Haiku (batch 5) | 434 | ~$0.80 |
 | Gaps | Claude Sonnet | 1 call | ~$0.20 |
-| Embed | Ollama (local) | 361 | $0.00 |
-| Refs | Regex (local) | 361 | $0.00 |
-| Trends | SQL (local) | 361 | $0.00 |
-| Idea-overlap | SequenceMatcher (local) | 1,780 ideas | $0.00 |
-| WG Status | Naming convention | 361 | $0.00 |
+| Embed | Ollama (local) | 434 | $0.00 |
+| Refs | Regex (local) | 434 | $0.00 |
+| Trends | SQL (local) | 434 | $0.00 |
+| Idea-overlap | SequenceMatcher (local) | 419 ideas | $0.00 |
+| WG Status | Naming convention | 434 | $0.00 |
 | **Total** | | | **~$9** |

-For context: analyzing 361 IETF drafts -- fetching full text, rating quality on 5 dimensions, extracting ~1,700 technical components, detecting 12 gaps, mapping 557 authors, parsing 4,231 cross-references, and identifying 18 team blocs -- cost less than two large coffees.
+For context: analyzing 434 IETF drafts -- fetching full text, rating quality on 5 dimensions, extracting 419 technical ideas, detecting 11 gaps, mapping 557 authors, parsing 4,231 cross-references, and identifying 18 team blocs -- cost less than two large coffees.

 ## The Tech Stack