Files

Christian Nennemann 439424bd04 Fix security, data integrity, and accuracy issues from 4-perspective review

Security fixes:
- Fix SQL injection in db.py:update_generation_run (column name whitelist)
- Flask SECRET_KEY from env var instead of hardcoded
- Add LLM rating bounds validation (_clamp_rating, 1-10)
- Fix JSON extraction trailing whitespace handling

Data integrity:
- Normalize 21 legacy category names to 11 canonical short forms
- Add false_positive column, flag 73 non-AI drafts (361 relevant remain)
- Document verified counts: 434 total/361 relevant drafts, 557 authors, 419 ideas, 11 gaps

Code quality:
- Fix version string 0.1.0 → 0.2.0
- Add close()/context manager to Embedder class
- Dynamic matrix size instead of hardcoded "260x260"

Blog accuracy:
- Fix EU AI Act timeline (enforcement Aug 2026, not "18 months")
- Distinguish OAuth consent from GDPR Einwilligung
- Add EU AI Act Annex III context to hospital scenario
- Add FIPA, eIDAS 2.0 references where relevant

Methodology:
- Add methodology.md documenting pipeline, limitations, rating rubric
- Add LLM-as-judge caveats to analyzer.py
- Document clustering threshold rationale

Reviews from: legal (German/EU law), statistics, development, science perspectives.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-08 10:52:33 +01:00

21 KiB

Raw Blame History

How We Built This: Analyzing 361 IETF Drafts with Claude and Ollama

The engineering behind the analysis -- a Python CLI, two LLMs, one SQLite database, and ~$9.

Every claim in this series -- the 4:1 safety ratio, the 14 competing OAuth proposals, the 18 team blocs, the 12 gaps, the 180 ideas crossing the Chinese-Western divide -- comes from an automated analysis pipeline we built in Python. This post describes how it works, what it costs, what it found that surprised us, and what we learned about LLM-powered document analysis at scale.

The tool is open source. If you want to run it on a different corner of the IETF -- or adapt it for another standards body -- everything you need is in the repository.

The Pipeline

The analysis runs in six core stages. Each builds on the previous, and every stage caches its work so re-runs are fast and cheap.

fetch --> analyze --> embed --> ideas --> gaps --> report
  |         |          |        |        |         |
  v         v          v        v        v         v
Datatracker  Claude     Ollama   Claude   Claude   Markdown
  API       Sonnet   nomic-embed  Haiku   Sonnet   + rich

Three additional analysis passes run on top of the core pipeline:

refs --> trends --> idea-overlap --> status
  |        |           |              |
  v        v           v              v
Regex    SQL query   SequenceMatcher  Naming convention
(local)  (local)     (local)          (local)

These secondary passes cost nothing -- they operate entirely on data already in the database.

Stage 1: Fetch

The Datatracker API (https://datatracker.ietf.org/api/v1/doc/document/) provides structured metadata for every Internet-Draft: name, title, abstract, authors, revision, submission date, working group, and current status. Full text is available at https://www.ietf.org/archive/id/{name}-{rev}.txt.

We search for drafts matching 12 keywords: agent, ai-agent, llm, autonomous, machine-learning, artificial-intelligence, mcp, agentic, inference, generative, intelligent, aipref. Both name__contains and abstract__contains filters are used to cast a wide net. We started with 6 keywords and 260 drafts; adding 6 more captured 101 new drafts in categories we were missing -- MCP-related work, generative AI infrastructure, intelligent networking, and the nascent aipref working group.

Gotchas learned the hard way: The Datatracker API uses type__slug=draft (not type=draft) to filter to drafts. Pagination requires tracking meta.next through the response chain. Affiliation data comes from the documentauthor record, not the person record. We add a 0.5-second polite delay between requests.

The result: 361 drafts fetched, with full metadata and text stored in SQLite.

Stage 2: Analyze

Each draft is sent to Claude Sonnet with a compact structured prompt that includes the draft name, title, date, page count, and abstract. The prompt asks for:

Category classification (one or more of 11 categories: A2A protocols, agent identity/auth, autonomous netops, data formats/interop, agent discovery/reg, human-agent interaction, AI safety/alignment, ML traffic management, policy/governance, model serving/inference, other)
Quality rating on five dimensions (novelty, maturity, overlap, momentum, relevance) each scored 1-5
Brief summary of what the draft does and why it matters

The key optimization: caching. Every Claude API call is stored in an llm_cache table keyed by the SHA-256 hash of the full prompt. If the same draft is analyzed twice, the second call is free and instant. This makes the pipeline idempotent -- you can re-run any stage without wasting money.

We initially sent full draft text to Claude, but switched to abstract-only analysis after testing showed that abstracts produce equivalent ratings at roughly 10x lower token cost. Full text is still used for idea extraction (Stage 4), where granular detail matters.

Cost: About $3.16 for the initial 260 drafts on Claude Sonnet (376K input tokens, 200K output tokens). With the --cheap flag, analysis uses Claude Haiku instead, cutting costs roughly 10x.

Stage 3: Embed

For similarity analysis, we generate vector embeddings using Ollama running locally with the nomic-embed-text model. Each draft's abstract is embedded into a 768-dimensional vector, stored as raw bytes in the database.

Why not Claude for embeddings? Cost and speed. Ollama runs locally, is free, and processes all 361 drafts in under a minute. The embeddings are used for approximate similarity (cosine distance), overlap detection, and t-SNE visualization -- tasks where a small local model is perfectly adequate.

The embeddings enable:

Overlap clusters: Draft pairs with >0.85 cosine similarity grouped together
Near-duplicate detection: 25+ pairs with >0.98 similarity flagged as potential duplicates
Interactive t-SNE landscape: 2D visualization of the entire draft space, color-coded by category

Stage 4: Ideas

The most expensive stage. Each draft's full text is analyzed by Claude to extract discrete technical ideas -- mechanisms, architectures, protocols, patterns, extensions, and requirements.

Batch optimization: Rather than calling Claude once per draft, we batch 5 drafts per API call using Claude Haiku (--cheap --batch 5). This cuts the number of API calls by 5x and uses the cheaper model. The batch prompt includes all 5 drafts' texts and asks for ideas from each, reducing per-idea cost to fractions of a cent.

Result: 1,780 technical components extracted from 361 drafts (averaging ~5 per draft). Of 1,692 unique titles, 96% appear in exactly one draft -- most are draft-specific component descriptions ("Agent Gateway," "Transport Configuration System"), not standalone innovations. Only 75 ideas show genuine cross-draft convergence (appearing in 2+ drafts), and only 11 appear in 3+ drafts. The real signal comes from the cross-org overlap analysis (idea-overlap feature), which uses fuzzy matching to identify 628 ideas where 2+ organizations work on recognizably similar problems -- 43% of all unique idea clusters.

Stage 5: Gaps

The gap analysis is a synthesis step. We send Claude Sonnet the full landscape context -- category distributions, idea taxonomy, safety ratio, overlap patterns -- and ask it to identify areas where standardization work is missing or inadequate.

This is the one stage where the LLM is doing genuine reasoning, not just extraction. The prompt provides the data; Claude identifies the structural gaps. We validate its findings against the raw data (e.g., confirming that only 6 ideas address error recovery, or that cross-protocol translation has zero ideas).

Result: 12 gaps identified (3 critical, 6 high, 3 medium), each cross-referenced with related drafts and ideas.

Stage 6: Report

Reports are generated in Markdown with embedded data tables. Fifteen report types are available: overview, landscape, digest, timeline, overlap-matrix, overlap-clusters, authors, ideas, gaps, refs, trends, idea-overlap, and status. The rich library provides formatted terminal output for CLI commands.

The Database

The SQLite database is the real product. At 28 MB, it contains everything needed to reproduce any finding in this series.

Table	Rows	Purpose
drafts	361	Full metadata + text for every draft
ratings	361	5-dimension quality scores + summaries
embeddings	361	768-dim vectors as binary blobs
ideas	1,780	Extracted technical components with types
authors	557	Person records from Datatracker
draft_authors	1,057	Author-to-draft linkage with affiliation
draft_refs	4,231	RFC/draft/BCP cross-references
gaps	12	Identified standardization gaps
llm_cache	703	Cached Claude API responses

FTS5 full-text search is enabled on drafts, supporting queries like ietf search "agent authentication" that return ranked results in milliseconds. Indexes on draft_refs(ref_type, ref_id) and ideas(draft_name) keep query performance fast even for cross-table joins.

The database design follows a principle: store raw data, compute derived data. The drafts table stores full text; the ratings, ideas, and refs tables store analysis results. Any analysis can be re-run without re-fetching from the Datatracker API.

The Author Network

The author analysis deserves special mention because it revealed the team bloc pattern -- one of the most important findings in the series.

The IETF Datatracker provides author information via two API endpoints:

/api/v1/doc/documentauthor/?document__name=X -- returns author links per draft
/api/v1/person/person/{id}/ -- returns person details (name, affiliation)

We fetch all authors for all drafts, build a co-authorship graph, and detect team blocs: groups where every pair of members shares at least 70% of their drafts. This threshold was chosen empirically -- lower thresholds produce too many loose groups; higher thresholds miss real teams.

The detection algorithm:

For each pair of authors, calculate pairwise overlap = |shared drafts| / min(|A's drafts|, |B's drafts|)
Build a graph where edges represent pairs with >= 70% overlap and >= 2 shared drafts
Find connected components in this graph
Each component is a team bloc

Organization normalization turned out to be essential. "Huawei Technologies", "Huawei Technologies Co., Ltd.", and "Huawei Canada" all need to resolve to "Huawei". We maintain a hand-curated alias table of 40+ mappings plus automatic suffix stripping for common patterns (", Inc.", " LLC", " AB", etc.). Without this, cross-org analysis would fragment the same company into multiple entities.

Result: 18 team blocs detected among 557 authors. The largest: a 13-person Huawei team with 22 shared drafts and 94% average cohesion.

The New Features

Four features were added during the analysis session, each unlocking a deeper analytical layer. All four run locally with zero API cost.

RFC Cross-References (`ietf refs`)

What it does: Parses all 361 drafts for RFC references using regex (RFC\s*\d{4,}, \[RFC\d+\], BCP\s*\d+, draft-[\w-]+). Stores results in a draft_refs table for querying.

What it found: 4,231 cross-references (2,443 RFC, 698 draft, 1,090 BCP) across 360 drafts with text. The most-referenced standards reveal what the agent ecosystem builds on:

RFC	References	What It Is
RFC 2119	285	MUST/SHALL/MAY conventions
RFC 8174	237	Key words update
RFC 8446	42	TLS 1.3
RFC 6749	36	OAuth 2.0
RFC 9110	34	HTTP Semantics
RFC 8259	26	JSON
RFC 5280	22	X.509 Certificates
RFC 7519	22	JWT
RFC 9052	20	COSE

The insight: Strip away RFC 2119/8174 (boilerplate conventions that every IETF draft references) and the picture is clear: the agent ecosystem is built on OAuth + TLS + HTTP + JWT. It is a security and identity infrastructure, not a networking infrastructure. The IETF's agent standards are being constructed on the same foundation as the web itself. This reframes the entire landscape: agent standards are not something new. They are the next layer on top of the web's existing security architecture.

Category Trends (`ietf trends`)

What it does: Monthly breakdown of new drafts per category with growth rates, comparing recent periods to earlier ones.

What it found: The growth curve is a step function. Monthly submissions went from 2 (Jun 2025) to 67 (Oct 2025) to 86 (Feb 2026). A2A protocols are still accelerating (26 in Oct/Nov 2025, 36 in Feb 2026). Safety/alignment is growing but slower (5 in Oct 2025, 12 in Feb 2026). The 4:1 ratio is narrowing, but not fast enough.

Cross-Org Idea Overlap (`ietf idea-overlap`)

What it does: Groups similar ideas using SequenceMatcher (threshold 0.75), then checks which ideas span drafts from multiple organizations. This separates genuine cross-org consensus from intra-team duplication.

What it found: By exact title, only 75 of 1,692 unique ideas appear in 2+ drafts -- 96% are islands. But fuzzy matching reveals 628 ideas where 2+ organizations work on recognizably similar problems (43% of unique clusters). The top convergence signal -- "A2A Communication Paradigm" -- spans 8 organizations from 4 countries. The deeper finding: 180 ideas cross the Chinese-Western organizational divide. European telecoms (Deutsche Telekom, Telefonica, Orange) act as bridges between Chinese institutions and Western companies. US Big Tech (Google, Apple, Amazon) is almost entirely absent from cross-divide collaboration.

WG Adoption Status (`ietf status`)

What it does: Determines which drafts have been formally adopted by IETF Working Groups based on the draft-ietf-{wg}-* naming convention. Compares scores, categories, and gap coverage between WG-adopted and individual drafts.

What it found: Only 36 of 361 drafts (10%) are WG-adopted. The remaining 90% are individual submissions -- ideas seeking institutional backing. WG-adopted drafts score slightly higher on average (3.54 vs 3.31), validating our rating methodology.

The most revealing finding: 19 of 36 WG-adopted drafts are in security Working Groups (lamps, lake, tls, emu, ace). The agent-focused aipref WG has only 2 adopted drafts. The IETF is not building agent standards in agent-focused groups -- it is retrofitting its existing security infrastructure for agent use cases. The standards that will actually govern AI agents on the internet are being written by the same people who write TLS and OAuth, not by new agent-specific working groups.

What We Learned

LLMs are good at structured extraction

Claude's strength in this pipeline is turning unstructured technical documents into structured data: categories, ratings, ideas, gaps. The extraction quality is high -- we spot-checked 50 drafts and found categorization and idea extraction accurate in ~90% of cases. The errors tend to be over-categorization (assigning too many categories) rather than miscategorization.

LLMs need validation for synthesis

The gap analysis (Stage 5) required the most human oversight. Claude correctly identified the gaps, but the severity rankings and the "zero ideas" claims needed manual verification against the raw data. LLMs can synthesize, but the synthesis should be treated as a hypothesis, not a conclusion.

Caching changes the economics

The llm_cache table transforms the cost model. The first run costs ~$3. Every subsequent run -- adding new drafts, re-running with different prompts, regenerating reports -- costs only for new work. Over the project's life, we estimate caching saved $30+ in redundant API calls. The cache key is a SHA-256 hash of the full prompt, making it trivially collision-resistant.

Hybrid models work

Using Claude Sonnet for reasoning-heavy tasks (analysis, gap synthesis) and Claude Haiku for extraction-heavy tasks (idea extraction, batch processing) cut costs by 5-10x without meaningful quality loss. Using Ollama for embeddings made similarity analysis free and fast. The principle: match the model's capability to the task's difficulty.

The free analyses are the most revealing

The four features that cost zero API dollars -- regex-based RFC parsing, SQL-based trend analysis, SequenceMatcher-based idea dedup, and naming-convention-based WG detection -- produced some of the most narratively important findings in the entire series. The OAuth-stack-as-foundation insight from RFC cross-references. The 180 cross-divide ideas. The 10% WG adoption rate. The security-WG-not-agent-WG finding. None of these required an LLM. They required a well-structured database and the right questions.

The database is the product

The most valuable output is not any single report -- it is the SQLite database. With all drafts analyzed, ideas extracted, authors mapped, refs parsed, and embeddings stored, the database supports ad-hoc queries that no pre-built report can anticipate. The blog series was written primarily by querying the database, not by re-running the pipeline.

Cost Summary

Stage	Model	Drafts	Cost
Analyze	Claude Sonnet	260	~$2.50
Analyze	Claude Sonnet	101	~$5.50
Ideas	Claude Haiku (batch 5)	361	~$0.80
Gaps	Claude Sonnet	1 call	~$0.20
Embed	Ollama (local)	361	$0.00
Refs	Regex (local)	361	$0.00
Trends	SQL (local)	361	$0.00
Idea-overlap	SequenceMatcher (local)	1,780 ideas	$0.00
WG Status	Naming convention	361	$0.00
Total			~$9

For context: analyzing 361 IETF drafts -- fetching full text, rating quality on 5 dimensions, extracting ~1,700 technical components, detecting 12 gaps, mapping 557 authors, parsing 4,231 cross-references, and identifying 18 team blocs -- cost less than two large coffees.

The Tech Stack

Python 3.11+ with Click for the CLI
SQLite with FTS5 for full-text search
httpx for HTTP requests (Datatracker API)
anthropic SDK for Claude API
ollama for local embeddings
rich for terminal formatting
numpy for cosine similarity and matrix operations

43 CLI commands, 13+ interactive visualizations (HTML/PNG), 15 report types. Total codebase: approximately 6,100 lines of Python across 12 modules.

Limitations

This analysis is exploratory, not peer-reviewed research. Several methodological limitations should be understood when interpreting the results:

LLM-as-Judge ratings: All quality ratings are generated by Claude Sonnet from draft abstracts (not full text), with no human calibration. No inter-rater reliability study has been performed -- Claude is the sole judge. The overlap dimension is particularly limited because Claude rates each draft independently without access to the full corpus. Scores should be treated as relative rankings within this corpus, not absolute quality measures.

Keyword-based corpus selection: The 12 search keywords cast a wide net but introduce both false positives (drafts about "user agents" or "autonomous systems" unrelated to AI) and false negatives (relevant drafts using terminology we did not search for). We estimate 30-50 false positives remain in the corpus. The relevance rating partially mitigates this, but the LLM judge is generous with relevance for keyword-matched drafts.

Clustering thresholds: The 0.85 cosine similarity threshold for topical clusters, 0.90 for near-duplicates, and 0.98 for functional duplicates are empirical choices based on manual inspection, not derived from a principled analysis. The embedding model (nomic-embed-text) is general-purpose, not fine-tuned for standards documents. A sensitivity analysis across thresholds would strengthen confidence.

Gap analysis: The gap identification is a single-shot LLM analysis based on compressed landscape statistics, not a systematic comparison against a reference architecture. Gap severity is assigned by Claude without defined thresholds. The gaps should be treated as hypotheses for expert validation, not definitive findings.

Idea extraction quality: Batch extraction (Haiku, abstract-only at 800 chars) produces different results than individual extraction (Sonnet, abstract + full text). No precision/recall measurement has been performed. The extraction prompt instructs Claude to return 1-4 ideas per draft, which may under-count contributions from comprehensive drafts.

Abstract-only analysis: Ratings are based on abstracts truncated to 2000 characters. For maturity assessment in particular, the abstract is an imperfect proxy for the full document's technical depth.

For full methodology documentation, see data/reports/methodology.md in the project repository.

Key Takeaways

The full analysis cost ~$9 -- LLM-powered document analysis at scale is practical and cheap with proper caching and model selection
Caching is essential: SHA-256 hashed prompt caching makes the pipeline idempotent and dramatically reduces costs on re-runs
Hybrid LLM strategy: Claude Sonnet for reasoning, Claude Haiku for extraction (10x cheaper), Ollama for embeddings (free) -- match model capability to task difficulty
The zero-cost analyses were the most revealing: RFC cross-references, idea overlap, WG adoption, and trend analysis all run locally and produced the series' most important structural findings
The database is the product: a well-structured SQLite DB supports queries no pre-built report anticipates; the blog series was written by querying, not re-running

Next in this series: Agents Building the Agent Analysis -- we used a team of AI agents to produce this series. The irony is the point.

The IETF Draft Analyzer is open source. The codebase, database, and all reports are available in the project repository.

21 KiB Raw Blame History