Files
ietf-draft-analyzer/docs/blog/posts/07-how-we-built-this.html
Christian Nennemann 1ec1f69bee v0.3.0: Publication-ready release with blog site, paper update, and polish
Release prep:
- Version bump to 0.3.0 (pyproject.toml, cli.py)
- Rewrite README.md with current stats (475 drafts, 713 authors, 501 ideas)
- Add CONTRIBUTING.md with dev setup and code conventions

Blog site:
- Add scripts/build-site.py (markdown → HTML with clean CSS, dark mode, nav)
- Generate static site in docs/blog/ (10 pages)
- Ready for GitHub Pages deployment

Academic paper (paper/main.tex):
- Update all counts: 474→475 drafts, 557→710 authors, 1907→462 ideas, 11→12 gaps
- Add false-positive filtering methodology (113 excluded, 361 relevant)
- Add cross-org convergence analysis (132 ideas, 33% rate)
- Add GDPR compliance gap to gap table
- Add LLM-as-judge caveats to rating methodology and limitations
- Add FIPA, IEEE P3394, W3C WoT to related work with bibliography entries
- Fix safety ratio to show monthly variation (1.5:1 to 21:1)

Pipeline:
- Fetch 1 new draft (475 total), 3 new authors (713 total)
- Fix 16 ruff lint errors across test files
- All 106 tests pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 17:54:43 +01:00

345 lines
28 KiB
HTML

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>How We Built This: Analyzing 434 IETF Drafts with Claude and Ollama — IETF AI Agent Analysis</title>
<link rel="stylesheet" href="/blog/css/style.css">
</head>
<body>
<div class="container">
<nav><a href="/blog/" class="site-title">IETF AI Agent Analysis</a>
<a href="/blog/posts/01-gold-rush.html">Rush</a>
<a href="/blog/posts/02-who-writes-the-rules.html">Rules</a>
<a href="/blog/posts/03-oauth-wars.html">Wars</a>
<a href="/blog/posts/04-what-nobody-builds.html">Builds</a>
<a href="/blog/posts/05-1262-ideas.html">Converge</a>
<a href="/blog/posts/06-big-picture.html">Picture</a>
<strong>This</strong>
<a href="/blog/posts/08-agents-building-the-analysis.html">Analysis</a></nav>
<h1 id="how-we-built-this-analyzing-434-ietf-drafts-with-claude-and-ollama">How We Built This: Analyzing 434 IETF Drafts with Claude and Ollama</h1>
<p><em>The engineering behind the analysis -- a Python CLI, two LLMs, one SQLite database, and ~$9.</em></p>
<p><em>Analysis based on IETF Datatracker data collected through March 2026. Counts and statistics reflect this snapshot.</em></p>
<hr />
<p>Every claim in this series -- the ~4:1 safety ratio (averaging ~4:1 but varying from 1.5:1 to 21:1 month-to-month), the 14 competing OAuth proposals, the 18 team blocs, the 11 gaps, the 180 ideas crossing the Chinese-Western divide -- comes from an automated analysis pipeline we built in Python. This post describes how it works, what it costs, what it found that surprised us, and what we learned about LLM-powered document analysis at scale.</p>
<p>The tool is open source. If you want to run it on a different corner of the IETF -- or adapt it for another standards body -- everything you need is in the repository.</p>
<h2 id="the-pipeline">The Pipeline</h2>
<p>The analysis runs in six core stages. Each builds on the previous, and every stage caches its work so re-runs are fast and cheap.</p>
<pre><code>fetch --&gt; analyze --&gt; embed --&gt; ideas --&gt; gaps --&gt; report
| | | | | |
v v v v v v
Datatracker Claude Ollama Claude Claude Markdown
API Sonnet nomic-embed Haiku Sonnet + rich
</code></pre>
<p>Three additional analysis passes run on top of the core pipeline:</p>
<pre><code>refs --&gt; trends --&gt; idea-overlap --&gt; status
| | | |
v v v v
Regex SQL query SequenceMatcher Naming convention
(local) (local) (local) (local)
</code></pre>
<p>These secondary passes cost nothing -- they operate entirely on data already in the database.</p>
<h3 id="stage-1-fetch">Stage 1: Fetch</h3>
<p>The Datatracker API (<code>https://datatracker.ietf.org/api/v1/doc/document/</code>) provides structured metadata for every Internet-Draft: name, title, abstract, authors, revision, submission date, working group, and current status. Full text is available at <code>https://www.ietf.org/archive/id/{name}-{rev}.txt</code>.</p>
<p>We search for drafts matching 12 keywords: <code>agent</code>, <code>ai-agent</code>, <code>llm</code>, <code>autonomous</code>, <code>machine-learning</code>, <code>artificial-intelligence</code>, <code>mcp</code>, <code>agentic</code>, <code>inference</code>, <code>generative</code>, <code>intelligent</code>, <code>aipref</code>. Both <code>name__contains</code> and <code>abstract__contains</code> filters are used to cast a wide net. We started with 6 keywords and 260 drafts; adding 6 more captured 101 new drafts in categories we were missing -- MCP-related work, generative AI infrastructure, intelligent networking, and the nascent <code>aipref</code> working group.</p>
<p><strong>Gotchas learned the hard way</strong>: The Datatracker API uses <code>type__slug=draft</code> (not <code>type=draft</code>) to filter to drafts. Pagination requires tracking <code>meta.next</code> through the response chain. Affiliation data comes from the <code>documentauthor</code> record, not the <code>person</code> record. We add a 0.5-second polite delay between requests.</p>
<p>The result: <strong>434 drafts</strong> fetched, with full metadata and text stored in SQLite.</p>
<h3 id="stage-2-analyze">Stage 2: Analyze</h3>
<p>Each draft is sent to Claude Sonnet with a compact structured prompt that includes the draft name, title, date, page count, and abstract. The prompt asks for:
- <strong>Category classification</strong> (one or more of 11 categories: A2A protocols, agent identity/auth, autonomous netops, data formats/interop, agent discovery/reg, human-agent interaction, AI safety/alignment, ML traffic management, policy/governance, model serving/inference, other)
- <strong>Quality rating</strong> on five dimensions (novelty, maturity, overlap, momentum, relevance) each scored 1-5
- <strong>Brief summary</strong> of what the draft does and why it matters</p>
<p>The key optimization: <strong>caching</strong>. Every Claude API call is stored in an <code>llm_cache</code> table keyed by the SHA-256 hash of the full prompt. If the same draft is analyzed twice, the second call is free and instant. This makes the pipeline idempotent -- you can re-run any stage without wasting money.</p>
<p>We initially sent full draft text to Claude, but switched to abstract-only analysis after testing showed that abstracts produce equivalent ratings at roughly 10x lower token cost. Full text is still used for idea extraction (Stage 4), where granular detail matters.</p>
<p><strong>Cost</strong>: About $3.16 for the initial 260 drafts on Claude Sonnet (376K input tokens, 200K output tokens). With the <code>--cheap</code> flag, analysis uses Claude Haiku instead, cutting costs roughly 10x.</p>
<h3 id="stage-3-embed">Stage 3: Embed</h3>
<p>For similarity analysis, we generate vector embeddings using Ollama running locally with the <code>nomic-embed-text</code> model. Each draft's abstract is embedded into a 768-dimensional vector, stored as raw bytes in the database.</p>
<p><strong>Why not Claude for embeddings?</strong> Cost and speed. Ollama runs locally, is free, and processes all 434 drafts in under a minute. The embeddings are used for approximate similarity (cosine distance), overlap detection, and t-SNE visualization -- tasks where a small local model is perfectly adequate.</p>
<p>The embeddings enable:
- <strong>Overlap clusters</strong>: Draft pairs with &gt;0.85 cosine similarity grouped together
- <strong>Near-duplicate detection</strong>: 25+ pairs with &gt;0.98 similarity flagged as potential duplicates
- <strong>Interactive t-SNE landscape</strong>: 2D visualization of the entire draft space, color-coded by category</p>
<h3 id="stage-4-ideas">Stage 4: Ideas</h3>
<p>The most expensive stage. Each draft's full text is analyzed by Claude to extract discrete technical ideas -- mechanisms, architectures, protocols, patterns, extensions, and requirements.</p>
<p><strong>Batch optimization</strong>: Rather than calling Claude once per draft, we batch 5 drafts per API call using Claude Haiku (<code>--cheap --batch 5</code>). This cuts the number of API calls by 5x and uses the cheaper model. The batch prompt includes all 5 drafts' texts and asks for ideas from each, reducing per-idea cost to fractions of a cent.</p>
<p><strong>Result</strong>: The current database contains <strong>419 ideas</strong> across 377 drafts. An earlier pipeline run produced roughly 1,780 components from 361 drafts (averaging ~5 per draft). The difference reflects changes in extraction parameters, batching strategy, and deduplication -- a known limitation of LLM-based extraction. What is consistent across both runs: the vast majority of extracted ideas appear in exactly one draft, and most are draft-specific component descriptions rather than standalone innovations. The real signal comes from the cross-org overlap analysis (idea-overlap feature), which uses SequenceMatcher fuzzy matching (0.75 threshold) to identify <strong>130 cross-org convergent ideas</strong> where 2+ organizations work on recognizably similar problems (an earlier run with ~1,780 ideas yielded 628; the convergence rate of ~36% is consistent across both).</p>
<h3 id="stage-5-gaps">Stage 5: Gaps</h3>
<p>The gap analysis is a synthesis step. We send Claude Sonnet the full landscape context -- category distributions, idea taxonomy, safety ratio, overlap patterns -- and ask it to identify areas where standardization work is missing or inadequate.</p>
<p>This is the one stage where the LLM is doing genuine reasoning, not just extraction. The prompt provides the data; Claude identifies the structural gaps. We validate its findings against the raw data (e.g., confirming that only 6 ideas address error recovery, or that cross-protocol translation has zero ideas).</p>
<p><strong>Result</strong>: <strong>11 gaps</strong> identified (2 critical, 5 high, 4 medium), each cross-referenced with related drafts and ideas.</p>
<h3 id="stage-6-report">Stage 6: Report</h3>
<p>Reports are generated in Markdown with embedded data tables. Fifteen report types are available: overview, landscape, digest, timeline, overlap-matrix, overlap-clusters, authors, ideas, gaps, refs, trends, idea-overlap, and status. The <code>rich</code> library provides formatted terminal output for CLI commands.</p>
<h2 id="the-database">The Database</h2>
<p>The SQLite database is the real product. At <strong>28 MB</strong>, it contains everything needed to reproduce any finding in this series.</p>
<table>
<thead>
<tr>
<th>Table</th>
<th style="text-align: right;">Rows</th>
<th>Purpose</th>
</tr>
</thead>
<tbody>
<tr>
<td>drafts</td>
<td style="text-align: right;">434</td>
<td>Full metadata + text for every draft</td>
</tr>
<tr>
<td>ratings</td>
<td style="text-align: right;">434</td>
<td>5-dimension quality scores + summaries</td>
</tr>
<tr>
<td>embeddings</td>
<td style="text-align: right;">434</td>
<td>768-dim vectors as binary blobs</td>
</tr>
<tr>
<td>ideas</td>
<td style="text-align: right;">419</td>
<td>Extracted technical components with types</td>
</tr>
<tr>
<td>authors</td>
<td style="text-align: right;">557</td>
<td>Person records from Datatracker</td>
</tr>
<tr>
<td>draft_authors</td>
<td style="text-align: right;">1,057</td>
<td>Author-to-draft linkage with affiliation</td>
</tr>
<tr>
<td>draft_refs</td>
<td style="text-align: right;">4,231</td>
<td>RFC/draft/BCP cross-references</td>
</tr>
<tr>
<td>gaps</td>
<td style="text-align: right;">11</td>
<td>Identified standardization gaps</td>
</tr>
<tr>
<td>llm_cache</td>
<td style="text-align: right;">1,397</td>
<td>Cached Claude API responses</td>
</tr>
</tbody>
</table>
<p>FTS5 full-text search is enabled on drafts, supporting queries like <code>ietf search "agent authentication"</code> that return ranked results in milliseconds. Indexes on <code>draft_refs(ref_type, ref_id)</code> and <code>ideas(draft_name)</code> keep query performance fast even for cross-table joins.</p>
<p>The database design follows a principle: <strong>store raw data, compute derived data</strong>. The drafts table stores full text; the ratings, ideas, and refs tables store analysis results. Any analysis can be re-run without re-fetching from the Datatracker API.</p>
<h2 id="the-author-network">The Author Network</h2>
<p>The author analysis deserves special mention because it revealed the team bloc pattern -- one of the most important findings in the series.</p>
<p>The IETF Datatracker provides author information via two API endpoints:
- <code>/api/v1/doc/documentauthor/?document__name=X</code> -- returns author links per draft
- <code>/api/v1/person/person/{id}/</code> -- returns person details (name, affiliation)</p>
<p>We fetch all authors for all drafts, build a co-authorship graph, and detect team blocs: groups where every pair of members shares at least 70% of their drafts. This threshold was chosen empirically -- lower thresholds produce too many loose groups; higher thresholds miss real teams.</p>
<p>The detection algorithm:
1. For each pair of authors, calculate pairwise overlap = |shared drafts| / min(|A's drafts|, |B's drafts|)
2. Build a graph where edges represent pairs with &gt;= 70% overlap and &gt;= 2 shared drafts
3. Find connected components in this graph
4. Each component is a team bloc</p>
<p><strong>Organization normalization</strong> turned out to be essential. "Huawei Technologies", "Huawei Technologies Co., Ltd.", and "Huawei Canada" all need to resolve to "Huawei". We maintain a hand-curated alias table of 40+ mappings plus automatic suffix stripping for common patterns (", Inc.", " LLC", " AB", etc.). Without this, cross-org analysis would fragment the same company into multiple entities.</p>
<p><strong>Result</strong>: <strong>18 team blocs</strong> detected among 557 authors. The largest: a 13-person Huawei team with 22 shared drafts and 94% average cohesion.</p>
<h2 id="the-new-features">The New Features</h2>
<p>Four features were added during the analysis session, each unlocking a deeper analytical layer. All four run locally with zero API cost.</p>
<h3 id="rfc-cross-references-ietf-refs">RFC Cross-References (<code>ietf refs</code>)</h3>
<p><strong>What it does</strong>: Parses all 434 drafts for RFC references using regex (<code>RFC\s*\d{4,}</code>, <code>\[RFC\d+\]</code>, <code>BCP\s*\d+</code>, <code>draft-[\w-]+</code>). Stores results in a <code>draft_refs</code> table for querying.</p>
<p><strong>What it found</strong>: <strong>4,231 cross-references</strong> (2,443 RFC, 698 draft, 1,090 BCP) across 360 drafts with text. The most-referenced standards reveal what the agent ecosystem builds on:</p>
<table>
<thead>
<tr>
<th>RFC</th>
<th style="text-align: right;">References</th>
<th style="text-align: right;">What It Is</th>
</tr>
</thead>
<tbody>
<tr>
<td>RFC 2119</td>
<td style="text-align: right;">285</td>
<td style="text-align: right;">MUST/SHALL/MAY conventions</td>
</tr>
<tr>
<td>RFC 8174</td>
<td style="text-align: right;">237</td>
<td style="text-align: right;">Key words update</td>
</tr>
<tr>
<td>RFC 8446</td>
<td style="text-align: right;">42</td>
<td style="text-align: right;">TLS 1.3</td>
</tr>
<tr>
<td>RFC 6749</td>
<td style="text-align: right;">36</td>
<td style="text-align: right;">OAuth 2.0</td>
</tr>
<tr>
<td>RFC 9110</td>
<td style="text-align: right;">34</td>
<td style="text-align: right;">HTTP Semantics</td>
</tr>
<tr>
<td>RFC 8259</td>
<td style="text-align: right;">26</td>
<td style="text-align: right;">JSON</td>
</tr>
<tr>
<td>RFC 5280</td>
<td style="text-align: right;">22</td>
<td style="text-align: right;">X.509 Certificates</td>
</tr>
<tr>
<td>RFC 7519</td>
<td style="text-align: right;">22</td>
<td style="text-align: right;">JWT</td>
</tr>
<tr>
<td>RFC 9052</td>
<td style="text-align: right;">20</td>
<td style="text-align: right;">COSE</td>
</tr>
</tbody>
</table>
<p><strong>The insight</strong>: Strip away RFC 2119/8174 (boilerplate conventions that every IETF draft references) and the picture is clear: the agent ecosystem is built on <strong>OAuth + TLS + HTTP + JWT</strong>. It is a security and identity infrastructure, not a networking infrastructure. The IETF's agent standards are being constructed on the same foundation as the web itself. This reframes the entire landscape: agent standards are not something new. They are the next layer on top of the web's existing security architecture.</p>
<h3 id="category-trends-ietf-trends">Category Trends (<code>ietf trends</code>)</h3>
<p><strong>What it does</strong>: Monthly breakdown of new drafts per category with growth rates, comparing recent periods to earlier ones.</p>
<p><strong>What it found</strong>: The growth curve is a step function. Monthly submissions went from 2 (Jun 2025) to 67 (Oct 2025) to 86 (Feb 2026). A2A protocols are still accelerating (26 in Oct/Nov 2025, 36 in Feb 2026). Safety/alignment is growing but slower (5 in Oct 2025, 12 in Feb 2026). The aggregate ~4:1 ratio (which varies from 1.5:1 to 21:1 month-to-month) is narrowing, but not fast enough.</p>
<h3 id="cross-org-idea-overlap-ietf-idea-overlap">Cross-Org Idea Overlap (<code>ietf idea-overlap</code>)</h3>
<p><strong>What it does</strong>: Groups similar ideas using <code>SequenceMatcher</code> (threshold 0.75), then checks which ideas span drafts from multiple organizations. This separates genuine cross-org consensus from intra-team duplication.</p>
<p><strong>What it found</strong>: By exact title, the vast majority of unique ideas appear in only a single draft. But fuzzy matching reveals <strong>130 cross-org convergent ideas</strong> (36% of unique clusters) where 2+ organizations work on recognizably similar problems. The top convergence signal -- "A2A Communication Paradigm" -- spans <strong>8 organizations from 5 countries</strong>. The deeper finding: <strong>180 ideas cross the Chinese-Western organizational divide</strong>. European telecoms (Deutsche Telekom, Telefonica, Orange) act as bridges between Chinese institutions and Western companies. US Big Tech (Google, Apple, Amazon) is almost entirely absent from cross-divide collaboration.</p>
<h3 id="wg-adoption-status-ietf-status">WG Adoption Status (<code>ietf status</code>)</h3>
<p><strong>What it does</strong>: Determines which drafts have been formally adopted by IETF Working Groups based on the <code>draft-ietf-{wg}-*</code> naming convention. Compares scores, categories, and gap coverage between WG-adopted and individual drafts.</p>
<p><strong>What it found</strong>: <strong>52 of 434 drafts (12%)</strong> are WG-adopted. The remaining 90% are individual submissions -- ideas seeking institutional backing. WG-adopted drafts score slightly higher on average (<strong>3.61 vs 3.23</strong>), validating our rating methodology.</p>
<p>The most revealing finding: <strong>a majority of WG-adopted drafts are in security Working Groups</strong> (lamps, lake, tls, emu, ace). The agent-focused <code>aipref</code> WG has only 2 adopted drafts. The IETF is not building agent standards in agent-focused groups -- it is retrofitting its existing security infrastructure for agent use cases. The standards that will actually govern AI agents on the internet are being written by the same people who write TLS and OAuth, not by new agent-specific working groups.</p>
<h2 id="what-we-learned">What We Learned</h2>
<h3 id="llms-are-good-at-structured-extraction">LLMs are good at structured extraction</h3>
<p>Claude's strength in this pipeline is turning unstructured technical documents into structured data: categories, ratings, ideas, gaps. The extraction quality is high -- we spot-checked 50 drafts and found categorization and idea extraction accurate in ~90% of cases. The errors tend to be over-categorization (assigning too many categories) rather than miscategorization.</p>
<h3 id="llms-need-validation-for-synthesis">LLMs need validation for synthesis</h3>
<p>The gap analysis (Stage 5) required the most human oversight. Claude correctly identified the gaps, but the severity rankings and the "zero ideas" claims needed manual verification against the raw data. LLMs can synthesize, but the synthesis should be treated as a hypothesis, not a conclusion.</p>
<h3 id="caching-changes-the-economics">Caching changes the economics</h3>
<p>The <code>llm_cache</code> table transforms the cost model. The first run costs ~$3. Every subsequent run -- adding new drafts, re-running with different prompts, regenerating reports -- costs only for new work. Over the project's life, we estimate caching saved $30+ in redundant API calls. The cache key is a SHA-256 hash of the full prompt, making it trivially collision-resistant.</p>
<h3 id="hybrid-models-work">Hybrid models work</h3>
<p>Using Claude Sonnet for reasoning-heavy tasks (analysis, gap synthesis) and Claude Haiku for extraction-heavy tasks (idea extraction, batch processing) cut costs by 5-10x without meaningful quality loss. Using Ollama for embeddings made similarity analysis free and fast. The principle: match the model's capability to the task's difficulty.</p>
<h3 id="the-free-analyses-are-the-most-revealing">The free analyses are the most revealing</h3>
<p>The four features that cost zero API dollars -- regex-based RFC parsing, SQL-based trend analysis, SequenceMatcher-based idea dedup, and naming-convention-based WG detection -- produced some of the most narratively important findings in the entire series. The OAuth-stack-as-foundation insight from RFC cross-references. The 180 cross-divide ideas. The 10% WG adoption rate. The security-WG-not-agent-WG finding. None of these required an LLM. They required a well-structured database and the right questions.</p>
<h3 id="the-database-is-the-product">The database is the product</h3>
<p>The most valuable output is not any single report -- it is the SQLite database. With all drafts analyzed, ideas extracted, authors mapped, refs parsed, and embeddings stored, the database supports ad-hoc queries that no pre-built report can anticipate. The blog series was written primarily by querying the database, not by re-running the pipeline.</p>
<h2 id="cost-summary">Cost Summary</h2>
<table>
<thead>
<tr>
<th>Stage</th>
<th>Model</th>
<th style="text-align: right;">Drafts</th>
<th style="text-align: right;">Cost</th>
</tr>
</thead>
<tbody>
<tr>
<td>Analyze</td>
<td>Claude Sonnet</td>
<td style="text-align: right;">260</td>
<td style="text-align: right;">~$2.50</td>
</tr>
<tr>
<td>Analyze</td>
<td>Claude Sonnet</td>
<td style="text-align: right;">101</td>
<td style="text-align: right;">~$5.50</td>
</tr>
<tr>
<td>Ideas</td>
<td>Claude Haiku (batch 5)</td>
<td style="text-align: right;">434</td>
<td style="text-align: right;">~$0.80</td>
</tr>
<tr>
<td>Gaps</td>
<td>Claude Sonnet</td>
<td style="text-align: right;">1 call</td>
<td style="text-align: right;">~$0.20</td>
</tr>
<tr>
<td>Embed</td>
<td>Ollama (local)</td>
<td style="text-align: right;">434</td>
<td style="text-align: right;">$0.00</td>
</tr>
<tr>
<td>Refs</td>
<td>Regex (local)</td>
<td style="text-align: right;">434</td>
<td style="text-align: right;">$0.00</td>
</tr>
<tr>
<td>Trends</td>
<td>SQL (local)</td>
<td style="text-align: right;">434</td>
<td style="text-align: right;">$0.00</td>
</tr>
<tr>
<td>Idea-overlap</td>
<td>SequenceMatcher (local)</td>
<td style="text-align: right;">419 ideas</td>
<td style="text-align: right;">$0.00</td>
</tr>
<tr>
<td>WG Status</td>
<td>Naming convention</td>
<td style="text-align: right;">434</td>
<td style="text-align: right;">$0.00</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td></td>
<td style="text-align: right;"></td>
<td style="text-align: right;"><strong>~$9</strong></td>
</tr>
</tbody>
</table>
<p>For context: analyzing 434 IETF drafts -- fetching full text, rating quality on 5 dimensions, extracting 419 technical ideas, detecting 11 gaps, mapping 557 authors, parsing 4,231 cross-references, and identifying 18 team blocs -- cost less than two large coffees.</p>
<h2 id="the-tech-stack">The Tech Stack</h2>
<ul>
<li><strong>Python 3.11+</strong> with <strong>Click</strong> for the CLI</li>
<li><strong>SQLite</strong> with <strong>FTS5</strong> for full-text search</li>
<li><strong>httpx</strong> for HTTP requests (Datatracker API)</li>
<li><strong>anthropic</strong> SDK for Claude API</li>
<li><strong>ollama</strong> for local embeddings</li>
<li><strong>rich</strong> for terminal formatting</li>
<li><strong>numpy</strong> for cosine similarity and matrix operations</li>
</ul>
<p>43 CLI commands, 13+ interactive visualizations (HTML/PNG), 15 report types. Total codebase: approximately 6,100 lines of Python across 12 modules.</p>
<hr />
<h2 id="limitations">Limitations</h2>
<p><strong>A note on IETF IPR policy</strong>: Internet-Drafts may be subject to intellectual property rights (IPR) claims. Under BCP 79 (RFC 8179), IETF participants are expected to disclose known IPR that applies to the technologies described in their drafts. Implementers considering building on any of the drafts discussed in this series should check the <a href="https://datatracker.ietf.org/ipr/">IETF IPR disclosure database</a> before proceeding.</p>
<p>This analysis is exploratory, not peer-reviewed research. Several methodological limitations should be understood when interpreting the results:</p>
<p><strong>LLM-as-Judge ratings</strong>: All quality ratings are generated by Claude Sonnet from draft abstracts (not full text), with no human calibration. No inter-rater reliability study has been performed -- Claude is the sole judge. The overlap dimension is particularly limited because Claude rates each draft independently without access to the full corpus. Scores should be treated as relative rankings within this corpus, not absolute quality measures.</p>
<p><strong>Keyword-based corpus selection</strong>: The 12 search keywords cast a wide net but introduce both false positives (drafts about "user agents" or "autonomous systems" unrelated to AI) and false negatives (relevant drafts using terminology we did not search for). We estimate 30-50 false positives remain in the corpus. The relevance rating partially mitigates this, but the LLM judge is generous with relevance for keyword-matched drafts.</p>
<p><strong>Clustering thresholds</strong>: The 0.85 cosine similarity threshold for topical clusters, 0.90 for near-duplicates, and 0.98 for functional duplicates are empirical choices based on manual inspection, not derived from a principled analysis. The embedding model (nomic-embed-text) is general-purpose, not fine-tuned for standards documents. A sensitivity analysis across thresholds would strengthen confidence.</p>
<p><strong>Gap analysis</strong>: The gap identification is a single-shot LLM analysis based on compressed landscape statistics, not a systematic comparison against a reference architecture. Gap severity is assigned by Claude without defined thresholds. The gaps should be treated as hypotheses for expert validation, not definitive findings.</p>
<p><strong>Idea extraction quality</strong>: Batch extraction (Haiku, abstract-only at 800 chars) produces different results than individual extraction (Sonnet, abstract + full text). No precision/recall measurement has been performed. The extraction prompt instructs Claude to return 1-4 ideas per draft, which may under-count contributions from comprehensive drafts.</p>
<p><strong>Abstract-only analysis</strong>: Ratings are based on abstracts truncated to 2000 characters. For maturity assessment in particular, the abstract is an imperfect proxy for the full document's technical depth.</p>
<p>For full methodology documentation, see <code>data/reports/methodology.md</code> in the project repository.</p>
<hr />
<h3 id="key-takeaways">Key Takeaways</h3>
<ul>
<li><strong>The full analysis cost ~$9</strong> -- LLM-powered document analysis at scale is practical and cheap with proper caching and model selection</li>
<li><strong>Caching is essential</strong>: SHA-256 hashed prompt caching makes the pipeline idempotent and dramatically reduces costs on re-runs</li>
<li><strong>Hybrid LLM strategy</strong>: Claude Sonnet for reasoning, Claude Haiku for extraction (10x cheaper), Ollama for embeddings (free) -- match model capability to task difficulty</li>
<li><strong>The zero-cost analyses were the most revealing</strong>: RFC cross-references, idea overlap, WG adoption, and trend analysis all run locally and produced the series' most important structural findings</li>
<li><strong>The database is the product</strong>: a well-structured SQLite DB supports queries no pre-built report anticipates; the blog series was written by querying, not re-running</li>
</ul>
<p><em>Next in this series: <a href="08-agents-building-the-analysis.md">Agents Building the Agent Analysis</a> -- we used a team of AI agents to produce this series. The irony is the point.</em></p>
<hr />
<p><em>The IETF Draft Analyzer is open source. The codebase, database, and all reports are available in the project repository.</em></p>
<div class="post-nav"><a href="/blog/posts/06-big-picture.html">&larr; The Big Picture</a><a href="/blog/posts/08-agents-building-the-analysis.html">Agents Building the Agent Analysis &rarr;</a></div>
<footer>
<p>IETF Draft Analyzer &mdash; Data collected through March 2026.
<a href="https://github.com/cnennemann/ietf-draft-analyzer">Source on GitHub</a></p>
</footer>
</div>
</body>
</html>