Fix remaining critical, high, and medium issues from 4-perspective review
Critical fixes:
- Fix rating clamp range 1-10 → 1-5 (actual scale)
- Add `ietf ideas convergence` command (SequenceMatcher at 0.75 threshold)
- Fix "628 cross-org ideas" → 130 (verified from current DB) across 8 files
Security fixes:
- Sanitize FTS5 query input (strip special chars + boolean operators)
- Add rate limiting (10 req/min/IP) on Claude-calling endpoints
- Change <path:name> → <string:name> on draft routes
Codebase fixes:
- Add Database context manager (__enter__/__exit__)
- Wire false_positive filtering into queries (exclude by default in web UI)
- Fix Post 3 arithmetic ("~300" → "~409" distinct proposals)
Content & licensing:
- Add MIT LICENSE file
- Add IPR/FRAND notes (BCP 79, RFC 8179) to Posts 03 and 07
- Qualify "4:1 safety ratio" with monthly variation in 6 remaining files
- Add "Data as of March 2026" freeze-date headers to all 10 blog posts
- Hedge causal language in Post 04
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -2,9 +2,11 @@
|
||||
|
||||
*The engineering behind the analysis -- a Python CLI, two LLMs, one SQLite database, and ~$9.*
|
||||
|
||||
*Analysis based on IETF Datatracker data collected through March 2026. Counts and statistics reflect this snapshot.*
|
||||
|
||||
---
|
||||
|
||||
Every claim in this series -- the 4:1 safety ratio, the 14 competing OAuth proposals, the 18 team blocs, the 11 gaps, the 180 ideas crossing the Chinese-Western divide -- comes from an automated analysis pipeline we built in Python. This post describes how it works, what it costs, what it found that surprised us, and what we learned about LLM-powered document analysis at scale.
|
||||
Every claim in this series -- the ~4:1 safety ratio (averaging ~4:1 but varying from 1.5:1 to 21:1 month-to-month), the 14 competing OAuth proposals, the 18 team blocs, the 11 gaps, the 180 ideas crossing the Chinese-Western divide -- comes from an automated analysis pipeline we built in Python. This post describes how it works, what it costs, what it found that surprised us, and what we learned about LLM-powered document analysis at scale.
|
||||
|
||||
The tool is open source. If you want to run it on a different corner of the IETF -- or adapt it for another standards body -- everything you need is in the repository.
|
||||
|
||||
@@ -72,7 +74,7 @@ The most expensive stage. Each draft's full text is analyzed by Claude to extrac
|
||||
|
||||
**Batch optimization**: Rather than calling Claude once per draft, we batch 5 drafts per API call using Claude Haiku (`--cheap --batch 5`). This cuts the number of API calls by 5x and uses the cheaper model. The batch prompt includes all 5 drafts' texts and asks for ideas from each, reducing per-idea cost to fractions of a cent.
|
||||
|
||||
**Result**: The current database contains **419 ideas** across 377 drafts. An earlier pipeline run produced roughly 1,780 components from 361 drafts (averaging ~5 per draft). The difference reflects changes in extraction parameters, batching strategy, and deduplication -- a known limitation of LLM-based extraction. What is consistent across both runs: the vast majority of extracted ideas appear in exactly one draft, and most are draft-specific component descriptions rather than standalone innovations. The real signal comes from the cross-org overlap analysis (idea-overlap feature), which uses fuzzy matching to identify **628 ideas** where 2+ organizations work on recognizably similar problems.
|
||||
**Result**: The current database contains **419 ideas** across 377 drafts. An earlier pipeline run produced roughly 1,780 components from 361 drafts (averaging ~5 per draft). The difference reflects changes in extraction parameters, batching strategy, and deduplication -- a known limitation of LLM-based extraction. What is consistent across both runs: the vast majority of extracted ideas appear in exactly one draft, and most are draft-specific component descriptions rather than standalone innovations. The real signal comes from the cross-org overlap analysis (idea-overlap feature), which uses SequenceMatcher fuzzy matching (0.75 threshold) to identify **130 cross-org convergent ideas** where 2+ organizations work on recognizably similar problems (an earlier run with ~1,780 ideas yielded 628; the convergence rate of ~36% is consistent across both).
|
||||
|
||||
### Stage 5: Gaps
|
||||
|
||||
@@ -154,13 +156,13 @@ Four features were added during the analysis session, each unlocking a deeper an
|
||||
|
||||
**What it does**: Monthly breakdown of new drafts per category with growth rates, comparing recent periods to earlier ones.
|
||||
|
||||
**What it found**: The growth curve is a step function. Monthly submissions went from 2 (Jun 2025) to 67 (Oct 2025) to 86 (Feb 2026). A2A protocols are still accelerating (26 in Oct/Nov 2025, 36 in Feb 2026). Safety/alignment is growing but slower (5 in Oct 2025, 12 in Feb 2026). The 4:1 ratio is narrowing, but not fast enough.
|
||||
**What it found**: The growth curve is a step function. Monthly submissions went from 2 (Jun 2025) to 67 (Oct 2025) to 86 (Feb 2026). A2A protocols are still accelerating (26 in Oct/Nov 2025, 36 in Feb 2026). Safety/alignment is growing but slower (5 in Oct 2025, 12 in Feb 2026). The aggregate ~4:1 ratio (which varies from 1.5:1 to 21:1 month-to-month) is narrowing, but not fast enough.
|
||||
|
||||
### Cross-Org Idea Overlap (`ietf idea-overlap`)
|
||||
|
||||
**What it does**: Groups similar ideas using `SequenceMatcher` (threshold 0.75), then checks which ideas span drafts from multiple organizations. This separates genuine cross-org consensus from intra-team duplication.
|
||||
|
||||
**What it found**: By exact title, the vast majority of unique ideas appear in only a single draft. But fuzzy matching reveals **628 ideas** where 2+ organizations work on recognizably similar problems. The top convergence signal -- "A2A Communication Paradigm" -- spans **8 organizations from 5 countries**. The deeper finding: **180 ideas cross the Chinese-Western organizational divide**. European telecoms (Deutsche Telekom, Telefonica, Orange) act as bridges between Chinese institutions and Western companies. US Big Tech (Google, Apple, Amazon) is almost entirely absent from cross-divide collaboration.
|
||||
**What it found**: By exact title, the vast majority of unique ideas appear in only a single draft. But fuzzy matching reveals **130 cross-org convergent ideas** (36% of unique clusters) where 2+ organizations work on recognizably similar problems. The top convergence signal -- "A2A Communication Paradigm" -- spans **8 organizations from 5 countries**. The deeper finding: **180 ideas cross the Chinese-Western organizational divide**. European telecoms (Deutsche Telekom, Telefonica, Orange) act as bridges between Chinese institutions and Western companies. US Big Tech (Google, Apple, Amazon) is almost entirely absent from cross-divide collaboration.
|
||||
|
||||
### WG Adoption Status (`ietf status`)
|
||||
|
||||
@@ -229,6 +231,8 @@ For context: analyzing 434 IETF drafts -- fetching full text, rating quality on
|
||||
|
||||
## Limitations
|
||||
|
||||
**A note on IETF IPR policy**: Internet-Drafts may be subject to intellectual property rights (IPR) claims. Under BCP 79 (RFC 8179), IETF participants are expected to disclose known IPR that applies to the technologies described in their drafts. Implementers considering building on any of the drafts discussed in this series should check the [IETF IPR disclosure database](https://datatracker.ietf.org/ipr/) before proceeding.
|
||||
|
||||
This analysis is exploratory, not peer-reviewed research. Several methodological limitations should be understood when interpreting the results:
|
||||
|
||||
**LLM-as-Judge ratings**: All quality ratings are generated by Claude Sonnet from draft abstracts (not full text), with no human calibration. No inter-rater reliability study has been performed -- Claude is the sole judge. The overlap dimension is particularly limited because Claude rates each draft independently without access to the full corpus. Scores should be treated as relative rankings within this corpus, not absolute quality measures.
|
||||
|
||||
Reference in New Issue
Block a user