Files

Christian Nennemann 439424bd04 Fix security, data integrity, and accuracy issues from 4-perspective review

Security fixes:
- Fix SQL injection in db.py:update_generation_run (column name whitelist)
- Flask SECRET_KEY from env var instead of hardcoded
- Add LLM rating bounds validation (_clamp_rating, 1-10)
- Fix JSON extraction trailing whitespace handling

Data integrity:
- Normalize 21 legacy category names to 11 canonical short forms
- Add false_positive column, flag 73 non-AI drafts (361 relevant remain)
- Document verified counts: 434 total/361 relevant drafts, 557 authors, 419 ideas, 11 gaps

Code quality:
- Fix version string 0.1.0 → 0.2.0
- Add close()/context manager to Embedder class
- Dynamic matrix size instead of hardcoded "260x260"

Blog accuracy:
- Fix EU AI Act timeline (enforcement Aug 2026, not "18 months")
- Distinguish OAuth consent from GDPR Einwilligung
- Add EU AI Act Annex III context to hospital scenario
- Add FIPA, eIDAS 2.0 references where relevant

Methodology:
- Add methodology.md documenting pipeline, limitations, rating rubric
- Add LLM-as-judge caveats to analyzer.py
- Document clustering threshold rationale

Reviews from: legal (German/EU law), statistics, development, science perspectives.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-08 10:52:33 +01:00

16 KiB

Raw Blame History

Methodology — IETF Draft Analyzer

This document describes the data collection, analysis pipeline, and known limitations of the IETF Draft Analyzer project. It is intended to provide transparency for anyone evaluating the findings in the blog series or reports.

1. Data Collection

Source

All data is fetched from the IETF Datatracker API (https://datatracker.ietf.org/api/v1/doc/document/). Full draft text is retrieved from https://www.ietf.org/archive/id/{name}-{rev}.txt. Author and affiliation data comes from the /api/v1/doc/documentauthor/ and /api/v1/person/person/ endpoints.

Keyword Selection

The corpus is built by searching for drafts matching 12 keywords across both name__contains and abstract__contains fields:

agent, ai-agent, llm, autonomous, machine-learning, artificial-intelligence, mcp, agentic, inference, generative, intelligent, aipref

Only drafts with type__slug=draft and submission date >= 2024-01-01 are included.

Selection Bias Acknowledgment

Keyword-based selection introduces both false positives and false negatives:

False positives: Keywords like "agent" match "user agent" in HTTP contexts, "autonomous" matches "autonomous systems" (AS) in routing, and "intelligent" matches "intelligent networking" unrelated to AI. We estimate 30-50 false positives remain in the corpus despite relevance filtering. Drafts with relevance score <= 2 are the most obvious, but some false positives receive relevance scores of 3-4 from the LLM judge.
False negatives: Relevant drafts using terminology not in our keyword list (e.g., "cognitive," "self-driving network," or domain-specific terms) are missed entirely.
Temporal bias: The fetch_since cutoff of 2024-01-01 excludes earlier foundational work that may inform the current landscape.

Organization Normalization

Author affiliations are normalized using a hand-curated alias table of 40+ mappings (e.g., "Huawei Technologies Co., Ltd." -> "Huawei") plus automatic suffix stripping for common patterns (", Inc.", " LLC", " AB", etc.). This normalization is essential for cross-org analysis but introduces judgment calls about organizational boundaries.

2. Analysis Pipeline

The pipeline runs in six stages, each building on the previous:

fetch --> analyze --> embed --> ideas --> gaps --> report
  |         |          |        |        |         |
  v         v          v        v        v         v
Datatracker  Claude     Ollama   Claude   Claude   Markdown
  API       Sonnet   nomic-embed  Haiku   Sonnet   + rich

Stage 1: Fetch

Retrieves draft metadata, full text, and author information from the Datatracker API with a 0.5-second polite delay between requests.

Stage 2: Analyze (Rating)

Each draft is rated by Claude Sonnet on five dimensions using a compact structured prompt that includes the draft's name, title, date, page count, and abstract (truncated to 2000 characters). See "Rating Rubric" below.

Stage 3: Embed

Vector embeddings are generated using Ollama with the nomic-embed-text model. The input combines the draft's title, abstract, and first 4000 characters of full text. Embeddings are 768-dimensional vectors stored as binary blobs in SQLite. See "Embedding Model" below.

Stage 4: Ideas

Technical ideas are extracted by Claude. Individual extraction uses Sonnet with abstract + first 3000 characters of full text. Batch extraction uses Haiku with abstract only (truncated to 800 characters). The prompt requests 1-4 top-level novel contributions per draft.

Stage 5: Gaps

A single Claude Sonnet call receives compressed landscape statistics (category counts, top ideas, overlap summary) and identifies 8-15 standardization gaps. See "Gap Analysis" below.

Stage 6: Report

Markdown reports are generated from database queries. No LLM is involved in report generation.

Caching

All Claude API calls are cached in an llm_cache table keyed by SHA-256 hash of the full prompt. Re-runs return cached results, making the pipeline idempotent. This also means that intra-rater consistency cannot be measured from cached results.

3. Rating Rubric

Each draft is scored on five dimensions, each on a 1-5 integer scale:

Dimension	1	2	3	4	5
Novelty	Trivial/obvious extension	Incremental	Useful contribution	Notable originality	Genuinely novel approach
Maturity	Problem statement only	Early sketch	Defined protocol/mechanism	Detailed spec with examples	Implementation-ready with test vectors
Overlap	Unique approach	Minor similarities	Shares concepts with 1-2 drafts	Significant overlap	Near-duplicate of existing work
Momentum	Inactive/abandoned	Single revision	Active development	WG interest/adoption	Strong community momentum
Relevance	Not about AI/agents (false positive)	Tangentially related	Partially relevant	Directly relevant	Core AI agent topic

Composite Score

The composite score used in reports is a 4-dimension average of novelty, maturity, momentum, and relevance (excluding overlap, since overlap measures redundancy rather than quality). Exact decimal values are used; rounding is avoided.

Scale Interpretation

Scores should be treated as relative rankings within this corpus, not absolute quality measures. Key limitations:

Abstract-only input: Ratings are based on the draft's abstract (truncated to 2000 characters), not the full text. Maturity and overlap scores are particularly affected, since the abstract may not convey the full technical depth or specificity of the draft.
Single LLM judge: Claude Sonnet is the sole rater. No human calibration study has been performed. No second-model comparison has been conducted. Even a small calibration set (20-30 drafts rated by domain experts) would substantially strengthen confidence.
No consistency measurement: Each draft is rated once. The caching mechanism prevents re-rating, so Claude's self-consistency on these drafts is untested.
Overlap score limitations: Claude rates each draft independently without access to the full corpus. The overlap dimension reflects Claude's general knowledge of the field, not corpus-specific similarity analysis. For corpus-level overlap, use the embedding-based similarity analysis instead.
Relevance inflation: Keyword-matched drafts tend to score high on relevance by construction. The distribution is right-skewed (most drafts at 4-5).
Batch effects: When rated in batches of 5 (using BATCH_PROMPT), position effects and comparison effects between drafts in the same batch are uncontrolled. Abstracts are truncated more aggressively (1500 chars) in batch mode.

4. Clustering

Method

Greedy single-linkage clustering on the pairwise cosine similarity matrix of draft embeddings.

Algorithm: For each unvisited draft (seed), find all unvisited drafts with cosine similarity >= threshold to the seed. Add them to the seed's cluster and mark them visited. This produces disjoint clusters where every member is similar to the seed, but members are not guaranteed to be similar to each other (single-linkage property).

Thresholds

Threshold	Label	Justification
0.85	Topically overlapping	Empirical: 0.80 produced too many false groupings; 0.90 missed obvious clusters; 0.85 yielded groups that looked reasonable on manual spot-checking
0.90	Near-duplicates	Empirical: pairs above 0.90 consistently covered the same topic with similar approaches
0.98	Functionally identical	Empirical: pairs above 0.98 were essentially the same document under different names

None of these thresholds are derived from a principled analysis. A sensitivity analysis (running clustering at 0.80, 0.85, 0.90 and comparing results) would strengthen confidence. Different embedding models would produce different similarity distributions, potentially requiring different thresholds.

Limitations

Single-linkage chaining: A chain of pairwise-similar drafts can produce clusters containing semantically distant drafts connected through intermediaries.
No comparison to alternatives: The clustering has not been compared against k-means, DBSCAN, hierarchical agglomerative clustering, or other standard methods.
General-purpose embeddings: The nomic-embed-text model was not trained specifically for technical standards document similarity. Domain-specific or fine-tuned embeddings might produce significantly different cluster structures.
Inconsistent embedding input: Drafts with full text available are embedded from title + abstract + 4000 chars of body. Drafts without full text are embedded from title + abstract only. This creates systematic quality differences in embeddings.

5. Gap Analysis

The gap analysis sends Claude Sonnet a compressed landscape summary containing:

Category distribution (category name and draft count)
Top 20 most frequently occurring idea titles
Overlap summary (top 5 categories by count, labeled "high internal overlap")

Claude is instructed to identify 8-15 gaps with topic, description, category, severity (critical/high/medium/low), and evidence.

Limitations

Single-shot generation: Gaps are identified in one LLM call, not through systematic comparison against a reference taxonomy.
No reference architecture: A rigorous gap analysis would compare the corpus against an explicit agent ecosystem reference model (e.g., NIST AI RMF, FIPA agent platform model). The current approach relies on Claude's general knowledge.
Circular overlap summary: The overlap information fed to Claude is just category-level counts, not specific technical areas of overlap within categories.
Variable evidence quality: Some gap evidence cites specific data ("only N drafts address X"), while other evidence is based on Claude's inference about what is missing.
Ungrounded severity: The distinction between critical, high, medium, and low severity is assigned by Claude without defined thresholds.

Strengthening Options

Ground against a reference architecture (FIPA, NIST AI RMF, or a custom agent ecosystem model)
Run multiple independent gap analyses and intersect results
Have domain experts validate and rank gaps
Cite specific drafts that partially address each gap

6. Embedding Model

Model: nomic-embed-text (Nomic AI), run locally via Ollama.

Properties:

768-dimensional embeddings
Context window: ~8192 tokens
General-purpose text embedding model trained on diverse English text
Not fine-tuned for technical/standards document similarity

Input: Title + abstract + first 4000 characters of full text (when available), concatenated with double newlines. Input is truncated to ~32,000 characters before embedding.

Similarity metric: Cosine similarity, computed as dot product divided by product of L2 norms.

Limitations: As a general-purpose model, nomic-embed-text may not capture domain-specific semantic relationships in standards documents as well as a model fine-tuned on technical/legal/standards text. The embeddings have not been evaluated against a gold-standard similarity judgment set for IETF drafts.

7. Known Limitations Summary

Limitation	Impact	Mitigation
Abstract-only rating	Maturity/overlap scores may be unreliable	Could re-rate with full text for a validation sample
No human calibration	Rating validity is unknown	Calibration study with 5 experts on 25 drafts
Keyword selection bias	~30-50 false positives, unknown false negatives	Relevance filtering, manual review of low-scoring drafts
Empirical clustering thresholds	Cluster boundaries may be arbitrary	Sensitivity analysis at multiple thresholds
Single-shot gap analysis	Gaps may be incomplete or misprioritized	Ground against reference architecture
General-purpose embeddings	Domain-specific similarity may be missed	Evaluate against expert similarity judgments
Batch vs. individual extraction quality	Idea counts and quality may vary by extraction method	Compare batch (Haiku) vs. individual (Sonnet) on sample
Organization normalization	Cross-org analysis depends on alias accuracy	Publish and review normalization table

The IETF's AI agent standardization effort exists within a broader ecosystem of agent-related standards and research. This analysis would benefit from comparison against:

FIPA (Foundation for Intelligent Physical Agents)

The original agent communication standards body (1996-2005). FIPA's Agent Communication Language (ACL), Agent Management Specification, and Agent Platform specifications are the direct ancestors of modern A2A protocols. Key specifications include:

FIPA ACL (SC00061): Message structure and performatives for agent communication
FIPA Agent Management (SC00023): Agent platform architecture with Agent Management System (AMS), Directory Facilitator (DF), and Message Transport Service (MTS)
FIPA Interaction Protocols (SC00026-SC00036): Request, query, contract net, brokering, and other standard interaction patterns

FIPA's work is relevant because many of the "novel" A2A protocol proposals in the IETF corpus address problems FIPA solved (or attempted to solve) 20+ years ago. The absence of FIPA references in most current drafts suggests a lack of awareness of prior art.

IEEE P3394 — Standard for Trustworthy Autonomous and Semi-Autonomous Systems

An active IEEE working group developing standards for trustworthy AI agents, addressing trust, transparency, and accountability. Relevant to the IETF's AI safety/alignment and policy/governance categories. The IETF's safety deficit (4:1 capability-to-safety ratio) should be evaluated in the context of IEEE's complementary safety-focused standardization.

W3C Web of Things (WoT)

The W3C WoT Architecture and Thing Description specifications address agent/device discovery and interoperability in IoT contexts:

WoT Architecture (W3C REC): Defines servients, protocol bindings, and discovery mechanisms applicable to agent systems
WoT Thing Description (W3C REC): Machine-readable metadata for capabilities, interfaces, and security -- analogous to agent capability description proposals in the IETF corpus

Several IETF drafts build on or compete with WoT concepts for agent discovery and description.

Academic Multi-Agent Systems (MAS) Research

The multi-agent systems research community (AAMAS, JAIR, JAAMAS) has decades of work on problems the IETF drafts are now addressing at the protocol level:

Agent coordination: Consensus, negotiation, auction mechanisms (relevant to IETF gap "Multi-Agent Consensus Protocols")
Trust and reputation: Computational trust models, reputation systems (relevant to agent identity/auth drafts)
Agent verification: Model checking, runtime verification of agent behavior (relevant to IETF gap "Agent Behavioral Verification")
MAS security: Secure agent platforms, malicious agent detection (relevant to AI safety/alignment drafts)

Key survey references:

Wooldridge, M. (2009). "An Introduction to MultiAgent Systems" -- Standard textbook covering agent architectures, communication, coordination
Dorri, A., Kanhere, S.S., Jurdak, R. (2018). "Multi-Agent Systems: A Survey" -- IEEE Access, covering modern MAS challenges
The AAMAS conference proceedings (annual since 2002) -- primary venue for MAS research

Other Relevant Standards Bodies

OASIS: TOSCA (Topology and Orchestration Specification for Cloud Applications) and prior work on service-oriented agent architectures
ITU-T: Y.3170 series on machine learning in future networks, relevant to the autonomous netops and ML traffic management categories
ETSI: ENI (Experiential Networked Intelligence) and ZSM (Zero-touch network and Service Management), addressing autonomous network management

This document was created 2026-03-08 in response to scientific and statistical review findings. It should be updated as the analysis pipeline evolves.

16 KiB Raw Blame History

Methodology — IETF Draft Analyzer

1. Data Collection

Source

Keyword Selection

Selection Bias Acknowledgment

Organization Normalization

2. Analysis Pipeline

Stage 1: Fetch

Stage 2: Analyze (Rating)

Stage 3: Embed

Stage 4: Ideas

Stage 5: Gaps

Stage 6: Report

Caching

3. Rating Rubric

Composite Score

Scale Interpretation

4. Clustering

Method

Thresholds

Limitations

5. Gap Analysis

Limitations

Strengthening Options

6. Embedding Model

7. Known Limitations Summary

8. Related Work

FIPA (Foundation for Intelligent Physical Agents)

IEEE P3394 — Standard for Trustworthy Autonomous and Semi-Autonomous Systems

W3C Web of Things (WoT)

Academic Multi-Agent Systems (MAS) Research

Other Relevant Standards Bodies

16 KiB

Raw Blame History