Files
ietf-draft-analyzer/data/reports/methodology.md
Christian Nennemann 439424bd04 Fix security, data integrity, and accuracy issues from 4-perspective review
Security fixes:
- Fix SQL injection in db.py:update_generation_run (column name whitelist)
- Flask SECRET_KEY from env var instead of hardcoded
- Add LLM rating bounds validation (_clamp_rating, 1-10)
- Fix JSON extraction trailing whitespace handling

Data integrity:
- Normalize 21 legacy category names to 11 canonical short forms
- Add false_positive column, flag 73 non-AI drafts (361 relevant remain)
- Document verified counts: 434 total/361 relevant drafts, 557 authors, 419 ideas, 11 gaps

Code quality:
- Fix version string 0.1.0 → 0.2.0
- Add close()/context manager to Embedder class
- Dynamic matrix size instead of hardcoded "260x260"

Blog accuracy:
- Fix EU AI Act timeline (enforcement Aug 2026, not "18 months")
- Distinguish OAuth consent from GDPR Einwilligung
- Add EU AI Act Annex III context to hospital scenario
- Add FIPA, eIDAS 2.0 references where relevant

Methodology:
- Add methodology.md documenting pipeline, limitations, rating rubric
- Add LLM-as-judge caveats to analyzer.py
- Document clustering threshold rationale

Reviews from: legal (German/EU law), statistics, development, science perspectives.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 10:52:33 +01:00

233 lines
16 KiB
Markdown

# Methodology — IETF Draft Analyzer
*This document describes the data collection, analysis pipeline, and known limitations of the IETF Draft Analyzer project. It is intended to provide transparency for anyone evaluating the findings in the blog series or reports.*
---
## 1. Data Collection
### Source
All data is fetched from the IETF Datatracker API (`https://datatracker.ietf.org/api/v1/doc/document/`). Full draft text is retrieved from `https://www.ietf.org/archive/id/{name}-{rev}.txt`. Author and affiliation data comes from the `/api/v1/doc/documentauthor/` and `/api/v1/person/person/` endpoints.
### Keyword Selection
The corpus is built by searching for drafts matching 12 keywords across both `name__contains` and `abstract__contains` fields:
`agent`, `ai-agent`, `llm`, `autonomous`, `machine-learning`, `artificial-intelligence`, `mcp`, `agentic`, `inference`, `generative`, `intelligent`, `aipref`
Only drafts with `type__slug=draft` and submission date >= 2024-01-01 are included.
### Selection Bias Acknowledgment
Keyword-based selection introduces both false positives and false negatives:
- **False positives**: Keywords like "agent" match "user agent" in HTTP contexts, "autonomous" matches "autonomous systems" (AS) in routing, and "intelligent" matches "intelligent networking" unrelated to AI. We estimate 30-50 false positives remain in the corpus despite relevance filtering. Drafts with relevance score <= 2 are the most obvious, but some false positives receive relevance scores of 3-4 from the LLM judge.
- **False negatives**: Relevant drafts using terminology not in our keyword list (e.g., "cognitive," "self-driving network," or domain-specific terms) are missed entirely.
- **Temporal bias**: The `fetch_since` cutoff of 2024-01-01 excludes earlier foundational work that may inform the current landscape.
### Organization Normalization
Author affiliations are normalized using a hand-curated alias table of 40+ mappings (e.g., "Huawei Technologies Co., Ltd." -> "Huawei") plus automatic suffix stripping for common patterns (", Inc.", " LLC", " AB", etc.). This normalization is essential for cross-org analysis but introduces judgment calls about organizational boundaries.
---
## 2. Analysis Pipeline
The pipeline runs in six stages, each building on the previous:
```
fetch --> analyze --> embed --> ideas --> gaps --> report
| | | | | |
v v v v v v
Datatracker Claude Ollama Claude Claude Markdown
API Sonnet nomic-embed Haiku Sonnet + rich
```
### Stage 1: Fetch
Retrieves draft metadata, full text, and author information from the Datatracker API with a 0.5-second polite delay between requests.
### Stage 2: Analyze (Rating)
Each draft is rated by Claude Sonnet on five dimensions using a compact structured prompt that includes the draft's name, title, date, page count, and abstract (truncated to 2000 characters). See "Rating Rubric" below.
### Stage 3: Embed
Vector embeddings are generated using Ollama with the `nomic-embed-text` model. The input combines the draft's title, abstract, and first 4000 characters of full text. Embeddings are 768-dimensional vectors stored as binary blobs in SQLite. See "Embedding Model" below.
### Stage 4: Ideas
Technical ideas are extracted by Claude. Individual extraction uses Sonnet with abstract + first 3000 characters of full text. Batch extraction uses Haiku with abstract only (truncated to 800 characters). The prompt requests 1-4 top-level novel contributions per draft.
### Stage 5: Gaps
A single Claude Sonnet call receives compressed landscape statistics (category counts, top ideas, overlap summary) and identifies 8-15 standardization gaps. See "Gap Analysis" below.
### Stage 6: Report
Markdown reports are generated from database queries. No LLM is involved in report generation.
### Caching
All Claude API calls are cached in an `llm_cache` table keyed by SHA-256 hash of the full prompt. Re-runs return cached results, making the pipeline idempotent. This also means that intra-rater consistency cannot be measured from cached results.
---
## 3. Rating Rubric
Each draft is scored on five dimensions, each on a 1-5 integer scale:
| Dimension | 1 | 2 | 3 | 4 | 5 |
|-----------|---|---|---|---|---|
| **Novelty** | Trivial/obvious extension | Incremental | Useful contribution | Notable originality | Genuinely novel approach |
| **Maturity** | Problem statement only | Early sketch | Defined protocol/mechanism | Detailed spec with examples | Implementation-ready with test vectors |
| **Overlap** | Unique approach | Minor similarities | Shares concepts with 1-2 drafts | Significant overlap | Near-duplicate of existing work |
| **Momentum** | Inactive/abandoned | Single revision | Active development | WG interest/adoption | Strong community momentum |
| **Relevance** | Not about AI/agents (false positive) | Tangentially related | Partially relevant | Directly relevant | Core AI agent topic |
### Composite Score
The composite score used in reports is a 4-dimension average of novelty, maturity, momentum, and relevance (excluding overlap, since overlap measures redundancy rather than quality). Exact decimal values are used; rounding is avoided.
### Scale Interpretation
Scores should be treated as **relative rankings within this corpus**, not absolute quality measures. Key limitations:
- **Abstract-only input**: Ratings are based on the draft's abstract (truncated to 2000 characters), not the full text. Maturity and overlap scores are particularly affected, since the abstract may not convey the full technical depth or specificity of the draft.
- **Single LLM judge**: Claude Sonnet is the sole rater. No human calibration study has been performed. No second-model comparison has been conducted. Even a small calibration set (20-30 drafts rated by domain experts) would substantially strengthen confidence.
- **No consistency measurement**: Each draft is rated once. The caching mechanism prevents re-rating, so Claude's self-consistency on these drafts is untested.
- **Overlap score limitations**: Claude rates each draft independently without access to the full corpus. The overlap dimension reflects Claude's general knowledge of the field, not corpus-specific similarity analysis. For corpus-level overlap, use the embedding-based similarity analysis instead.
- **Relevance inflation**: Keyword-matched drafts tend to score high on relevance by construction. The distribution is right-skewed (most drafts at 4-5).
- **Batch effects**: When rated in batches of 5 (using `BATCH_PROMPT`), position effects and comparison effects between drafts in the same batch are uncontrolled. Abstracts are truncated more aggressively (1500 chars) in batch mode.
---
## 4. Clustering
### Method
Greedy single-linkage clustering on the pairwise cosine similarity matrix of draft embeddings.
**Algorithm**: For each unvisited draft (seed), find all unvisited drafts with cosine similarity >= threshold to the seed. Add them to the seed's cluster and mark them visited. This produces disjoint clusters where every member is similar to the seed, but members are not guaranteed to be similar to each other (single-linkage property).
### Thresholds
| Threshold | Label | Justification |
|-----------|-------|---------------|
| 0.85 | Topically overlapping | Empirical: 0.80 produced too many false groupings; 0.90 missed obvious clusters; 0.85 yielded groups that looked reasonable on manual spot-checking |
| 0.90 | Near-duplicates | Empirical: pairs above 0.90 consistently covered the same topic with similar approaches |
| 0.98 | Functionally identical | Empirical: pairs above 0.98 were essentially the same document under different names |
**None of these thresholds are derived from a principled analysis.** A sensitivity analysis (running clustering at 0.80, 0.85, 0.90 and comparing results) would strengthen confidence. Different embedding models would produce different similarity distributions, potentially requiring different thresholds.
### Limitations
- **Single-linkage chaining**: A chain of pairwise-similar drafts can produce clusters containing semantically distant drafts connected through intermediaries.
- **No comparison to alternatives**: The clustering has not been compared against k-means, DBSCAN, hierarchical agglomerative clustering, or other standard methods.
- **General-purpose embeddings**: The `nomic-embed-text` model was not trained specifically for technical standards document similarity. Domain-specific or fine-tuned embeddings might produce significantly different cluster structures.
- **Inconsistent embedding input**: Drafts with full text available are embedded from title + abstract + 4000 chars of body. Drafts without full text are embedded from title + abstract only. This creates systematic quality differences in embeddings.
---
## 5. Gap Analysis
The gap analysis sends Claude Sonnet a compressed landscape summary containing:
- Category distribution (category name and draft count)
- Top 20 most frequently occurring idea titles
- Overlap summary (top 5 categories by count, labeled "high internal overlap")
Claude is instructed to identify 8-15 gaps with topic, description, category, severity (critical/high/medium/low), and evidence.
### Limitations
- **Single-shot generation**: Gaps are identified in one LLM call, not through systematic comparison against a reference taxonomy.
- **No reference architecture**: A rigorous gap analysis would compare the corpus against an explicit agent ecosystem reference model (e.g., NIST AI RMF, FIPA agent platform model). The current approach relies on Claude's general knowledge.
- **Circular overlap summary**: The overlap information fed to Claude is just category-level counts, not specific technical areas of overlap within categories.
- **Variable evidence quality**: Some gap evidence cites specific data ("only N drafts address X"), while other evidence is based on Claude's inference about what is missing.
- **Ungrounded severity**: The distinction between critical, high, medium, and low severity is assigned by Claude without defined thresholds.
### Strengthening Options
- Ground against a reference architecture (FIPA, NIST AI RMF, or a custom agent ecosystem model)
- Run multiple independent gap analyses and intersect results
- Have domain experts validate and rank gaps
- Cite specific drafts that partially address each gap
---
## 6. Embedding Model
**Model**: `nomic-embed-text` (Nomic AI), run locally via Ollama.
**Properties**:
- 768-dimensional embeddings
- Context window: ~8192 tokens
- General-purpose text embedding model trained on diverse English text
- Not fine-tuned for technical/standards document similarity
**Input**: Title + abstract + first 4000 characters of full text (when available), concatenated with double newlines. Input is truncated to ~32,000 characters before embedding.
**Similarity metric**: Cosine similarity, computed as dot product divided by product of L2 norms.
**Limitations**: As a general-purpose model, nomic-embed-text may not capture domain-specific semantic relationships in standards documents as well as a model fine-tuned on technical/legal/standards text. The embeddings have not been evaluated against a gold-standard similarity judgment set for IETF drafts.
---
## 7. Known Limitations Summary
| Limitation | Impact | Mitigation |
|------------|--------|------------|
| Abstract-only rating | Maturity/overlap scores may be unreliable | Could re-rate with full text for a validation sample |
| No human calibration | Rating validity is unknown | Calibration study with 5 experts on 25 drafts |
| Keyword selection bias | ~30-50 false positives, unknown false negatives | Relevance filtering, manual review of low-scoring drafts |
| Empirical clustering thresholds | Cluster boundaries may be arbitrary | Sensitivity analysis at multiple thresholds |
| Single-shot gap analysis | Gaps may be incomplete or misprioritized | Ground against reference architecture |
| General-purpose embeddings | Domain-specific similarity may be missed | Evaluate against expert similarity judgments |
| Batch vs. individual extraction quality | Idea counts and quality may vary by extraction method | Compare batch (Haiku) vs. individual (Sonnet) on sample |
| Organization normalization | Cross-org analysis depends on alias accuracy | Publish and review normalization table |
---
## 8. Related Work
The IETF's AI agent standardization effort exists within a broader ecosystem of agent-related standards and research. This analysis would benefit from comparison against:
### FIPA (Foundation for Intelligent Physical Agents)
The original agent communication standards body (1996-2005). FIPA's Agent Communication Language (ACL), Agent Management Specification, and Agent Platform specifications are the direct ancestors of modern A2A protocols. Key specifications include:
- **FIPA ACL** (SC00061): Message structure and performatives for agent communication
- **FIPA Agent Management** (SC00023): Agent platform architecture with Agent Management System (AMS), Directory Facilitator (DF), and Message Transport Service (MTS)
- **FIPA Interaction Protocols** (SC00026-SC00036): Request, query, contract net, brokering, and other standard interaction patterns
FIPA's work is relevant because many of the "novel" A2A protocol proposals in the IETF corpus address problems FIPA solved (or attempted to solve) 20+ years ago. The absence of FIPA references in most current drafts suggests a lack of awareness of prior art.
### IEEE P3394 — Standard for Trustworthy Autonomous and Semi-Autonomous Systems
An active IEEE working group developing standards for trustworthy AI agents, addressing trust, transparency, and accountability. Relevant to the IETF's AI safety/alignment and policy/governance categories. The IETF's safety deficit (4:1 capability-to-safety ratio) should be evaluated in the context of IEEE's complementary safety-focused standardization.
### W3C Web of Things (WoT)
The W3C WoT Architecture and Thing Description specifications address agent/device discovery and interoperability in IoT contexts:
- **WoT Architecture** (W3C REC): Defines servients, protocol bindings, and discovery mechanisms applicable to agent systems
- **WoT Thing Description** (W3C REC): Machine-readable metadata for capabilities, interfaces, and security -- analogous to agent capability description proposals in the IETF corpus
Several IETF drafts build on or compete with WoT concepts for agent discovery and description.
### Academic Multi-Agent Systems (MAS) Research
The multi-agent systems research community (AAMAS, JAIR, JAAMAS) has decades of work on problems the IETF drafts are now addressing at the protocol level:
- **Agent coordination**: Consensus, negotiation, auction mechanisms (relevant to IETF gap "Multi-Agent Consensus Protocols")
- **Trust and reputation**: Computational trust models, reputation systems (relevant to agent identity/auth drafts)
- **Agent verification**: Model checking, runtime verification of agent behavior (relevant to IETF gap "Agent Behavioral Verification")
- **MAS security**: Secure agent platforms, malicious agent detection (relevant to AI safety/alignment drafts)
Key survey references:
- Wooldridge, M. (2009). "An Introduction to MultiAgent Systems" -- Standard textbook covering agent architectures, communication, coordination
- Dorri, A., Kanhere, S.S., Jurdak, R. (2018). "Multi-Agent Systems: A Survey" -- IEEE Access, covering modern MAS challenges
- The AAMAS conference proceedings (annual since 2002) -- primary venue for MAS research
### Other Relevant Standards Bodies
- **OASIS**: TOSCA (Topology and Orchestration Specification for Cloud Applications) and prior work on service-oriented agent architectures
- **ITU-T**: Y.3170 series on machine learning in future networks, relevant to the autonomous netops and ML traffic management categories
- **ETSI**: ENI (Experiential Networked Intelligence) and ZSM (Zero-touch network and Service Management), addressing autonomous network management
---
*This document was created 2026-03-08 in response to scientific and statistical review findings. It should be updated as the analysis pipeline evolves.*