Idea quality pipeline, web UI features, academic paper

- Tighten idea extraction prompts (1-4 ideas, no sub-features) reducing 1,907 ideas to 468 across 434 drafts (78% reduction) - Add embedding-based dedup (ietf dedup-ideas) for same-draft similarity - Add novelty scoring (ietf ideas score) and filtering (ietf ideas filter) using Claude to rate ideas 1-5, removing 49 generic building blocks - Final count: 419 high-quality ideas (avg 1.1/draft) - Web UI: gap explorer with live draft generation and pre-generated demos - Web UI: D3.js author collaboration network (498 nodes, 1142 edges, 68 clusters, org filtering, interactive zoom/pan) - Academic paper: 15-page LaTeX workshop paper analyzing the 434-draft AI agent standards landscape - Save improvement ideas backlog to data/reports/improvement-ideas.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 22:17:57 +01:00
parent 3c3d7e649f
commit 6e3a387778
29 changed files with 6575 additions and 240 deletions
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -11,13 +11,12 @@
 \usepackage{xcolor}
 \usepackage{amsmath}
 \usepackage{natbib}
-% \usepackage{microtype}  % Uncomment if texlive-fonts-extra is installed
 \usepackage{float}
 \usepackage{caption}
 \usepackage{subcaption}
-% \usepackage{multirow}  % Uncomment if texlive-latex-extra is installed
 \usepackage{tabularx}
 \usepackage{enumitem}
+% \usepackage{multirow}  % Uncomment if texlive-latex-extra is installed

 \hypersetup{
    colorlinks=true,
@@ -31,9 +30,8 @@
 % ── Title ─────────────────────────────────────────────────────────────────

 \title{%
-    \textbf{The AI Agent Standardization Wave:\\
-    A Quantitative Analysis of 260 IETF Internet-Drafts\\
-    on Autonomous Agents and Artificial Intelligence}%
+    \textbf{The AI Agent Standards Gold Rush:\\
+    A Systematic Analysis of 434 IETF Internet-Drafts}%
 }

 \author{
@@ -42,7 +40,7 @@
    \texttt{[email]}
 }

-\date{February 2026}
+\date{March 2026}

 \begin{document}
 \maketitle
@@ -50,10 +48,10 @@
 % ── Abstract ──────────────────────────────────────────────────────────────

 \begin{abstract}
-The Internet Engineering Task Force (IETF) is experiencing an unprecedented surge in standardization activity related to artificial intelligence and autonomous agents. Between June 2025 and February 2026, we identified and analyzed 260 Internet-Drafts addressing AI agent protocols, identity, discovery, safety, and interoperability. Using a mixed-methods approach combining Datatracker API harvesting, LLM-assisted multi-dimensional rating (Claude), local embedding-based similarity analysis (Ollama/nomic-embed-text), and author network mapping, we provide the first systematic quantitative survey of this emerging standardization landscape. Our analysis reveals significant thematic overlap (7.9\% of draft pairs exceed 0.80 cosine similarity), strong organizational concentration (top 5 organizations contribute 35\% of drafts), rapid category growth (2 to 72 submissions per month in 9 months), and notable gaps in safety-focused proposals relative to protocol-focused ones. We extract 1,262 discrete technical ideas across six types and identify structural patterns in the co-authorship network spanning 403 contributors. Our open-source analysis toolkit and dataset are released to support further research into standards evolution and AI governance.
+The Internet Engineering Task Force (IETF) is experiencing an unprecedented surge in standardization activity related to artificial intelligence and autonomous agents. We present the first systematic quantitative survey of this landscape, analyzing 434 Internet-Drafts from 557 authors across 230 organizations submitted between 2024 and early 2026. Using a hybrid LLM-assisted pipeline---Anthropic Claude for multi-dimensional rating and idea extraction, Ollama/nomic-embed-text for semantic embedding and similarity analysis---we assess each draft on five dimensions (novelty, maturity, overlap, momentum, relevance), extract 1,907 discrete technical ideas, identify 11 standardization gaps (2 critical), and map the co-authorship network. Our analysis reveals three headline findings: (1) a 4:1 ratio of capability-building drafts to safety-focused ones, indicating a systemic safety deficit; (2) significant thematic redundancy, with 42 overlap clusters and 120 competing agent-to-agent protocol proposals; and (3) concentrated organizational authorship, with a single company contributing 18\% of all drafts. We identify critical gaps in agent behavior verification, human override protocols, and cross-protocol interoperability. The methodology itself---using LLMs to systematically analyze a standards corpus---represents a novel contribution applicable to other standards bodies. Our open-source toolkit and dataset are released for reproducibility.
 \end{abstract}

-\noindent\textbf{Keywords:} IETF, Internet-Drafts, AI agents, standardization, protocol analysis, NLP, embedding similarity, author networks
+\noindent\textbf{Keywords:} IETF, Internet-Drafts, AI agents, standardization, protocol analysis, LLM-assisted analysis, embedding similarity, safety deficit, author networks

 % ── 1. Introduction ──────────────────────────────────────────────────────

@@ -61,7 +59,7 @@ The Internet Engineering Task Force (IETF) is experiencing an unprecedented surg

 The rapid deployment of large language models (LLMs) and autonomous AI agents has created urgent demand for interoperability standards. Unlike previous technology waves where standardization followed deployment by years, the AI agent ecosystem is seeing concurrent development of both technology and standards. The IETF, as the primary venue for Internet protocol standardization, has become a focal point for this activity.

-Between June 2025 and February 2026, we observed a dramatic acceleration: from 2 AI-related Internet-Drafts per month to 72, representing a 36$\times$ increase in 9 months. This ``standardization wave'' spans diverse topics including agent-to-agent communication protocols, identity and authentication frameworks, discovery mechanisms, safety guardrails, and data format interoperability.
+The acceleration is dramatic. In 2024, just 9 AI/agent-related Internet-Drafts were submitted to the IETF---0.5\% of all submissions. By Q1 2026, AI/agent drafts account for 9.3\% of all new Internet-Drafts: nearly 1 in 10. This ``gold rush'' spans diverse topics including agent-to-agent (A2A) communication protocols, identity and authentication frameworks, discovery mechanisms, safety guardrails, and data format interoperability.

 However, the speed and volume of this activity raises important questions:
 \begin{itemize}[nosep]
@@ -73,18 +71,20 @@ However, the speed and volume of this activity raises important questions:

 To answer these questions, we built an automated analysis pipeline that:
 \begin{enumerate}[nosep]
-    \item Harvests draft metadata and full text from the IETF Datatracker API (260 drafts, 403 authors).
+    \item Harvests draft metadata and full text from the IETF Datatracker API (434 drafts, 557 authors).
    \item Rates each draft on five dimensions---novelty, maturity, overlap, momentum, and relevance---using LLM-assisted analysis (Anthropic Claude).
-    \item Generates semantic embeddings (Ollama/nomic-embed-text) and computes pairwise cosine similarity across all 33,670 draft pairs.
-    \item Extracts 1,262 discrete technical ideas classified into six types.
-    \item Maps the co-authorship network and organizational affiliations.
+    \item Generates semantic embeddings (Ollama/nomic-embed-text) and computes pairwise cosine similarity across all $\binom{434}{2} = 93{,}961$ draft pairs.
+    \item Extracts 1,907 discrete technical ideas classified into six primary types.
+    \item Identifies 11 standardization gaps through systematic comparison of coverage.
+    \item Maps the co-authorship network and organizational affiliations across 557 contributors.
 \end{enumerate}

 \noindent Our contributions are:
 \begin{itemize}[nosep]
-    \item \textbf{First systematic survey} of AI/agent-related IETF drafts at scale.
-    \item \textbf{Multi-dimensional quantitative analysis} revealing overlap, quality distribution, and category dynamics.
-    \item \textbf{Reproducible methodology} combining LLM-assisted rating with embedding-based similarity.
+    \item \textbf{First systematic survey} of AI/agent-related IETF drafts at scale, covering 434 drafts.
+    \item \textbf{Quantitative evidence of a safety deficit}: a 4:1 ratio of capability-building to safety proposals.
+    \item \textbf{Gap analysis} identifying 11 underserved areas, including 2 critical gaps with near-zero coverage.
+    \item \textbf{Reproducible LLM-assisted methodology} combining Claude-based rating with embedding-based similarity, applicable to other standards corpora.
    \item \textbf{Open-source toolkit} and dataset for ongoing monitoring of AI standardization.
 \end{itemize}

@@ -94,73 +94,90 @@ To answer these questions, we built an automated analysis pipeline that:

 \subsection{IETF Standardization Process}

-The IETF develops Internet standards through an open, consensus-based process~\citep{rfc2026}. Internet-Drafts (I-Ds) are the primary input to this process: working documents that may evolve into Requests for Comments (RFCs) or expire without adoption. The Datatracker system\footnote{\url{https://datatracker.ietf.org}} provides programmatic API access to draft metadata, author information, and lifecycle states.
+The IETF develops Internet standards through an open, consensus-based process~\citep{rfc2026}. Internet-Drafts (I-Ds) are the primary input: working documents that may evolve into Requests for Comments (RFCs) or expire without adoption. The Datatracker system\footnote{\url{https://datatracker.ietf.org}} provides programmatic API access to draft metadata, author information, and lifecycle states. I-Ds have a six-month expiry and can be submitted by any individual or working group.

-\subsection{AI Agent Standardization}
+\subsection{AI Agent Standardization Landscape}

-Several parallel efforts address AI agent interoperability. Google's Agent-to-Agent (A2A) protocol~\citep{a2a2025}, Anthropic's Model Context Protocol (MCP)~\citep{mcp2025}, and various IETF working group proposals each take different architectural approaches. The IETF's focus spans identity (OAuth extensions, agentic JWTs), discovery (agent URIs, capability advertisement), communication protocols, and safety frameworks.
+Several parallel efforts address AI agent interoperability. Google's Agent-to-Agent (A2A) protocol~\citep{a2a2025} defines a framework for agent discovery and task execution. Anthropic's Model Context Protocol (MCP)~\citep{mcp2025} specifies how LLMs connect to external tools and data sources. Within the IETF, the newly formed AIPREF working group addresses AI content usage preferences, while proposals span identity (OAuth extensions, agentic JWTs), discovery (agent URIs, DNS-based registration), communication protocols (over QUIC, SIP, HTTP), and safety frameworks (accountability protocols, verifiable conversations).

 \subsection{Automated Analysis of Standards Documents}

-Prior work on automated standards analysis has focused on RFC evolution~\citep{arkko2019}, IETF participation patterns~\citep{simmons2019}, and working group dynamics. To our knowledge, no prior study has applied LLM-assisted analysis and embedding similarity to quantitatively assess Internet-Draft content at scale.
+Prior work on automated standards analysis has focused on RFC evolution~\citep{arkko2019}, IETF participation patterns~\citep{simmons2019}, and working group dynamics. Bibliometric studies of standards bodies~\citep{baron2019} have examined citation networks and organizational influence. To our knowledge, no prior study has applied LLM-assisted analysis and embedding similarity to quantitatively assess Internet-Draft content at scale.

 \subsection{LLM-Assisted Document Analysis}

-Recent work demonstrates the effectiveness of LLMs for document classification~\citep{brown2020}, technical summarization, and multi-dimensional assessment. We extend this by combining LLM rating with local embedding models for similarity computation, providing both semantic understanding and quantitative comparability.
+Recent work demonstrates the effectiveness of LLMs for document classification~\citep{brown2020}, technical summarization, and multi-dimensional assessment. The use of LLMs as ``judges'' for evaluating text quality has gained traction in NLP research~\citep{zheng2023}. We extend this paradigm by combining LLM-based rating with local embedding models for similarity computation, providing both semantic understanding and quantitative comparability across a large technical corpus.

 % ── 3. Methodology ──────────────────────────────────────────────────────

 \section{Methodology}

+Figure~\ref{fig:pipeline} illustrates our five-stage analysis pipeline. Each stage is described below.
+
+\begin{figure}[H]
+    \centering
+    \fbox{\parbox{0.9\textwidth}{\centering
+    \textbf{Pipeline Overview}\\[6pt]
+    \texttt{Fetch} $\rightarrow$ \texttt{Analyze/Rate} $\rightarrow$ \texttt{Embed} $\rightarrow$ \texttt{Extract Ideas} $\rightarrow$ \texttt{Find Gaps}\\[4pt]
+    {\small Datatracker API \quad Claude (Sonnet 4) \quad Ollama/nomic-embed-text \quad Claude \quad Claude}
+    }}
+    \caption{Five-stage analysis pipeline. All intermediate results are cached in SQLite for reproducibility.}
+    \label{fig:pipeline}
+\end{figure}
+
 \subsection{Data Collection}

-We queried the IETF Datatracker API v1\footnote{\url{https://datatracker.ietf.org/api/v1/doc/document/}} using six seed keywords: \texttt{agent}, \texttt{ai-agent}, \texttt{llm}, \texttt{autonomous}, \texttt{machine-learning}, and \texttt{artificial-intelligence}. For each matching draft (type \texttt{draft}), we retrieved:
+We queried the IETF Datatracker API v1\footnote{\url{https://datatracker.ietf.org/api/v1/doc/document/}} using twelve seed keywords: \texttt{agent}, \texttt{ai-agent}, \texttt{llm}, \texttt{autonomous}, \texttt{machine-learning}, \texttt{artificial-intelligence}, \texttt{mcp}, \texttt{agentic}, \texttt{inference}, \texttt{generative}, \texttt{intelligent}, and \texttt{aipref}. Keywords were matched against both draft names (\texttt{name\_\_contains}) and abstracts (\texttt{abstract\_\_contains}). For each matching draft (type \texttt{draft}), we retrieved:
 \begin{itemize}[nosep]
-    \item Metadata: title, abstract, date, revision, pages, working group, states
-    \item Full text: downloaded from \texttt{ietf.org/archive/id/}
-    \item Author information: via the \texttt{/api/v1/doc/documentauthor/} and \texttt{/api/v1/person/person/} endpoints
+    \item Metadata: title, abstract, submission date, revision number, page count, working group, states
+    \item Full text: downloaded from \texttt{ietf.org/archive/id/\{name\}-\{rev\}.txt}
+    \item Author information: via the \texttt{documentauthor} and \texttt{person} API endpoints
 \end{itemize}
-All data was stored in a SQLite database with FTS5 full-text search indexing.
+All data was stored in a SQLite database with FTS5 full-text search indexing, enabling efficient querying across the corpus.

 \subsection{LLM-Assisted Rating}

 Each draft was assessed using Anthropic Claude (Sonnet 4) on five dimensions, each scored 1--5:

 \begin{itemize}[nosep]
-    \item \textbf{Novelty}: Originality of the proposed approach relative to existing standards.
+    \item \textbf{Novelty}: Originality of the proposed approach relative to existing standards and other drafts.
    \item \textbf{Maturity}: Completeness of specification (protocol details, data formats, security considerations).
    \item \textbf{Overlap}: Degree of redundancy with other drafts in the corpus.
    \item \textbf{Momentum}: Evidence of community engagement (revisions, working group adoption, co-authors).
-    \item \textbf{Relevance}: Importance to the AI/agent ecosystem.
+    \item \textbf{Relevance}: Importance to the AI/agent ecosystem specifically.
 \end{itemize}

-\noindent Drafts were rated in batches of 5 (abstract-only input, $\sim$400 tokens output per draft) with response caching to ensure reproducibility. A composite score was computed as:
+\noindent The prompt provided each draft's abstract and, where available, the first 4,000 characters of full text. Responses were cached by prompt SHA-256 hash to ensure reproducibility. A composite score was computed as:
 \begin{equation}
    S = 0.30 \cdot \text{novelty} + 0.25 \cdot \text{relevance} + 0.20 \cdot \text{maturity} + 0.15 \cdot \text{momentum} + 0.10 \cdot (6 - \text{overlap})
 \end{equation}

-\noindent The weighting prioritizes novelty and relevance while penalizing overlap (inverted, so less overlap yields higher scores).
+\noindent The weighting prioritizes novelty and relevance while penalizing overlap (inverted, so less overlap yields higher scores). We validated robustness by testing alternative weighting schemes (Section~\ref{app:sensitivity}).

 \subsection{Embedding and Similarity Analysis}

-We generated embeddings for each draft using Ollama with the \texttt{nomic-embed-text} model, encoding a combination of title, abstract, and the first 4,000 characters of full text. Pairwise cosine similarity was computed across all $\binom{260}{2} = 33{,}670$ draft pairs:
+We generated 768-dimensional embeddings for each draft using Ollama with the \texttt{nomic-embed-text} model, encoding a combination of title, abstract, and the first 4,000 characters of full text. Pairwise cosine similarity was computed across all $\binom{434}{2} = 93{,}961$ draft pairs:
 \begin{equation}
    \text{sim}(a, b) = \frac{\mathbf{v}_a \cdot \mathbf{v}_b}{\|\mathbf{v}_a\| \cdot \|\mathbf{v}_b\|}
 \end{equation}

-\noindent Hierarchical clustering (Ward's method) was applied to the distance matrix ($1 - \text{sim}$) for heatmap visualization, and greedy clustering at threshold 0.85 identified groups of near-duplicate drafts.
+\noindent Greedy clustering at thresholds of 0.85 and 0.90 identified groups of near-duplicate and highly similar drafts. Hierarchical clustering (Ward's method) was applied to the distance matrix ($1 - \text{sim}$) for visualization.

 \subsection{Idea Extraction}

-Claude was used to extract 3--8 discrete technical ideas per draft, each classified as one of: \textit{mechanism}, \textit{protocol}, \textit{pattern}, \textit{requirement}, \textit{architecture}, or \textit{extension}. Fuzzy string matching (SequenceMatcher, threshold 0.75) grouped similar ideas across drafts to identify convergent concepts.
+Claude was used to extract 3--8 discrete technical ideas per draft, each classified into one of six primary types: \textit{mechanism}, \textit{architecture}, \textit{pattern}, \textit{protocol}, \textit{requirement}, or \textit{extension}. Fuzzy string matching (SequenceMatcher, threshold 0.75) grouped similar ideas across drafts to identify convergent concepts---ideas that multiple independent teams arrived at independently.
+
+\subsection{Gap Analysis}
+
+Gaps were identified by comparing the idea coverage across categories against the requirements implied by the drafts themselves. Claude analyzed the full set of ideas and categories to identify areas where standardization work is missing or inadequate, assigning severity ratings (critical, high, medium) based on the breadth of the shortfall and the consequences of leaving it unfilled.

 \subsection{Author Network Analysis}

-Author and affiliation data were retrieved from Datatracker, yielding a bipartite graph of 403 authors across 260 drafts (742 author--draft edges). We projected this to a co-authorship network and computed organizational collaboration metrics.
+Author and affiliation data were retrieved from Datatracker, yielding a bipartite graph of 557 authors across 434 drafts. We identified persistent co-author teams (``team blocs'') using a pairwise draft overlap threshold of $\geq$70\% with $\geq$3 shared drafts. Cross-organizational collaboration was measured by counting shared drafts between organizations.

 \subsection{Reproducibility and Cost}

-The entire analysis consumed 472,900 API tokens (329,629 input + 143,271 output). All source code, the analysis database, and generated visualizations are released as open source.\footnote{Repository URL: [TODO]}
+The entire analysis pipeline is implemented as a Python CLI tool (\texttt{ietf}) using Click, with all results stored in a SQLite database. LLM responses are cached to ensure reproducibility. The total API cost was approximately \$3.16 for initial analysis (330K input + 144K output tokens, Sonnet 4). All source code, the analysis database, and generated reports are released as open source.\footnote{Repository: \url{https://github.com/[redacted]/ietf-draft-analyzer}}

 % ── 4. Dataset ──────────────────────────────────────────────────────────

@@ -174,15 +191,37 @@ The entire analysis consumed 472,900 API tokens (329,629 input + 143,271 output)
 \toprule
 \textbf{Metric} & \textbf{Value} \\
 \midrule
-Internet-Drafts analyzed & 260 \\
-Unique authors & 403 \\
-Author--draft relationships & 742 \\
-Technical ideas extracted & 1,262 \\
-Distinct categories & 19 \\
-Time span & Jun 2025 -- Feb 2026 \\
+Internet-Drafts analyzed & 434 \\
+Unique authors & 557 \\
+Organizations represented & 230 \\
+Technical ideas extracted & 1,907 \\
+Standardization gaps identified & 11 \\
+Drafts with ratings & 434 \\
+Overlap clusters ($\geq$0.85 threshold) & 42 \\
+Near-duplicate pairs ($\geq$0.90 threshold) & 34 \\
+Time span & 2024 -- Mar 2026 \\
 Embedding dimension & 768 (nomic-embed-text) \\
-Pairwise similarity pairs & 33,670 \\
-Total API tokens used & 472,900 \\
+Pairwise similarity pairs & 93,961 \\
+\bottomrule
+\end{tabular}
+\end{table}
+
+The corpus spans drafts submitted from early 2024 through March 2026, with the overwhelming majority (425 of 434) submitted after June 2025. Table~\ref{tab:growth} shows the acceleration in AI/agent-related submissions relative to total IETF activity.
+
+\begin{table}[h]
+\centering
+\caption{Growth of AI/agent Internet-Drafts relative to total IETF submissions.}
+\label{tab:growth}
+\begin{tabular}{rrrr}
+\toprule
+\textbf{Year} & \textbf{Total IETF Drafts} & \textbf{AI/Agent Drafts} & \textbf{AI Share} \\
+\midrule
+2021 & 1,108 & $\sim$0 & $\sim$0\% \\
+2022 & 1,121 & $\sim$0 & $\sim$0\% \\
+2023 & 1,241 & $\sim$0 & $\sim$0\% \\
+2024 & 1,651 & 9 & 0.5\% \\
+2025 & 2,696 & 190 & 7.0\% \\
+2026 (Q1) & 1,748 & 162 & 9.3\% \\
 \bottomrule
 \end{tabular}
 \end{table}
@@ -191,114 +230,93 @@ Total API tokens used & 472,900 \\

 \section{Findings}

-\subsection{Temporal Dynamics: A Rapid Acceleration}
+\subsection{Category Distribution: The Safety Deficit}

-Figure~\ref{fig:timeline} shows monthly submission volume. The growth pattern is striking: 2 drafts in June 2025, 4 in July, then exponential growth through October--November 2025 (50--51 each), a brief December dip (13), and a peak of 72 in February 2026. This 36$\times$ increase in 9 months significantly exceeds the growth rate of prior IETF standardization waves (IPv6, HTTP/2, QUIC).
-
-\begin{figure}[H]
-    \centering
-    \includegraphics[width=\textwidth]{timeline-placeholder.pdf}
-    \caption{Monthly IETF AI/agent draft submissions by category (June 2025 -- February 2026). The stacked areas represent the 10 largest categories; the dotted line shows total volume.}
-    \label{fig:timeline}
-\end{figure}
-
-\subsection{Category Distribution}
-
-We identified 19 semantic categories through LLM-assisted classification. Table~\ref{tab:categories} shows the top 10 by draft count.
+Our LLM-assisted classification assigned each draft to one or more of ten semantic categories (drafts may belong to multiple categories). Table~\ref{tab:categories} shows the distribution.

 \begin{table}[h]
 \centering
-\caption{Top 10 categories by draft count (multi-assignment: drafts may appear in multiple categories).}
+\caption{Draft distribution across categories. Percentages exceed 100\% due to multi-assignment.}
 \label{tab:categories}
-\begin{tabular}{lrcc}
+\begin{tabular}{lrr}
 \toprule
-\textbf{Category} & \textbf{Drafts} & \textbf{Avg Score} & \textbf{Avg Novelty} \\
+\textbf{Category} & \textbf{Drafts} & \textbf{Share} \\
 \midrule
-Data formats / interop & 102 & 3.3 & 3.2 \\
-Agent identity / auth & 98 & 3.4 & 3.5 \\
-A2A protocols & 92 & 3.4 & 3.5 \\
-Policy / governance & 60 & 3.3 & 3.2 \\
-Autonomous netops & 60 & 3.3 & 3.1 \\
-Agent discovery / reg & 57 & 3.5 & 3.5 \\
-AI safety / alignment & 36 & 3.4 & 3.4 \\
-ML traffic mgmt & 23 & 3.3 & 3.2 \\
-Human-agent interaction & 22 & 3.3 & 3.3 \\
-Other AI/agent & 21 & 3.4 & 3.4 \\
+Data formats / interoperability & 145 & 33\% \\
+A2A protocols & 120 & 28\% \\
+Agent identity / authentication & 108 & 25\% \\
+Autonomous network operations & 93 & 21\% \\
+Policy / governance & 91 & 21\% \\
+ML traffic management & 73 & 17\% \\
+Agent discovery / registration & 65 & 15\% \\
+AI safety / alignment & 44 & 10\% \\
+Model serving / inference & 42 & 10\% \\
+Human-agent interaction & 30 & 7\% \\
 \bottomrule
 \end{tabular}
 \end{table}

-\noindent A notable imbalance emerges: protocol-focused categories (data formats, identity, A2A) collectively account for over 290 category assignments, while AI safety/alignment---arguably the most consequential area---has only 36. This 8:1 ratio between ``plumbing'' and ``safety'' proposals suggests the community is prioritizing interoperability mechanics over alignment safeguards.
+The most striking finding is the \textbf{safety deficit}. Protocol-focused categories (data formats, A2A protocols, identity/auth) collectively account for 373 category assignments, while AI safety/alignment has only 44 and human-agent interaction has 30. This yields a \textbf{4:1 ratio of capability-building to safety proposals}. For every draft about keeping agents safe, approximately four are building new capabilities. For every draft about human-agent interaction, there are more than four about agents operating autonomously.
+
+The safety drafts that \emph{do} exist are often among the highest-rated. \texttt{draft-aylward-daap-v2} (a comprehensive accountability protocol) and \texttt{draft-cowles-volt} (a tamper-evident execution trace format) each scored 4.8/5.0---the highest in the entire corpus. The quality is there; the quantity is not.

 \subsection{Rating Distributions}

-Across all 260 drafts, the composite score distribution is approximately normal ($\mu = 3.38$, $\sigma = 0.59$, range $[1.65, 4.80]$). Figure~\ref{fig:distributions} breaks this down by dimension:
+Across all 434 rated drafts, Table~\ref{tab:ratings} summarizes the five rating dimensions.

-\begin{figure}[H]
-    \centering
-    \includegraphics[width=\textwidth]{score-distributions.png}
-    \caption{Rating distributions by dimension across the 8 largest categories. Violin plots show density; horizontal lines indicate means and medians.}
-    \label{fig:distributions}
-\end{figure}
+\begin{table}[h]
+\centering
+\caption{Average scores across five rating dimensions ($n = 434$, scale 1--5).}
+\label{tab:ratings}
+\begin{tabular}{lcc}
+\toprule
+\textbf{Dimension} & \textbf{Mean} & \textbf{Interpretation} \\
+\midrule
+Relevance & 3.81 & High: keyword selection captured genuinely AI-relevant drafts \\
+Novelty & 3.27 & Moderate: mix of innovative and derivative proposals \\
+Momentum & 3.02 & Moderate: many early-stage drafts without WG adoption \\
+Maturity & 2.99 & Low--moderate: most proposals are early-stage \\
+Overlap & 2.59 & Moderate: substantial redundancy in the corpus \\
+\bottomrule
+\end{tabular}
+\end{table}

 \noindent Key observations:
 \begin{itemize}[nosep]
-    \item \textbf{Relevance} is consistently high ($\mu = 3.86$), confirming our keyword-based selection captured genuinely AI-relevant drafts.
-    \item \textbf{Maturity} is the lowest-scoring dimension ($\mu = 2.98$), reflecting the early stage of most proposals.
-    \item \textbf{Novelty} varies widely ($\sigma = 0.83$), with clear separation between innovative and derivative drafts.
-    \item \textbf{Overlap} ($\mu = 2.52$) indicates moderate-to-low self-assessed redundancy, though embedding analysis (Section~\ref{sec:overlap}) reveals higher actual overlap.
+    \item \textbf{Relevance} is consistently high ($\mu = 3.81$), confirming that the keyword-based selection captured genuinely AI-relevant drafts rather than false positives.
+    \item \textbf{Maturity} is the lowest-scoring dimension ($\mu = 2.99$), reflecting the early stage of most proposals---many lack complete protocol specifications, security considerations, or reference implementations.
+    \item \textbf{Overlap} ($\mu = 2.59$) indicates moderate self-assessed redundancy. However, the embedding-based similarity analysis (Section~\ref{sec:overlap}) reveals that actual topical overlap is significantly higher than LLM-assessed overlap, suggesting that many drafts do not adequately acknowledge related work.
 \end{itemize}

 \subsection{Semantic Overlap and Redundancy}
 \label{sec:overlap}

-The pairwise cosine similarity analysis reveals substantial redundancy in the corpus. Of 33,670 pairs:
-\begin{itemize}[nosep]
-    \item 56 pairs (0.2\%) exceed 0.90 similarity (near-duplicate)
-    \item 344 pairs (1.0\%) exceed 0.85 (highly similar)
-    \item 2,668 pairs (7.9\%) exceed 0.80 (significantly overlapping)
-\end{itemize}
+The pairwise cosine similarity analysis reveals substantial redundancy. At a 0.85 similarity threshold, we identify \textbf{42 overlap clusters}---groups of drafts addressing essentially the same technical problem. At a 0.90 threshold, \textbf{34 clusters} remain, representing near-duplicates or same-author variants.

-\noindent The mean pairwise similarity of 0.721 ($\sigma = 0.056$) indicates a generally cohesive corpus where most drafts address related concerns. Figure~\ref{fig:heatmap} shows the clustered similarity matrix, revealing several distinct clusters of near-identical proposals.
+Table~\ref{tab:clusters} shows the three largest competing clusters.

-\begin{figure}[H]
-    \centering
-    \includegraphics[width=0.85\textwidth]{similarity-heatmap.png}
-    \caption{Hierarchically clustered pairwise similarity matrix (260 $\times$ 260). Color bars on the left indicate primary category. Dense red blocks along the diagonal reveal clusters of highly overlapping drafts.}
-    \label{fig:heatmap}
-\end{figure}
+\begin{table}[h]
+\centering
+\caption{Three largest overlap clusters by draft count.}
+\label{tab:clusters}
+\begin{tabularx}{\textwidth}{clX}
+\toprule
+\textbf{Drafts} & \textbf{Cluster Topic} & \textbf{Description} \\
+\midrule
+13 & OAuth for AI Agents & All solving agent authentication/authorization via OAuth 2.0 extensions. Approaches range from Agentic JWTs to scope aggregation to accountability protocols. \\
+10 & Agent Gateway / Multi-Agent Collaboration & Addressing cross-platform agent collaboration through gateway architectures, with competing semantic routing, task protocol, and infrastructure designs. \\
+6 & Agent Discovery & DNS-based, URI-based, and custom protocol approaches to finding and invoking AI agents. \\
+\bottomrule
+\end{tabularx}
+\end{table}

-\noindent The highest-similarity pair (0.999) consists of \texttt{draft-rosenberg-aiproto} and \texttt{draft-rosenberg-aiproto-nact}, which are essentially the same draft submitted under different affiliations. Several other pairs in the 0.95--0.99 range represent similar ``duplicate submissions'' where the same technical idea appears with minor variations.
+We also identified 25 near-duplicate draft pairs ($>$0.98 cosine similarity)---functionally identical proposals submitted under different names, in different working groups, or as renamed versions. Notable examples include \texttt{draft-rosenberg-aiproto} and \texttt{draft-rosenberg-aiproto-nact} (same N-ACT protocol, renamed), and \texttt{draft-abbey-scim-agent-extension} and \texttt{draft-scim-agent-extension} (same SCIM extension, different submission path).

-Figure~\ref{fig:quality} maps each draft's composite score against its maximum similarity to any other draft, creating a quality--uniqueness quadrant view. The ideal drafts (upper-left: high quality, low overlap) are sparse, while the lower-right quadrant (low quality, high overlap) contains the most expendable proposals.
-
-\begin{figure}[H]
-    \centering
-    \includegraphics[width=0.9\textwidth]{quality-placeholder.pdf}
-    \caption{Draft quality (composite score) vs.\ uniqueness (max pairwise similarity). Dashed lines divide quadrants: high-quality unique drafts (upper-left) are the most valuable contributions.}
-    \label{fig:quality}
-\end{figure}
-
-\subsection{Category Profiles}
-
-Figure~\ref{fig:radar} compares the rating profiles of the 8 largest categories using radar charts. Distinct profiles emerge:
-\begin{itemize}[nosep]
-    \item \textbf{Agent identity/auth}: High novelty and relevance, moderate maturity---an active innovation frontier.
-    \item \textbf{Data formats/interop}: High maturity but lower novelty---many proposals build on well-understood patterns.
-    \item \textbf{AI safety/alignment}: High relevance but lower maturity---critical problems without mature solutions.
-    \item \textbf{Autonomous netops}: Balanced profile, reflecting established network management practices adapted for AI.
-\end{itemize}
-
-\begin{figure}[H]
-    \centering
-    \includegraphics[width=0.7\textwidth]{radar-placeholder.pdf}
-    \caption{Average rating profiles per category (top 8). Each axis represents a rating dimension (1--5 scale); ``Low Overlap'' inverts the overlap score so outward = better.}
-    \label{fig:radar}
-\end{figure}
+This fragmentation has practical consequences. The most common recurring technical idea---``Multi-Agent Communication Protocol''---appears independently in 8 separate drafts from different teams. Yet of the 1,907 technical ideas extracted from the corpus, \textbf{96\% appear in exactly one draft}. Everyone is solving the same problems; nobody is solving them together.

 \subsection{Technical Ideas Landscape}

-The 1,262 extracted ideas distribute across six types (Table~\ref{tab:ideas}). \textit{Mechanisms} (concrete technical constructs) dominate at 38.7\%, followed by \textit{architectures} (17.2\%) and \textit{protocols} (14.2\%).
+The 1,907 extracted ideas distribute across six primary types (Table~\ref{tab:ideas}).

 \begin{table}[h]
 \centering
@@ -308,22 +326,34 @@ The 1,262 extracted ideas distribute across six types (Table~\ref{tab:ideas}). \
 \toprule
 \textbf{Idea Type} & \textbf{Count} & \textbf{\%} \\
 \midrule
-Mechanism & 488 & 38.7 \\
-Architecture & 217 & 17.2 \\
-Protocol & 179 & 14.2 \\
-Pattern & 169 & 13.4 \\
-Extension & 99 & 7.8 \\
-Requirement & 93 & 7.4 \\
-Other & 17 & 1.3 \\
+Mechanism & 694 & 36.4 \\
+Architecture & 301 & 15.8 \\
+Pattern & 273 & 14.3 \\
+Protocol & 237 & 12.4 \\
+Extension & 201 & 10.5 \\
+Requirement & 182 & 9.5 \\
+Other & 19 & 1.0 \\
 \midrule
-\textbf{Total} & \textbf{1,262} & \textbf{100.0} \\
+\textbf{Total} & \textbf{1,907} & \textbf{100.0} \\
 \bottomrule
 \end{tabular}
 \end{table}

-\noindent Fuzzy matching revealed several convergent ideas appearing across 3+ drafts, indicating areas of implicit community consensus. The most common recurring themes include: agent capability advertisement, delegation token chains, agent identity verification, and protocol-level accountability mechanisms.
+\noindent \textit{Mechanisms} (concrete technical constructs like ``Pseudonymous Key Generation'' or ``Context-Aware Task Scheduling'') dominate at 36.4\%, followed by \textit{architectures} (system-level designs) and \textit{patterns} (reusable design approaches). The most frequently recurring convergent ideas---those appearing independently in 3+ drafts---include:
+
+\begin{itemize}[nosep]
+    \item Multi-Agent Communication Protocol (8 drafts)
+    \item Agentic Network Architecture (7 drafts)
+    \item Cross-Domain Agent Coordination (6 drafts)
+    \item Agent-to-Agent Communication Paradigm (5 drafts)
+    \item Action-Based Authorization (5 drafts)
+    \item Agent Registration Process (5 drafts)
+\end{itemize}
+
+\noindent These convergent ideas represent areas of implicit community consensus---problems that multiple independent teams consider important enough to address. They are strong candidates for working group formation.

 \subsection{Author and Organizational Dynamics}
+\label{sec:authors}

 \subsubsection{Organizational Concentration}

@@ -337,32 +367,29 @@ The authorship landscape shows significant organizational concentration. Table~\
 \toprule
 \textbf{Organization} & \textbf{Authors} & \textbf{Drafts} \\
 \midrule
-Huawei & 30 & 25 \\
-China Mobile & 17 & 19 \\
-Huawei Technologies & 12 & 18 \\
-China Telecom & 17 & 15 \\
-China Unicom & 19 & 14 \\
-Cisco & 10 & 12 \\
-Tsinghua University & 7 & 11 \\
-Independent & 10 & 9 \\
-Cisco Systems & 10 & 9 \\
-Sandelman Software Works & 1 & 7 \\
+Huawei & 53 & 66 \\
+China Mobile & 24 & 35 \\
+Cisco & 24 & 26 \\
+Independent & 19 & 25 \\
+China Telecom & 24 & 24 \\
+China Unicom & 22 & 21 \\
+Tsinghua University & 13 & 16 \\
+ZTE Corporation & 12 & 12 \\
+Five9 & 1 & 10 \\
+Ericsson & 4 & 9 \\
 \bottomrule
 \end{tabular}
 \end{table}

-\noindent Chinese technology organizations (Huawei, China Mobile, China Telecom, China Unicom) collectively contribute $\sim$35\% of all drafts. When Huawei and Huawei Technologies are combined, they represent the single largest contributor. Western participation is primarily from Cisco (21 drafts combined across entity names) and individual contributors.
+Huawei dominates with 53 authors contributing to 66 drafts---\textbf{18\% of the entire corpus} from a single company. Chinese technology organizations collectively (Huawei, China Mobile, China Telecom, China Unicom, ZTE, Tsinghua) contribute approximately 40\% of all drafts. Western participation is led by Cisco (26 drafts) and independent contributors (25 drafts), with notable concentrated contributions from Five9 (10 drafts from a single prolific author, Jonathan Rosenberg) and Ericsson (9 drafts from 4 authors).

-\subsubsection{Collaboration Network}
+\subsubsection{Team Blocs}

-The co-authorship network reveals tight clustering within organizations. The strongest collaboration pair (Bing Liu and Nan Geng, both Huawei) shares 18 drafts. Cross-organizational collaboration is relatively rare: the strongest cross-org link (Five9--Bitwave, 6 shared drafts) is significantly weaker than top intra-org pairs. Figure~\ref{fig:network} visualizes this network.
+We identified 18 persistent co-author teams (``team blocs'') with $\geq$70\% pairwise draft overlap and $\geq$3 shared drafts. The largest is a 12-member Huawei team responsible for 23 drafts with 96\% internal cohesion---meaning team members almost always co-author together. Other notable blocs include a 5-member Cisco/Five9 team (13 drafts, 100\% cohesion) and a 5-member Ericsson team (6 drafts, 100\% cohesion).

-\begin{figure}[H]
-    \centering
-    \includegraphics[width=0.9\textwidth]{network-placeholder.pdf}
-    \caption{Author collaboration network. Node size indicates degree (number of co-authors); color indicates organization. Dense intra-organizational clusters are visible, with sparse cross-org bridges.}
-    \label{fig:network}
-\end{figure}
+\subsubsection{Cross-Organizational Collaboration}
+
+Cross-organizational collaboration exists but is weaker than intra-organizational ties. The strongest cross-org links are between Chinese organizations: China Telecom--Huawei (8 shared drafts), China Unicom--Huawei (7), and China Mobile--ZTE (7). Western cross-org collaboration is led by Cisco--Google (5 shared drafts) and Bitwave--Five9 (6). Notably, cross-regional collaboration (Chinese--Western) is minimal in the dataset.

 \subsection{Top-Ranked Proposals}

@@ -377,89 +404,173 @@ Table~\ref{tab:top} lists the five highest-scored drafts, representing the propo
 \toprule
 \textbf{Score} & \textbf{N/M/O/Mom/R} & \textbf{Draft} & \textbf{Summary} \\
 \midrule
-4.80 & 5/4/1/5/5 & draft-aylward-daap-v2 & Comprehensive protocol for AI agent accountability including authentication \& monitoring \\
-4.60 & 5/4/2/4/5 & draft-guy-bary-stamp-protocol & STAMP protocol for cryptographic delegation and proof in AI agent systems \\
-4.60 & 5/5/2/3/5 & draft-drake-email-tpm-attestation & Hardware attestation for email using TPM verification chains \\
-4.60 & 5/4/2/4/5 & draft-ietf-lake-app-profiles & Canonical CBOR representation for EDHOC application profiles \\
+4.80 & 5/5/1/4/5 & draft-cowles-volt & Tamper-evident execution trace format for AI agent workflows using hash chains and cryptographic signatures \\
+4.80 & 5/4/1/5/5 & draft-aylward-daap-v2 & Comprehensive protocol for AI agent accountability including authentication, monitoring, and audit \\
+4.60 & 5/4/2/4/5 & draft-guy-bary-stamp & STAMP protocol for cryptographic delegation and proof in AI agent systems \\
+4.60 & 5/5/2/3/5 & draft-drake-email-tpm & Hardware attestation for email using TPM verification chains \\
 4.50 & 5/4/2/4/5 & draft-goswami-agentic-jwt & Extends OAuth 2.0 with Agentic JWT for autonomous agent authorization \\
 \bottomrule
 \end{tabularx}
 \end{table}

-% ── 6. Discussion ────────────────────────────────────────────────────────
+\noindent It is notable that 3 of the top 5 drafts are safety/accountability-focused, suggesting that while the community underinvests in safety proposals, the ones that do exist tend to be high-quality.
+
+% ── 6. Gap Analysis ─────────────────────────────────────────────────────
+
+\section{Gap Analysis}
+
+Our systematic gap analysis identified 11 areas where standardization work is missing or inadequate. Table~\ref{tab:gaps} summarizes these gaps by severity.
+
+\begin{table}[h]
+\centering
+\caption{Identified standardization gaps by severity, with the number of existing technical ideas partially addressing each gap.}
+\label{tab:gaps}
+\begin{tabularx}{\textwidth}{clXr}
+\toprule
+\textbf{Sev.} & \textbf{Gap} & \textbf{Description} & \textbf{Ideas} \\
+\midrule
+CRIT & Behavior Verification & No mechanism to verify agents behave per declared policies at runtime & 53 \\
+CRIT & Human Override Protocols & No standard for emergency stop, takeover, or constraint of running agents & 7 \\
+\midrule
+HIGH & Resource Exhaustion & No agent-specific resource quotas or enforcement mechanisms & 40 \\
+HIGH & Data Provenance & Insufficient tracking of agent-generated data lineage & 4 \\
+HIGH & Capability Degradation & No graceful degradation protocols for model drift or corruption & 45 \\
+HIGH & Coordination Deadlocks & No deadlock detection/resolution for multi-agent circular dependencies & 11 \\
+HIGH & Privacy Preservation & Lack of differential privacy or secure MPC for agent interactions & 11 \\
+\midrule
+MED & Cross-Protocol Migration & No state/context migration between different A2A protocols & 3 \\
+MED & Real-time Debugging & No standard interfaces for production agent introspection & 23 \\
+MED & Model Update Security & Missing cryptographically verified, rollback-capable agent updates & 79 \\
+MED & Energy Optimization & No energy-aware agent deployment or energy budget enforcement & 17 \\
+\bottomrule
+\end{tabularx}
+\end{table}
+
+\subsection{Critical Gap: Agent Behavior Verification}
+
+While 108 drafts address agent identity and authentication---establishing \emph{who} an agent is---only 44 address AI safety/alignment, and none provides a real-time mechanism to verify that an agent is behaving according to its declared capabilities and policies \emph{while it is operating}. The gap is between policy declaration and policy enforcement: the difference between a speed limit sign and a speed camera.
+
+Some drafts approach the problem from adjacent angles. \texttt{draft-aylward-daap-v2} (score 4.8) defines a behavioral monitoring framework with cryptographic identity verification. \texttt{draft-birkholz-verifiable-agent-conversations} (score 4.5) proposes verifiable conversation records using COSE signing. \texttt{draft-berlinai-vera} (score 3.9) introduces a zero-trust architecture with five enforcement pillars. But all focus on \emph{recording} behavior for post-hoc audit rather than \emph{detecting deviation in real time}.
+
+\subsection{Critical Gap: Human Override Protocols}
+
+Only 30 of 434 drafts address human-agent interaction, compared to 120 A2A protocol drafts and 93 autonomous operations drafts. Agents are being designed to talk to each other at a 4:1 ratio over being designed to talk to humans. The CHEQ protocol (\texttt{draft-rosenberg-aiproto-cheq}, score 3.9) is a rare exception---it defines human confirmation \emph{before} agent execution. But CHEQ is opt-in and pre-execution. No draft standardizes what happens \emph{during} execution: how a human pauses a running workflow, constrains an agent's scope, takes over a task, or issues an emergency stop.
+
+\subsection{The Zero-Coverage Gap: Cross-Protocol Translation}
+
+With 120 competing A2A protocols and no translation layer, agents speaking different protocols cannot interoperate. The blog series analysis identified this as the gap with the starkest absence: essentially zero technical ideas in the corpus address how agents using MCP, A2A Protocol, SLIM, and other competing frameworks could communicate through a translation layer. If the IETF does not build this, the market will---and the result will be vendor-locked ecosystems rather than open interoperability.
+
+% ── 7. Discussion ────────────────────────────────────────────────────────

 \section{Discussion}

+\subsection{The Capability-Safety Asymmetry}
+
+The 4:1 ratio of capability-building to safety proposals is the most consequential finding of this analysis. It mirrors a broader pattern observed across AI development: capabilities consistently outpace governance~\citep{amodei2016}. In the IETF context, this asymmetry has structural causes. Safety proposals require addressing harder, cross-cutting problems (behavior verification spans all protocol categories) while capability proposals can focus narrowly on a single well-defined problem (e.g., extending OAuth with an agent-specific claim). Additionally, organizations contributing drafts are primarily technology vendors with incentives to ship interoperable products, not safety researchers.
+
+The quality signal offers a counterpoint: the highest-scored drafts in the corpus (\texttt{draft-cowles-volt}, \texttt{draft-aylward-daap-v2}, both 4.8/5.0) are safety-focused. The IETF community clearly values safety work when it appears. The deficit is one of \emph{volume}, not \emph{receptivity}. Targeted calls for safety-focused submissions, similar to IETF BOF sessions on specific topics, could help rebalance this.
+
 \subsection{The Redundancy Problem}

-The most striking finding is the degree of thematic overlap. With 2,668 draft pairs exceeding 0.80 cosine similarity (7.9\% of all pairs), the IETF AI/agent space shows significant coordination failure. Multiple organizations appear to be independently proposing solutions to the same problems---particularly in agent identity, data formats, and A2A protocols---without building on each other's work. This wastes engineering effort and fragments community attention.
+With 42 overlap clusters and 120 competing A2A protocol proposals, the IETF AI/agent space shows significant coordination failure. The OAuth-for-agents cluster alone contains 13 independent proposals, none compatible with each other. This fragmentation wastes engineering effort, confuses implementers, and risks incompatible deployments that entrench rather than resolve the problem.

-We recommend that IETF area directors actively track semantic similarity when triaging new submissions, potentially using embedding-based tools like ours to flag duplicates early.
+We observe that redundancy is partly a natural consequence of the IETF's open submission process---anyone can submit a draft---and partly reflects the ``gold rush'' dynamics where organizations race to establish their preferred approach as the standard. The embedding-based similarity tools developed here could help IETF area directors flag duplicates during triage and actively encourage consolidation.

-\subsection{The Safety Deficit}
+\subsection{Geopolitical Dimensions}

-AI safety and alignment proposals account for only 36 of the 260 drafts (13.8\%), despite being rated as highly relevant ($\mu_{\text{relevance}} = 3.4$). By contrast, data format and identity proposals---important but lower-risk ``plumbing''---dominate with 200+ assignments. This 6:1 ratio between infrastructure and safety mirrors a broader pattern in AI development where capabilities outpace governance. Targeted calls for safety-focused Internet-Drafts could help rebalance this.
+The concentration of contributions---approximately 40\% from Chinese organizations, led by Huawei's 18\%---raises questions about geographic diversity in AI standardization. Our collaboration network analysis reveals two largely separate clusters: Chinese organizations collaborate heavily with each other (China Telecom--Huawei: 8 shared drafts; China Unicom--Huawei: 7; China Mobile--ZTE: 7) while Western organizations form a smaller, separate cluster (Cisco--Google: 5; Bitwave--Five9: 6). Cross-regional bridges are sparse.

-\subsection{Organizational Dynamics}
+This bifurcation extends to the technical foundations. The Chinese bloc tends to build on YANG/NETCONF for network management, while Western proposals favor COSE/CBOR/CoAP for IoT security and OAuth/JWT for identity. The only shared foundation is OAuth 2.0. Any architectural unification must be genuinely protocol-agnostic to bridge this divide.

-The concentration of contributions in a small number of Chinese technology organizations raises questions about geographic diversity in AI standardization. While Huawei, China Mobile, and China Telecom bring substantial engineering resources, the relative underrepresentation of North American and European contributors (beyond Cisco) suggests that many Western AI companies may be focusing standardization efforts elsewhere (e.g., OASIS, W3C, or proprietary protocols).
+\subsection{Methodological Contributions}

-\subsection{Methodological Considerations}
-
-\subsubsection{LLM Rating Validity}
-
-Our LLM-assisted ratings provide scalable assessment but have inherent limitations. Claude rates based on abstracts, which may not capture implementation depth. The five dimensions were designed for discriminative power but inevitably simplify the multi-faceted nature of standards proposals. Validation against human expert ratings (Section~\ref{sec:future}) would strengthen confidence.
-
-\subsubsection{Embedding Similarity}
-
-Cosine similarity between nomic-embed-text embeddings correlates with topical similarity but may not capture functional equivalence. Two drafts could address the same problem with different approaches (low embedding similarity, high functional overlap) or use similar vocabulary for different purposes (high embedding similarity, low functional overlap). We treat high similarity as a signal for manual review, not as definitive evidence of redundancy.
-
-\subsection{Limitations}
-\label{sec:limitations}
+The LLM-assisted analysis pipeline itself represents a methodological contribution. Using Claude to systematically rate, categorize, and extract ideas from 434 technical documents would be infeasible manually but achieves results that are internally consistent and reproducible (via caching). Several design choices merit discussion:

 \begin{itemize}[nosep]
-    \item \textbf{Keyword bias}: Our seed keywords may miss relevant drafts that use different terminology.
-    \item \textbf{Single-LLM assessment}: Ratings from one model may carry systematic biases.
-    \item \textbf{Snapshot analysis}: The dataset reflects a single point in time; drafts expire, evolve, and merge.
-    \item \textbf{Author disambiguation}: Datatracker affiliations may be inconsistent (e.g., ``Huawei'' vs.\ ``Huawei Technologies'').
-    \item \textbf{No citation analysis}: We do not track which drafts reference each other, which would enrich the overlap analysis.
+    \item \textbf{LLM rating validity}: Claude rates based on abstracts and partial full text, which may not capture implementation depth. We mitigate this by using five orthogonal dimensions that capture different quality facets, and by validating that alternative weighting schemes produce highly correlated rankings (Appendix~\ref{app:sensitivity}, Spearman $\rho \geq 0.93$).
+
+    \item \textbf{Embedding similarity}: Cosine similarity between nomic-embed-text embeddings captures topical similarity but not functional equivalence. Two drafts may address the same problem with different approaches (low similarity, high functional overlap). We treat high similarity as a signal for manual review, not definitive evidence of redundancy.
+
+    \item \textbf{Cost efficiency}: The entire analysis cost approximately \$3.16 in API fees---orders of magnitude cheaper than equivalent expert analysis, enabling continuous monitoring as new drafts appear.
 \end{itemize}

-% ── 7. Future Work ──────────────────────────────────────────────────────
+\subsection{Toward an Architectural Vision}

-\section{Future Work}
-\label{sec:future}
+Our analysis suggests that the 11 gaps are not random absences but structurally related. They point to four missing architectural pillars for the AI agent ecosystem:

 \begin{enumerate}[nosep]
-    \item \textbf{Human validation}: Compare LLM ratings against expert assessments for 20--30 drafts.
-    \item \textbf{Longitudinal monitoring}: Run continuous analysis as new drafts appear.
-    \item \textbf{Citation network}: Extract inter-draft references to build a citation graph.
-    \item \textbf{Gap-driven standardization}: Use identified gaps to propose new Internet-Drafts.
-    \item \textbf{Cross-venue analysis}: Compare IETF activity with W3C, OASIS, and ISO/IEC JTC 1 AI standardization.
-    \item \textbf{Historical comparison}: Quantitatively compare this wave with IPv6, QUIC, and TLS 1.3 standardization trajectories.
+    \item \textbf{DAG-based execution model}: Multi-agent workflows as directed acyclic graphs with checkpoints, rollback, and blast-radius containment---addressing error recovery, resource management, and coordination gaps.
+
+    \item \textbf{Human-in-the-loop as first class}: Approval gates, override commands, escalation paths, and explainability tokens as native constructs in the execution model---addressing the human override and explainability gaps.
+
+    \item \textbf{Protocol-agnostic interoperability}: A translation layer letting agents using different A2A protocols communicate through gateways---addressing the cross-protocol gap with zero existing ideas.
+
+    \item \textbf{Assurance profiles}: Named configurations that dial up or down the proof requirements (from best-effort to cryptographic attestation per task)---addressing behavior verification, data provenance, and dynamic trust gaps.
 \end{enumerate}

-% ── 8. Conclusion ────────────────────────────────────────────────────────
+\noindent These pillars build on existing IETF work rather than competing with it: SPIFFE/WIMSE for identity, Execution Context Tokens for evidence, OAuth 2.0 for authorization, and the various A2A protocols for communication.
+
+\subsection{Limitations}
+
+\begin{itemize}[nosep]
+    \item \textbf{Keyword bias}: Our twelve seed keywords may miss relevant drafts using different terminology (e.g., ``cognitive computing,'' ``neural network'' in draft names).
+    \item \textbf{Single-LLM assessment}: Ratings from Claude may carry systematic biases. Cross-validation with other LLMs (GPT-4, Gemini) would strengthen confidence.
+    \item \textbf{Snapshot analysis}: The dataset reflects a point in time; drafts expire, evolve, and merge continuously.
+    \item \textbf{Author disambiguation}: Datatracker affiliations are self-reported and may be inconsistent (e.g., ``Huawei'' vs.\ ``Huawei Technologies'' appear as separate entities).
+    \item \textbf{No citation analysis}: We do not track inter-draft references, which would reveal influence networks beyond topical similarity.
+    \item \textbf{Abstract-level assessment}: Rating from abstracts may miss implementation depth in full-text specifications.
+\end{itemize}
+
+% ── 8. Related Work ─────────────────────────────────────────────────────
+
+\section{Related Work}
+
+\textbf{Standards landscape analysis.} Baron and Spulber~\citep{baron2019} provide bibliometric analysis of standards organizations but focus on patents and firm-level strategy rather than technical content. Simmons and Thaler~\citep{simmons2019} study IETF participation diversity but do not assess draft content or topical overlap. Our work extends this line by applying NLP techniques to the document content itself.
+
+\textbf{AI governance and safety.} Amodei et al.~\citep{amodei2016} articulate the challenge of aligning AI systems with human values, a concern our safety deficit finding quantifies in the standards context. The EU AI Act~\citep{euaiact2024} and NIST AI Risk Management Framework~\citep{nist2023} provide regulatory perspectives on AI governance, but neither addresses Internet protocol standardization specifically.
+
+\textbf{LLM-assisted evaluation.} Zheng et al.~\citep{zheng2023} demonstrate that LLM judges can match human evaluation quality for text assessment. Our pipeline extends this approach from evaluating model outputs to evaluating standards documents, using structured prompts for multi-dimensional rating.
+
+\textbf{Multi-agent systems.} The AAMAS community has long studied multi-agent coordination~\citep{wooldridge2009}. Our analysis reveals that the IETF is now addressing many of the same problems (coordination, trust, resource allocation) but from a protocol standardization perspective rather than an algorithmic one.
+
+% ── 9. Future Work ──────────────────────────────────────────────────────
+
+\section{Future Work}
+
+\begin{enumerate}[nosep]
+    \item \textbf{Human validation}: Compare LLM ratings against expert assessments for a stratified sample of 30--50 drafts to quantify LLM judge accuracy in this domain.
+    \item \textbf{Longitudinal monitoring}: Deploy the pipeline for continuous analysis as new drafts appear, tracking the evolution of the safety ratio, overlap clusters, and gap coverage over time.
+    \item \textbf{Citation network}: Extract inter-draft references to build a citation graph, enabling influence analysis beyond topical similarity.
+    \item \textbf{Gap-driven standardization}: Use identified gaps to propose new Internet-Drafts---we have already generated five experimental drafts addressing the architectural pillars described in Section~7.4.
+    \item \textbf{Cross-venue analysis}: Extend the methodology to W3C, OASIS, ISO/IEC JTC 1, and 3GPP AI standardization activities for a comprehensive view of the global AI standards landscape.
+    \item \textbf{Multi-LLM validation}: Cross-validate ratings using multiple LLM judges (Claude, GPT-4, Gemini) to assess systematic bias.
+\end{enumerate}
+
+% ── 10. Conclusion ────────────────────────────────────────────────────────

 \section{Conclusion}

-The IETF AI/agent standardization wave represents a unique moment in Internet governance: the community is attempting to standardize the infrastructure for autonomous agents in real time, alongside their deployment. Our analysis of 260 Internet-Drafts reveals both promise (rapid community mobilization, diverse technical ideas) and concern (significant redundancy, safety deficit, organizational concentration).
+The IETF AI/agent standardization wave represents a unique moment in Internet governance: the community is attempting to standardize the infrastructure for autonomous agents concurrently with their deployment. Our analysis of 434 Internet-Drafts from 557 authors reveals a landscape characterized by both extraordinary energy and significant structural problems.

-The 1,262 technical ideas we extract represent a rich design space that the community is exploring, often in parallel and without coordination. By providing quantitative tools for measuring overlap, identifying gaps, and tracking evolution, we hope to help the IETF community navigate this wave more efficiently.
+Three findings demand attention. First, the \textbf{4:1 safety deficit}: the community is building agent capabilities four times faster than safety mechanisms, despite the highest-quality proposals being safety-focused. Second, \textbf{extreme fragmentation}: 120 competing A2A protocol proposals, 13 independent OAuth-for-agents drafts, and 96\% of technical ideas appearing in only one draft indicate that coordination mechanisms are failing to keep pace with submission volume. Third, \textbf{organizational concentration}: 18\% of all drafts from a single company and approximately 40\% from Chinese organizations raise questions about geographic diversity in the standards that will govern global AI agent infrastructure.

-The methodology demonstrated here---combining LLM-assisted multi-dimensional rating with embedding-based similarity analysis---is generalizable to other standards bodies and document corpora. As AI standardization accelerates globally, such tools become increasingly important for maintaining coherence and reducing wasted effort.
+The 1,907 technical ideas we extract represent a rich but disorganized design space. The 11 gaps we identify---from behavior verification to human override protocols to cross-protocol translation---highlight where the community's collective blind spots lie. The architectural vision we sketch, building on existing IETF primitives (WIMSE, ECT, OAuth), suggests a path from fragmentation toward coherence.
+
+The methodology demonstrated here---combining LLM-assisted multi-dimensional rating with embedding-based similarity analysis---is itself a contribution. At \$3.16 in API costs, it provides a scalable, reproducible approach to standards landscape analysis that could be applied to any standards body facing a surge in submissions. As AI standardization accelerates globally, such tools become essential for maintaining coherence and directing limited community attention to the areas that matter most.
+
+The gold rush will not slow down. The question is whether the safety inspectors can catch up.

 % ── Acknowledgments ──────────────────────────────────────────────────────

 \section*{Acknowledgments}

-Analysis was performed using Anthropic Claude (Sonnet 4) for rating and idea extraction, and Ollama with nomic-embed-text for embedding generation. We thank the IETF community for maintaining the open Datatracker API.
+Analysis was performed using Anthropic Claude (Sonnet 4) for rating, categorization, and idea extraction, and Ollama with nomic-embed-text for embedding generation. We thank the IETF community for maintaining the open Datatracker API that made this analysis possible.

 % ── References ───────────────────────────────────────────────────────────

 \bibliographystyle{plainnat}

-\begin{thebibliography}{10}
+\begin{thebibliography}{12}

 \bibitem[RFC2026(1996)]{rfc2026}
 S.~Bradner.
@@ -477,11 +588,41 @@ J.~Simmons and D.~Thaler.
 \newblock IETF Participation Trends and Diversity.
 \newblock Presented at IETF 106, 2019.

+\bibitem[Baron \& Spulber(2019)]{baron2019}
+J.~Baron and D.~Spulber.
+\newblock Technology Standards and Standard Setting Organizations: Introduction to the Searle Center Database.
+\newblock \emph{Journal of Economics \& Management Strategy}, 27(3):462--503, 2019.
+
 \bibitem[Brown et~al.(2020)]{brown2020}
 T.~Brown, B.~Mann, N.~Ryder, et~al.
 \newblock Language Models are Few-Shot Learners.
 \newblock In \emph{Advances in Neural Information Processing Systems}, 2020.

+\bibitem[Zheng et~al.(2023)]{zheng2023}
+L.~Zheng, W.-L.~Chiang, Y.~Sheng, et~al.
+\newblock Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
+\newblock In \emph{Advances in Neural Information Processing Systems}, 2023.
+
+\bibitem[Amodei et~al.(2016)]{amodei2016}
+D.~Amodei, C.~Olah, J.~Steinhardt, et~al.
+\newblock Concrete Problems in AI Safety.
+\newblock \emph{arXiv:1606.06565}, 2016.
+
+\bibitem[Wooldridge(2009)]{wooldridge2009}
+M.~Wooldridge.
+\newblock \emph{An Introduction to MultiAgent Systems}.
+\newblock John Wiley \& Sons, 2nd edition, 2009.
+
+\bibitem[EU(2024)]{euaiact2024}
+European Parliament and Council.
+\newblock Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (AI Act).
+\newblock \emph{Official Journal of the European Union}, 2024.
+
+\bibitem[NIST(2023)]{nist2023}
+National Institute of Standards and Technology.
+\newblock Artificial Intelligence Risk Management Framework (AI RMF 1.0).
+\newblock NIST AI 100-1, January 2023.
+
 \bibitem[Google(2025)]{a2a2025}
 Google.
 \newblock Agent-to-Agent (A2A) Protocol Specification.
@@ -500,41 +641,6 @@ Anthropic.

 \appendix

-\section{Full Category List}
-\label{app:categories}
-
-\begin{table}[H]
-\centering
-\small
-\begin{tabular}{lr}
-\toprule
-\textbf{Category} & \textbf{Draft Count} \\
-\midrule
-Data formats / interop & 102 \\
-Agent identity / auth & 98 \\
-A2A protocols & 92 \\
-Policy / governance & 60 \\
-Autonomous netops & 60 \\
-Agent discovery / registration & 57 \\
-AI safety / alignment & 36 \\
-ML traffic management & 23 \\
-Human-agent interaction & 22 \\
-Other AI/agent & 21 \\
-Agent-to-agent communication protocols & 16 \\
-Agent discovery / registration (variant) & 14 \\
-Model serving / inference & 13 \\
-Identity / auth for AI agents (variant) & 13 \\
-Autonomous network operations (variant) & 5 \\
-Data formats / semantics (variant) & 3 \\
-Policy / governance (variant) & 2 \\
-AI safety / guardrails (variant) & 1 \\
-ML-based traffic mgmt (variant) & 1 \\
-\bottomrule
-\end{tabular}
-\caption{Complete list of 19 categories. Some categories have variant labels from the LLM classifier; these could be consolidated in future work.}
-\label{tab:all-categories}
-\end{table}
-
 \section{Composite Score Formula Sensitivity}
 \label{app:sensitivity}

@@ -556,4 +662,58 @@ Novelty-only & 0.50 & 0.20 & 0.10 & 0.10 & 0.10 & 0.93 \\
 \label{tab:sensitivity}
 \end{table}

+\section{Keyword Search Terms}
+\label{app:keywords}
+
+\begin{table}[H]
+\centering
+\begin{tabular}{ll}
+\toprule
+\textbf{Keyword} & \textbf{Rationale} \\
+\midrule
+\texttt{agent} & Core term for AI agent drafts \\
+\texttt{ai-agent} & Specific AI agent proposals \\
+\texttt{llm} & Large language model infrastructure \\
+\texttt{autonomous} & Self-operating systems and agents \\
+\texttt{machine-learning} & ML-related protocol work \\
+\texttt{artificial-intelligence} & General AI drafts \\
+\texttt{mcp} & Model Context Protocol ecosystem \\
+\texttt{agentic} & Agentic AI paradigm \\
+\texttt{inference} & AI inference infrastructure \\
+\texttt{generative} & Generative AI protocols \\
+\texttt{intelligent} & Intelligent networking/systems \\
+\texttt{aipref} & AI preference signaling (AIPREF WG) \\
+\bottomrule
+\end{tabular}
+\caption{Twelve seed keywords used for Datatracker API queries, with rationale for inclusion.}
+\end{table}
+
+\section{Top Convergent Ideas}
+\label{app:convergent}
+
+\begin{table}[H]
+\centering
+\small
+\begin{tabularx}{\textwidth}{Xrl}
+\toprule
+\textbf{Idea} & \textbf{Drafts} & \textbf{Primary Type} \\
+\midrule
+Multi-Agent Communication Protocol & 8 & protocol \\
+Agentic Network Architecture & 7 & architecture \\
+Cross-Domain Agent Coordination & 6 & mechanism \\
+ELA Protocol (EDHOC Lightweight Auth) & 6 & protocol \\
+Agent-to-Agent Communication Paradigm & 5 & protocol \\
+Action-Based Authorization & 5 & mechanism \\
+AI Agent Communication Network & 5 & architecture \\
+Agent Registration Process & 5 & protocol \\
+AI Gateway & 4 & architecture \\
+MCP Session Establishment over MOQT & 4 & protocol \\
+Network Equipment as MCP Servers & 4 & mechanism \\
+Multi-Agent Interaction Model & 4 & pattern \\
+Distributed AI Inference Architecture & 4 & architecture \\
+\bottomrule
+\end{tabularx}
+\caption{Most frequently occurring convergent ideas (appearing in $\geq$4 drafts independently). These represent areas of implicit community consensus.}
+\end{table}
+
 \end{document}