\documentclass[11pt,a4paper]{article} \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{lmodern} \usepackage[margin=2.5cm]{geometry} \usepackage{amsmath,amssymb} \usepackage{graphicx} \usepackage{booktabs} \usepackage{tabularx} \usepackage{hyperref} \usepackage{xcolor} \usepackage{natbib} \usepackage{enumitem} \usepackage{float} \usepackage{caption} \hypersetup{ colorlinks=true, linkcolor=blue!60!black, citecolor=green!50!black, urlcolor=blue!70!black, } \setlength{\parskip}{0.4em} \setlength{\parindent}{0em} \title{% Mapping the AI-Agent Standardization Landscape:\\ An LLM-Assisted Analysis of IETF Internet-Drafts% } \author{ Christian Nennemann\\ Independent Researcher\\ \texttt{write@nennemann.de} } \date{April 2026} \begin{document} \maketitle \begin{abstract} The Internet Engineering Task Force (IETF) is experiencing an unprecedented surge in standardization activity around AI agents. Between January~2024 and March~2026, AI- and agent-related Internet-Drafts grew from 0.5\% to 9.3\% of all IETF submissions. We present a systematic, LLM-assisted analysis of this landscape, covering 475 drafts from 713 authors across more than 230 organizations. Our pipeline combines keyword-based corpus construction from the IETF Datatracker API, multi-dimensional quality rating via Claude (Anthropic) as an LLM-as-judge, semantic embedding and clustering via a local embedding model (nomic-embed-text), LLM-based extraction of 501 discrete technical ideas, and gap analysis against the assembled corpus. Key findings include: (1)~a persistent capability-to-safety deficit, with roughly four capability-building drafts for every safety-oriented one; (2)~extreme protocol fragmentation, including 14~competing OAuth-for-agents proposals and 155~agent-to-agent protocol drafts with no interoperability layer; (3)~high organizational concentration, with a single vendor contributing approximately 16\% of all drafts; (4)~132 cross-organization convergent ideas independently proposed by multiple organizations, signaling latent consensus beneath the fragmentation; and (5)~11 identified standardization gaps, three rated critical, centered on behavioral verification, capability degradation detection, and emergency override protocols. The total analysis cost approximately \$9--15\,USD in API fees. We discuss implications for AI-agent standardization strategy, the limitations of LLM-as-judge methodologies applied to technical document corpora, and organizational dynamics shaping the standards landscape. \end{abstract} \textbf{Keywords:} IETF, Internet-Drafts, AI agents, standardization, LLM-as-judge, landscape analysis, multi-agent systems, protocol fragmentation % ========================================================================= \section{Introduction} \label{sec:intro} % ========================================================================= The deployment of autonomous AI agents---software systems that perceive their environment, make decisions, and take actions with limited human supervision---has accelerated dramatically since 2023. Commercial offerings from Anthropic, Google, OpenAI, and others have moved AI agents from research prototypes to production systems that browse the web, execute code, manage cloud infrastructure, and interact with external services on behalf of users. This proliferation raises fundamental questions about identity, authentication, delegation, safety, and interoperability that fall squarely within the purview of Internet standards bodies. The IETF, responsible for the core protocols of the Internet, has responded with an extraordinary burst of activity. In 2024, just 9 AI- or agent-related Internet-Drafts were submitted---0.5\% of all submissions. By the first quarter of 2026, that figure reached 9.3\%: nearly one in ten new drafts addressed AI agents in some capacity. Monthly submissions surged from 5 in June~2025 to 85 in February~2026, a growth rate without precedent in the IETF's recent history. This rapid expansion creates an analytical challenge. The volume of drafts, the diversity of working groups involved, the overlapping scope of competing proposals, and the speed of new submissions make manual tracking infeasible. A standards participant seeking to understand the landscape---which problems are being addressed, which are being neglected, where proposals converge and where they conflict---faces a corpus of hundreds of technical documents evolving on a weekly basis. We address this challenge with an LLM-assisted analysis pipeline that automates the collection, rating, clustering, idea extraction, and gap identification for the full corpus of AI-agent-related IETF Internet-Drafts. The pipeline combines three complementary analytical approaches: (1)~LLM-as-judge rating of drafts on five quality dimensions, using Claude (Anthropic) with structured prompts; (2)~embedding-based semantic similarity and clustering, using a locally hosted nomic-embed-text model via Ollama; and (3)~LLM-based extraction of discrete technical ideas and identification of landscape gaps. Our contributions are: \begin{itemize}[nosep] \item A comprehensive, quantitative map of the IETF's AI-agent standardization landscape as of March~2026, covering 475 drafts, 713 authors, 501 extracted technical ideas, and 11 identified gaps. \item A replicable, cost-effective methodology for LLM-assisted standards corpus analysis (\$9--15 total), with explicit documentation of limitations and methodological caveats. \item Empirical findings on organizational concentration, protocol fragmentation, cross-organization convergence, and the capability-to-safety imbalance in the current landscape. \item An open-source tool (the IETF Draft Analyzer) that makes the pipeline, database, and all derived reports available for independent verification and extension. \end{itemize} The remainder of this paper is organized as follows. Section~\ref{sec:related} reviews related work on standards landscape analysis, NLP for technical documents, and technology mapping. Section~\ref{sec:method} describes the data collection and analysis pipeline in detail. Section~\ref{sec:results} presents our findings across five analytical dimensions. Section~\ref{sec:discussion} discusses implications, limitations, and organizational dynamics. Section~\ref{sec:conclusion} concludes. % ========================================================================= \section{Related Work} \label{sec:related} % ========================================================================= Our work sits at the intersection of three research areas: standards ecosystem analysis, NLP applied to technical document corpora, and technology landscape mapping. \subsection{Standards Analysis} The economics and dynamics of technical standardization have been studied extensively. \citet{simcoe2012} analyzes consensus governance in standard-setting committees, showing how committee structure influences the trajectory of shared technology platforms. \citet{blind2017} examine the impact of standards and regulation on innovation in uncertain markets, a framing directly applicable to the nascent AI-agent ecosystem where both the technology and the regulatory environment are in flux. \citet{lerner2014} study standard-essential patents, a concern that is beginning to surface in the AI-agent space as organizations file IPR declarations on agent-related protocols. Prior quantitative analyses of IETF activity have typically focused on participation patterns, working group dynamics, or the trajectory of individual RFCs through the standards process. Our work differs in scope: rather than analyzing the IETF as an institution, we analyze a specific cross-cutting topic (AI agents) that spans multiple working groups and is evolving too rapidly for traditional manual survey methods. \subsection{NLP for Technical Documents} The application of natural language processing to technical and legal document corpora has expanded significantly with the advent of large language models. \citet{devlin2019} introduced BERT-based approaches that enabled transfer learning for domain-specific text classification. More recently, \citet{brown2020} demonstrated that large language models exhibit strong few-shot and zero-shot performance on diverse text understanding tasks, opening the possibility of using LLMs as automated annotators for technical documents. The ``LLM-as-judge'' paradigm---using language models to evaluate or rate text artifacts---has been systematically studied by \citet{zheng2023}, who introduced MT-Bench and Chatbot Arena to evaluate LLM judges against human preferences. Their work establishes both the promise (high correlation with human judgment on structured evaluation tasks) and the limitations (position bias, verbosity bias, self-enhancement bias) of LLM-based evaluation. Our use of Claude as a rater for IETF drafts follows this paradigm, with the specific limitation that no human calibration study has been performed on our rating outputs (see Section~\ref{sec:limitations}). Embedding-based document similarity using models such as Sentence-BERT~\citep{nussbaumer2024} and its successors has become standard practice for document clustering and retrieval. We use nomic-embed-text~\citep{nomic2024}, a general-purpose text embedding model, for computing pairwise cosine similarity across the draft corpus. The resulting similarity matrix enables both cluster detection and visualization via t-SNE~\citep{vandermaaten2008}. \subsection{Technology Landscape Surveys} Technology landscape mapping---the systematic identification and organization of technical activities within a domain---has a long history in foresight and innovation studies. \citet{porter2005} introduced ``tech mining'' as a methodology for extracting competitive intelligence from patent and publication databases. \citet{roper2011} extended these methods to broader technology management contexts. Our work adapts these approaches to the standards domain, replacing patent databases with the IETF Datatracker and augmenting keyword-based search with LLM-driven semantic analysis. The AI agent research community has produced several recent surveys. \citet{wang2024} and \citet{xi2023} survey the rapidly growing literature on LLM-based autonomous agents, covering architectures, capabilities, and evaluation. These academic surveys focus on research contributions; our work complements them by mapping the parallel standardization effort, where research ideas meet the engineering constraints of Internet protocol design. The multi-agent systems (MAS) research tradition, surveyed comprehensively by \citet{wooldridge2009} and \citet{dorri2018}, provides historical context. The FIPA Agent Communication Language~\citep{fipa-acl} and Agent Management Specification~\citep{fipa-ams}, developed between 1996 and 2005, addressed many of the same problems---agent discovery, communication protocols, platform interoperability---that the current IETF drafts tackle. The near-complete absence of FIPA references in the contemporary IETF corpus suggests limited awareness of this prior art, a finding we quantify in Section~\ref{sec:results}. % ========================================================================= \section{Methodology} \label{sec:method} % ========================================================================= The analysis pipeline consists of six sequential stages, each building on the output of the previous. All intermediate results are stored in a SQLite database (28\,MB) with FTS5 full-text search, enabling both pipeline idempotency and ad-hoc querying. The complete pipeline is implemented as a Python CLI tool (approximately 6,100 lines across 12 modules) using Click, httpx, the Anthropic SDK, and Ollama. \subsection{Data Collection} \label{sec:datacollection} \subsubsection{Corpus Construction} Drafts were retrieved from the IETF Datatracker API\footnote{\url{https://datatracker.ietf.org/api/v1/doc/document/}} using keyword search across both draft names (\texttt{name\_\_contains}) and abstracts (\texttt{abstract\_\_contains}). Twelve search terms were used: \textit{agent}, \textit{ai-agent}, \textit{agentic}, \textit{autonomous}, \textit{mcp}, \textit{inference}, \textit{generative}, \textit{intelligent}, \textit{large language model}, \textit{multi-agent}, and \textit{trustworth}. Only drafts with \texttt{type\_\_slug=draft} and submission date $\geq$~2024-01-01 were included. Full text was downloaded from the IETF archive.\footnote{\url{https://www.ietf.org/archive/id/}} The keyword set was expanded iteratively. An initial set of 6 keywords yielded 260 drafts; adding 6 further terms captured 174 additional drafts in categories initially underrepresented, including MCP-related work, generative AI infrastructure, and the nascent \texttt{aipref} working group. A polite delay of 0.5\,seconds was applied between API requests. The resulting corpus contains 475 drafts. After false-positive filtering (removing drafts about ``user agents,'' ``autonomous systems'' in routing, and other non-AI uses of matched keywords), 361 drafts were retained as AI/agent-relevant based on a relevance rating threshold. \subsubsection{Supplementary Standards Bodies} To contextualize the IETF landscape, we ingested a supplementary corpus of standards and specifications from five additional bodies: ISO/IEC (including ISO~22989~\citep{iso22989} and ISO~42001~\citep{iso42001}), ITU-T (including Y.3172~\citep{itu-y3172}), ETSI (ENI, ZSM), W3C (Web of Things, Verifiable Credentials, WebNN), and NIST (AI RMF~\citep{nist-ai-rmf}). These documents were included in the gap analysis (Section~\ref{sec:gaps}) to identify areas where non-IETF bodies provide coverage that the IETF corpus lacks, and vice versa. \subsubsection{Author and Affiliation Data} Author records were fetched from the Datatracker's \texttt{documentauthor} and \texttt{person} endpoints. Organizational affiliations were normalized using a hand-curated alias table of 40+ mappings (e.g., ``Huawei Technologies Co., Ltd.'' $\rightarrow$~``Huawei'') supplemented by automatic suffix stripping for common corporate suffixes. \subsection{LLM-Based Analysis} \label{sec:llm-analysis} \subsubsection{Multi-Dimensional Rating} Each draft was rated by Claude (Anthropic; Sonnet model) on five dimensions using a structured prompt containing the draft's name, title, submission date, page count, and abstract (truncated to 2,000 characters). The five rating dimensions are: \begin{itemize}[nosep] \item \textbf{Novelty} (1--5): Originality relative to existing standards and proposals. \item \textbf{Maturity} (1--5): Completeness of the technical specification. \item \textbf{Overlap} (1--5): Redundancy with other known drafts (5 indicates near-duplication). \item \textbf{Momentum} (1--5): Community engagement, revisions, and working group adoption signals. \item \textbf{Relevance} (1--5): Importance to the AI/agent ecosystem specifically. \end{itemize} The prompt instructs Claude to return structured JSON with integer scores and brief justification notes for each dimension, plus a 2--3 sentence summary and one or more category labels drawn from a predefined taxonomy of 11 categories (Table~\ref{tab:categories}). A composite quality score is computed as the arithmetic mean of novelty, maturity, momentum, and relevance (excluding overlap, which measures redundancy rather than quality). To reduce API costs, drafts were rated in batches of five using a batch prompt variant. Each draft's abstract was truncated to 1,500 characters in batch mode. All API responses were cached in an \texttt{llm\_cache} table keyed by SHA-256 hash of the full prompt, making the pipeline idempotent on re-runs. \subsubsection{Idea Extraction} Discrete technical ideas---mechanisms, protocols, architectural patterns, extensions, and requirements---were extracted from each draft using Claude. For individual extraction, the prompt included the abstract and the first 3,000 characters of full text (Sonnet model). For batch extraction, groups of five drafts were processed per API call using the cheaper Haiku model with abstracts truncated to 800 characters. The prompt requested 1--4 top-level novel contributions per draft, with explicit instructions to merge sub-features into parent ideas and to return an empty array for drafts lacking substantive technical content. Extracted ideas were deduplicated within each draft using embedding-based cosine similarity (threshold~0.85), removing ideas that were restatements of the same concept. Cross-draft idea overlap was analyzed using Python's \texttt{SequenceMatcher} with a fuzzy matching threshold of~0.75 on idea titles, enabling detection of convergent ideas across organizational boundaries. \subsubsection{Gap Analysis} A single Claude Sonnet call received a compressed landscape summary containing category distribution counts, the 20 most frequently occurring idea titles, overlap cluster statistics, and summaries of relevant non-IETF standards. The prompt instructed the model to identify 8--15 standardization gaps---areas, problems, or technical challenges not adequately addressed by the existing corpus---with structured output including topic, description, severity rating (critical/high/medium/low), evidence, and partial coverage from existing standards. \subsection{Embedding and Clustering} \label{sec:embedding} Vector embeddings were generated locally using Ollama with the nomic-embed-text model~\citep{nomic2024}. For each draft, the input combined the title, abstract, and first 4,000 characters of full text (when available), producing a 768-dimensional vector stored as a binary blob in SQLite. Pairwise cosine similarity was computed across all embedded drafts, producing an $n \times n$ similarity matrix (cached to disk as a NumPy array). Clustering used a greedy single-linkage algorithm: for each unvisited draft, all unvisited drafts with cosine similarity $\geq \tau$ to the seed were added to its cluster. Three empirically determined thresholds were applied: \begin{itemize}[nosep] \item $\tau = 0.85$: Topically overlapping drafts (42 clusters). \item $\tau = 0.90$: Near-duplicates or same-author variants (34 clusters). \item $\tau = 0.98$: Functionally identical drafts (25+ pairs). \end{itemize} These thresholds were selected by manual inspection of draft pairs at each level; no systematic sensitivity analysis was performed (see Section~\ref{sec:limitations}). \subsection{Supplementary Analyses} Three additional analysis passes operate on the stored data with zero API cost: \begin{enumerate}[nosep] \item \textbf{RFC cross-references}: Regex-based extraction of RFC, BCP, and draft citations from full text, yielding 4,231 cross-references across 360 drafts. \item \textbf{Category trends}: SQL-based monthly breakdown of new drafts per category with growth rates. \item \textbf{Co-authorship network}: Team bloc detection via pairwise author overlap ($\geq$70\% shared drafts, $\geq$2 shared drafts), with connected components forming blocs. \end{enumerate} \subsection{Cost} Table~\ref{tab:cost} summarizes the total pipeline cost for 475 drafts. \begin{table}[H] \centering \caption{Pipeline cost breakdown.} \label{tab:cost} \begin{tabular}{llrr} \toprule \textbf{Stage} & \textbf{Model} & \textbf{Items} & \textbf{Cost (USD)} \\ \midrule Rating & Claude Sonnet & 475 drafts & \$5.50--8.00 \\ Idea extract. & Claude Haiku & 475 drafts & \$0.80 \\ Gap analysis & Claude Sonnet & 1 call & \$0.20 \\ Embeddings & Ollama (local) & 475 drafts & \$0.00 \\ RFC refs & Regex (local) & 475 drafts & \$0.00 \\ Trends & SQL (local) & 475 drafts & \$0.00 \\ Idea overlap & SequenceMatcher & 501 ideas & \$0.00 \\ \midrule \textbf{Total} & & & \textbf{\$6.50--9.00} \\ \bottomrule \end{tabular} \end{table} % ========================================================================= \section{Results} \label{sec:results} % ========================================================================= \subsection{Corpus Overview and Growth Trajectory} The final corpus comprises 475 Internet-Drafts submitted between January~2024 and March~2026. After false-positive filtering (drafts with relevance score $\leq$~2 or manually flagged), 361 drafts were retained as substantively related to AI agents. The growth trajectory is striking. In 2024, 9 AI/agent drafts were submitted (0.5\% of 1,651 total IETF drafts). In 2025, 190 were submitted (7.0\% of 2,696). In Q1~2026 alone, 162 were submitted (9.3\% of 1,748). Monthly submissions followed a step function: 5~drafts in June~2025, 61 in October~2025, 85 in February~2026. The acceleration has not plateaued as of March~2026. \begin{table}[H] \centering \caption{Growth of AI/agent-related IETF Internet-Drafts.} \label{tab:growth} \begin{tabular}{rrrr} \toprule \textbf{Year} & \textbf{Total IETF} & \textbf{AI/Agent} & \textbf{Share (\%)} \\ \midrule 2024 & 1,651 & 9 & 0.5 \\ 2025 & 2,696 & 190 & 7.0 \\ 2026 (Q1) & 1,748 & 162 & 9.3 \\ \bottomrule \end{tabular} \end{table} \subsection{Thematic Distribution} \label{sec:categories} Drafts were classified into 11 non-exclusive categories (Table~\ref{tab:categories}). A single draft may belong to multiple categories; percentages therefore exceed 100\%. \begin{table}[H] \centering \caption{Category distribution across 475 drafts. Drafts may appear in multiple categories.} \label{tab:categories} \begin{tabular}{lrr} \toprule \textbf{Category} & \textbf{Drafts} & \textbf{Share (\%)} \\ \midrule Data formats / interoperability & 214 & 45 \\ Policy / governance & 214 & 45 \\ Agent identity / authentication & 160 & 34 \\ A2A protocols & 157 & 33 \\ Autonomous network operations & 124 & 26 \\ Agent discovery / registration & 89 & 19 \\ ML traffic management & 79 & 17 \\ Human--agent interaction & 57 & 12 \\ AI safety / alignment & 112 & 24 \\ Model serving / inference & 42 & 9 \\ Other AI/agent & -- & -- \\ \bottomrule \end{tabular} \end{table} The dominance of infrastructure categories---data formats, identity, communication protocols---is expected for an early-stage standards effort. The comparatively low representation of safety/alignment and human--agent interaction categories is a structural finding we examine in Section~\ref{sec:safety-deficit}. \subsection{The Capability-to-Safety Deficit} \label{sec:safety-deficit} The ratio of capability-building drafts (A2A protocols, autonomous network operations, agent discovery, model serving) to safety-oriented drafts (AI safety/alignment, human--agent interaction) is approximately 4:1 on aggregate. This ratio varies significantly by month, ranging from 1.5:1 in months with concentrated safety submissions to over 20:1 in months dominated by protocol proposals. The drafts that do address safety are among the highest-rated in the corpus. The Verifiable Observation Logging for Transparency (VOLT)~\citep{draft-cowles-volt} protocol scored 4.75/5.0 on the four-dimension composite (excluding overlap), as did the Distributed AI Accountability Protocol (DAAP)~\citep{draft-aylward-daap}. The STAMP protocol~\citep{draft-guy-bary-stamp} for cryptographic delegation and proof scored 4.5. The quality of safety-focused work is high; the quantity is not. An analysis of RFC cross-references reinforces this finding. Across 4,231 parsed citations, the most-referenced standards after the boilerplate RFC~2119/8174 conventions are TLS~1.3~\citep{rfc8446} (42 citations), OAuth~2.0~\citep{rfc6749} (36), HTTP Semantics~\citep{rfc9110} (34), and JWT~\citep{rfc7519} (22). The agent standards ecosystem is being constructed on the web's existing security infrastructure---OAuth, TLS, HTTP, JWT---yet the safety layer that should accompany this security foundation remains underdeveloped. \subsection{Protocol Fragmentation} \label{sec:fragmentation} Embedding-based similarity analysis reveals extensive duplication and fragmentation across the corpus. \subsubsection{Near-Duplicates} At the 0.98 cosine similarity threshold, 25+ draft pairs are functionally identical---the same proposal submitted under different names, to different working groups, or as renamed revisions. A taxonomy of near-duplicates includes: same draft submitted to different working groups (14 pairs), renamed drafts (5), evolutionary versions (3), and genuinely competing proposals from different organizations (2+). \subsubsection{Competing Clusters} At the 0.85 threshold, 42 topical clusters emerge. The most crowded is OAuth for AI agents, with 14 distinct proposals all addressing how AI agents authenticate and receive authorization via the OAuth framework. These range from broad profile proposals to narrow scope extensions to comprehensive accountability systems. None are interoperable. The A2A protocol space encompasses 157 drafts with no interoperability layer. The most common technical idea in the entire extracted corpus---``Multi-Agent Communication Protocol''---appears independently in 8 drafts from different teams. A 10-draft cluster addresses agent gateway and multi-agent collaboration, with approaches ranging from semantic routing gateways to cross-domain interoperability frameworks. \subsubsection{Causes of Fragmentation} The data distinguishes three causes: (1)~working group shopping, where authors submit the same draft to multiple working groups seeking adoption; (2)~parallel invention, where isolated teams independently solve the same problem; and (3)~strategic surface-area expansion, where organizations submit multiple related drafts to maximize presence in the standards landscape. \subsection{Organizational Dynamics} \label{sec:orgs} \subsubsection{Concentration} Authorship is heavily concentrated. Huawei leads with 53 authors contributing to 69 drafts---approximately 16\% of the entire corpus across all Huawei entities. China Mobile (24~authors, 35~drafts), Cisco (24~authors, 26~drafts), and China Telecom (24~authors, 24~drafts) follow. Chinese-linked institutions (Huawei, China Mobile, China Telecom, China Unicom, Tsinghua University, ZTE, BUPT, and associated laboratories) collectively account for over 160 authors. Western technology companies are dramatically underrepresented relative to their market positions. Google is present with 5 authors on 9 drafts. Microsoft, Apple, and Meta have minimal direct participation. Amazon's 6 authors focus on post-quantum cryptography rather than agent-specific work. \subsubsection{Team Blocs} Co-authorship analysis identifies 18 team blocs among the 713 authors, covering approximately 25\% of all authors. The largest bloc is a 13-person Huawei team sharing 22 drafts with 94\% average cohesion (measured as pairwise overlap of draft portfolios). The team's core of 7 members each appear on 13--23 drafts. Cross-organizational collaboration is sparse. The most productive cross-team pair shares only 3 drafts. Chinese organizations form a tightly linked ecosystem: Huawei--China Unicom shares 6 drafts, Tsinghua--Zhongguancun Lab shares 5, China Mobile--ZTE shares 4. European telecoms (Deutsche Telekom, Telef\'onica, Orange) act as bridges between Chinese and Western institutions. \subsection{Cross-Organization Convergence} \label{sec:convergence} Despite the fragmentation, significant latent consensus exists. Using fuzzy title matching (\texttt{SequenceMatcher} at 0.75 threshold) on the 501 extracted ideas, 132 ideas (approximately 33\% of unique idea clusters) have been independently proposed by two or more organizations. The strongest convergence signals include ``A2A Communication Paradigm'' (proposed by 8 organizations from 5 countries), ``AI Agent Network Architecture'' (8 organizations), and ``Multi-Agent Communication Protocol'' (7 organizations). An examination of organizational pairs reveals that 180 convergent ideas cross the boundary between Chinese-linked and Western organizations, indicating genuine cross-cultural consensus on technical directions despite the sparse direct collaboration noted in Section~\ref{sec:orgs}. The coexistence of convergence and fragmentation has a specific structure: organizations agree on \textit{what} needs building (the convergent ideas) but disagree on \textit{how} to build it (the competing protocol proposals). This gap between problem consensus and solution divergence is where architectural coordination is most needed. \subsection{Gap Analysis} \label{sec:gaps} The gap analysis identified 11 standardization gaps, distributed across severity levels as shown in Table~\ref{tab:gaps}. \begin{table}[H] \centering \caption{Identified standardization gaps by severity.} \label{tab:gaps} \begin{tabularx}{\textwidth}{llX} \toprule \textbf{Severity} & \textbf{Topic} & \textbf{Description} \\ \midrule Critical & Agent legal liability & No standard addresses liability assignment when autonomous agents cause harm or make binding commitments across creators, operators, and users. \\ Critical & Capability degradation detection & No standard defines detection mechanisms for gradual capability degradation due to concept drift, adversarial inputs, or model corruption. \\ Critical & Emergency override protocols & No standard defines distributed emergency-stop mechanisms for autonomous agents exhibiting dangerous behavior across multi-system deployments. \\ \midrule High & Cross-domain identity portability & Agents cannot maintain consistent identity across organizational domains with different identity systems. \\ High & Real-time behavior explanation & No standard for interactive, real-time explanations of agent decision-making during operation. \\ High & Multi-agent conflict resolution & No protocol for resolving conflicts when multiple agents have competing objectives or contend for shared resources. \\ High & Inter-standards-body bridging & Protocols from IETF, ITU-T, and ISO cannot interoperate, creating silos across network, internet, and industrial domains. \\ High & Behavioral audit trails & Missing standards for immutable, decision-level audit logs supporting forensic analysis and regulatory compliance. \\ \midrule Medium & Resource consumption limits & No self-regulation standards for agent computational, network, and energy resource usage. \\ Medium & Training data provenance & Missing standards for tracking data lineage as it flows between agents in federated learning scenarios. \\ Medium & Content attribution & No cryptographic attribution standards for agent-generated content.\\ \bottomrule \end{tabularx} \end{table} The three critical gaps share a common theme: they address what happens when autonomous agents fail or misbehave. The capability-building majority of the corpus assumes cooperative, well-functioning agent systems; the critical gaps expose the absence of standards for the adversarial, degraded, and emergency cases that inevitably arise in production deployment. Cross-referencing gaps with extracted ideas quantifies the coverage deficit. The ``emergency override'' gap has only 15 partially addressing ideas across the corpus. The ``multi-agent conflict resolution'' and ``inter-standards-body bridging'' gaps have zero directly related extracted ideas---they are entirely unaddressed. % ========================================================================= \section{Discussion} \label{sec:discussion} % ========================================================================= \subsection{Implications for Standardization Strategy} The landscape reveals a standards ecosystem in a characteristic early-stage pattern: rapid expansion, parallel invention, and insufficient coordination. The IETF has navigated such patterns before---the early web, IoT, DNS security---and the historical resolution involves convergence of competing proposals, working group consolidation, and the emergence of a small number of lasting standards from a large initial field. Three strategic priorities emerge from the data: \textbf{Safety-first coordination.} The 4:1 capability-to-safety ratio is a structural risk. The critical gaps---behavioral verification, capability degradation detection, emergency override---are precisely the areas where standardization failure has the highest real-world consequence. Unlike protocol fragmentation, which causes confusion and implementation cost, safety gaps create liability and harm. The EU AI Act~\citep{eu-ai-act}, which mandates real-time explainability and human oversight for high-risk AI systems, will make several of these gaps regulatory obligations rather than optional best practices. \textbf{Architectural connective tissue.} The landscape needs not more protocols but a shared execution model. The convergence data shows that organizations agree on the components; they disagree on the integration. Proposals like VOLT~\citep{draft-cowles-volt} (execution traces), DAAP~\citep{draft-aylward-daap} (accountability), STAMP~\citep{draft-guy-bary-stamp} (cryptographic delegation), and Verifiable Agent Conversations~\citep{draft-birkholz-vac} (signed conversation records) address complementary parts of the same architectural problem. An overarching agent execution architecture that composes these components would accelerate convergence more effectively than continued parallel invention. \textbf{Cross-organization coordination.} The team bloc structure produces drafts that are internally consistent but externally incompatible. The 18 detected blocs function as islands; the bridges between them are thin. Mechanisms that encourage cross-bloc collaboration---joint design teams, interop testing events, shared reference implementations---are more likely to produce lasting standards than the current pattern of parallel submission. \subsection{Relationship to Prior Agent Standards} A notable finding is the near-complete absence of references to FIPA (Foundation for Intelligent Physical Agents) in the contemporary IETF corpus. FIPA's Agent Communication Language~\citep{fipa-acl} and Agent Management Specification~\citep{fipa-ams}, developed between 1996 and 2005, addressed agent discovery, communication, platform interoperability, and interaction protocols---the same problem space that the current wave of IETF drafts tackles. The absence of FIPA references does not necessarily indicate ignorance; the web-native technical context of 2025 differs substantially from the Java/CORBA context of 2002. However, the recurrence of problems FIPA addressed (agent naming, message semantics, directory services, interaction protocols) suggests that explicit engagement with the FIPA legacy could help the IETF community avoid re-learning lessons from two decades ago. \subsection{Limitations} \label{sec:limitations} The methodology has several limitations that affect the confidence and generalizability of the findings. \textbf{LLM-as-judge validity.} All quality ratings are generated by a single LLM (Claude Sonnet) from draft abstracts truncated to 2,000 characters. No human calibration study has been performed; no inter-rater reliability is established. The ratings should be treated as relative rankings within this corpus, not absolute quality measures. Maturity scores are particularly affected by abstract-only input, as abstracts may not convey the full technical depth of a specification. The overlap dimension is limited because Claude rates each draft independently without access to the full corpus, meaning it reflects the model's general knowledge rather than corpus-specific similarity. A validation study using domain expert ratings on a sample of 25--30 drafts would substantially strengthen confidence. \textbf{Corpus selection bias.} Keyword-based selection introduces both false positives (``agent'' matching ``user agent,'' ``autonomous'' matching ``autonomous systems'' in routing) and false negatives (relevant drafts using terminology outside the keyword set). We estimate 30--50 false positives remain despite relevance filtering. The temporal cutoff of January~2024 excludes earlier foundational work. \textbf{Clustering thresholds.} The similarity thresholds (0.85, 0.90, 0.98) are empirically chosen by manual inspection, not derived from principled analysis. The embedding model (nomic-embed-text) is a general-purpose model not fine-tuned for standards document similarity. Sensitivity analysis across thresholds and comparison with alternative clustering methods (DBSCAN, hierarchical agglomerative) would strengthen the clustering results. \textbf{Gap analysis methodology.} Gap identification relies on a single-shot LLM analysis of compressed landscape statistics, not systematic comparison against a reference taxonomy. A rigorous approach would compare the corpus against an explicit reference architecture such as NIST AI RMF~\citep{nist-ai-rmf}, the FIPA agent platform model, or a purpose-built agent ecosystem reference model. Gap severity is assigned by Claude without defined quantitative thresholds. \textbf{Idea extraction consistency.} Batch extraction using Haiku with abstract-only input produces different results from individual extraction using Sonnet with full text. No precision/recall measurement has been performed. The extraction prompt limits output to 1--4 ideas per draft, potentially under-counting contributions from comprehensive specifications. \textbf{Organizational normalization.} Cross-organization analysis depends on the accuracy of a hand-curated alias table. Boundary cases (e.g., joint ventures, university--industry affiliations, subsidiary relationships) introduce judgment calls that affect concentration statistics. Despite these limitations, the findings are robust in their broad contours: the growth trajectory, the safety deficit, the protocol fragmentation, and the organizational concentration are visible across multiple analytical methods and are not sensitive to the specific threshold or model choices within reasonable ranges. \subsection{Reproducibility and Openness} The complete pipeline, database, and derived reports are released as open-source software (the IETF Draft Analyzer). The SQLite database contains all raw data, ratings, embeddings, ideas, gaps, author records, and cached LLM responses, enabling independent verification of every finding reported in this paper. The caching mechanism ensures that re-running the pipeline produces identical results without additional API cost. % ========================================================================= \section{Conclusion} \label{sec:conclusion} % ========================================================================= We have presented a systematic, LLM-assisted analysis of the IETF's AI-agent standardization landscape, covering 475 Internet-Drafts from 713 authors across more than 230 organizations. The analysis reveals a standards ecosystem experiencing unprecedented growth---from 0.5\% to 9.3\% of all IETF submissions in fifteen months---accompanied by significant structural challenges. The capability-to-safety ratio of approximately 4:1, the extreme protocol fragmentation (14 competing OAuth proposals, 155 A2A drafts with no interoperability layer), and the concentration of authorship (one vendor contributing $\sim$16\% of all drafts) are findings that have direct implications for the trajectory of AI-agent standardization. The 11 identified gaps, with three critical gaps centered on what happens when agents fail, highlight the areas where standardization effort is most urgently needed. At the same time, the 132 cross-organization convergent ideas demonstrate that latent consensus exists beneath the fragmentation. Organizations agree on the problems; they disagree on the solutions. This gap between problem consensus and solution divergence defines the current phase of the standards race and points toward the needed intervention: not more protocol proposals, but architectural connective tissue that composes the existing high-quality components into a coherent ecosystem. The methodology itself contributes a replicable, cost-effective approach to standards landscape analysis. At \$9--15 total, the pipeline demonstrates that LLM-assisted document analysis at scale is practical for research and policy applications. The explicit documentation of limitations---no human calibration, empirical thresholds, single-judge ratings---provides a template for the responsible use of LLM-as-judge methodologies in technical document analysis. The IETF has navigated standardization sprints before, and the lasting standards have consistently emerged from efforts that prioritized interoperability and safety alongside capability. Whether the current AI-agent wave follows this historical pattern depends on whether the community can shift from parallel invention to coordinated architecture before the capability work ships without the safety work that should accompany it. % ========================================================================= % References % ========================================================================= \bibliographystyle{plainnat} \bibliography{ietf-refs} \end{document}