New LaTeX paper analyzing the AI-agent standardization landscape across IETF Internet-Drafts. Includes bibliography, updated Makefile for pdflatex+bibtex build, and gitignore entries for build artifacts.
900 lines
41 KiB
TeX
900 lines
41 KiB
TeX
\documentclass[11pt,a4paper]{article}
|
|
|
|
\usepackage[utf8]{inputenc}
|
|
\usepackage[T1]{fontenc}
|
|
\usepackage{lmodern}
|
|
\usepackage[margin=2.5cm]{geometry}
|
|
\usepackage{amsmath,amssymb}
|
|
\usepackage{graphicx}
|
|
\usepackage{booktabs}
|
|
\usepackage{tabularx}
|
|
\usepackage{hyperref}
|
|
\usepackage{xcolor}
|
|
\usepackage{natbib}
|
|
\usepackage{enumitem}
|
|
\usepackage{float}
|
|
\usepackage{caption}
|
|
|
|
\hypersetup{
|
|
colorlinks=true,
|
|
linkcolor=blue!60!black,
|
|
citecolor=green!50!black,
|
|
urlcolor=blue!70!black,
|
|
}
|
|
|
|
\setlength{\parskip}{0.4em}
|
|
\setlength{\parindent}{0em}
|
|
|
|
\title{%
|
|
Mapping the AI-Agent Standardization Landscape:\\
|
|
An LLM-Assisted Analysis of IETF Internet-Drafts%
|
|
}
|
|
\author{
|
|
Christian Nennemann\\
|
|
Independent Researcher\\
|
|
\texttt{write@nennemann.de}
|
|
}
|
|
\date{April 2026}
|
|
|
|
\begin{document}
|
|
\maketitle
|
|
|
|
\begin{abstract}
|
|
The Internet Engineering Task Force (IETF) is experiencing an unprecedented
|
|
surge in standardization activity around AI agents. Between January~2024 and
|
|
March~2026, AI- and agent-related Internet-Drafts grew from 0.5\% to 9.3\%
|
|
of all IETF submissions. We present a systematic, LLM-assisted analysis of
|
|
this landscape, covering 475 drafts from 713 authors across more than 230
|
|
organizations. Our pipeline combines keyword-based corpus construction from
|
|
the IETF Datatracker API, multi-dimensional quality rating via Claude
|
|
(Anthropic) as an LLM-as-judge, semantic embedding and clustering via a
|
|
local embedding model (nomic-embed-text), LLM-based extraction of 501
|
|
discrete technical ideas, and gap analysis against the assembled corpus.
|
|
Key findings include: (1)~a persistent capability-to-safety deficit, with
|
|
roughly four capability-building drafts for every safety-oriented one;
|
|
(2)~extreme protocol fragmentation, including 14~competing OAuth-for-agents
|
|
proposals and 155~agent-to-agent protocol drafts with no interoperability
|
|
layer; (3)~high organizational concentration, with a single vendor
|
|
contributing approximately 16\% of all drafts; (4)~132 cross-organization
|
|
convergent ideas independently proposed by multiple organizations, signaling
|
|
latent consensus beneath the fragmentation; and (5)~11 identified
|
|
standardization gaps, three rated critical, centered on behavioral
|
|
verification, capability degradation detection, and emergency override
|
|
protocols. The total analysis cost approximately \$9--15\,USD in API fees.
|
|
We discuss implications for AI-agent standardization strategy, the
|
|
limitations of LLM-as-judge methodologies applied to technical document
|
|
corpora, and organizational dynamics shaping the standards landscape.
|
|
\end{abstract}
|
|
|
|
\textbf{Keywords:} IETF, Internet-Drafts, AI agents, standardization,
|
|
LLM-as-judge, landscape analysis, multi-agent systems, protocol
|
|
fragmentation
|
|
|
|
|
|
% =========================================================================
|
|
\section{Introduction}
|
|
\label{sec:intro}
|
|
% =========================================================================
|
|
|
|
The deployment of autonomous AI agents---software systems that perceive
|
|
their environment, make decisions, and take actions with limited human
|
|
supervision---has accelerated dramatically since 2023. Commercial
|
|
offerings from Anthropic, Google, OpenAI, and others have moved AI agents
|
|
from research prototypes to production systems that browse the web,
|
|
execute code, manage cloud infrastructure, and interact with external
|
|
services on behalf of users. This proliferation raises fundamental
|
|
questions about identity, authentication, delegation, safety, and
|
|
interoperability that fall squarely within the purview of Internet
|
|
standards bodies.
|
|
|
|
The IETF, responsible for the core protocols of the Internet, has
|
|
responded with an extraordinary burst of activity. In 2024, just 9
|
|
AI- or agent-related Internet-Drafts were submitted---0.5\% of all
|
|
submissions. By the first quarter of 2026, that figure reached 9.3\%:
|
|
nearly one in ten new drafts addressed AI agents in some capacity.
|
|
Monthly submissions surged from 5 in June~2025 to 85 in February~2026,
|
|
a growth rate without precedent in the IETF's recent history.
|
|
|
|
This rapid expansion creates an analytical challenge. The volume of
|
|
drafts, the diversity of working groups involved, the overlapping scope
|
|
of competing proposals, and the speed of new submissions make manual
|
|
tracking infeasible. A standards participant seeking to understand the
|
|
landscape---which problems are being addressed, which are being
|
|
neglected, where proposals converge and where they conflict---faces a
|
|
corpus of hundreds of technical documents evolving on a weekly basis.
|
|
|
|
We address this challenge with an LLM-assisted analysis pipeline that
|
|
automates the collection, rating, clustering, idea extraction, and gap
|
|
identification for the full corpus of AI-agent-related IETF
|
|
Internet-Drafts. The pipeline combines three complementary analytical
|
|
approaches: (1)~LLM-as-judge rating of drafts on five quality
|
|
dimensions, using Claude (Anthropic) with structured prompts;
|
|
(2)~embedding-based semantic similarity and clustering, using a locally
|
|
hosted nomic-embed-text model via Ollama; and (3)~LLM-based extraction
|
|
of discrete technical ideas and identification of landscape gaps.
|
|
|
|
Our contributions are:
|
|
|
|
\begin{itemize}[nosep]
|
|
\item A comprehensive, quantitative map of the IETF's AI-agent
|
|
standardization landscape as of March~2026, covering 475 drafts,
|
|
713 authors, 501 extracted technical ideas, and 11 identified gaps.
|
|
\item A replicable, cost-effective methodology for LLM-assisted
|
|
standards corpus analysis (\$9--15 total), with explicit
|
|
documentation of limitations and methodological caveats.
|
|
\item Empirical findings on organizational concentration,
|
|
protocol fragmentation, cross-organization convergence, and
|
|
the capability-to-safety imbalance in the current landscape.
|
|
\item An open-source tool (the IETF Draft Analyzer) that makes the
|
|
pipeline, database, and all derived reports available for
|
|
independent verification and extension.
|
|
\end{itemize}
|
|
|
|
The remainder of this paper is organized as follows.
|
|
Section~\ref{sec:related} reviews related work on standards landscape
|
|
analysis, NLP for technical documents, and technology mapping.
|
|
Section~\ref{sec:method} describes the data collection and analysis
|
|
pipeline in detail. Section~\ref{sec:results} presents our findings
|
|
across five analytical dimensions. Section~\ref{sec:discussion}
|
|
discusses implications, limitations, and organizational dynamics.
|
|
Section~\ref{sec:conclusion} concludes.
|
|
|
|
|
|
% =========================================================================
|
|
\section{Related Work}
|
|
\label{sec:related}
|
|
% =========================================================================
|
|
|
|
Our work sits at the intersection of three research areas: standards
|
|
ecosystem analysis, NLP applied to technical document corpora, and
|
|
technology landscape mapping.
|
|
|
|
\subsection{Standards Analysis}
|
|
|
|
The economics and dynamics of technical standardization have been
|
|
studied extensively. \citet{simcoe2012} analyzes consensus governance
|
|
in standard-setting committees, showing how committee structure
|
|
influences the trajectory of shared technology platforms.
|
|
\citet{blind2017} examine the impact of standards and regulation on
|
|
innovation in uncertain markets, a framing directly applicable to the
|
|
nascent AI-agent ecosystem where both the technology and the regulatory
|
|
environment are in flux. \citet{lerner2014} study standard-essential
|
|
patents, a concern that is beginning to surface in the AI-agent space
|
|
as organizations file IPR declarations on agent-related protocols.
|
|
|
|
Prior quantitative analyses of IETF activity have typically focused on
|
|
participation patterns, working group dynamics, or the trajectory of
|
|
individual RFCs through the standards process. Our work differs in
|
|
scope: rather than analyzing the IETF as an institution, we analyze a
|
|
specific cross-cutting topic (AI agents) that spans multiple working
|
|
groups and is evolving too rapidly for traditional manual survey methods.
|
|
|
|
\subsection{NLP for Technical Documents}
|
|
|
|
The application of natural language processing to technical and legal
|
|
document corpora has expanded significantly with the advent of large
|
|
language models. \citet{devlin2019} introduced BERT-based approaches
|
|
that enabled transfer learning for domain-specific text
|
|
classification. More recently, \citet{brown2020} demonstrated that
|
|
large language models exhibit strong few-shot and zero-shot performance
|
|
on diverse text understanding tasks, opening the possibility of using
|
|
LLMs as automated annotators for technical documents.
|
|
|
|
The ``LLM-as-judge'' paradigm---using language models to evaluate or
|
|
rate text artifacts---has been systematically studied by
|
|
\citet{zheng2023}, who introduced MT-Bench and Chatbot Arena to
|
|
evaluate LLM judges against human preferences. Their work establishes
|
|
both the promise (high correlation with human judgment on structured
|
|
evaluation tasks) and the limitations (position bias, verbosity bias,
|
|
self-enhancement bias) of LLM-based evaluation. Our use of Claude as a
|
|
rater for IETF drafts follows this paradigm, with the specific
|
|
limitation that no human calibration study has been performed on our
|
|
rating outputs (see Section~\ref{sec:limitations}).
|
|
|
|
Embedding-based document similarity using models such as
|
|
Sentence-BERT~\citep{nussbaumer2024} and its successors has become
|
|
standard practice for document clustering and retrieval. We use
|
|
nomic-embed-text~\citep{nomic2024}, a general-purpose text embedding
|
|
model, for computing pairwise cosine similarity across the draft corpus.
|
|
The resulting similarity matrix enables both cluster detection and
|
|
visualization via t-SNE~\citep{vandermaaten2008}.
|
|
|
|
\subsection{Technology Landscape Surveys}
|
|
|
|
Technology landscape mapping---the systematic identification and
|
|
organization of technical activities within a domain---has a long
|
|
history in foresight and innovation studies.
|
|
\citet{porter2005} introduced ``tech mining'' as a methodology for
|
|
extracting competitive intelligence from patent and publication
|
|
databases. \citet{roper2011} extended these methods to broader
|
|
technology management contexts. Our work adapts these approaches to
|
|
the standards domain, replacing patent databases with the IETF
|
|
Datatracker and augmenting keyword-based search with LLM-driven
|
|
semantic analysis.
|
|
|
|
The AI agent research community has produced several recent surveys.
|
|
\citet{wang2024} and \citet{xi2023} survey the rapidly growing
|
|
literature on LLM-based autonomous agents, covering architectures,
|
|
capabilities, and evaluation. These academic surveys focus on
|
|
research contributions; our work complements them by mapping the
|
|
parallel standardization effort, where research ideas meet the
|
|
engineering constraints of Internet protocol design.
|
|
|
|
The multi-agent systems (MAS) research tradition, surveyed
|
|
comprehensively by \citet{wooldridge2009} and \citet{dorri2018},
|
|
provides historical context. The FIPA Agent Communication
|
|
Language~\citep{fipa-acl} and Agent Management
|
|
Specification~\citep{fipa-ams}, developed between 1996 and 2005,
|
|
addressed many of the same problems---agent discovery, communication
|
|
protocols, platform interoperability---that the current IETF drafts
|
|
tackle. The near-complete absence of FIPA references in the
|
|
contemporary IETF corpus suggests limited awareness of this prior art,
|
|
a finding we quantify in Section~\ref{sec:results}.
|
|
|
|
|
|
% =========================================================================
|
|
\section{Methodology}
|
|
\label{sec:method}
|
|
% =========================================================================
|
|
|
|
The analysis pipeline consists of six sequential stages, each building
|
|
on the output of the previous. All intermediate results are stored in
|
|
a SQLite database (28\,MB) with FTS5 full-text search, enabling both
|
|
pipeline idempotency and ad-hoc querying. The complete pipeline is
|
|
implemented as a Python CLI tool (approximately 6,100 lines across 12
|
|
modules) using Click, httpx, the Anthropic SDK, and Ollama.
|
|
|
|
\subsection{Data Collection}
|
|
\label{sec:datacollection}
|
|
|
|
\subsubsection{Corpus Construction}
|
|
|
|
Drafts were retrieved from the IETF Datatracker
|
|
API\footnote{\url{https://datatracker.ietf.org/api/v1/doc/document/}}
|
|
using keyword search across both draft names
|
|
(\texttt{name\_\_contains}) and abstracts
|
|
(\texttt{abstract\_\_contains}). Twelve search terms were used:
|
|
\textit{agent}, \textit{ai-agent}, \textit{agentic},
|
|
\textit{autonomous}, \textit{mcp}, \textit{inference},
|
|
\textit{generative}, \textit{intelligent}, \textit{large language
|
|
model}, \textit{multi-agent}, and \textit{trustworth}.
|
|
Only drafts with \texttt{type\_\_slug=draft} and submission date
|
|
$\geq$~2024-01-01 were included. Full text was downloaded from the
|
|
IETF archive.\footnote{\url{https://www.ietf.org/archive/id/}}
|
|
|
|
The keyword set was expanded iteratively. An initial set of 6 keywords
|
|
yielded 260 drafts; adding 6 further terms captured 174 additional
|
|
drafts in categories initially underrepresented, including MCP-related
|
|
work, generative AI infrastructure, and the nascent \texttt{aipref}
|
|
working group. A polite delay of 0.5\,seconds was applied between API
|
|
requests.
|
|
|
|
The resulting corpus contains 475 drafts. After false-positive
|
|
filtering (removing drafts about ``user agents,'' ``autonomous
|
|
systems'' in routing, and other non-AI uses of matched keywords), 361
|
|
drafts were retained as AI/agent-relevant based on a relevance
|
|
rating threshold.
|
|
|
|
\subsubsection{Supplementary Standards Bodies}
|
|
|
|
To contextualize the IETF landscape, we ingested a supplementary
|
|
corpus of standards and specifications from five additional bodies:
|
|
ISO/IEC (including ISO~22989~\citep{iso22989} and
|
|
ISO~42001~\citep{iso42001}), ITU-T (including
|
|
Y.3172~\citep{itu-y3172}), ETSI (ENI, ZSM), W3C (Web of Things,
|
|
Verifiable Credentials, WebNN), and NIST (AI RMF~\citep{nist-ai-rmf}).
|
|
These documents were included in the gap analysis (Section~\ref{sec:gaps})
|
|
to identify areas where non-IETF bodies provide coverage that the IETF
|
|
corpus lacks, and vice versa.
|
|
|
|
\subsubsection{Author and Affiliation Data}
|
|
|
|
Author records were fetched from the Datatracker's
|
|
\texttt{documentauthor} and \texttt{person} endpoints. Organizational
|
|
affiliations were normalized using a hand-curated alias table of 40+
|
|
mappings (e.g., ``Huawei Technologies Co., Ltd.''
|
|
$\rightarrow$~``Huawei'') supplemented by automatic suffix stripping
|
|
for common corporate suffixes.
|
|
|
|
\subsection{LLM-Based Analysis}
|
|
\label{sec:llm-analysis}
|
|
|
|
\subsubsection{Multi-Dimensional Rating}
|
|
|
|
Each draft was rated by Claude (Anthropic; Sonnet model) on five
|
|
dimensions using a structured prompt containing the draft's name,
|
|
title, submission date, page count, and abstract (truncated to 2,000
|
|
characters). The five rating dimensions are:
|
|
|
|
\begin{itemize}[nosep]
|
|
\item \textbf{Novelty} (1--5): Originality relative to existing
|
|
standards and proposals.
|
|
\item \textbf{Maturity} (1--5): Completeness of the technical
|
|
specification.
|
|
\item \textbf{Overlap} (1--5): Redundancy with other known drafts
|
|
(5 indicates near-duplication).
|
|
\item \textbf{Momentum} (1--5): Community engagement, revisions,
|
|
and working group adoption signals.
|
|
\item \textbf{Relevance} (1--5): Importance to the AI/agent
|
|
ecosystem specifically.
|
|
\end{itemize}
|
|
|
|
The prompt instructs Claude to return structured JSON with integer
|
|
scores and brief justification notes for each dimension, plus a 2--3
|
|
sentence summary and one or more category labels drawn from a
|
|
predefined taxonomy of 11 categories (Table~\ref{tab:categories}).
|
|
A composite quality score is computed as the arithmetic mean of
|
|
novelty, maturity, momentum, and relevance (excluding overlap, which
|
|
measures redundancy rather than quality).
|
|
|
|
To reduce API costs, drafts were rated in batches of five using a
|
|
batch prompt variant. Each draft's abstract was truncated to 1,500
|
|
characters in batch mode. All API responses were cached in an
|
|
\texttt{llm\_cache} table keyed by SHA-256 hash of the full prompt,
|
|
making the pipeline idempotent on re-runs.
|
|
|
|
\subsubsection{Idea Extraction}
|
|
|
|
Discrete technical ideas---mechanisms, protocols, architectural
|
|
patterns, extensions, and requirements---were extracted from each
|
|
draft using Claude. For individual extraction, the prompt included
|
|
the abstract and the first 3,000 characters of full text (Sonnet
|
|
model). For batch extraction, groups of five drafts were processed
|
|
per API call using the cheaper Haiku model with abstracts truncated
|
|
to 800 characters. The prompt requested 1--4 top-level novel
|
|
contributions per draft, with explicit instructions to merge
|
|
sub-features into parent ideas and to return an empty array for
|
|
drafts lacking substantive technical content.
|
|
|
|
Extracted ideas were deduplicated within each draft using
|
|
embedding-based cosine similarity (threshold~0.85), removing ideas
|
|
that were restatements of the same concept. Cross-draft idea overlap
|
|
was analyzed using Python's \texttt{SequenceMatcher} with a fuzzy
|
|
matching threshold of~0.75 on idea titles, enabling detection of
|
|
convergent ideas across organizational boundaries.
|
|
|
|
\subsubsection{Gap Analysis}
|
|
|
|
A single Claude Sonnet call received a compressed landscape summary
|
|
containing category distribution counts, the 20 most frequently
|
|
occurring idea titles, overlap cluster statistics, and summaries of
|
|
relevant non-IETF standards. The prompt instructed the model to
|
|
identify 8--15 standardization gaps---areas, problems, or technical
|
|
challenges not adequately addressed by the existing corpus---with
|
|
structured output including topic, description, severity rating
|
|
(critical/high/medium/low), evidence, and partial coverage from
|
|
existing standards.
|
|
|
|
\subsection{Embedding and Clustering}
|
|
\label{sec:embedding}
|
|
|
|
Vector embeddings were generated locally using Ollama with the
|
|
nomic-embed-text model~\citep{nomic2024}. For each draft, the input
|
|
combined the title, abstract, and first 4,000 characters of full text
|
|
(when available), producing a 768-dimensional vector stored as a
|
|
binary blob in SQLite.
|
|
|
|
Pairwise cosine similarity was computed across all embedded drafts,
|
|
producing an $n \times n$ similarity matrix (cached to disk as a
|
|
NumPy array). Clustering used a greedy single-linkage algorithm: for
|
|
each unvisited draft, all unvisited drafts with cosine similarity
|
|
$\geq \tau$ to the seed were added to its cluster. Three empirically
|
|
determined thresholds were applied:
|
|
|
|
\begin{itemize}[nosep]
|
|
\item $\tau = 0.85$: Topically overlapping drafts (42 clusters).
|
|
\item $\tau = 0.90$: Near-duplicates or same-author variants (34
|
|
clusters).
|
|
\item $\tau = 0.98$: Functionally identical drafts (25+ pairs).
|
|
\end{itemize}
|
|
|
|
These thresholds were selected by manual inspection of draft pairs at
|
|
each level; no systematic sensitivity analysis was performed (see
|
|
Section~\ref{sec:limitations}).
|
|
|
|
\subsection{Supplementary Analyses}
|
|
|
|
Three additional analysis passes operate on the stored data with zero
|
|
API cost:
|
|
|
|
\begin{enumerate}[nosep]
|
|
\item \textbf{RFC cross-references}: Regex-based extraction of
|
|
RFC, BCP, and draft citations from full text, yielding 4,231
|
|
cross-references across 360 drafts.
|
|
\item \textbf{Category trends}: SQL-based monthly breakdown of new
|
|
drafts per category with growth rates.
|
|
\item \textbf{Co-authorship network}: Team bloc detection via
|
|
pairwise author overlap ($\geq$70\% shared drafts, $\geq$2 shared
|
|
drafts), with connected components forming blocs.
|
|
\end{enumerate}
|
|
|
|
\subsection{Cost}
|
|
|
|
Table~\ref{tab:cost} summarizes the total pipeline cost for 475 drafts.
|
|
|
|
\begin{table}[H]
|
|
\centering
|
|
\caption{Pipeline cost breakdown.}
|
|
\label{tab:cost}
|
|
\begin{tabular}{llrr}
|
|
\toprule
|
|
\textbf{Stage} & \textbf{Model} & \textbf{Items} & \textbf{Cost (USD)} \\
|
|
\midrule
|
|
Rating & Claude Sonnet & 475 drafts & \$5.50--8.00 \\
|
|
Idea extract. & Claude Haiku & 475 drafts & \$0.80 \\
|
|
Gap analysis & Claude Sonnet & 1 call & \$0.20 \\
|
|
Embeddings & Ollama (local) & 475 drafts & \$0.00 \\
|
|
RFC refs & Regex (local) & 475 drafts & \$0.00 \\
|
|
Trends & SQL (local) & 475 drafts & \$0.00 \\
|
|
Idea overlap & SequenceMatcher & 501 ideas & \$0.00 \\
|
|
\midrule
|
|
\textbf{Total} & & & \textbf{\$6.50--9.00} \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\end{table}
|
|
|
|
|
|
% =========================================================================
|
|
\section{Results}
|
|
\label{sec:results}
|
|
% =========================================================================
|
|
|
|
\subsection{Corpus Overview and Growth Trajectory}
|
|
|
|
The final corpus comprises 475 Internet-Drafts submitted between
|
|
January~2024 and March~2026. After false-positive filtering (drafts
|
|
with relevance score $\leq$~2 or manually flagged), 361 drafts were
|
|
retained as substantively related to AI agents.
|
|
|
|
The growth trajectory is striking. In 2024, 9 AI/agent drafts were
|
|
submitted (0.5\% of 1,651 total IETF drafts). In 2025, 190 were
|
|
submitted (7.0\% of 2,696). In Q1~2026 alone, 162 were submitted
|
|
(9.3\% of 1,748). Monthly submissions followed a step function:
|
|
5~drafts in June~2025, 61 in October~2025, 85 in February~2026.
|
|
The acceleration has not plateaued as of March~2026.
|
|
|
|
\begin{table}[H]
|
|
\centering
|
|
\caption{Growth of AI/agent-related IETF Internet-Drafts.}
|
|
\label{tab:growth}
|
|
\begin{tabular}{rrrr}
|
|
\toprule
|
|
\textbf{Year} & \textbf{Total IETF} & \textbf{AI/Agent} & \textbf{Share (\%)} \\
|
|
\midrule
|
|
2024 & 1,651 & 9 & 0.5 \\
|
|
2025 & 2,696 & 190 & 7.0 \\
|
|
2026 (Q1) & 1,748 & 162 & 9.3 \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\end{table}
|
|
|
|
\subsection{Thematic Distribution}
|
|
\label{sec:categories}
|
|
|
|
Drafts were classified into 11 non-exclusive categories
|
|
(Table~\ref{tab:categories}). A single draft may belong to multiple
|
|
categories; percentages therefore exceed 100\%.
|
|
|
|
\begin{table}[H]
|
|
\centering
|
|
\caption{Category distribution across 475 drafts. Drafts may appear in
|
|
multiple categories.}
|
|
\label{tab:categories}
|
|
\begin{tabular}{lrr}
|
|
\toprule
|
|
\textbf{Category} & \textbf{Drafts} & \textbf{Share (\%)} \\
|
|
\midrule
|
|
Data formats / interoperability & 214 & 45 \\
|
|
Policy / governance & 214 & 45 \\
|
|
Agent identity / authentication & 160 & 34 \\
|
|
A2A protocols & 157 & 33 \\
|
|
Autonomous network operations & 124 & 26 \\
|
|
Agent discovery / registration & 89 & 19 \\
|
|
ML traffic management & 79 & 17 \\
|
|
Human--agent interaction & 57 & 12 \\
|
|
AI safety / alignment & 112 & 24 \\
|
|
Model serving / inference & 42 & 9 \\
|
|
Other AI/agent & -- & -- \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\end{table}
|
|
|
|
The dominance of infrastructure categories---data formats, identity,
|
|
communication protocols---is expected for an early-stage standards
|
|
effort. The comparatively low representation of safety/alignment and
|
|
human--agent interaction categories is a structural finding we examine
|
|
in Section~\ref{sec:safety-deficit}.
|
|
|
|
\subsection{The Capability-to-Safety Deficit}
|
|
\label{sec:safety-deficit}
|
|
|
|
The ratio of capability-building drafts (A2A protocols, autonomous
|
|
network operations, agent discovery, model serving) to safety-oriented
|
|
drafts (AI safety/alignment, human--agent interaction) is
|
|
approximately 4:1 on aggregate. This ratio varies significantly by
|
|
month, ranging from 1.5:1 in months with concentrated safety
|
|
submissions to over 20:1 in months dominated by protocol proposals.
|
|
|
|
The drafts that do address safety are among the highest-rated in the
|
|
corpus. The Verifiable Observation Logging for Transparency
|
|
(VOLT)~\citep{draft-cowles-volt} protocol scored 4.75/5.0 on the
|
|
four-dimension composite (excluding overlap), as did the Distributed
|
|
AI Accountability Protocol (DAAP)~\citep{draft-aylward-daap}. The
|
|
STAMP protocol~\citep{draft-guy-bary-stamp} for cryptographic
|
|
delegation and proof scored 4.5. The quality of safety-focused work
|
|
is high; the quantity is not.
|
|
|
|
An analysis of RFC cross-references reinforces this finding. Across
|
|
4,231 parsed citations, the most-referenced standards after the
|
|
boilerplate RFC~2119/8174 conventions are TLS~1.3~\citep{rfc8446}
|
|
(42 citations), OAuth~2.0~\citep{rfc6749} (36), HTTP
|
|
Semantics~\citep{rfc9110} (34), and JWT~\citep{rfc7519} (22). The
|
|
agent standards ecosystem is being constructed on the web's existing
|
|
security infrastructure---OAuth, TLS, HTTP, JWT---yet the safety
|
|
layer that should accompany this security foundation remains
|
|
underdeveloped.
|
|
|
|
\subsection{Protocol Fragmentation}
|
|
\label{sec:fragmentation}
|
|
|
|
Embedding-based similarity analysis reveals extensive duplication and
|
|
fragmentation across the corpus.
|
|
|
|
\subsubsection{Near-Duplicates}
|
|
|
|
At the 0.98 cosine similarity threshold, 25+ draft pairs are
|
|
functionally identical---the same proposal submitted under different
|
|
names, to different working groups, or as renamed revisions. A
|
|
taxonomy of near-duplicates includes: same draft submitted to
|
|
different working groups (14 pairs), renamed drafts (5), evolutionary
|
|
versions (3), and genuinely competing proposals from different
|
|
organizations (2+).
|
|
|
|
\subsubsection{Competing Clusters}
|
|
|
|
At the 0.85 threshold, 42 topical clusters emerge. The most crowded
|
|
is OAuth for AI agents, with 14 distinct proposals all addressing
|
|
how AI agents authenticate and receive authorization via the OAuth
|
|
framework. These range from broad profile proposals to narrow scope
|
|
extensions to comprehensive accountability systems. None are
|
|
interoperable.
|
|
|
|
The A2A protocol space encompasses 157 drafts with no
|
|
interoperability layer. The most common technical idea in the entire
|
|
extracted corpus---``Multi-Agent Communication Protocol''---appears
|
|
independently in 8 drafts from different teams. A 10-draft cluster
|
|
addresses agent gateway and multi-agent collaboration, with
|
|
approaches ranging from semantic routing gateways to cross-domain
|
|
interoperability frameworks.
|
|
|
|
\subsubsection{Causes of Fragmentation}
|
|
|
|
The data distinguishes three causes: (1)~working group shopping, where
|
|
authors submit the same draft to multiple working groups seeking
|
|
adoption; (2)~parallel invention, where isolated teams independently
|
|
solve the same problem; and (3)~strategic surface-area expansion,
|
|
where organizations submit multiple related drafts to maximize
|
|
presence in the standards landscape.
|
|
|
|
\subsection{Organizational Dynamics}
|
|
\label{sec:orgs}
|
|
|
|
\subsubsection{Concentration}
|
|
|
|
Authorship is heavily concentrated. Huawei leads with 53 authors
|
|
contributing to 69 drafts---approximately 16\% of the entire corpus
|
|
across all Huawei entities. China Mobile (24~authors, 35~drafts),
|
|
Cisco (24~authors, 26~drafts), and China Telecom (24~authors,
|
|
24~drafts) follow. Chinese-linked institutions (Huawei, China
|
|
Mobile, China Telecom, China Unicom, Tsinghua University, ZTE, BUPT,
|
|
and associated laboratories) collectively account for over 160
|
|
authors.
|
|
|
|
Western technology companies are dramatically underrepresented
|
|
relative to their market positions. Google is present with 5 authors
|
|
on 9 drafts. Microsoft, Apple, and Meta have minimal direct
|
|
participation. Amazon's 6 authors focus on post-quantum cryptography
|
|
rather than agent-specific work.
|
|
|
|
\subsubsection{Team Blocs}
|
|
|
|
Co-authorship analysis identifies 18 team blocs among the 713 authors,
|
|
covering approximately 25\% of all authors. The largest bloc is a
|
|
13-person Huawei team sharing 22 drafts with 94\% average cohesion
|
|
(measured as pairwise overlap of draft portfolios). The team's core
|
|
of 7 members each appear on 13--23 drafts.
|
|
|
|
Cross-organizational collaboration is sparse. The most productive
|
|
cross-team pair shares only 3 drafts. Chinese organizations form a
|
|
tightly linked ecosystem: Huawei--China Unicom shares 6 drafts,
|
|
Tsinghua--Zhongguancun Lab shares 5, China Mobile--ZTE shares 4.
|
|
European telecoms (Deutsche Telekom, Telef\'onica, Orange) act as
|
|
bridges between Chinese and Western institutions.
|
|
|
|
\subsection{Cross-Organization Convergence}
|
|
\label{sec:convergence}
|
|
|
|
Despite the fragmentation, significant latent consensus exists. Using
|
|
fuzzy title matching (\texttt{SequenceMatcher} at 0.75 threshold) on
|
|
the 501 extracted ideas, 132 ideas (approximately 33\% of unique idea
|
|
clusters) have been independently proposed by two or more organizations.
|
|
|
|
The strongest convergence signals include ``A2A Communication
|
|
Paradigm'' (proposed by 8 organizations from 5 countries),
|
|
``AI Agent Network Architecture'' (8 organizations), and
|
|
``Multi-Agent Communication Protocol'' (7 organizations). An
|
|
examination of organizational pairs reveals that 180 convergent ideas
|
|
cross the boundary between Chinese-linked and Western organizations,
|
|
indicating genuine cross-cultural consensus on technical directions
|
|
despite the sparse direct collaboration noted in
|
|
Section~\ref{sec:orgs}.
|
|
|
|
The coexistence of convergence and fragmentation has a specific
|
|
structure: organizations agree on \textit{what} needs building (the
|
|
convergent ideas) but disagree on \textit{how} to build it (the
|
|
competing protocol proposals). This gap between problem consensus and
|
|
solution divergence is where architectural coordination is most needed.
|
|
|
|
\subsection{Gap Analysis}
|
|
\label{sec:gaps}
|
|
|
|
The gap analysis identified 11 standardization gaps, distributed across
|
|
severity levels as shown in Table~\ref{tab:gaps}.
|
|
|
|
\begin{table}[H]
|
|
\centering
|
|
\caption{Identified standardization gaps by severity.}
|
|
\label{tab:gaps}
|
|
\begin{tabularx}{\textwidth}{llX}
|
|
\toprule
|
|
\textbf{Severity} & \textbf{Topic} & \textbf{Description} \\
|
|
\midrule
|
|
Critical & Agent legal liability &
|
|
No standard addresses liability assignment when autonomous agents
|
|
cause harm or make binding commitments across creators, operators,
|
|
and users. \\
|
|
Critical & Capability degradation detection &
|
|
No standard defines detection mechanisms for gradual capability
|
|
degradation due to concept drift, adversarial inputs, or model
|
|
corruption. \\
|
|
Critical & Emergency override protocols &
|
|
No standard defines distributed emergency-stop mechanisms for
|
|
autonomous agents exhibiting dangerous behavior across
|
|
multi-system deployments. \\
|
|
\midrule
|
|
High & Cross-domain identity portability &
|
|
Agents cannot maintain consistent identity across organizational
|
|
domains with different identity systems. \\
|
|
High & Real-time behavior explanation &
|
|
No standard for interactive, real-time explanations of agent
|
|
decision-making during operation. \\
|
|
High & Multi-agent conflict resolution &
|
|
No protocol for resolving conflicts when multiple agents have
|
|
competing objectives or contend for shared resources. \\
|
|
High & Inter-standards-body bridging &
|
|
Protocols from IETF, ITU-T, and ISO cannot interoperate, creating
|
|
silos across network, internet, and industrial domains. \\
|
|
High & Behavioral audit trails &
|
|
Missing standards for immutable, decision-level audit logs
|
|
supporting forensic analysis and regulatory compliance. \\
|
|
\midrule
|
|
Medium & Resource consumption limits &
|
|
No self-regulation standards for agent computational, network, and
|
|
energy resource usage. \\
|
|
Medium & Training data provenance &
|
|
Missing standards for tracking data lineage as it flows between
|
|
agents in federated learning scenarios. \\
|
|
Medium & Content attribution &
|
|
No cryptographic attribution standards for agent-generated content.\\
|
|
\bottomrule
|
|
\end{tabularx}
|
|
\end{table}
|
|
|
|
The three critical gaps share a common theme: they address what happens
|
|
when autonomous agents fail or misbehave. The capability-building
|
|
majority of the corpus assumes cooperative, well-functioning agent
|
|
systems; the critical gaps expose the absence of standards for the
|
|
adversarial, degraded, and emergency cases that inevitably arise in
|
|
production deployment.
|
|
|
|
Cross-referencing gaps with extracted ideas quantifies the coverage
|
|
deficit. The ``emergency override'' gap has only 15 partially
|
|
addressing ideas across the corpus. The ``multi-agent conflict
|
|
resolution'' and ``inter-standards-body bridging'' gaps have zero
|
|
directly related extracted ideas---they are entirely unaddressed.
|
|
|
|
|
|
% =========================================================================
|
|
\section{Discussion}
|
|
\label{sec:discussion}
|
|
% =========================================================================
|
|
|
|
\subsection{Implications for Standardization Strategy}
|
|
|
|
The landscape reveals a standards ecosystem in a characteristic
|
|
early-stage pattern: rapid expansion, parallel invention, and
|
|
insufficient coordination. The IETF has navigated such patterns
|
|
before---the early web, IoT, DNS security---and the historical
|
|
resolution involves convergence of competing proposals, working group
|
|
consolidation, and the emergence of a small number of lasting
|
|
standards from a large initial field.
|
|
|
|
Three strategic priorities emerge from the data:
|
|
|
|
\textbf{Safety-first coordination.} The 4:1 capability-to-safety
|
|
ratio is a structural risk. The critical gaps---behavioral verification,
|
|
capability degradation detection, emergency override---are precisely
|
|
the areas where standardization failure has the highest real-world
|
|
consequence. Unlike protocol fragmentation, which causes confusion and
|
|
implementation cost, safety gaps create liability and harm. The
|
|
EU AI Act~\citep{eu-ai-act}, which mandates real-time explainability
|
|
and human oversight for high-risk AI systems, will make several of
|
|
these gaps regulatory obligations rather than optional best practices.
|
|
|
|
\textbf{Architectural connective tissue.} The landscape needs not more
|
|
protocols but a shared execution model. The convergence data shows that
|
|
organizations agree on the components; they disagree on the
|
|
integration. Proposals like VOLT~\citep{draft-cowles-volt} (execution
|
|
traces), DAAP~\citep{draft-aylward-daap} (accountability),
|
|
STAMP~\citep{draft-guy-bary-stamp} (cryptographic delegation), and
|
|
Verifiable Agent Conversations~\citep{draft-birkholz-vac} (signed
|
|
conversation records) address complementary parts of the same
|
|
architectural problem. An overarching agent execution architecture
|
|
that composes these components would accelerate convergence more
|
|
effectively than continued parallel invention.
|
|
|
|
\textbf{Cross-organization coordination.} The team bloc structure
|
|
produces drafts that are internally consistent but externally
|
|
incompatible. The 18 detected blocs function as islands; the bridges
|
|
between them are thin. Mechanisms that encourage cross-bloc
|
|
collaboration---joint design teams, interop testing events,
|
|
shared reference implementations---are more likely to produce lasting
|
|
standards than the current pattern of parallel submission.
|
|
|
|
\subsection{Relationship to Prior Agent Standards}
|
|
|
|
A notable finding is the near-complete absence of references to FIPA
|
|
(Foundation for Intelligent Physical Agents) in the contemporary IETF
|
|
corpus. FIPA's Agent Communication Language~\citep{fipa-acl} and Agent
|
|
Management Specification~\citep{fipa-ams}, developed between 1996 and
|
|
2005, addressed agent discovery, communication, platform
|
|
interoperability, and interaction protocols---the same problem space
|
|
that the current wave of IETF drafts tackles.
|
|
|
|
The absence of FIPA references does not necessarily indicate ignorance;
|
|
the web-native technical context of 2025 differs substantially from the
|
|
Java/CORBA context of 2002. However, the recurrence of problems
|
|
FIPA addressed (agent naming, message semantics, directory services,
|
|
interaction protocols) suggests that explicit engagement with the
|
|
FIPA legacy could help the IETF community avoid re-learning lessons
|
|
from two decades ago.
|
|
|
|
\subsection{Limitations}
|
|
\label{sec:limitations}
|
|
|
|
The methodology has several limitations that affect the confidence and
|
|
generalizability of the findings.
|
|
|
|
\textbf{LLM-as-judge validity.} All quality ratings are generated by a
|
|
single LLM (Claude Sonnet) from draft abstracts truncated to 2,000
|
|
characters. No human calibration study has been performed; no
|
|
inter-rater reliability is established. The ratings should be treated
|
|
as relative rankings within this corpus, not absolute quality measures.
|
|
Maturity scores are particularly affected by abstract-only input, as
|
|
abstracts may not convey the full technical depth of a specification.
|
|
The overlap dimension is limited because Claude rates each draft
|
|
independently without access to the full corpus, meaning it reflects
|
|
the model's general knowledge rather than corpus-specific similarity.
|
|
A validation study using domain expert ratings on a sample of 25--30
|
|
drafts would substantially strengthen confidence.
|
|
|
|
\textbf{Corpus selection bias.} Keyword-based selection introduces both
|
|
false positives (``agent'' matching ``user agent,'' ``autonomous''
|
|
matching ``autonomous systems'' in routing) and false negatives
|
|
(relevant drafts using terminology outside the keyword set). We
|
|
estimate 30--50 false positives remain despite relevance filtering.
|
|
The temporal cutoff of January~2024 excludes earlier foundational work.
|
|
|
|
\textbf{Clustering thresholds.} The similarity thresholds (0.85, 0.90,
|
|
0.98) are empirically chosen by manual inspection, not derived from
|
|
principled analysis. The embedding model (nomic-embed-text) is a
|
|
general-purpose model not fine-tuned for standards document similarity.
|
|
Sensitivity analysis across thresholds and comparison with alternative
|
|
clustering methods (DBSCAN, hierarchical agglomerative) would
|
|
strengthen the clustering results.
|
|
|
|
\textbf{Gap analysis methodology.} Gap identification relies on a
|
|
single-shot LLM analysis of compressed landscape statistics, not
|
|
systematic comparison against a reference taxonomy. A rigorous
|
|
approach would compare the corpus against an explicit reference
|
|
architecture such as NIST AI RMF~\citep{nist-ai-rmf}, the FIPA agent
|
|
platform model, or a purpose-built agent ecosystem reference model.
|
|
Gap severity is assigned by Claude without defined quantitative
|
|
thresholds.
|
|
|
|
\textbf{Idea extraction consistency.} Batch extraction using Haiku
|
|
with abstract-only input produces different results from individual
|
|
extraction using Sonnet with full text. No precision/recall measurement
|
|
has been performed. The extraction prompt limits output to 1--4 ideas
|
|
per draft, potentially under-counting contributions from comprehensive
|
|
specifications.
|
|
|
|
\textbf{Organizational normalization.} Cross-organization analysis
|
|
depends on the accuracy of a hand-curated alias table. Boundary cases
|
|
(e.g., joint ventures, university--industry affiliations, subsidiary
|
|
relationships) introduce judgment calls that affect concentration
|
|
statistics.
|
|
|
|
Despite these limitations, the findings are robust in their broad
|
|
contours: the growth trajectory, the safety deficit, the protocol
|
|
fragmentation, and the organizational concentration are visible
|
|
across multiple analytical methods and are not sensitive to the
|
|
specific threshold or model choices within reasonable ranges.
|
|
|
|
\subsection{Reproducibility and Openness}
|
|
|
|
The complete pipeline, database, and derived reports are released as
|
|
open-source software (the IETF Draft Analyzer). The SQLite database
|
|
contains all raw data, ratings, embeddings, ideas, gaps, author
|
|
records, and cached LLM responses, enabling independent verification
|
|
of every finding reported in this paper. The caching mechanism ensures
|
|
that re-running the pipeline produces identical results without
|
|
additional API cost.
|
|
|
|
|
|
% =========================================================================
|
|
\section{Conclusion}
|
|
\label{sec:conclusion}
|
|
% =========================================================================
|
|
|
|
We have presented a systematic, LLM-assisted analysis of the IETF's
|
|
AI-agent standardization landscape, covering 475 Internet-Drafts from
|
|
713 authors across more than 230 organizations. The analysis reveals a
|
|
standards ecosystem experiencing unprecedented growth---from 0.5\% to
|
|
9.3\% of all IETF submissions in fifteen months---accompanied by
|
|
significant structural challenges.
|
|
|
|
The capability-to-safety ratio of approximately 4:1, the extreme
|
|
protocol fragmentation (14 competing OAuth proposals, 155 A2A drafts
|
|
with no interoperability layer), and the concentration of authorship
|
|
(one vendor contributing $\sim$16\% of all drafts) are findings that
|
|
have direct implications for the trajectory of AI-agent
|
|
standardization. The 11 identified gaps, with three critical gaps
|
|
centered on what happens when agents fail, highlight the areas where
|
|
standardization effort is most urgently needed.
|
|
|
|
At the same time, the 132 cross-organization convergent ideas
|
|
demonstrate that latent consensus exists beneath the fragmentation.
|
|
Organizations agree on the problems; they disagree on the solutions.
|
|
This gap between problem consensus and solution divergence defines the
|
|
current phase of the standards race and points toward the needed
|
|
intervention: not more protocol proposals, but architectural
|
|
connective tissue that composes the existing high-quality components
|
|
into a coherent ecosystem.
|
|
|
|
The methodology itself contributes a replicable, cost-effective
|
|
approach to standards landscape analysis. At \$9--15 total, the
|
|
pipeline demonstrates that LLM-assisted document analysis at scale is
|
|
practical for research and policy applications. The explicit
|
|
documentation of limitations---no human calibration, empirical
|
|
thresholds, single-judge ratings---provides a template for the
|
|
responsible use of LLM-as-judge methodologies in technical document
|
|
analysis.
|
|
|
|
The IETF has navigated standardization sprints before, and the lasting
|
|
standards have consistently emerged from efforts that prioritized
|
|
interoperability and safety alongside capability. Whether the current
|
|
AI-agent wave follows this historical pattern depends on whether the
|
|
community can shift from parallel invention to coordinated
|
|
architecture before the capability work ships without the safety work
|
|
that should accompany it.
|
|
|
|
|
|
% =========================================================================
|
|
% References
|
|
% =========================================================================
|
|
\bibliographystyle{plainnat}
|
|
\bibliography{ietf-refs}
|
|
|
|
\end{document}
|