ietf-draft-analyzer/paper/ietf-landscape.tex

\documentclass[11pt,a4paper]{article}

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage[margin=2.5cm]{geometry}
\usepackage{amsmath,amssymb}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{tabularx}
\usepackage{hyperref}
\usepackage{xcolor}
\usepackage{natbib}
\usepackage{enumitem}
\usepackage{float}
\usepackage{caption}

\hypersetup{
  colorlinks=true,
  linkcolor=blue!60!black,
  citecolor=green!50!black,
  urlcolor=blue!70!black,
}

\setlength{\parskip}{0.4em}
\setlength{\parindent}{0em}

\title{%
  Mapping the AI-Agent Standardization Landscape:\\
  An LLM-Assisted Analysis of IETF Internet-Drafts%
}
\author{
  Christian Nennemann\\
  Independent Researcher\\
  \texttt{write@nennemann.de}
}
\date{April 2026}

\begin{document}
\maketitle

\begin{abstract}
The Internet Engineering Task Force (IETF) is experiencing an unprecedented
surge in standardization activity around AI agents. Between January~2024 and
March~2026, AI- and agent-related Internet-Drafts grew from 0.5\% to 9.3\%
of all IETF submissions. We present a systematic, LLM-assisted analysis of
this landscape, covering 475 drafts from 713 authors across more than 230
organizations. Our pipeline combines keyword-based corpus construction from
the IETF Datatracker API, multi-dimensional quality rating via Claude
(Anthropic) as an LLM-as-judge, semantic embedding and clustering via a
local embedding model (nomic-embed-text), LLM-based extraction of 501
discrete technical ideas, and gap analysis against the assembled corpus.
Key findings include: (1)~a persistent capability-to-safety deficit, with
roughly four capability-building drafts for every safety-oriented one;
(2)~extreme protocol fragmentation, including 14~competing OAuth-for-agents
proposals and 155~agent-to-agent protocol drafts with no interoperability
layer; (3)~high organizational concentration, with a single vendor
contributing approximately 16\% of all drafts; (4)~132 cross-organization
convergent ideas independently proposed by multiple organizations, signaling
latent consensus beneath the fragmentation; and (5)~11 identified
standardization gaps, three rated critical, centered on behavioral
verification, capability degradation detection, and emergency override
protocols. The total analysis cost approximately \$9--15\,USD in API fees.
We discuss implications for AI-agent standardization strategy, the
limitations of LLM-as-judge methodologies applied to technical document
corpora, and organizational dynamics shaping the standards landscape.
\end{abstract}

\textbf{Keywords:} IETF, Internet-Drafts, AI agents, standardization,
LLM-as-judge, landscape analysis, multi-agent systems, protocol
fragmentation


% =========================================================================
\section{Introduction}
\label{sec:intro}
% =========================================================================

The deployment of autonomous AI agents---software systems that perceive
their environment, make decisions, and take actions with limited human
supervision---has accelerated dramatically since 2023. Commercial
offerings from Anthropic, Google, OpenAI, and others have moved AI agents
from research prototypes to production systems that browse the web,
execute code, manage cloud infrastructure, and interact with external
services on behalf of users. This proliferation raises fundamental
questions about identity, authentication, delegation, safety, and
interoperability that fall squarely within the purview of Internet
standards bodies.

The IETF, responsible for the core protocols of the Internet, has
responded with an extraordinary burst of activity. In 2024, just 9
AI- or agent-related Internet-Drafts were submitted---0.5\% of all
submissions. By the first quarter of 2026, that figure reached 9.3\%:
nearly one in ten new drafts addressed AI agents in some capacity.
Monthly submissions surged from 5 in June~2025 to 85 in February~2026,
a growth rate without precedent in the IETF's recent history.

This rapid expansion creates an analytical challenge. The volume of
drafts, the diversity of working groups involved, the overlapping scope
of competing proposals, and the speed of new submissions make manual
tracking infeasible. A standards participant seeking to understand the
landscape---which problems are being addressed, which are being
neglected, where proposals converge and where they conflict---faces a
corpus of hundreds of technical documents evolving on a weekly basis.

We address this challenge with an LLM-assisted analysis pipeline that
automates the collection, rating, clustering, idea extraction, and gap
identification for the full corpus of AI-agent-related IETF
Internet-Drafts. The pipeline combines three complementary analytical
approaches: (1)~LLM-as-judge rating of drafts on five quality
dimensions, using Claude (Anthropic) with structured prompts;
(2)~embedding-based semantic similarity and clustering, using a locally
hosted nomic-embed-text model via Ollama; and (3)~LLM-based extraction
of discrete technical ideas and identification of landscape gaps.

Our contributions are:

\begin{itemize}[nosep]
  \item A comprehensive, quantitative map of the IETF's AI-agent
    standardization landscape as of March~2026, covering 475 drafts,
    713 authors, 501 extracted technical ideas, and 11 identified gaps.
  \item A replicable, cost-effective methodology for LLM-assisted
    standards corpus analysis (\$9--15 total), with explicit
    documentation of limitations and methodological caveats.
  \item Empirical findings on organizational concentration,
    protocol fragmentation, cross-organization convergence, and
    the capability-to-safety imbalance in the current landscape.
  \item An open-source tool (the IETF Draft Analyzer) that makes the
    pipeline, database, and all derived reports available for
    independent verification and extension.
\end{itemize}

The remainder of this paper is organized as follows.
Section~\ref{sec:related} reviews related work on standards landscape
analysis, NLP for technical documents, and technology mapping.
Section~\ref{sec:method} describes the data collection and analysis
pipeline in detail. Section~\ref{sec:results} presents our findings
across five analytical dimensions. Section~\ref{sec:discussion}
discusses implications, limitations, and organizational dynamics.
Section~\ref{sec:conclusion} concludes.


% =========================================================================
\section{Related Work}
\label{sec:related}
% =========================================================================

Our work sits at the intersection of three research areas: standards
ecosystem analysis, NLP applied to technical document corpora, and
technology landscape mapping.

\subsection{Standards Analysis}

The economics and dynamics of technical standardization have been
studied extensively. \citet{simcoe2012} analyzes consensus governance
in standard-setting committees, showing how committee structure
influences the trajectory of shared technology platforms.
\citet{blind2017} examine the impact of standards and regulation on
innovation in uncertain markets, a framing directly applicable to the
nascent AI-agent ecosystem where both the technology and the regulatory
environment are in flux. \citet{lerner2014} study standard-essential
patents, a concern that is beginning to surface in the AI-agent space
as organizations file IPR declarations on agent-related protocols.

Prior quantitative analyses of IETF activity have typically focused on
participation patterns, working group dynamics, or the trajectory of
individual RFCs through the standards process. Our work differs in
scope: rather than analyzing the IETF as an institution, we analyze a
specific cross-cutting topic (AI agents) that spans multiple working
groups and is evolving too rapidly for traditional manual survey methods.

\subsection{NLP for Technical Documents}

The application of natural language processing to technical and legal
document corpora has expanded significantly with the advent of large
language models. \citet{devlin2019} introduced BERT-based approaches
that enabled transfer learning for domain-specific text
classification. More recently, \citet{brown2020} demonstrated that
large language models exhibit strong few-shot and zero-shot performance
on diverse text understanding tasks, opening the possibility of using
LLMs as automated annotators for technical documents.

The ``LLM-as-judge'' paradigm---using language models to evaluate or
rate text artifacts---has been systematically studied by
\citet{zheng2023}, who introduced MT-Bench and Chatbot Arena to
evaluate LLM judges against human preferences. Their work establishes
both the promise (high correlation with human judgment on structured
evaluation tasks) and the limitations (position bias, verbosity bias,
self-enhancement bias) of LLM-based evaluation. Our use of Claude as a
rater for IETF drafts follows this paradigm, with the specific
limitation that no human calibration study has been performed on our
rating outputs (see Section~\ref{sec:limitations}).

Embedding-based document similarity using models such as
Sentence-BERT~\citep{nussbaumer2024} and its successors has become
standard practice for document clustering and retrieval. We use
nomic-embed-text~\citep{nomic2024}, a general-purpose text embedding
model, for computing pairwise cosine similarity across the draft corpus.
The resulting similarity matrix enables both cluster detection and
visualization via t-SNE~\citep{vandermaaten2008}.

\subsection{Technology Landscape Surveys}

Technology landscape mapping---the systematic identification and
organization of technical activities within a domain---has a long
history in foresight and innovation studies.
\citet{porter2005} introduced ``tech mining'' as a methodology for
extracting competitive intelligence from patent and publication
databases. \citet{roper2011} extended these methods to broader
technology management contexts. Our work adapts these approaches to
the standards domain, replacing patent databases with the IETF
Datatracker and augmenting keyword-based search with LLM-driven
semantic analysis.

The AI agent research community has produced several recent surveys.
\citet{wang2024} and \citet{xi2023} survey the rapidly growing
literature on LLM-based autonomous agents, covering architectures,
capabilities, and evaluation. These academic surveys focus on
research contributions; our work complements them by mapping the
parallel standardization effort, where research ideas meet the
engineering constraints of Internet protocol design.

The multi-agent systems (MAS) research tradition, surveyed
comprehensively by \citet{wooldridge2009} and \citet{dorri2018},
provides historical context. The FIPA Agent Communication
Language~\citep{fipa-acl} and Agent Management
Specification~\citep{fipa-ams}, developed between 1996 and 2005,
addressed many of the same problems---agent discovery, communication
protocols, platform interoperability---that the current IETF drafts
tackle. The near-complete absence of FIPA references in the
contemporary IETF corpus suggests limited awareness of this prior art,
a finding we quantify in Section~\ref{sec:results}.


% =========================================================================
\section{Methodology}
\label{sec:method}
% =========================================================================

The analysis pipeline consists of six sequential stages, each building
on the output of the previous. All intermediate results are stored in
a SQLite database (28\,MB) with FTS5 full-text search, enabling both
pipeline idempotency and ad-hoc querying. The complete pipeline is
implemented as a Python CLI tool (approximately 6,100 lines across 12
modules) using Click, httpx, the Anthropic SDK, and Ollama.

\subsection{Data Collection}
\label{sec:datacollection}

\subsubsection{Corpus Construction}

Drafts were retrieved from the IETF Datatracker
API\footnote{\url{https://datatracker.ietf.org/api/v1/doc/document/}}
using keyword search across both draft names
(\texttt{name\_\_contains}) and abstracts
(\texttt{abstract\_\_contains}). Twelve search terms were used:
\textit{agent}, \textit{ai-agent}, \textit{agentic},
\textit{autonomous}, \textit{mcp}, \textit{inference},
\textit{generative}, \textit{intelligent}, \textit{large language
model}, \textit{multi-agent}, and \textit{trustworth}.
Only drafts with \texttt{type\_\_slug=draft} and submission date
$\geq$~2024-01-01 were included. Full text was downloaded from the
IETF archive.\footnote{\url{https://www.ietf.org/archive/id/}}

The keyword set was expanded iteratively. An initial set of 6 keywords
yielded 260 drafts; adding 6 further terms captured 174 additional
drafts in categories initially underrepresented, including MCP-related
work, generative AI infrastructure, and the nascent \texttt{aipref}
working group. A polite delay of 0.5\,seconds was applied between API
requests.

The resulting corpus contains 475 drafts. After false-positive
filtering (removing drafts about ``user agents,'' ``autonomous
systems'' in routing, and other non-AI uses of matched keywords), 361
drafts were retained as AI/agent-relevant based on a relevance
rating threshold.

\subsubsection{Supplementary Standards Bodies}

To contextualize the IETF landscape, we ingested a supplementary
corpus of standards and specifications from five additional bodies:
ISO/IEC (including ISO~22989~\citep{iso22989} and
ISO~42001~\citep{iso42001}), ITU-T (including
Y.3172~\citep{itu-y3172}), ETSI (ENI, ZSM), W3C (Web of Things,
Verifiable Credentials, WebNN), and NIST (AI RMF~\citep{nist-ai-rmf}).
These documents were included in the gap analysis (Section~\ref{sec:gaps})
to identify areas where non-IETF bodies provide coverage that the IETF
corpus lacks, and vice versa.

\subsubsection{Author and Affiliation Data}

Author records were fetched from the Datatracker's
\texttt{documentauthor} and \texttt{person} endpoints. Organizational
affiliations were normalized using a hand-curated alias table of 40+
mappings (e.g., ``Huawei Technologies Co., Ltd.''
$\rightarrow$~``Huawei'') supplemented by automatic suffix stripping
for common corporate suffixes.

\subsection{LLM-Based Analysis}
\label{sec:llm-analysis}

\subsubsection{Multi-Dimensional Rating}

Each draft was rated by Claude (Anthropic; Sonnet model) on five
dimensions using a structured prompt containing the draft's name,
title, submission date, page count, and abstract (truncated to 2,000
characters). The five rating dimensions are:

\begin{itemize}[nosep]
  \item \textbf{Novelty} (1--5): Originality relative to existing
    standards and proposals.
  \item \textbf{Maturity} (1--5): Completeness of the technical
    specification.
  \item \textbf{Overlap} (1--5): Redundancy with other known drafts
    (5 indicates near-duplication).
  \item \textbf{Momentum} (1--5): Community engagement, revisions,
    and working group adoption signals.
  \item \textbf{Relevance} (1--5): Importance to the AI/agent
    ecosystem specifically.
\end{itemize}

The prompt instructs Claude to return structured JSON with integer
scores and brief justification notes for each dimension, plus a 2--3
sentence summary and one or more category labels drawn from a
predefined taxonomy of 11 categories (Table~\ref{tab:categories}).
A composite quality score is computed as the arithmetic mean of
novelty, maturity, momentum, and relevance (excluding overlap, which
measures redundancy rather than quality).

To reduce API costs, drafts were rated in batches of five using a
batch prompt variant. Each draft's abstract was truncated to 1,500
characters in batch mode. All API responses were cached in an
\texttt{llm\_cache} table keyed by SHA-256 hash of the full prompt,
making the pipeline idempotent on re-runs.

\subsubsection{Idea Extraction}

Discrete technical ideas---mechanisms, protocols, architectural
patterns, extensions, and requirements---were extracted from each
draft using Claude. For individual extraction, the prompt included
the abstract and the first 3,000 characters of full text (Sonnet
model). For batch extraction, groups of five drafts were processed
per API call using the cheaper Haiku model with abstracts truncated
to 800 characters. The prompt requested 1--4 top-level novel
contributions per draft, with explicit instructions to merge
sub-features into parent ideas and to return an empty array for
drafts lacking substantive technical content.

Extracted ideas were deduplicated within each draft using
embedding-based cosine similarity (threshold~0.85), removing ideas
that were restatements of the same concept. Cross-draft idea overlap
was analyzed using Python's \texttt{SequenceMatcher} with a fuzzy
matching threshold of~0.75 on idea titles, enabling detection of
convergent ideas across organizational boundaries.

\subsubsection{Gap Analysis}

A single Claude Sonnet call received a compressed landscape summary
containing category distribution counts, the 20 most frequently
occurring idea titles, overlap cluster statistics, and summaries of
relevant non-IETF standards. The prompt instructed the model to
identify 8--15 standardization gaps---areas, problems, or technical
challenges not adequately addressed by the existing corpus---with
structured output including topic, description, severity rating
(critical/high/medium/low), evidence, and partial coverage from
existing standards.

\subsection{Embedding and Clustering}
\label{sec:embedding}

Vector embeddings were generated locally using Ollama with the
nomic-embed-text model~\citep{nomic2024}. For each draft, the input
combined the title, abstract, and first 4,000 characters of full text
(when available), producing a 768-dimensional vector stored as a
binary blob in SQLite.

Pairwise cosine similarity was computed across all embedded drafts,
producing an $n \times n$ similarity matrix (cached to disk as a
NumPy array). Clustering used a greedy single-linkage algorithm: for
each unvisited draft, all unvisited drafts with cosine similarity
$\geq \tau$ to the seed were added to its cluster. Three empirically
determined thresholds were applied:

\begin{itemize}[nosep]
  \item $\tau = 0.85$: Topically overlapping drafts (42 clusters).
  \item $\tau = 0.90$: Near-duplicates or same-author variants (34
    clusters).
  \item $\tau = 0.98$: Functionally identical drafts (25+ pairs).
\end{itemize}

These thresholds were selected by manual inspection of draft pairs at
each level; no systematic sensitivity analysis was performed (see
Section~\ref{sec:limitations}).

\subsection{Supplementary Analyses}

Three additional analysis passes operate on the stored data with zero
API cost:

\begin{enumerate}[nosep]
  \item \textbf{RFC cross-references}: Regex-based extraction of
    RFC, BCP, and draft citations from full text, yielding 4,231
    cross-references across 360 drafts.
  \item \textbf{Category trends}: SQL-based monthly breakdown of new
    drafts per category with growth rates.
  \item \textbf{Co-authorship network}: Team bloc detection via
    pairwise author overlap ($\geq$70\% shared drafts, $\geq$2 shared
    drafts), with connected components forming blocs.
\end{enumerate}

\subsection{Cost}

Table~\ref{tab:cost} summarizes the total pipeline cost for 475 drafts.

\begin{table}[H]
\centering
\caption{Pipeline cost breakdown.}
\label{tab:cost}
\begin{tabular}{llrr}
\toprule
\textbf{Stage} & \textbf{Model} & \textbf{Items} & \textbf{Cost (USD)} \\
\midrule
Rating        & Claude Sonnet     & 475 drafts & \$5.50--8.00 \\
Idea extract. & Claude Haiku      & 475 drafts & \$0.80 \\
Gap analysis  & Claude Sonnet     & 1 call     & \$0.20 \\
Embeddings    & Ollama (local)    & 475 drafts & \$0.00 \\
RFC refs      & Regex (local)     & 475 drafts & \$0.00 \\
Trends        & SQL (local)       & 475 drafts & \$0.00 \\
Idea overlap  & SequenceMatcher   & 501 ideas  & \$0.00 \\
\midrule
\textbf{Total} & & & \textbf{\$6.50--9.00} \\
\bottomrule
\end{tabular}
\end{table}


% =========================================================================
\section{Results}
\label{sec:results}
% =========================================================================

\subsection{Corpus Overview and Growth Trajectory}

The final corpus comprises 475 Internet-Drafts submitted between
January~2024 and March~2026. After false-positive filtering (drafts
with relevance score $\leq$~2 or manually flagged), 361 drafts were
retained as substantively related to AI agents.

The growth trajectory is striking. In 2024, 9 AI/agent drafts were
submitted (0.5\% of 1,651 total IETF drafts). In 2025, 190 were
submitted (7.0\% of 2,696). In Q1~2026 alone, 162 were submitted
(9.3\% of 1,748). Monthly submissions followed a step function:
5~drafts in June~2025, 61 in October~2025, 85 in February~2026.
The acceleration has not plateaued as of March~2026.

\begin{table}[H]
\centering
\caption{Growth of AI/agent-related IETF Internet-Drafts.}
\label{tab:growth}
\begin{tabular}{rrrr}
\toprule
\textbf{Year} & \textbf{Total IETF} & \textbf{AI/Agent} & \textbf{Share (\%)} \\
\midrule
2024          & 1,651 &   9 & 0.5 \\
2025          & 2,696 & 190 & 7.0 \\
2026 (Q1)     & 1,748 & 162 & 9.3 \\
\bottomrule
\end{tabular}
\end{table}

\subsection{Thematic Distribution}
\label{sec:categories}

Drafts were classified into 11 non-exclusive categories
(Table~\ref{tab:categories}). A single draft may belong to multiple
categories; percentages therefore exceed 100\%.

\begin{table}[H]
\centering
\caption{Category distribution across 475 drafts. Drafts may appear in
multiple categories.}
\label{tab:categories}
\begin{tabular}{lrr}
\toprule
\textbf{Category} & \textbf{Drafts} & \textbf{Share (\%)} \\
\midrule
Data formats / interoperability   & 214 & 45 \\
Policy / governance               & 214 & 45 \\
Agent identity / authentication   & 160 & 34 \\
A2A protocols                     & 157 & 33 \\
Autonomous network operations     & 124 & 26 \\
Agent discovery / registration    &  89 & 19 \\
ML traffic management             &  79 & 17 \\
Human--agent interaction          &  57 & 12 \\
AI safety / alignment             & 112 & 24 \\
Model serving / inference         &  42 &  9 \\
Other AI/agent                    &  -- & -- \\
\bottomrule
\end{tabular}
\end{table}

The dominance of infrastructure categories---data formats, identity,
communication protocols---is expected for an early-stage standards
effort. The comparatively low representation of safety/alignment and
human--agent interaction categories is a structural finding we examine
in Section~\ref{sec:safety-deficit}.

\subsection{The Capability-to-Safety Deficit}
\label{sec:safety-deficit}

The ratio of capability-building drafts (A2A protocols, autonomous
network operations, agent discovery, model serving) to safety-oriented
drafts (AI safety/alignment, human--agent interaction) is
approximately 4:1 on aggregate. This ratio varies significantly by
month, ranging from 1.5:1 in months with concentrated safety
submissions to over 20:1 in months dominated by protocol proposals.

The drafts that do address safety are among the highest-rated in the
corpus. The Verifiable Observation Logging for Transparency
(VOLT)~\citep{draft-cowles-volt} protocol scored 4.75/5.0 on the
four-dimension composite (excluding overlap), as did the Distributed
AI Accountability Protocol (DAAP)~\citep{draft-aylward-daap}. The
STAMP protocol~\citep{draft-guy-bary-stamp} for cryptographic
delegation and proof scored 4.5. The quality of safety-focused work
is high; the quantity is not.

An analysis of RFC cross-references reinforces this finding. Across
4,231 parsed citations, the most-referenced standards after the
boilerplate RFC~2119/8174 conventions are TLS~1.3~\citep{rfc8446}
(42 citations), OAuth~2.0~\citep{rfc6749} (36), HTTP
Semantics~\citep{rfc9110} (34), and JWT~\citep{rfc7519} (22). The
agent standards ecosystem is being constructed on the web's existing
security infrastructure---OAuth, TLS, HTTP, JWT---yet the safety
layer that should accompany this security foundation remains
underdeveloped.

\subsection{Protocol Fragmentation}
\label{sec:fragmentation}

Embedding-based similarity analysis reveals extensive duplication and
fragmentation across the corpus.

\subsubsection{Near-Duplicates}

At the 0.98 cosine similarity threshold, 25+ draft pairs are
functionally identical---the same proposal submitted under different
names, to different working groups, or as renamed revisions. A
taxonomy of near-duplicates includes: same draft submitted to
different working groups (14 pairs), renamed drafts (5), evolutionary
versions (3), and genuinely competing proposals from different
organizations (2+).

\subsubsection{Competing Clusters}

At the 0.85 threshold, 42 topical clusters emerge. The most crowded
is OAuth for AI agents, with 14 distinct proposals all addressing
how AI agents authenticate and receive authorization via the OAuth
framework. These range from broad profile proposals to narrow scope
extensions to comprehensive accountability systems. None are
interoperable.

The A2A protocol space encompasses 157 drafts with no
interoperability layer. The most common technical idea in the entire
extracted corpus---``Multi-Agent Communication Protocol''---appears
independently in 8 drafts from different teams. A 10-draft cluster
addresses agent gateway and multi-agent collaboration, with
approaches ranging from semantic routing gateways to cross-domain
interoperability frameworks.

\subsubsection{Causes of Fragmentation}

The data distinguishes three causes: (1)~working group shopping, where
authors submit the same draft to multiple working groups seeking
adoption; (2)~parallel invention, where isolated teams independently
solve the same problem; and (3)~strategic surface-area expansion,
where organizations submit multiple related drafts to maximize
presence in the standards landscape.

\subsection{Organizational Dynamics}
\label{sec:orgs}

\subsubsection{Concentration}

Authorship is heavily concentrated. Huawei leads with 53 authors
contributing to 69 drafts---approximately 16\% of the entire corpus
across all Huawei entities. China Mobile (24~authors, 35~drafts),
Cisco (24~authors, 26~drafts), and China Telecom (24~authors,
24~drafts) follow. Chinese-linked institutions (Huawei, China
Mobile, China Telecom, China Unicom, Tsinghua University, ZTE, BUPT,
and associated laboratories) collectively account for over 160
authors.

Western technology companies are dramatically underrepresented
relative to their market positions. Google is present with 5 authors
on 9 drafts. Microsoft, Apple, and Meta have minimal direct
participation. Amazon's 6 authors focus on post-quantum cryptography
rather than agent-specific work.

\subsubsection{Team Blocs}

Co-authorship analysis identifies 18 team blocs among the 713 authors,
covering approximately 25\% of all authors. The largest bloc is a
13-person Huawei team sharing 22 drafts with 94\% average cohesion
(measured as pairwise overlap of draft portfolios). The team's core
of 7 members each appear on 13--23 drafts.

Cross-organizational collaboration is sparse. The most productive
cross-team pair shares only 3 drafts. Chinese organizations form a
tightly linked ecosystem: Huawei--China Unicom shares 6 drafts,
Tsinghua--Zhongguancun Lab shares 5, China Mobile--ZTE shares 4.
European telecoms (Deutsche Telekom, Telef\'onica, Orange) act as
bridges between Chinese and Western institutions.

\subsection{Cross-Organization Convergence}
\label{sec:convergence}

Despite the fragmentation, significant latent consensus exists. Using
fuzzy title matching (\texttt{SequenceMatcher} at 0.75 threshold) on
the 501 extracted ideas, 132 ideas (approximately 33\% of unique idea
clusters) have been independently proposed by two or more organizations.

The strongest convergence signals include ``A2A Communication
Paradigm'' (proposed by 8 organizations from 5 countries),
``AI Agent Network Architecture'' (8 organizations), and
``Multi-Agent Communication Protocol'' (7 organizations). An
examination of organizational pairs reveals that 180 convergent ideas
cross the boundary between Chinese-linked and Western organizations,
indicating genuine cross-cultural consensus on technical directions
despite the sparse direct collaboration noted in
Section~\ref{sec:orgs}.

The coexistence of convergence and fragmentation has a specific
structure: organizations agree on \textit{what} needs building (the
convergent ideas) but disagree on \textit{how} to build it (the
competing protocol proposals). This gap between problem consensus and
solution divergence is where architectural coordination is most needed.

\subsection{Gap Analysis}
\label{sec:gaps}

The gap analysis identified 11 standardization gaps, distributed across
severity levels as shown in Table~\ref{tab:gaps}.

\begin{table}[H]
\centering
\caption{Identified standardization gaps by severity.}
\label{tab:gaps}
\begin{tabularx}{\textwidth}{llX}
\toprule
\textbf{Severity} & \textbf{Topic} & \textbf{Description} \\
\midrule
Critical & Agent legal liability &
  No standard addresses liability assignment when autonomous agents
  cause harm or make binding commitments across creators, operators,
  and users. \\
Critical & Capability degradation detection &
  No standard defines detection mechanisms for gradual capability
  degradation due to concept drift, adversarial inputs, or model
  corruption. \\
Critical & Emergency override protocols &
  No standard defines distributed emergency-stop mechanisms for
  autonomous agents exhibiting dangerous behavior across
  multi-system deployments. \\
\midrule
High & Cross-domain identity portability &
  Agents cannot maintain consistent identity across organizational
  domains with different identity systems. \\
High & Real-time behavior explanation &
  No standard for interactive, real-time explanations of agent
  decision-making during operation. \\
High & Multi-agent conflict resolution &
  No protocol for resolving conflicts when multiple agents have
  competing objectives or contend for shared resources. \\
High & Inter-standards-body bridging &
  Protocols from IETF, ITU-T, and ISO cannot interoperate, creating
  silos across network, internet, and industrial domains. \\
High & Behavioral audit trails &
  Missing standards for immutable, decision-level audit logs
  supporting forensic analysis and regulatory compliance. \\
\midrule
Medium & Resource consumption limits &
  No self-regulation standards for agent computational, network, and
  energy resource usage. \\
Medium & Training data provenance &
  Missing standards for tracking data lineage as it flows between
  agents in federated learning scenarios. \\
Medium & Content attribution &
  No cryptographic attribution standards for agent-generated content.\\
\bottomrule
\end{tabularx}
\end{table}

The three critical gaps share a common theme: they address what happens
when autonomous agents fail or misbehave. The capability-building
majority of the corpus assumes cooperative, well-functioning agent
systems; the critical gaps expose the absence of standards for the
adversarial, degraded, and emergency cases that inevitably arise in
production deployment.

Cross-referencing gaps with extracted ideas quantifies the coverage
deficit. The ``emergency override'' gap has only 15 partially
addressing ideas across the corpus. The ``multi-agent conflict
resolution'' and ``inter-standards-body bridging'' gaps have zero
directly related extracted ideas---they are entirely unaddressed.


% =========================================================================
\section{Discussion}
\label{sec:discussion}
% =========================================================================

\subsection{Implications for Standardization Strategy}

The landscape reveals a standards ecosystem in a characteristic
early-stage pattern: rapid expansion, parallel invention, and
insufficient coordination. The IETF has navigated such patterns
before---the early web, IoT, DNS security---and the historical
resolution involves convergence of competing proposals, working group
consolidation, and the emergence of a small number of lasting
standards from a large initial field.

Three strategic priorities emerge from the data:

\textbf{Safety-first coordination.} The 4:1 capability-to-safety
ratio is a structural risk. The critical gaps---behavioral verification,
capability degradation detection, emergency override---are precisely
the areas where standardization failure has the highest real-world
consequence. Unlike protocol fragmentation, which causes confusion and
implementation cost, safety gaps create liability and harm. The
EU AI Act~\citep{eu-ai-act}, which mandates real-time explainability
and human oversight for high-risk AI systems, will make several of
these gaps regulatory obligations rather than optional best practices.

\textbf{Architectural connective tissue.} The landscape needs not more
protocols but a shared execution model. The convergence data shows that
organizations agree on the components; they disagree on the
integration. Proposals like VOLT~\citep{draft-cowles-volt} (execution
traces), DAAP~\citep{draft-aylward-daap} (accountability),
STAMP~\citep{draft-guy-bary-stamp} (cryptographic delegation), and
Verifiable Agent Conversations~\citep{draft-birkholz-vac} (signed
conversation records) address complementary parts of the same
architectural problem. An overarching agent execution architecture
that composes these components would accelerate convergence more
effectively than continued parallel invention.

\textbf{Cross-organization coordination.} The team bloc structure
produces drafts that are internally consistent but externally
incompatible. The 18 detected blocs function as islands; the bridges
between them are thin. Mechanisms that encourage cross-bloc
collaboration---joint design teams, interop testing events,
shared reference implementations---are more likely to produce lasting
standards than the current pattern of parallel submission.

\subsection{Relationship to Prior Agent Standards}

A notable finding is the near-complete absence of references to FIPA
(Foundation for Intelligent Physical Agents) in the contemporary IETF
corpus. FIPA's Agent Communication Language~\citep{fipa-acl} and Agent
Management Specification~\citep{fipa-ams}, developed between 1996 and
2005, addressed agent discovery, communication, platform
interoperability, and interaction protocols---the same problem space
that the current wave of IETF drafts tackles.

The absence of FIPA references does not necessarily indicate ignorance;
the web-native technical context of 2025 differs substantially from the
Java/CORBA context of 2002. However, the recurrence of problems
FIPA addressed (agent naming, message semantics, directory services,
interaction protocols) suggests that explicit engagement with the
FIPA legacy could help the IETF community avoid re-learning lessons
from two decades ago.

\subsection{Limitations}
\label{sec:limitations}

The methodology has several limitations that affect the confidence and
generalizability of the findings.

\textbf{LLM-as-judge validity.} All quality ratings are generated by a
single LLM (Claude Sonnet) from draft abstracts truncated to 2,000
characters. No human calibration study has been performed; no
inter-rater reliability is established. The ratings should be treated
as relative rankings within this corpus, not absolute quality measures.
Maturity scores are particularly affected by abstract-only input, as
abstracts may not convey the full technical depth of a specification.
The overlap dimension is limited because Claude rates each draft
independently without access to the full corpus, meaning it reflects
the model's general knowledge rather than corpus-specific similarity.
A validation study using domain expert ratings on a sample of 25--30
drafts would substantially strengthen confidence.

\textbf{Corpus selection bias.} Keyword-based selection introduces both
false positives (``agent'' matching ``user agent,'' ``autonomous''
matching ``autonomous systems'' in routing) and false negatives
(relevant drafts using terminology outside the keyword set). We
estimate 30--50 false positives remain despite relevance filtering.
The temporal cutoff of January~2024 excludes earlier foundational work.

\textbf{Clustering thresholds.} The similarity thresholds (0.85, 0.90,
0.98) are empirically chosen by manual inspection, not derived from
principled analysis. The embedding model (nomic-embed-text) is a
general-purpose model not fine-tuned for standards document similarity.
Sensitivity analysis across thresholds and comparison with alternative
clustering methods (DBSCAN, hierarchical agglomerative) would
strengthen the clustering results.

\textbf{Gap analysis methodology.} Gap identification relies on a
single-shot LLM analysis of compressed landscape statistics, not
systematic comparison against a reference taxonomy. A rigorous
approach would compare the corpus against an explicit reference
architecture such as NIST AI RMF~\citep{nist-ai-rmf}, the FIPA agent
platform model, or a purpose-built agent ecosystem reference model.
Gap severity is assigned by Claude without defined quantitative
thresholds.

\textbf{Idea extraction consistency.} Batch extraction using Haiku
with abstract-only input produces different results from individual
extraction using Sonnet with full text. No precision/recall measurement
has been performed. The extraction prompt limits output to 1--4 ideas
per draft, potentially under-counting contributions from comprehensive
specifications.

\textbf{Organizational normalization.} Cross-organization analysis
depends on the accuracy of a hand-curated alias table. Boundary cases
(e.g., joint ventures, university--industry affiliations, subsidiary
relationships) introduce judgment calls that affect concentration
statistics.

Despite these limitations, the findings are robust in their broad
contours: the growth trajectory, the safety deficit, the protocol
fragmentation, and the organizational concentration are visible
across multiple analytical methods and are not sensitive to the
specific threshold or model choices within reasonable ranges.

\subsection{Reproducibility and Openness}

The complete pipeline, database, and derived reports are released as
open-source software (the IETF Draft Analyzer). The SQLite database
contains all raw data, ratings, embeddings, ideas, gaps, author
records, and cached LLM responses, enabling independent verification
of every finding reported in this paper. The caching mechanism ensures
that re-running the pipeline produces identical results without
additional API cost.


% =========================================================================
\section{Conclusion}
\label{sec:conclusion}
% =========================================================================

We have presented a systematic, LLM-assisted analysis of the IETF's
AI-agent standardization landscape, covering 475 Internet-Drafts from
713 authors across more than 230 organizations. The analysis reveals a
standards ecosystem experiencing unprecedented growth---from 0.5\% to
9.3\% of all IETF submissions in fifteen months---accompanied by
significant structural challenges.

The capability-to-safety ratio of approximately 4:1, the extreme
protocol fragmentation (14 competing OAuth proposals, 155 A2A drafts
with no interoperability layer), and the concentration of authorship
(one vendor contributing $\sim$16\% of all drafts) are findings that
have direct implications for the trajectory of AI-agent
standardization. The 11 identified gaps, with three critical gaps
centered on what happens when agents fail, highlight the areas where
standardization effort is most urgently needed.

At the same time, the 132 cross-organization convergent ideas
demonstrate that latent consensus exists beneath the fragmentation.
Organizations agree on the problems; they disagree on the solutions.
This gap between problem consensus and solution divergence defines the
current phase of the standards race and points toward the needed
intervention: not more protocol proposals, but architectural
connective tissue that composes the existing high-quality components
into a coherent ecosystem.

The methodology itself contributes a replicable, cost-effective
approach to standards landscape analysis. At \$9--15 total, the
pipeline demonstrates that LLM-assisted document analysis at scale is
practical for research and policy applications. The explicit
documentation of limitations---no human calibration, empirical
thresholds, single-judge ratings---provides a template for the
responsible use of LLM-as-judge methodologies in technical document
analysis.

The IETF has navigated standardization sprints before, and the lasting
standards have consistently emerged from efforts that prioritized
interoperability and safety alongside capability. Whether the current
AI-agent wave follows this historical pattern depends on whether the
community can shift from parallel invention to coordinated
architecture before the capability work ships without the safety work
that should accompany it.


% =========================================================================
% References
% =========================================================================
\bibliographystyle{plainnat}
\bibliography{ietf-refs}

\end{document}