Files
ietf-draft-analyzer/data/reports/draft-proposals/camel-inspired/03-data-provenance-tracking.md
Christian Nennemann 5ec7410b89 feat: proposal intake pipeline with AI-powered generation on /proposals/new
Add full proposal system: DB schema (proposals + proposal_gaps tables),
CLI `ietf intake` command, and web UI with Quick Generate on /proposals/new.
The new page merges AI intake (paste URL/text → Haiku generates multiple
proposals auto-linked to gaps) with manual form entry. Generated proposals
are clickable cards that fill the editor below for refinement.

Uses claude_model_cheap (Haiku) for cost-efficient web intake. Includes
CaML-inspired draft proposals from arXiv:2503.18813 analysis.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 03:15:11 +01:00

9.8 KiB

title, draft_name, intended_wg, status, gaps_addressed, camel_sections, date
title draft_name intended_wg status gaps_addressed camel_sections date
Data Provenance Tracking Protocol for AI Agent Communications draft-nennemann-ai-agent-provenance-00 SECDISPATCH or WIMSE outline
84
88
93
5.3
5.4
2026-03-09

Data Provenance Tracking Protocol for AI Agent Communications

1. Problem Statement

When AI agents process data through multi-step tool-calling pipelines, the origin and transformation history of each piece of data is lost. This creates three critical problems:

  1. No explainability (Gap #84): when an agent makes a decision, there is no standard way to trace which data influenced it and where that data came from in real time
  2. Incompatible audit trails (Gap #88): different agent platforms log decisions in incompatible formats, making cross-system forensics impossible
  3. Privacy leakage (Gap #93): without provenance tracking, agents cannot enforce data handling policies — private training data, user interactions, and proprietary algorithms may leak through tool calls

CaML demonstrates that tracking provenance at the individual value level (not just the message level) is both feasible and essential for security. Every variable in CaML's interpreter carries metadata about its sources and allowed readers.

2. Scope

This document defines:

  1. A provenance record format for tracking data origin and transformation chains
  2. A provenance propagation protocol for maintaining provenance across agent boundaries
  3. A provenance query interface for real-time explainability
  4. Privacy constraints on provenance metadata itself

3. Provenance Model

3.1 Provenance Record

Every data value in an agent system carries a provenance record:

{
  "prov:id": "prov-8c3a2d",
  "prov:value_ref": "val-email-body",
  "prov:origin": {
    "type": "tool_output",
    "tool": "read_email",
    "invocation_id": "inv-4f2a1b",
    "agent_id": "agent-a@org1.example",
    "timestamp": "2026-03-09T14:30:00Z",
    "inner_sources": [
      {
        "type": "external_entity",
        "identifier": "sender:bob@example.com",
        "trust_level": "untrusted"
      }
    ]
  },
  "prov:transformations": [
    {
      "type": "llm_extraction",
      "model_role": "quarantined",
      "operation": "extract_email_address",
      "input_provenance": ["prov-7b2a1c"],
      "timestamp": "2026-03-09T14:30:01Z"
    }
  ],
  "prov:classification": {
    "trust_level": "untrusted",
    "sensitivity": "pii",
    "readers": ["user", "bob@example.com"]
  }
}

3.2 Origin Types

Origin Type Description Trust Default
user_input Directly from the authenticated user's query trusted
tool_output Returned by a tool invocation depends on tool
llm_generation Generated by an LLM (P-LLM or Q-LLM) depends on role
literal Hardcoded in the execution plan trusted
external_entity Inner source within tool data (e.g., email sender) untrusted
derived Computed from other values min(input trust levels)

3.3 Transformation Types

Transform Type Description Provenance Effect
llm_extraction Q-LLM parses unstructured → structured inherits all input provenance
computation Deterministic operation (concat, filter) union of input provenance
aggregation Multiple values combined union of all input provenance
user_approval User explicitly approved a value upgrades trust to "user_approved"
redaction Sensitive content removed may upgrade trust classification

4. Propagation Protocol

4.1 Intra-Agent Propagation

Within a single agent, the execution engine (interpreter) maintains provenance automatically:

val_a = tool_1()           → prov: {origin: tool_1}
val_b = tool_2()           → prov: {origin: tool_2}
val_c = extract(val_a)     → prov: {origin: tool_1, transform: extraction}
val_d = combine(val_b, c)  → prov: {origin: [tool_1, tool_2], transform: computation}

Rule: derived values inherit the union of all input provenances and the minimum trust level.

4.2 Inter-Agent Propagation

When data crosses agent boundaries (via A2A, HTTP, message queues):

Agent A                              Agent B
┌──────────┐                        ┌──────────┐
│ val_d    │                        │          │
│ prov: {  │ ──── message ────►     │ val_e    │
│   A's    │   with provenance      │ prov: {  │
│   chain  │   header/metadata      │   A's chain + │
│ }        │                        │   hop record  │
└──────────┘                        │ }        │
                                    └──────────┘

Provenance headers in inter-agent messages:

POST /agent-b/task HTTP/1.1
Content-Type: application/json
X-Agent-Provenance: eyJwcm92OmlkIjoicHJvdi04YzNhMmQi...  (base64-encoded provenance chain)
X-Agent-Provenance-Signature: <signed by agent A>

Or as a structured field in A2A messages:

{
  "a2a:message": { ... },
  "a2a:provenance": {
    "chain": [ ... ],
    "hop": {
      "agent_id": "agent-a@org1.example",
      "timestamp": "2026-03-09T14:30:02Z",
      "attestation": "<signature>"
    }
  }
}

4.3 Provenance Compaction

For long chains, provenance can be compacted:

  1. Hash chaining: replace full chain with Merkle tree root + most recent N entries
  2. Trust boundary summarization: when crossing org boundaries, summarize internal provenance as a single attested record
  3. TTL-based pruning: provenance entries older than a configurable TTL are archived (reference retained, detail available on request)

5. Real-Time Provenance Query

Directly addresses Gap #84: Real-time AI agent explainability protocols.

5.1 Query Interface

Any participant (user, operator, peer agent) can query provenance:

{
  "query:type": "explain_value",
  "query:value_ref": "val-d",
  "query:depth": "full",
  "query:format": "graph"
}

Response:

{
  "explain:value_ref": "val-d",
  "explain:summary": "Email address extracted from meeting notes retrieved from cloud storage, combined with user-specified recipient name",
  "explain:graph": {
    "nodes": [
      {"id": "user_input", "trust": "trusted", "content_hint": "user query"},
      {"id": "tool_1:search_notes", "trust": "tool", "content_hint": "meeting notes"},
      {"id": "q_llm:extract", "trust": "untrusted", "content_hint": "extracted email"}
    ],
    "edges": [
      {"from": "tool_1:search_notes", "to": "q_llm:extract"},
      {"from": "q_llm:extract", "to": "val-d"}
    ]
  },
  "explain:trust_assessment": "UNTRUSTED — depends on quarantined LLM extraction from tool output",
  "explain:timestamp": "2026-03-09T14:30:05Z"
}

5.2 Streaming Provenance

For long-running agent tasks, provenance can be streamed:

  • SSE (Server-Sent Events) or WebSocket connection
  • Each tool invocation emits a provenance event
  • Operators see the dependency graph build in real time

6. Privacy-Preserving Provenance

Addresses Gap #93: Privacy-preserving agent-to-agent communication.

6.1 The Provenance Privacy Paradox

Provenance metadata can itself leak sensitive information:

  • Knowing which tools were called reveals the user's intent
  • Knowing inner sources (e.g., email senders) reveals the user's contacts
  • The transformation chain reveals the agent's reasoning process

6.2 Privacy Controls

  1. Selective disclosure: agents can share provenance summaries (trust level, origin type) without full chains
  2. Zero-knowledge trust: "this value is trusted" attested by a trusted third party, without revealing the full provenance
  3. Provenance redaction: when crossing privacy boundaries, inner sources are replaced with attestations
  4. Need-to-know: provenance detail levels based on the requester's authorization
{
  "prov:origin": {
    "type": "attested",
    "attestor": "org1.example",
    "trust_level": "trusted",
    "detail": "redacted — contact org1.example for full provenance"
  }
}

7. Relationship to ECT

Execution Context Tokens (draft-nennemann-wimse-ect) record what happened in a DAG of signed tokens. Provenance tracking records where data came from. They are complementary:

Aspect ECT This Draft
Tracks Task execution events Data origin and flow
Granularity Per-task Per-value
Format JWT with DAG links JSON provenance records
Purpose Audit "what was done" Explain "why this data"

Integration: ECT claims can reference provenance records, and provenance records can link to ECT task IDs.

8. Security Considerations

  • Provenance records must be integrity-protected (signed by the producing agent)
  • Provenance forgery (claiming a higher trust level) must be detectable via attestation chains
  • Provenance metadata size can be significant — compaction mechanisms are essential
  • Timing information in provenance can leak operational patterns

9. Open Questions

  1. Standard vocabulary: should provenance types be extensible or fixed?
  2. Cross-standard alignment: how does this relate to W3C PROV (provenance ontology)?
  3. Storage: who is responsible for storing provenance long-term? Each agent? A shared ledger?
  4. Legal implications: does provenance tracking create liability for organizations that produce it?

10. References

  • Debenedetti et al. "Defeating Prompt Injections by Design." arXiv:2503.18813, 2025.
  • Denning. "A lattice model of secure information flow." CACM, 1976.
  • W3C PROV: Provenance Data Model. W3C Recommendation, 2013.
  • draft-nennemann-wimse-ect (Execution Context Tokens)
  • draft-ietf-wimse-arch (WIMSE architecture)