ietf-draft-analyzer/data/reports/draft-proposals/camel-inspired/03-data-provenance-tracking.md

---
title: "Data Provenance Tracking Protocol for AI Agent Communications"
draft_name: draft-nennemann-ai-agent-provenance-00
intended_wg: SECDISPATCH or WIMSE
status: outline
gaps_addressed: [84, 88, 93]
camel_sections: [5.3, 5.4]
date: 2026-03-09
---

# Data Provenance Tracking Protocol for AI Agent Communications

## 1. Problem Statement

When AI agents process data through multi-step tool-calling pipelines, the **origin and transformation history** of each piece of data is lost. This creates three critical problems:

1. **No explainability** (Gap #84): when an agent makes a decision, there is no standard way to trace *which data influenced it* and *where that data came from* in real time
2. **Incompatible audit trails** (Gap #88): different agent platforms log decisions in incompatible formats, making cross-system forensics impossible
3. **Privacy leakage** (Gap #93): without provenance tracking, agents cannot enforce data handling policies — private training data, user interactions, and proprietary algorithms may leak through tool calls

CaML demonstrates that tracking provenance at the **individual value level** (not just the message level) is both feasible and essential for security. Every variable in CaML's interpreter carries metadata about its sources and allowed readers.

## 2. Scope

This document defines:

1. A **provenance record format** for tracking data origin and transformation chains
2. A **provenance propagation protocol** for maintaining provenance across agent boundaries
3. A **provenance query interface** for real-time explainability
4. **Privacy constraints** on provenance metadata itself

## 3. Provenance Model

### 3.1 Provenance Record

Every data value in an agent system carries a provenance record:

```json
{
  "prov:id": "prov-8c3a2d",
  "prov:value_ref": "val-email-body",
  "prov:origin": {
    "type": "tool_output",
    "tool": "read_email",
    "invocation_id": "inv-4f2a1b",
    "agent_id": "agent-a@org1.example",
    "timestamp": "2026-03-09T14:30:00Z",
    "inner_sources": [
      {
        "type": "external_entity",
        "identifier": "sender:bob@example.com",
        "trust_level": "untrusted"
      }
    ]
  },
  "prov:transformations": [
    {
      "type": "llm_extraction",
      "model_role": "quarantined",
      "operation": "extract_email_address",
      "input_provenance": ["prov-7b2a1c"],
      "timestamp": "2026-03-09T14:30:01Z"
    }
  ],
  "prov:classification": {
    "trust_level": "untrusted",
    "sensitivity": "pii",
    "readers": ["user", "bob@example.com"]
  }
}
```

### 3.2 Origin Types

| Origin Type | Description | Trust Default |
|-------------|-------------|---------------|
| `user_input` | Directly from the authenticated user's query | trusted |
| `tool_output` | Returned by a tool invocation | depends on tool |
| `llm_generation` | Generated by an LLM (P-LLM or Q-LLM) | depends on role |
| `literal` | Hardcoded in the execution plan | trusted |
| `external_entity` | Inner source within tool data (e.g., email sender) | untrusted |
| `derived` | Computed from other values | min(input trust levels) |

### 3.3 Transformation Types

| Transform Type | Description | Provenance Effect |
|---------------|-------------|-------------------|
| `llm_extraction` | Q-LLM parses unstructured → structured | inherits all input provenance |
| `computation` | Deterministic operation (concat, filter) | union of input provenance |
| `aggregation` | Multiple values combined | union of all input provenance |
| `user_approval` | User explicitly approved a value | upgrades trust to "user_approved" |
| `redaction` | Sensitive content removed | may upgrade trust classification |

## 4. Propagation Protocol

### 4.1 Intra-Agent Propagation

Within a single agent, the execution engine (interpreter) maintains provenance automatically:

```
val_a = tool_1()           → prov: {origin: tool_1}
val_b = tool_2()           → prov: {origin: tool_2}
val_c = extract(val_a)     → prov: {origin: tool_1, transform: extraction}
val_d = combine(val_b, c)  → prov: {origin: [tool_1, tool_2], transform: computation}
```

**Rule**: derived values inherit the **union** of all input provenances and the **minimum** trust level.

### 4.2 Inter-Agent Propagation

When data crosses agent boundaries (via A2A, HTTP, message queues):

```
Agent A                              Agent B
┌──────────┐                        ┌──────────┐
│ val_d    │                        │          │
│ prov: {  │ ──── message ────►     │ val_e    │
│   A's    │   with provenance      │ prov: {  │
│   chain  │   header/metadata      │   A's chain + │
│ }        │                        │   hop record  │
└──────────┘                        │ }        │
                                    └──────────┘
```

Provenance headers in inter-agent messages:

```http
POST /agent-b/task HTTP/1.1
Content-Type: application/json
X-Agent-Provenance: eyJwcm92OmlkIjoicHJvdi04YzNhMmQi...  (base64-encoded provenance chain)
X-Agent-Provenance-Signature: <signed by agent A>
```

Or as a structured field in A2A messages:

```json
{
  "a2a:message": { ... },
  "a2a:provenance": {
    "chain": [ ... ],
    "hop": {
      "agent_id": "agent-a@org1.example",
      "timestamp": "2026-03-09T14:30:02Z",
      "attestation": "<signature>"
    }
  }
}
```

### 4.3 Provenance Compaction

For long chains, provenance can be compacted:

1. **Hash chaining**: replace full chain with Merkle tree root + most recent N entries
2. **Trust boundary summarization**: when crossing org boundaries, summarize internal provenance as a single attested record
3. **TTL-based pruning**: provenance entries older than a configurable TTL are archived (reference retained, detail available on request)

## 5. Real-Time Provenance Query

*Directly addresses Gap #84: Real-time AI agent explainability protocols.*

### 5.1 Query Interface

Any participant (user, operator, peer agent) can query provenance:

```json
{
  "query:type": "explain_value",
  "query:value_ref": "val-d",
  "query:depth": "full",
  "query:format": "graph"
}
```

Response:

```json
{
  "explain:value_ref": "val-d",
  "explain:summary": "Email address extracted from meeting notes retrieved from cloud storage, combined with user-specified recipient name",
  "explain:graph": {
    "nodes": [
      {"id": "user_input", "trust": "trusted", "content_hint": "user query"},
      {"id": "tool_1:search_notes", "trust": "tool", "content_hint": "meeting notes"},
      {"id": "q_llm:extract", "trust": "untrusted", "content_hint": "extracted email"}
    ],
    "edges": [
      {"from": "tool_1:search_notes", "to": "q_llm:extract"},
      {"from": "q_llm:extract", "to": "val-d"}
    ]
  },
  "explain:trust_assessment": "UNTRUSTED — depends on quarantined LLM extraction from tool output",
  "explain:timestamp": "2026-03-09T14:30:05Z"
}
```

### 5.2 Streaming Provenance

For long-running agent tasks, provenance can be streamed:

- SSE (Server-Sent Events) or WebSocket connection
- Each tool invocation emits a provenance event
- Operators see the dependency graph build in real time

## 6. Privacy-Preserving Provenance

*Addresses Gap #93: Privacy-preserving agent-to-agent communication.*

### 6.1 The Provenance Privacy Paradox

Provenance metadata can itself leak sensitive information:

- Knowing *which tools were called* reveals the user's intent
- Knowing *inner sources* (e.g., email senders) reveals the user's contacts
- The transformation chain reveals the agent's reasoning process

### 6.2 Privacy Controls

1. **Selective disclosure**: agents can share provenance summaries (trust level, origin type) without full chains
2. **Zero-knowledge trust**: "this value is trusted" attested by a trusted third party, without revealing the full provenance
3. **Provenance redaction**: when crossing privacy boundaries, inner sources are replaced with attestations
4. **Need-to-know**: provenance detail levels based on the requester's authorization

```json
{
  "prov:origin": {
    "type": "attested",
    "attestor": "org1.example",
    "trust_level": "trusted",
    "detail": "redacted — contact org1.example for full provenance"
  }
}
```

## 7. Relationship to ECT

Execution Context Tokens (draft-nennemann-wimse-ect) record *what happened* in a DAG of signed tokens. Provenance tracking records *where data came from*. They are complementary:

| Aspect | ECT | This Draft |
|--------|-----|-----------|
| **Tracks** | Task execution events | Data origin and flow |
| **Granularity** | Per-task | Per-value |
| **Format** | JWT with DAG links | JSON provenance records |
| **Purpose** | Audit "what was done" | Explain "why this data" |

Integration: ECT claims can reference provenance records, and provenance records can link to ECT task IDs.

## 8. Security Considerations

- Provenance records must be integrity-protected (signed by the producing agent)
- Provenance forgery (claiming a higher trust level) must be detectable via attestation chains
- Provenance metadata size can be significant — compaction mechanisms are essential
- Timing information in provenance can leak operational patterns

## 9. Open Questions

1. **Standard vocabulary**: should provenance types be extensible or fixed?
2. **Cross-standard alignment**: how does this relate to W3C PROV (provenance ontology)?
3. **Storage**: who is responsible for storing provenance long-term? Each agent? A shared ledger?
4. **Legal implications**: does provenance tracking create liability for organizations that produce it?

## 10. References

- Debenedetti et al. "Defeating Prompt Injections by Design." arXiv:2503.18813, 2025.
- Denning. "A lattice model of secure information flow." CACM, 1976.
- W3C PROV: Provenance Data Model. W3C Recommendation, 2013.
- draft-nennemann-wimse-ect (Execution Context Tokens)
- draft-ietf-wimse-arch (WIMSE architecture)