Add full proposal system: DB schema (proposals + proposal_gaps tables), CLI `ietf intake` command, and web UI with Quick Generate on /proposals/new. The new page merges AI intake (paste URL/text → Haiku generates multiple proposals auto-linked to gaps) with manual form entry. Generated proposals are clickable cards that fill the editor below for refinement. Uses claude_model_cheap (Haiku) for cost-efficient web intake. Includes CaML-inspired draft proposals from arXiv:2503.18813 analysis. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
269 lines
9.8 KiB
Markdown
269 lines
9.8 KiB
Markdown
---
|
|
title: "Data Provenance Tracking Protocol for AI Agent Communications"
|
|
draft_name: draft-nennemann-ai-agent-provenance-00
|
|
intended_wg: SECDISPATCH or WIMSE
|
|
status: outline
|
|
gaps_addressed: [84, 88, 93]
|
|
camel_sections: [5.3, 5.4]
|
|
date: 2026-03-09
|
|
---
|
|
|
|
# Data Provenance Tracking Protocol for AI Agent Communications
|
|
|
|
## 1. Problem Statement
|
|
|
|
When AI agents process data through multi-step tool-calling pipelines, the **origin and transformation history** of each piece of data is lost. This creates three critical problems:
|
|
|
|
1. **No explainability** (Gap #84): when an agent makes a decision, there is no standard way to trace *which data influenced it* and *where that data came from* in real time
|
|
2. **Incompatible audit trails** (Gap #88): different agent platforms log decisions in incompatible formats, making cross-system forensics impossible
|
|
3. **Privacy leakage** (Gap #93): without provenance tracking, agents cannot enforce data handling policies — private training data, user interactions, and proprietary algorithms may leak through tool calls
|
|
|
|
CaML demonstrates that tracking provenance at the **individual value level** (not just the message level) is both feasible and essential for security. Every variable in CaML's interpreter carries metadata about its sources and allowed readers.
|
|
|
|
## 2. Scope
|
|
|
|
This document defines:
|
|
|
|
1. A **provenance record format** for tracking data origin and transformation chains
|
|
2. A **provenance propagation protocol** for maintaining provenance across agent boundaries
|
|
3. A **provenance query interface** for real-time explainability
|
|
4. **Privacy constraints** on provenance metadata itself
|
|
|
|
## 3. Provenance Model
|
|
|
|
### 3.1 Provenance Record
|
|
|
|
Every data value in an agent system carries a provenance record:
|
|
|
|
```json
|
|
{
|
|
"prov:id": "prov-8c3a2d",
|
|
"prov:value_ref": "val-email-body",
|
|
"prov:origin": {
|
|
"type": "tool_output",
|
|
"tool": "read_email",
|
|
"invocation_id": "inv-4f2a1b",
|
|
"agent_id": "agent-a@org1.example",
|
|
"timestamp": "2026-03-09T14:30:00Z",
|
|
"inner_sources": [
|
|
{
|
|
"type": "external_entity",
|
|
"identifier": "sender:bob@example.com",
|
|
"trust_level": "untrusted"
|
|
}
|
|
]
|
|
},
|
|
"prov:transformations": [
|
|
{
|
|
"type": "llm_extraction",
|
|
"model_role": "quarantined",
|
|
"operation": "extract_email_address",
|
|
"input_provenance": ["prov-7b2a1c"],
|
|
"timestamp": "2026-03-09T14:30:01Z"
|
|
}
|
|
],
|
|
"prov:classification": {
|
|
"trust_level": "untrusted",
|
|
"sensitivity": "pii",
|
|
"readers": ["user", "bob@example.com"]
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3.2 Origin Types
|
|
|
|
| Origin Type | Description | Trust Default |
|
|
|-------------|-------------|---------------|
|
|
| `user_input` | Directly from the authenticated user's query | trusted |
|
|
| `tool_output` | Returned by a tool invocation | depends on tool |
|
|
| `llm_generation` | Generated by an LLM (P-LLM or Q-LLM) | depends on role |
|
|
| `literal` | Hardcoded in the execution plan | trusted |
|
|
| `external_entity` | Inner source within tool data (e.g., email sender) | untrusted |
|
|
| `derived` | Computed from other values | min(input trust levels) |
|
|
|
|
### 3.3 Transformation Types
|
|
|
|
| Transform Type | Description | Provenance Effect |
|
|
|---------------|-------------|-------------------|
|
|
| `llm_extraction` | Q-LLM parses unstructured → structured | inherits all input provenance |
|
|
| `computation` | Deterministic operation (concat, filter) | union of input provenance |
|
|
| `aggregation` | Multiple values combined | union of all input provenance |
|
|
| `user_approval` | User explicitly approved a value | upgrades trust to "user_approved" |
|
|
| `redaction` | Sensitive content removed | may upgrade trust classification |
|
|
|
|
## 4. Propagation Protocol
|
|
|
|
### 4.1 Intra-Agent Propagation
|
|
|
|
Within a single agent, the execution engine (interpreter) maintains provenance automatically:
|
|
|
|
```
|
|
val_a = tool_1() → prov: {origin: tool_1}
|
|
val_b = tool_2() → prov: {origin: tool_2}
|
|
val_c = extract(val_a) → prov: {origin: tool_1, transform: extraction}
|
|
val_d = combine(val_b, c) → prov: {origin: [tool_1, tool_2], transform: computation}
|
|
```
|
|
|
|
**Rule**: derived values inherit the **union** of all input provenances and the **minimum** trust level.
|
|
|
|
### 4.2 Inter-Agent Propagation
|
|
|
|
When data crosses agent boundaries (via A2A, HTTP, message queues):
|
|
|
|
```
|
|
Agent A Agent B
|
|
┌──────────┐ ┌──────────┐
|
|
│ val_d │ │ │
|
|
│ prov: { │ ──── message ────► │ val_e │
|
|
│ A's │ with provenance │ prov: { │
|
|
│ chain │ header/metadata │ A's chain + │
|
|
│ } │ │ hop record │
|
|
└──────────┘ │ } │
|
|
└──────────┘
|
|
```
|
|
|
|
Provenance headers in inter-agent messages:
|
|
|
|
```http
|
|
POST /agent-b/task HTTP/1.1
|
|
Content-Type: application/json
|
|
X-Agent-Provenance: eyJwcm92OmlkIjoicHJvdi04YzNhMmQi... (base64-encoded provenance chain)
|
|
X-Agent-Provenance-Signature: <signed by agent A>
|
|
```
|
|
|
|
Or as a structured field in A2A messages:
|
|
|
|
```json
|
|
{
|
|
"a2a:message": { ... },
|
|
"a2a:provenance": {
|
|
"chain": [ ... ],
|
|
"hop": {
|
|
"agent_id": "agent-a@org1.example",
|
|
"timestamp": "2026-03-09T14:30:02Z",
|
|
"attestation": "<signature>"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### 4.3 Provenance Compaction
|
|
|
|
For long chains, provenance can be compacted:
|
|
|
|
1. **Hash chaining**: replace full chain with Merkle tree root + most recent N entries
|
|
2. **Trust boundary summarization**: when crossing org boundaries, summarize internal provenance as a single attested record
|
|
3. **TTL-based pruning**: provenance entries older than a configurable TTL are archived (reference retained, detail available on request)
|
|
|
|
## 5. Real-Time Provenance Query
|
|
|
|
*Directly addresses Gap #84: Real-time AI agent explainability protocols.*
|
|
|
|
### 5.1 Query Interface
|
|
|
|
Any participant (user, operator, peer agent) can query provenance:
|
|
|
|
```json
|
|
{
|
|
"query:type": "explain_value",
|
|
"query:value_ref": "val-d",
|
|
"query:depth": "full",
|
|
"query:format": "graph"
|
|
}
|
|
```
|
|
|
|
Response:
|
|
|
|
```json
|
|
{
|
|
"explain:value_ref": "val-d",
|
|
"explain:summary": "Email address extracted from meeting notes retrieved from cloud storage, combined with user-specified recipient name",
|
|
"explain:graph": {
|
|
"nodes": [
|
|
{"id": "user_input", "trust": "trusted", "content_hint": "user query"},
|
|
{"id": "tool_1:search_notes", "trust": "tool", "content_hint": "meeting notes"},
|
|
{"id": "q_llm:extract", "trust": "untrusted", "content_hint": "extracted email"}
|
|
],
|
|
"edges": [
|
|
{"from": "tool_1:search_notes", "to": "q_llm:extract"},
|
|
{"from": "q_llm:extract", "to": "val-d"}
|
|
]
|
|
},
|
|
"explain:trust_assessment": "UNTRUSTED — depends on quarantined LLM extraction from tool output",
|
|
"explain:timestamp": "2026-03-09T14:30:05Z"
|
|
}
|
|
```
|
|
|
|
### 5.2 Streaming Provenance
|
|
|
|
For long-running agent tasks, provenance can be streamed:
|
|
|
|
- SSE (Server-Sent Events) or WebSocket connection
|
|
- Each tool invocation emits a provenance event
|
|
- Operators see the dependency graph build in real time
|
|
|
|
## 6. Privacy-Preserving Provenance
|
|
|
|
*Addresses Gap #93: Privacy-preserving agent-to-agent communication.*
|
|
|
|
### 6.1 The Provenance Privacy Paradox
|
|
|
|
Provenance metadata can itself leak sensitive information:
|
|
|
|
- Knowing *which tools were called* reveals the user's intent
|
|
- Knowing *inner sources* (e.g., email senders) reveals the user's contacts
|
|
- The transformation chain reveals the agent's reasoning process
|
|
|
|
### 6.2 Privacy Controls
|
|
|
|
1. **Selective disclosure**: agents can share provenance summaries (trust level, origin type) without full chains
|
|
2. **Zero-knowledge trust**: "this value is trusted" attested by a trusted third party, without revealing the full provenance
|
|
3. **Provenance redaction**: when crossing privacy boundaries, inner sources are replaced with attestations
|
|
4. **Need-to-know**: provenance detail levels based on the requester's authorization
|
|
|
|
```json
|
|
{
|
|
"prov:origin": {
|
|
"type": "attested",
|
|
"attestor": "org1.example",
|
|
"trust_level": "trusted",
|
|
"detail": "redacted — contact org1.example for full provenance"
|
|
}
|
|
}
|
|
```
|
|
|
|
## 7. Relationship to ECT
|
|
|
|
Execution Context Tokens (draft-nennemann-wimse-ect) record *what happened* in a DAG of signed tokens. Provenance tracking records *where data came from*. They are complementary:
|
|
|
|
| Aspect | ECT | This Draft |
|
|
|--------|-----|-----------|
|
|
| **Tracks** | Task execution events | Data origin and flow |
|
|
| **Granularity** | Per-task | Per-value |
|
|
| **Format** | JWT with DAG links | JSON provenance records |
|
|
| **Purpose** | Audit "what was done" | Explain "why this data" |
|
|
|
|
Integration: ECT claims can reference provenance records, and provenance records can link to ECT task IDs.
|
|
|
|
## 8. Security Considerations
|
|
|
|
- Provenance records must be integrity-protected (signed by the producing agent)
|
|
- Provenance forgery (claiming a higher trust level) must be detectable via attestation chains
|
|
- Provenance metadata size can be significant — compaction mechanisms are essential
|
|
- Timing information in provenance can leak operational patterns
|
|
|
|
## 9. Open Questions
|
|
|
|
1. **Standard vocabulary**: should provenance types be extensible or fixed?
|
|
2. **Cross-standard alignment**: how does this relate to W3C PROV (provenance ontology)?
|
|
3. **Storage**: who is responsible for storing provenance long-term? Each agent? A shared ledger?
|
|
4. **Legal implications**: does provenance tracking create liability for organizations that produce it?
|
|
|
|
## 10. References
|
|
|
|
- Debenedetti et al. "Defeating Prompt Injections by Design." arXiv:2503.18813, 2025.
|
|
- Denning. "A lattice model of secure information flow." CACM, 1976.
|
|
- W3C PROV: Provenance Data Model. W3C Recommendation, 2013.
|
|
- draft-nennemann-wimse-ect (Execution Context Tokens)
|
|
- draft-ietf-wimse-arch (WIMSE architecture)
|