ietf-draft-analyzer/workspace/draft-team/cycles/agent-error-recovery-rollback/30-outline.md

# Draft Outline

## Abstract

State that the document defines experimental recovery semantics for multi-agent task execution, including failure signaling, rollback requests, rollback results, and checkpoint references. Make clear it is protocol-agnostic and intended to improve interoperable recovery behavior across agent ecosystems.

## Section plan

1. Introduction
2. Terminology
3. Problem Statement and Design Goals
4. Recovery Model Overview
5. Event Types and Required Fields
6. Task States and Recovery Procedures
7. Rollback Scope and Dependency Handling
8. Error Conditions and Partial Rollback
9. Security Considerations
10. Privacy Considerations
11. IANA Considerations
12. References

## Author guidance by section

### 1. Introduction

Explain why autonomous multi-agent systems need interoperable recovery behavior. Keep this grounded in failure propagation and operational safety, not generic AI rhetoric.

### 2. Terminology

Define only the core terms needed for this document: task, dependency, checkpoint, failure event, rollback set, recovery record, coordinator. Keep terms stable and conservative.

### 3. Problem Statement and Design Goals

Describe the exact gap: current drafts define communication and orchestration patterns, but no common rollback semantics. Include explicit goals such as idempotency, partial rollback transparency, and protocol-agnostic applicability.

### 4. Recovery Model Overview

Describe the model at a high level before any field-level detail. Separate local failure handling from cross-agent recovery signaling. Make clear what this document does not define.

### 5. Event Types and Required Fields

Define `checkpoint`, `failure`, `rollback-request`, and `rollback-result`. This section must specify required versus optional fields and avoid vague "metadata may include" language where interoperability depends on a field.

### 6. Task States and Recovery Procedures

Define the state transitions relevant to failure and rollback. Include procedure ordering: detect failure, emit failure event, decide rollback scope, send rollback request, emit rollback result. If escalation is possible, say when.

### 7. Rollback Scope and Dependency Handling

Define how dependencies influence rollback. Be explicit about direct versus transitive effects, what happens when scope is uncertain, and how actual applied scope is reported back.

### 8. Error Conditions and Partial Rollback

Handle non-reversible tasks, refusal, timeout, duplicate requests, and partial success. This section is important for implementability and must not collapse into generic prose.

### 9. Security Considerations

Address spoofing, replay, unauthorized rollback, false failure signaling, topology leakage, and abuse of partial rollback states. The section should be mechanism-specific.

### 10. Privacy Considerations

Address exposure of task identifiers, failure causes, dependency graphs, and sensitive operational details.

### 11. IANA Considerations

Either clearly say none, or request small registries for failure classes and rollback outcomes. Do not hand-wave this.

### 12. References

Use placeholders where necessary, but include adjacent drafts that informed the design and any underlying execution-evidence substrate if referenced.

## Issues that must not be hand-waved

- what fields are mandatory in each event
- what counts as a successful versus partial rollback
- how rollback requests remain idempotent
- what an agent does when a requested rollback is impossible
- how dependency-driven rollback scope is determined and reported
- what security properties the mechanism relies on from lower layers