# User Spec

## Topic

Agent Error Recovery and Rollback for Multi-Agent Systems

## Goal

Produce a credible IETF-style Internet-Draft for a narrowly scoped mechanism that standardizes how cooperating agents report failures, define rollback scope, and execute coordinated recovery without cascading damage.

## Intended status

Experimental.

Rationale: the problem is clearly real and under-specified, but the ecosystem is still young and the mechanism should not pretend to have full deployment consensus yet.

## Problem to solve

Current AI-agent and autonomous-operations drafts define communication, identity, and orchestration patterns, but the landscape analysis shows no common mechanism for:

- signaling execution failure in a machine-actionable way
- declaring rollback boundaries and blast radius
- coordinating rollback across dependent agents
- recording recovery outcomes for audit and future trust decisions

This creates high interoperability and safety risk for autonomous systems that act across multiple services or domains.

## What must be true in the final draft

- The draft stays tightly scoped to recovery and rollback semantics, not a full agent architecture.
- The mechanism is protocol-agnostic enough to work across multiple agent ecosystems.
- The draft defines concrete states, triggers, and recovery procedures that two implementers could follow consistently.
- Security Considerations meaningfully address spoofed rollback, unauthorized override, replay, and denial-of-service by false failure signaling.
- The text is shaped like a real Internet-Draft, not a product design memo.
- The draft clearly states what is in scope now and what is deferred to later work such as richer workflow orchestration or dynamic trust scoring.

## Constraints

- scope constraints
  Keep this to rollback and recovery coordination. Do not absorb lifecycle management, full workflow DAG standardization, or human override into the core mechanism except where needed as interfaces.
- compatibility constraints
  Reuse adjacent concepts where possible from existing IETF-style work on execution evidence, attestation, or agent communication. Do not invent a full new identity or transport stack.
- terminology constraints
  Use conservative standards language. Prefer terms like agent, execution, checkpoint, rollback set, dependency, and recovery record. Avoid buzzwords and branding.

## Source materials to prioritize

- `/home/c/projects/ietf-draft-analyzer/data/reports/gaps.md`
- `/home/c/projects/ietf-draft-analyzer/data/reports/holistic-agent-ecosystem-draft-outlines.md`
- `/home/c/projects/ietf-draft-analyzer/data/reports/ideas.md`
- `/home/c/projects/ietf-draft-analyzer/data/reports/overview.md`
- `draft-yue-anima-agent-recovery-networks`
- `draft-li-dmsc-macp`
- `draft-fu-nmop-agent-communication-framework`
- `draft-srijal-agents-policy`
- related WIMSE or ECT materials when they help avoid redefining execution evidence

## Success criteria

- A reader can tell exactly what an agent must emit or process when a task fails.
- A reader can tell how rollback scope is determined and how dependent agents respond.
- The draft includes enough structure to support interoperability testing later.
- Specialist reviewers can criticize the draft on substance rather than on missing basic sections or obvious ambiguity.

## Questions for the team

- What is the smallest interoperable core for rollback semantics?
- Should checkpoints and recovery records be abstract objects, protocol messages, or profileable metadata on top of another carrier?
- What information is mandatory in a failure signal versus optional?
- How should rollback interact with partially completed downstream work?