# User Spec ## Topic Agent Error Recovery and Rollback for Multi-Agent Systems ## Goal Produce a credible IETF-style Internet-Draft for a narrowly scoped mechanism that standardizes how cooperating agents report failures, define rollback scope, and execute coordinated recovery without cascading damage. ## Intended status Experimental. Rationale: the problem is clearly real and under-specified, but the ecosystem is still young and the mechanism should not pretend to have full deployment consensus yet. ## Problem to solve Current AI-agent and autonomous-operations drafts define communication, identity, and orchestration patterns, but the landscape analysis shows no common mechanism for: - signaling execution failure in a machine-actionable way - declaring rollback boundaries and blast radius - coordinating rollback across dependent agents - recording recovery outcomes for audit and future trust decisions This creates high interoperability and safety risk for autonomous systems that act across multiple services or domains. ## What must be true in the final draft - The draft stays tightly scoped to recovery and rollback semantics, not a full agent architecture. - The mechanism is protocol-agnostic enough to work across multiple agent ecosystems. - The draft defines concrete states, triggers, and recovery procedures that two implementers could follow consistently. - Security Considerations meaningfully address spoofed rollback, unauthorized override, replay, and denial-of-service by false failure signaling. - The text is shaped like a real Internet-Draft, not a product design memo. - The draft clearly states what is in scope now and what is deferred to later work such as richer workflow orchestration or dynamic trust scoring. ## Constraints - scope constraints Keep this to rollback and recovery coordination. Do not absorb lifecycle management, full workflow DAG standardization, or human override into the core mechanism except where needed as interfaces. - compatibility constraints Reuse adjacent concepts where possible from existing IETF-style work on execution evidence, attestation, or agent communication. Do not invent a full new identity or transport stack. - terminology constraints Use conservative standards language. Prefer terms like agent, execution, checkpoint, rollback set, dependency, and recovery record. Avoid buzzwords and branding. ## Source materials to prioritize - `/home/c/projects/ietf-draft-analyzer/data/reports/gaps.md` - `/home/c/projects/ietf-draft-analyzer/data/reports/holistic-agent-ecosystem-draft-outlines.md` - `/home/c/projects/ietf-draft-analyzer/data/reports/ideas.md` - `/home/c/projects/ietf-draft-analyzer/data/reports/overview.md` - `draft-yue-anima-agent-recovery-networks` - `draft-li-dmsc-macp` - `draft-fu-nmop-agent-communication-framework` - `draft-srijal-agents-policy` - related WIMSE or ECT materials when they help avoid redefining execution evidence ## Success criteria - A reader can tell exactly what an agent must emit or process when a task fails. - A reader can tell how rollback scope is determined and how dependent agents respond. - The draft includes enough structure to support interoperability testing later. - Specialist reviewers can criticize the draft on substance rather than on missing basic sections or obvious ambiguity. ## Questions for the team - What is the smallest interoperable core for rollback semantics? - Should checkpoints and recovery records be abstract objects, protocol messages, or profileable metadata on top of another carrier? - What information is mandatory in a failure signal versus optional? - How should rollback interact with partially completed downstream work?