Files
ietf-draft-analyzer/workspace/draft-team/cycles/agent-error-recovery-rollback/00-user-spec.md
Christian Nennemann 2506b6325a
Some checks failed
CI / test (3.11) (push) Failing after 1m37s
CI / test (3.12) (push) Failing after 57s
feat: add draft data, gap analysis report, and workspace config
2026-04-06 18:47:15 +02:00

71 lines
3.6 KiB
Markdown

# User Spec
## Topic
Agent Error Recovery and Rollback for Multi-Agent Systems
## Goal
Produce a credible IETF-style Internet-Draft for a narrowly scoped mechanism that standardizes how cooperating agents report failures, define rollback scope, and execute coordinated recovery without cascading damage.
## Intended status
Experimental.
Rationale: the problem is clearly real and under-specified, but the ecosystem is still young and the mechanism should not pretend to have full deployment consensus yet.
## Problem to solve
Current AI-agent and autonomous-operations drafts define communication, identity, and orchestration patterns, but the landscape analysis shows no common mechanism for:
- signaling execution failure in a machine-actionable way
- declaring rollback boundaries and blast radius
- coordinating rollback across dependent agents
- recording recovery outcomes for audit and future trust decisions
This creates high interoperability and safety risk for autonomous systems that act across multiple services or domains.
## What must be true in the final draft
- The draft stays tightly scoped to recovery and rollback semantics, not a full agent architecture.
- The mechanism is protocol-agnostic enough to work across multiple agent ecosystems.
- The draft defines concrete states, triggers, and recovery procedures that two implementers could follow consistently.
- Security Considerations meaningfully address spoofed rollback, unauthorized override, replay, and denial-of-service by false failure signaling.
- The text is shaped like a real Internet-Draft, not a product design memo.
- The draft clearly states what is in scope now and what is deferred to later work such as richer workflow orchestration or dynamic trust scoring.
## Constraints
- scope constraints
Keep this to rollback and recovery coordination. Do not absorb lifecycle management, full workflow DAG standardization, or human override into the core mechanism except where needed as interfaces.
- compatibility constraints
Reuse adjacent concepts where possible from existing IETF-style work on execution evidence, attestation, or agent communication. Do not invent a full new identity or transport stack.
- terminology constraints
Use conservative standards language. Prefer terms like agent, execution, checkpoint, rollback set, dependency, and recovery record. Avoid buzzwords and branding.
## Source materials to prioritize
- `/home/c/projects/ietf-draft-analyzer/data/reports/gaps.md`
- `/home/c/projects/ietf-draft-analyzer/data/reports/holistic-agent-ecosystem-draft-outlines.md`
- `/home/c/projects/ietf-draft-analyzer/data/reports/ideas.md`
- `/home/c/projects/ietf-draft-analyzer/data/reports/overview.md`
- `draft-yue-anima-agent-recovery-networks`
- `draft-li-dmsc-macp`
- `draft-fu-nmop-agent-communication-framework`
- `draft-srijal-agents-policy`
- related WIMSE or ECT materials when they help avoid redefining execution evidence
## Success criteria
- A reader can tell exactly what an agent must emit or process when a task fails.
- A reader can tell how rollback scope is determined and how dependent agents respond.
- The draft includes enough structure to support interoperability testing later.
- Specialist reviewers can criticize the draft on substance rather than on missing basic sections or obvious ambiguity.
## Questions for the team
- What is the smallest interoperable core for rollback semantics?
- Should checkpoints and recovery records be abstract objects, protocol messages, or profileable metadata on top of another carrier?
- What information is mandatory in a failure signal versus optional?
- How should rollback interact with partially completed downstream work?