71 lines
3.6 KiB
Markdown
71 lines
3.6 KiB
Markdown
# User Spec
|
|
|
|
## Topic
|
|
|
|
Agent Error Recovery and Rollback for Multi-Agent Systems
|
|
|
|
## Goal
|
|
|
|
Produce a credible IETF-style Internet-Draft for a narrowly scoped mechanism that standardizes how cooperating agents report failures, define rollback scope, and execute coordinated recovery without cascading damage.
|
|
|
|
## Intended status
|
|
|
|
Experimental.
|
|
|
|
Rationale: the problem is clearly real and under-specified, but the ecosystem is still young and the mechanism should not pretend to have full deployment consensus yet.
|
|
|
|
## Problem to solve
|
|
|
|
Current AI-agent and autonomous-operations drafts define communication, identity, and orchestration patterns, but the landscape analysis shows no common mechanism for:
|
|
|
|
- signaling execution failure in a machine-actionable way
|
|
- declaring rollback boundaries and blast radius
|
|
- coordinating rollback across dependent agents
|
|
- recording recovery outcomes for audit and future trust decisions
|
|
|
|
This creates high interoperability and safety risk for autonomous systems that act across multiple services or domains.
|
|
|
|
## What must be true in the final draft
|
|
|
|
- The draft stays tightly scoped to recovery and rollback semantics, not a full agent architecture.
|
|
- The mechanism is protocol-agnostic enough to work across multiple agent ecosystems.
|
|
- The draft defines concrete states, triggers, and recovery procedures that two implementers could follow consistently.
|
|
- Security Considerations meaningfully address spoofed rollback, unauthorized override, replay, and denial-of-service by false failure signaling.
|
|
- The text is shaped like a real Internet-Draft, not a product design memo.
|
|
- The draft clearly states what is in scope now and what is deferred to later work such as richer workflow orchestration or dynamic trust scoring.
|
|
|
|
## Constraints
|
|
|
|
- scope constraints
|
|
Keep this to rollback and recovery coordination. Do not absorb lifecycle management, full workflow DAG standardization, or human override into the core mechanism except where needed as interfaces.
|
|
- compatibility constraints
|
|
Reuse adjacent concepts where possible from existing IETF-style work on execution evidence, attestation, or agent communication. Do not invent a full new identity or transport stack.
|
|
- terminology constraints
|
|
Use conservative standards language. Prefer terms like agent, execution, checkpoint, rollback set, dependency, and recovery record. Avoid buzzwords and branding.
|
|
|
|
## Source materials to prioritize
|
|
|
|
- `/home/c/projects/ietf-draft-analyzer/data/reports/gaps.md`
|
|
- `/home/c/projects/ietf-draft-analyzer/data/reports/holistic-agent-ecosystem-draft-outlines.md`
|
|
- `/home/c/projects/ietf-draft-analyzer/data/reports/ideas.md`
|
|
- `/home/c/projects/ietf-draft-analyzer/data/reports/overview.md`
|
|
- `draft-yue-anima-agent-recovery-networks`
|
|
- `draft-li-dmsc-macp`
|
|
- `draft-fu-nmop-agent-communication-framework`
|
|
- `draft-srijal-agents-policy`
|
|
- related WIMSE or ECT materials when they help avoid redefining execution evidence
|
|
|
|
## Success criteria
|
|
|
|
- A reader can tell exactly what an agent must emit or process when a task fails.
|
|
- A reader can tell how rollback scope is determined and how dependent agents respond.
|
|
- The draft includes enough structure to support interoperability testing later.
|
|
- Specialist reviewers can criticize the draft on substance rather than on missing basic sections or obvious ambiguity.
|
|
|
|
## Questions for the team
|
|
|
|
- What is the smallest interoperable core for rollback semantics?
|
|
- Should checkpoints and recovery records be abstract objects, protocol messages, or profileable metadata on top of another carrier?
|
|
- What information is mandatory in a failure signal versus optional?
|
|
- How should rollback interact with partially completed downstream work?
|