Architecture Brief

Scope

Define an experimental, protocol-agnostic recovery model for multi-agent execution that standardizes:

failure signaling
checkpoint references
rollback request and rollback result semantics
dependency-aware rollback scope
minimum task state transitions relevant to recovery

The document should be narrow enough that an existing agent protocol or execution-evidence carrier can adopt it as a profile or extension.

Non-goals

defining a full workflow or DAG language
defining human override or approval workflows beyond a hook for escalation
defining identity, authentication, or attestation systems
defining global trust scoring or reputation exchange
defining scheduler behavior, quota fairness, or resource arbitration beyond optional future hooks

Terminology and actors

agent: autonomous software entity performing one or more tasks
task: a discrete unit of work whose execution and outcome can be referenced
dependency: another task whose outcome affects whether the current task may continue or must roll back
checkpoint: a recorded pre-action or recovery-safe state from which rollback may proceed
failure event: a machine-actionable signal that a task or dependency failed
rollback set: the set of tasks and effects that the sender requests to revert or compensate
recovery record: a record of rollback attempt, success, partial success, or failure
coordinator: optional role that computes rollback scope across multiple dependent agents

Actors:

originating agent that detects failure
dependent agent that receives failure or rollback signals
optional coordination service or gateway
policy authority or operator only when automatic rollback is disallowed

Protocol or data model shape

Use an abstract event model with four core event types:

checkpoint
failure
rollback-request
rollback-result

Each event should carry a minimum common envelope:

event identifier
task identifier
workflow or execution context identifier if available
sender identity reference
timestamp
referenced parent task or dependency identifiers where relevant

Event-specific content:

checkpoint: checkpoint identifier, reversibility class, optional expiry
failure: failure class, severity, reversibility indicator, blast-radius hint, failed dependency reference
rollback-request: target checkpoint or rollback boundary, requested rollback scope, reason code, urgency, idempotency token
rollback-result: outcome status, actual scope applied, partial rollback indicators, residual risk or manual follow-up required

State model:

pending
running
completed
failed
rollback-requested
rolled-back
rollback-failed
compensation-required

Design choice: keep the carrier abstract in this first draft, but include a section describing how the model may bind to existing execution-evidence formats if such a substrate is available and sufficiently mature.

Normative requirements candidates

Agents MUST emit a failure event when a task failure can affect dependent execution outside local process scope.
Failure events MUST identify the failed task and SHOULD identify affected dependencies when known.
Rollback requests MUST be idempotent and uniquely identifiable.
Agents receiving a rollback request MUST return a rollback result, even when rollback is refused or only partially completed.
A rollback result MUST indicate one of: success, partial success, refusal, irreversible, or failure.
Agents MUST NOT claim successful rollback unless the referenced effects were actually reverted or explicitly compensated.
If a task is not reversible, the agent MUST signal that fact explicitly rather than silently ignoring rollback.
Implementations SHOULD support checkpoint references when a task has externally visible side effects.
The specification SHOULD allow policy-controlled escalation rather than requiring automatic rollback for every failure.
The document MUST distinguish rollback of prior effects from cancellation of work that has not yet executed.

Security, privacy, and abuse considerations

unauthorized rollback requests could be used as denial-of-service
spoofed failure signals could trigger cascading rollback
replayed rollback requests could repeatedly unwind completed work
rollback metadata may expose internal topology or sensitive task relationships
partial rollback can create inconsistent downstream state that attackers can exploit
signed or otherwise authenticated event carriage is strongly preferred, but the draft should avoid redefining base authentication
the draft should require clear handling of refusal, partial rollback, and policy escalation to avoid silent unsafe states

Privacy is probably secondary but not zero: task identifiers, dependency graphs, and failure reasons can leak operational details.

IANA impact

Most likely minimal for the first version.

If the draft defines abstract event or reason-code registries, keep them compact:

rollback event types
failure classes
rollback outcome codes

If an existing registry from an underlying carrier can be reused, prefer that.

Open design questions

Should rollback scope be defined normatively as dependency closure, or left partially implementation-specific with mandatory disclosure of actual scope?
Is a separate cancellation event needed, or is that explicitly out of scope for this draft?
How much of checkpoint semantics should be mandatory versus profile-specific?
Can one draft stay both carrier-agnostic and implementable, or does it need a non-normative binding example to avoid vagueness?

5.6 KiB Raw Blame History