Files
Christian Nennemann 2506b6325a
Some checks failed
CI / test (3.11) (push) Failing after 1m37s
CI / test (3.12) (push) Failing after 57s
feat: add draft data, gap analysis report, and workspace config
2026-04-06 18:47:15 +02:00

5.6 KiB

Architecture Brief

Scope

Define an experimental, protocol-agnostic recovery model for multi-agent execution that standardizes:

  • failure signaling
  • checkpoint references
  • rollback request and rollback result semantics
  • dependency-aware rollback scope
  • minimum task state transitions relevant to recovery

The document should be narrow enough that an existing agent protocol or execution-evidence carrier can adopt it as a profile or extension.

Non-goals

  • defining a full workflow or DAG language
  • defining human override or approval workflows beyond a hook for escalation
  • defining identity, authentication, or attestation systems
  • defining global trust scoring or reputation exchange
  • defining scheduler behavior, quota fairness, or resource arbitration beyond optional future hooks

Terminology and actors

  • agent: autonomous software entity performing one or more tasks
  • task: a discrete unit of work whose execution and outcome can be referenced
  • dependency: another task whose outcome affects whether the current task may continue or must roll back
  • checkpoint: a recorded pre-action or recovery-safe state from which rollback may proceed
  • failure event: a machine-actionable signal that a task or dependency failed
  • rollback set: the set of tasks and effects that the sender requests to revert or compensate
  • recovery record: a record of rollback attempt, success, partial success, or failure
  • coordinator: optional role that computes rollback scope across multiple dependent agents

Actors:

  • originating agent that detects failure
  • dependent agent that receives failure or rollback signals
  • optional coordination service or gateway
  • policy authority or operator only when automatic rollback is disallowed

Protocol or data model shape

Use an abstract event model with four core event types:

  1. checkpoint
  2. failure
  3. rollback-request
  4. rollback-result

Each event should carry a minimum common envelope:

  • event identifier
  • task identifier
  • workflow or execution context identifier if available
  • sender identity reference
  • timestamp
  • referenced parent task or dependency identifiers where relevant

Event-specific content:

  • checkpoint: checkpoint identifier, reversibility class, optional expiry
  • failure: failure class, severity, reversibility indicator, blast-radius hint, failed dependency reference
  • rollback-request: target checkpoint or rollback boundary, requested rollback scope, reason code, urgency, idempotency token
  • rollback-result: outcome status, actual scope applied, partial rollback indicators, residual risk or manual follow-up required

State model:

  • pending
  • running
  • completed
  • failed
  • rollback-requested
  • rolled-back
  • rollback-failed
  • compensation-required

Design choice: keep the carrier abstract in this first draft, but include a section describing how the model may bind to existing execution-evidence formats if such a substrate is available and sufficiently mature.

Normative requirements candidates

  • Agents MUST emit a failure event when a task failure can affect dependent execution outside local process scope.
  • Failure events MUST identify the failed task and SHOULD identify affected dependencies when known.
  • Rollback requests MUST be idempotent and uniquely identifiable.
  • Agents receiving a rollback request MUST return a rollback result, even when rollback is refused or only partially completed.
  • A rollback result MUST indicate one of: success, partial success, refusal, irreversible, or failure.
  • Agents MUST NOT claim successful rollback unless the referenced effects were actually reverted or explicitly compensated.
  • If a task is not reversible, the agent MUST signal that fact explicitly rather than silently ignoring rollback.
  • Implementations SHOULD support checkpoint references when a task has externally visible side effects.
  • The specification SHOULD allow policy-controlled escalation rather than requiring automatic rollback for every failure.
  • The document MUST distinguish rollback of prior effects from cancellation of work that has not yet executed.

Security, privacy, and abuse considerations

  • unauthorized rollback requests could be used as denial-of-service
  • spoofed failure signals could trigger cascading rollback
  • replayed rollback requests could repeatedly unwind completed work
  • rollback metadata may expose internal topology or sensitive task relationships
  • partial rollback can create inconsistent downstream state that attackers can exploit
  • signed or otherwise authenticated event carriage is strongly preferred, but the draft should avoid redefining base authentication
  • the draft should require clear handling of refusal, partial rollback, and policy escalation to avoid silent unsafe states

Privacy is probably secondary but not zero: task identifiers, dependency graphs, and failure reasons can leak operational details.

IANA impact

Most likely minimal for the first version.

If the draft defines abstract event or reason-code registries, keep them compact:

  • rollback event types
  • failure classes
  • rollback outcome codes

If an existing registry from an underlying carrier can be reused, prefer that.

Open design questions

  • Should rollback scope be defined normatively as dependency closure, or left partially implementation-specific with mandatory disclosure of actual scope?
  • Is a separate cancellation event needed, or is that explicitly out of scope for this draft?
  • How much of checkpoint semantics should be mandatory versus profile-specific?
  • Can one draft stay both carrier-agnostic and implementable, or does it need a non-normative binding example to avoid vagueness?