# Architecture Brief ## Scope Define an experimental, protocol-agnostic recovery model for multi-agent execution that standardizes: - failure signaling - checkpoint references - rollback request and rollback result semantics - dependency-aware rollback scope - minimum task state transitions relevant to recovery The document should be narrow enough that an existing agent protocol or execution-evidence carrier can adopt it as a profile or extension. ## Non-goals - defining a full workflow or DAG language - defining human override or approval workflows beyond a hook for escalation - defining identity, authentication, or attestation systems - defining global trust scoring or reputation exchange - defining scheduler behavior, quota fairness, or resource arbitration beyond optional future hooks ## Terminology and actors - `agent`: autonomous software entity performing one or more tasks - `task`: a discrete unit of work whose execution and outcome can be referenced - `dependency`: another task whose outcome affects whether the current task may continue or must roll back - `checkpoint`: a recorded pre-action or recovery-safe state from which rollback may proceed - `failure event`: a machine-actionable signal that a task or dependency failed - `rollback set`: the set of tasks and effects that the sender requests to revert or compensate - `recovery record`: a record of rollback attempt, success, partial success, or failure - `coordinator`: optional role that computes rollback scope across multiple dependent agents Actors: - originating agent that detects failure - dependent agent that receives failure or rollback signals - optional coordination service or gateway - policy authority or operator only when automatic rollback is disallowed ## Protocol or data model shape Use an abstract event model with four core event types: 1. `checkpoint` 2. `failure` 3. `rollback-request` 4. `rollback-result` Each event should carry a minimum common envelope: - event identifier - task identifier - workflow or execution context identifier if available - sender identity reference - timestamp - referenced parent task or dependency identifiers where relevant Event-specific content: - `checkpoint`: checkpoint identifier, reversibility class, optional expiry - `failure`: failure class, severity, reversibility indicator, blast-radius hint, failed dependency reference - `rollback-request`: target checkpoint or rollback boundary, requested rollback scope, reason code, urgency, idempotency token - `rollback-result`: outcome status, actual scope applied, partial rollback indicators, residual risk or manual follow-up required State model: - `pending` - `running` - `completed` - `failed` - `rollback-requested` - `rolled-back` - `rollback-failed` - `compensation-required` Design choice: keep the carrier abstract in this first draft, but include a section describing how the model may bind to existing execution-evidence formats if such a substrate is available and sufficiently mature. ## Normative requirements candidates - Agents MUST emit a failure event when a task failure can affect dependent execution outside local process scope. - Failure events MUST identify the failed task and SHOULD identify affected dependencies when known. - Rollback requests MUST be idempotent and uniquely identifiable. - Agents receiving a rollback request MUST return a rollback result, even when rollback is refused or only partially completed. - A rollback result MUST indicate one of: success, partial success, refusal, irreversible, or failure. - Agents MUST NOT claim successful rollback unless the referenced effects were actually reverted or explicitly compensated. - If a task is not reversible, the agent MUST signal that fact explicitly rather than silently ignoring rollback. - Implementations SHOULD support checkpoint references when a task has externally visible side effects. - The specification SHOULD allow policy-controlled escalation rather than requiring automatic rollback for every failure. - The document MUST distinguish rollback of prior effects from cancellation of work that has not yet executed. ## Security, privacy, and abuse considerations - unauthorized rollback requests could be used as denial-of-service - spoofed failure signals could trigger cascading rollback - replayed rollback requests could repeatedly unwind completed work - rollback metadata may expose internal topology or sensitive task relationships - partial rollback can create inconsistent downstream state that attackers can exploit - signed or otherwise authenticated event carriage is strongly preferred, but the draft should avoid redefining base authentication - the draft should require clear handling of refusal, partial rollback, and policy escalation to avoid silent unsafe states Privacy is probably secondary but not zero: task identifiers, dependency graphs, and failure reasons can leak operational details. ## IANA impact Most likely minimal for the first version. If the draft defines abstract event or reason-code registries, keep them compact: - rollback event types - failure classes - rollback outcome codes If an existing registry from an underlying carrier can be reused, prefer that. ## Open design questions - Should rollback scope be defined normatively as dependency closure, or left partially implementation-specific with mandatory disclosure of actual scope? - Is a separate `cancellation` event needed, or is that explicitly out of scope for this draft? - How much of checkpoint semantics should be mandatory versus profile-specific? - Can one draft stay both carrier-agnostic and implementable, or does it need a non-normative binding example to avoid vagueness?