feat: add draft data, gap analysis report, and workspace config
This commit is contained in:
@@ -0,0 +1,121 @@
|
||||
# Architecture Brief
|
||||
|
||||
## Scope
|
||||
|
||||
Define an experimental, protocol-agnostic recovery model for multi-agent execution that standardizes:
|
||||
|
||||
- failure signaling
|
||||
- checkpoint references
|
||||
- rollback request and rollback result semantics
|
||||
- dependency-aware rollback scope
|
||||
- minimum task state transitions relevant to recovery
|
||||
|
||||
The document should be narrow enough that an existing agent protocol or execution-evidence carrier can adopt it as a profile or extension.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- defining a full workflow or DAG language
|
||||
- defining human override or approval workflows beyond a hook for escalation
|
||||
- defining identity, authentication, or attestation systems
|
||||
- defining global trust scoring or reputation exchange
|
||||
- defining scheduler behavior, quota fairness, or resource arbitration beyond optional future hooks
|
||||
|
||||
## Terminology and actors
|
||||
|
||||
- `agent`: autonomous software entity performing one or more tasks
|
||||
- `task`: a discrete unit of work whose execution and outcome can be referenced
|
||||
- `dependency`: another task whose outcome affects whether the current task may continue or must roll back
|
||||
- `checkpoint`: a recorded pre-action or recovery-safe state from which rollback may proceed
|
||||
- `failure event`: a machine-actionable signal that a task or dependency failed
|
||||
- `rollback set`: the set of tasks and effects that the sender requests to revert or compensate
|
||||
- `recovery record`: a record of rollback attempt, success, partial success, or failure
|
||||
- `coordinator`: optional role that computes rollback scope across multiple dependent agents
|
||||
|
||||
Actors:
|
||||
|
||||
- originating agent that detects failure
|
||||
- dependent agent that receives failure or rollback signals
|
||||
- optional coordination service or gateway
|
||||
- policy authority or operator only when automatic rollback is disallowed
|
||||
|
||||
## Protocol or data model shape
|
||||
|
||||
Use an abstract event model with four core event types:
|
||||
|
||||
1. `checkpoint`
|
||||
2. `failure`
|
||||
3. `rollback-request`
|
||||
4. `rollback-result`
|
||||
|
||||
Each event should carry a minimum common envelope:
|
||||
|
||||
- event identifier
|
||||
- task identifier
|
||||
- workflow or execution context identifier if available
|
||||
- sender identity reference
|
||||
- timestamp
|
||||
- referenced parent task or dependency identifiers where relevant
|
||||
|
||||
Event-specific content:
|
||||
|
||||
- `checkpoint`: checkpoint identifier, reversibility class, optional expiry
|
||||
- `failure`: failure class, severity, reversibility indicator, blast-radius hint, failed dependency reference
|
||||
- `rollback-request`: target checkpoint or rollback boundary, requested rollback scope, reason code, urgency, idempotency token
|
||||
- `rollback-result`: outcome status, actual scope applied, partial rollback indicators, residual risk or manual follow-up required
|
||||
|
||||
State model:
|
||||
|
||||
- `pending`
|
||||
- `running`
|
||||
- `completed`
|
||||
- `failed`
|
||||
- `rollback-requested`
|
||||
- `rolled-back`
|
||||
- `rollback-failed`
|
||||
- `compensation-required`
|
||||
|
||||
Design choice: keep the carrier abstract in this first draft, but include a section describing how the model may bind to existing execution-evidence formats if such a substrate is available and sufficiently mature.
|
||||
|
||||
## Normative requirements candidates
|
||||
|
||||
- Agents MUST emit a failure event when a task failure can affect dependent execution outside local process scope.
|
||||
- Failure events MUST identify the failed task and SHOULD identify affected dependencies when known.
|
||||
- Rollback requests MUST be idempotent and uniquely identifiable.
|
||||
- Agents receiving a rollback request MUST return a rollback result, even when rollback is refused or only partially completed.
|
||||
- A rollback result MUST indicate one of: success, partial success, refusal, irreversible, or failure.
|
||||
- Agents MUST NOT claim successful rollback unless the referenced effects were actually reverted or explicitly compensated.
|
||||
- If a task is not reversible, the agent MUST signal that fact explicitly rather than silently ignoring rollback.
|
||||
- Implementations SHOULD support checkpoint references when a task has externally visible side effects.
|
||||
- The specification SHOULD allow policy-controlled escalation rather than requiring automatic rollback for every failure.
|
||||
- The document MUST distinguish rollback of prior effects from cancellation of work that has not yet executed.
|
||||
|
||||
## Security, privacy, and abuse considerations
|
||||
|
||||
- unauthorized rollback requests could be used as denial-of-service
|
||||
- spoofed failure signals could trigger cascading rollback
|
||||
- replayed rollback requests could repeatedly unwind completed work
|
||||
- rollback metadata may expose internal topology or sensitive task relationships
|
||||
- partial rollback can create inconsistent downstream state that attackers can exploit
|
||||
- signed or otherwise authenticated event carriage is strongly preferred, but the draft should avoid redefining base authentication
|
||||
- the draft should require clear handling of refusal, partial rollback, and policy escalation to avoid silent unsafe states
|
||||
|
||||
Privacy is probably secondary but not zero: task identifiers, dependency graphs, and failure reasons can leak operational details.
|
||||
|
||||
## IANA impact
|
||||
|
||||
Most likely minimal for the first version.
|
||||
|
||||
If the draft defines abstract event or reason-code registries, keep them compact:
|
||||
|
||||
- rollback event types
|
||||
- failure classes
|
||||
- rollback outcome codes
|
||||
|
||||
If an existing registry from an underlying carrier can be reused, prefer that.
|
||||
|
||||
## Open design questions
|
||||
|
||||
- Should rollback scope be defined normatively as dependency closure, or left partially implementation-specific with mandatory disclosure of actual scope?
|
||||
- Is a separate `cancellation` event needed, or is that explicitly out of scope for this draft?
|
||||
- How much of checkpoint semantics should be mandatory versus profile-specific?
|
||||
- Can one draft stay both carrier-agnostic and implementable, or does it need a non-normative binding example to avoid vagueness?
|
||||
Reference in New Issue
Block a user