feat: add draft data, gap analysis report, and workspace config

2026-04-06 18:47:15 +02:00
parent 4f310407b0
commit 2506b6325a
189 changed files with 62649 additions and 0 deletions
--- a/workspace/draft-team/cycles/agent-error-recovery-rollback/20-architecture-brief.md
+++ b/workspace/draft-team/cycles/agent-error-recovery-rollback/20-architecture-brief.md
@@ -0,0 +1,121 @@
+# Architecture Brief
+
+## Scope
+
+Define an experimental, protocol-agnostic recovery model for multi-agent execution that standardizes:
+
+- failure signaling
+- checkpoint references
+- rollback request and rollback result semantics
+- dependency-aware rollback scope
+- minimum task state transitions relevant to recovery
+
+The document should be narrow enough that an existing agent protocol or execution-evidence carrier can adopt it as a profile or extension.
+
+## Non-goals
+
+- defining a full workflow or DAG language
+- defining human override or approval workflows beyond a hook for escalation
+- defining identity, authentication, or attestation systems
+- defining global trust scoring or reputation exchange
+- defining scheduler behavior, quota fairness, or resource arbitration beyond optional future hooks
+
+## Terminology and actors
+
+- `agent`: autonomous software entity performing one or more tasks
+- `task`: a discrete unit of work whose execution and outcome can be referenced
+- `dependency`: another task whose outcome affects whether the current task may continue or must roll back
+- `checkpoint`: a recorded pre-action or recovery-safe state from which rollback may proceed
+- `failure event`: a machine-actionable signal that a task or dependency failed
+- `rollback set`: the set of tasks and effects that the sender requests to revert or compensate
+- `recovery record`: a record of rollback attempt, success, partial success, or failure
+- `coordinator`: optional role that computes rollback scope across multiple dependent agents
+
+Actors:
+
+- originating agent that detects failure
+- dependent agent that receives failure or rollback signals
+- optional coordination service or gateway
+- policy authority or operator only when automatic rollback is disallowed
+
+## Protocol or data model shape
+
+Use an abstract event model with four core event types:
+
+1. `checkpoint`
+2. `failure`
+3. `rollback-request`
+4. `rollback-result`
+
+Each event should carry a minimum common envelope:
+
+- event identifier
+- task identifier
+- workflow or execution context identifier if available
+- sender identity reference
+- timestamp
+- referenced parent task or dependency identifiers where relevant
+
+Event-specific content:
+
+- `checkpoint`: checkpoint identifier, reversibility class, optional expiry
+- `failure`: failure class, severity, reversibility indicator, blast-radius hint, failed dependency reference
+- `rollback-request`: target checkpoint or rollback boundary, requested rollback scope, reason code, urgency, idempotency token
+- `rollback-result`: outcome status, actual scope applied, partial rollback indicators, residual risk or manual follow-up required
+
+State model:
+
+- `pending`
+- `running`
+- `completed`
+- `failed`
+- `rollback-requested`
+- `rolled-back`
+- `rollback-failed`
+- `compensation-required`
+
+Design choice: keep the carrier abstract in this first draft, but include a section describing how the model may bind to existing execution-evidence formats if such a substrate is available and sufficiently mature.
+
+## Normative requirements candidates
+
+- Agents MUST emit a failure event when a task failure can affect dependent execution outside local process scope.
+- Failure events MUST identify the failed task and SHOULD identify affected dependencies when known.
+- Rollback requests MUST be idempotent and uniquely identifiable.
+- Agents receiving a rollback request MUST return a rollback result, even when rollback is refused or only partially completed.
+- A rollback result MUST indicate one of: success, partial success, refusal, irreversible, or failure.
+- Agents MUST NOT claim successful rollback unless the referenced effects were actually reverted or explicitly compensated.
+- If a task is not reversible, the agent MUST signal that fact explicitly rather than silently ignoring rollback.
+- Implementations SHOULD support checkpoint references when a task has externally visible side effects.
+- The specification SHOULD allow policy-controlled escalation rather than requiring automatic rollback for every failure.
+- The document MUST distinguish rollback of prior effects from cancellation of work that has not yet executed.
+
+## Security, privacy, and abuse considerations
+
+- unauthorized rollback requests could be used as denial-of-service
+- spoofed failure signals could trigger cascading rollback
+- replayed rollback requests could repeatedly unwind completed work
+- rollback metadata may expose internal topology or sensitive task relationships
+- partial rollback can create inconsistent downstream state that attackers can exploit
+- signed or otherwise authenticated event carriage is strongly preferred, but the draft should avoid redefining base authentication
+- the draft should require clear handling of refusal, partial rollback, and policy escalation to avoid silent unsafe states
+
+Privacy is probably secondary but not zero: task identifiers, dependency graphs, and failure reasons can leak operational details.
+
+## IANA impact
+
+Most likely minimal for the first version.
+
+If the draft defines abstract event or reason-code registries, keep them compact:
+
+- rollback event types
+- failure classes
+- rollback outcome codes
+
+If an existing registry from an underlying carrier can be reused, prefer that.
+
+## Open design questions
+
+- Should rollback scope be defined normatively as dependency closure, or left partially implementation-specific with mandatory disclosure of actual scope?
+- Is a separate `cancellation` event needed, or is that explicitly out of scope for this draft?
+- How much of checkpoint semantics should be mandatory versus profile-specific?
+- Can one draft stay both carrier-agnostic and implementable, or does it need a non-normative binding example to avoid vagueness?