feat: add draft data, gap analysis report, and workspace config

2026-04-06 18:47:15 +02:00
parent 4f310407b0
commit 2506b6325a
189 changed files with 62649 additions and 0 deletions
--- a/workspace/drafts/new-drafts/draft-aerr-agent-error-recovery-rollback-00.txt
+++ b/workspace/drafts/new-drafts/draft-aerr-agent-error-recovery-rollback-00.txt
@@ -0,0 +1,309 @@
+Internet-Draft                                           AI/Agent WG
+Intended status: Standards Track                          March 2026
+Expires: September 15, 2026
+
+
+         Agent Error Recovery and Rollback (AERR)
+         draft-aerr-agent-error-recovery-rollback-00
+
+Abstract
+
+   This document defines the Agent Error Recovery and Rollback
+   (AERR) protocol, a lightweight standard for handling errors,
+   cascading failures, and rollback in multi-agent systems.
+   Autonomous AI agents increasingly make unsupervised decisions,
+   yet no standard exists for how agents checkpoint state, signal
+   errors to peers, contain cascading failures, or roll back
+   autonomous decisions gone wrong.  AERR defines three mechanisms:
+   state checkpoints that agents create before consequential
+   actions, a circuit breaker pattern to contain cascading failures
+   across agent networks, and a rollback protocol for reverting
+   agent actions to a known-good state.  The protocol is transport-
+   agnostic and builds on JSON and standard HTTP semantics.
+
+Status of This Memo
+
+   This Internet-Draft is submitted in full conformance with the
+   provisions of BCP 78 and BCP 79.
+
+   This document is intended to have Standards Track status.
+   Distribution of this memo is unlimited.
+
+Table of Contents
+
+   1.  Introduction
+   2.  Terminology
+   3.  Problem Statement
+   4.  Checkpoint Mechanism
+   5.  Error Signaling
+   6.  Circuit Breaker Pattern
+   7.  Rollback Protocol
+   8.  Security Considerations
+   9.  IANA Considerations
+
+1.  Introduction
+
+   The IETF AI/agent landscape includes 60 drafts on autonomous
+   network operations but none that standardize error recovery.
+   When an autonomous agent misconfigures a router, allocates
+   resources incorrectly, or triggers an unintended cascade of
+   actions across a multi-agent system, there is currently no
+   standard mechanism for detecting the failure, containing its
+   blast radius, or reverting to a safe state.
+
+   AERR borrows proven patterns from distributed systems:
+   checkpoints from database transactions, circuit breakers from
+   microservice architectures, and rollback from version control.
+   It adapts these patterns to the specific needs of AI agents,
+   where actions may be partially reversible and where the agent
+   that caused the error may not be the best one to fix it.
+
+   Design principles:
+   1. Agents that take consequential actions MUST be able to undo
+      them, or MUST declare them irreversible upfront.
+   2. Failure containment takes priority over failure diagnosis.
+   3. The protocol adds minimal overhead to the happy path.
+
+2.  Terminology
+
+   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
+   NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
+   "OPTIONAL" in this document are to be interpreted as described
+   in RFC 2119 [RFC2119].
+
+   Checkpoint: A snapshot of an agent's state and the external
+   effects of its actions at a point in time, sufficient to
+   restore the system to that state.
+
+   Circuit Breaker: A mechanism that stops an agent from
+   propagating requests to a failing downstream agent, preventing
+   cascading failures.
+
+   Rollback: The process of reverting an agent's actions and state
+   to a previously recorded checkpoint.
+
+   Blast Radius: The set of agents and systems affected by a
+   single agent's failure.
+
+3.  Problem Statement
+
+   Consider a network operations scenario: Agent A instructs
+   Agent B to update firewall rules, which causes Agent C's
+   traffic monitoring to fail, which causes Agent D to
+   misclassify traffic patterns.  Today each agent handles errors
+   independently with no coordination.  There is no standard way
+   for Agent D to signal that the root cause is upstream, for the
+   cascade to be halted, or for the chain of actions to be rolled
+   back.
+
+   The only existing draft that partially addresses this space
+   (draft-yue-anima-agent-recovery-networks) focuses on mobile
+   network fault recovery and does not provide general-purpose
+   error recovery primitives usable across agent types.
+
+4.  Checkpoint Mechanism
+
+   An AERR-compliant agent MUST create a checkpoint before any
+   action it classifies as "consequential."  An action is
+   consequential if it modifies external state (e.g., network
+   config, database records, API calls with side effects).
+
+   A checkpoint is a JSON object:
+
+      {
+        "checkpoint_id": "urn:uuid:...",
+        "agent_id": "urn:uuid:...",
+        "timestamp": "2026-03-01T12:00:00Z",
+        "action": {
+          "type": "config_update",
+          "target": "router-07.example.com",
+          "description": "Update BGP peer config"
+        },
+        "reversible": true,
+        "rollback_procedure": {
+          "method": "POST",
+          "uri": "https://agent-b.example.com/aerr/rollback",
+          "payload_ref": "urn:uuid:...prior-config-snapshot"
+        },
+        "state_hash": "sha256:abcdef...",
+        "ttl": 86400
+      }
+
+   The "reversible" field MUST be present.  If false, the agent
+   declares that this action cannot be automatically undone and
+   rollback requests for this checkpoint MUST be escalated to a
+   human operator.
+
+   The "state_hash" provides integrity verification: the agent
+   hashes its relevant state at checkpoint time so that rollback
+   can verify it is restoring to an authentic prior state.
+
+   Checkpoints MUST be stored for at least the duration specified
+   by "ttl" (seconds).  Agents SHOULD store checkpoints in durable
+   storage that survives agent restarts.
+
+   Agents MAY create hierarchical checkpoints where a parent
+   checkpoint groups multiple child checkpoints from a multi-step
+   operation.  Rolling back the parent rolls back all children.
+
+5.  Error Signaling
+
+   When an agent detects an error, it MUST emit an AERR error
+   signal to all agents in the current action chain.  The error
+   signal is an HTTP POST to each peer's AERR endpoint:
+
+      POST /aerr/error HTTP/1.1
+      Content-Type: application/json
+
+      {
+        "error_id": "urn:uuid:...",
+        "source_agent": "urn:uuid:...",
+        "severity": "critical",
+        "checkpoint_id": "urn:uuid:...",
+        "error_type": "action_failed",
+        "description": "BGP session did not establish after config update",
+        "timestamp": "2026-03-01T12:05:00Z",
+        "upstream_errors": []
+      }
+
+   Severity levels: "info", "warning", "error", "critical".
+
+   Error types: "action_failed", "timeout", "constraint_violation",
+   "resource_exhausted", "upstream_cascade", "unknown".
+
+   When an agent receives an error signal caused by an action it
+   initiated, it MUST either:
+   (a) Attempt automatic rollback of its checkpoint, or
+   (b) Escalate to its operator if the action was irreversible.
+
+   The "upstream_errors" array allows agents to chain error
+   context, building a causal trace from the symptom back to the
+   root cause.
+
+6.  Circuit Breaker Pattern
+
+   Each agent MUST implement a circuit breaker for every downstream
+   agent it communicates with.  The circuit breaker has three
+   states:
+
+   CLOSED (normal operation): Requests flow through.  The agent
+   tracks the error rate over a sliding window (default: 60s).
+
+   OPEN (failure detected): When the error rate exceeds a
+   threshold (default: 50% over the window), the circuit breaker
+   opens.  All requests to the downstream agent are immediately
+   rejected with error_type "circuit_open".  The agent MUST emit
+   an error signal to upstream peers.
+
+   HALF-OPEN (recovery probe): After a cooldown period (default:
+   30s), the circuit breaker allows a single probe request.  If it
+   succeeds, the breaker returns to CLOSED.  If it fails, it
+   returns to OPEN with a doubled cooldown (exponential backoff,
+   max 300s).
+
+   Agents MUST expose circuit breaker state at:
+
+      GET /aerr/circuits
+
+   Response:
+      {
+        "circuits": [
+          {
+            "downstream_agent": "urn:uuid:...",
+            "state": "open",
+            "error_rate": 0.75,
+            "last_failure": "2026-03-01T12:05:00Z",
+            "cooldown_remaining_s": 22
+          }
+        ]
+      }
+
+   This enables monitoring systems and upstream agents to
+   understand the health topology of the agent network.
+
+7.  Rollback Protocol
+
+   A rollback is initiated by sending an HTTP POST to the target
+   agent's rollback endpoint:
+
+      POST /aerr/rollback HTTP/1.1
+      Content-Type: application/json
+
+      {
+        "rollback_id": "urn:uuid:...",
+        "checkpoint_id": "urn:uuid:...",
+        "reason": "Upstream action caused cascading failure",
+        "initiator": "urn:uuid:...",
+        "cascade": true
+      }
+
+   When "cascade" is true, the receiving agent MUST also initiate
+   rollback of any downstream checkpoints that were created as a
+   consequence of the checkpointed action.  This enables a single
+   rollback request to unwind an entire chain of agent actions.
+
+   The agent MUST respond with a rollback result:
+
+      {
+        "rollback_id": "urn:uuid:...",
+        "status": "completed",
+        "checkpoint_id": "urn:uuid:...",
+        "state_hash_before": "sha256:...",
+        "state_hash_after": "sha256:...",
+        "cascaded_rollbacks": [
+          {"agent_id": "urn:uuid:...", "status": "completed"},
+          {"agent_id": "urn:uuid:...", "status": "escalated"}
+        ]
+      }
+
+   Rollback status values: "completed", "partial", "escalated",
+   "failed".
+
+   "escalated" means the action was irreversible and a human
+   operator has been notified.  "partial" means some but not all
+   downstream rollbacks succeeded.
+
+   Agents MUST implement idempotent rollback: receiving the same
+   rollback_id twice MUST return the same result without re-
+   executing the rollback.
+
+8.  Security Considerations
+
+   Rollback requests are sensitive operations.  Agents MUST
+   authenticate rollback requests using mutual TLS or signed JWTs.
+   Only agents in the same action chain (identified by checkpoint
+   lineage) SHOULD be authorized to request rollback.
+
+   Checkpoint data may contain sensitive system state.  Agents
+   MUST encrypt stored checkpoints at rest and MUST NOT include
+   checkpoint contents in error signals.
+
+   Circuit breaker state is observable information about system
+   health.  The /aerr/circuits endpoint SHOULD be access-
+   controlled to prevent adversaries from mapping system topology.
+
+   Malicious agents could send false error signals to trigger
+   unnecessary rollbacks.  Agents SHOULD verify that error signals
+   reference valid checkpoint IDs from their own action chains
+   before initiating rollback.
+
+9.  IANA Considerations
+
+   This document requests IANA establish the following:
+
+   1. An "AERR Error Type" registry under Specification Required
+      policy.  Initial entries: "action_failed", "timeout",
+      "constraint_violation", "resource_exhausted",
+      "upstream_cascade", "unknown".
+
+   2. An "AERR Severity Level" registry under Specification
+      Required policy.  Initial entries: "info", "warning",
+      "error", "critical".
+
+   3. Well-known URI registrations for "aerr/error",
+      "aerr/rollback", and "aerr/circuits" per RFC 8615.
+
+Author's Address
+
+   Generated by IETF Draft Analyzer
+   2026-03-01