c/ietf-draft-analyzer

Fork 0

Files

Christian Nennemann 2506b6325a

CI / test (3.11) (push) Failing after 1m37s

Details

CI / test (3.12) (push) Failing after 57s

Details

feat: add draft data, gap analysis report, and workspace config

2026-04-06 18:47:15 +02:00

12 KiB

Raw Permalink Blame History

fullname: Generated by IETF Draft Analyzer
organization: Independent
email: placeholder@example.com

normative: RFC7519: RFC7515: RFC9110: I-D.nennemann-wimse-ect: title: "Execution Context Tokens for Distributed Agentic Workflows" target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/

informative: I-D.nennemann-agent-dag-hitl-safety: title: "Agent Context Policy Token: DAG Delegation with Human Override" target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/

--- abstract

This document defines the Agent Error Recovery and Rollback (AERR) protocol, a standard for handling errors, cascading failures, and rollback in multi-agent systems. AERR defines three mechanisms: state checkpoints recorded as Execution Context Token (ECT) DAG nodes, a circuit breaker pattern to contain cascading failures, and a rollback protocol that walks the ECT DAG backwards to revert agent actions to a known-good state. By building on ECT, AERR inherits cryptographic audit trails, assurance levels, and DAG validation without inventing parallel infrastructure.

--- middle

Introduction

The IETF AI/agent landscape includes 60 drafts on autonomous network operations but none that standardize error recovery. When an autonomous agent misconfigures a router, allocates resources incorrectly, or triggers a cascade of failures across a multi-agent system, there is no standard mechanism for detecting the failure, containing its blast radius, or reverting to a safe state.

AERR borrows proven patterns from distributed systems -- checkpoints from database transactions, circuit breakers from microservice architectures, rollback from version control -- and adapts them for AI agent workflows. Rather than inventing its own audit and tracing layer, AERR records all checkpoints, errors, and rollbacks as ECT DAG nodes {{I-D.nennemann-wimse-ect}}, giving every recovery action a cryptographic proof chain.

Design principles:

Agents that take consequential actions MUST be able to undo them, or MUST declare them irreversible upfront.
Failure containment takes priority over failure diagnosis.
The protocol adds minimal overhead to the happy path.

Conventions and Definitions

{::boilerplate bcp14-tagged}

Checkpoint:: An ECT recording an agent's state hash before a consequential action, providing a restore point for rollback.
Circuit Breaker:: A mechanism that stops an agent from propagating requests to a failing downstream agent, preventing cascading failures.
Rollback:: The process of reverting an agent's actions and state to a previously recorded checkpoint, walking the ECT DAG backwards.
Blast Radius:: The set of agents and systems affected by a single agent's failure, determinable by traversing the ECT DAG forward from the failing node.

Problem Statement

Consider a network operations scenario: Agent A instructs Agent B to update firewall rules, which causes Agent C's traffic monitoring to fail, which causes Agent D to misclassify traffic. Today each agent handles errors independently. There is no standard way for Agent D to signal that the root cause is upstream, for the cascade to be halted, or for the chain of actions to be rolled back.

The ECT DAG {{I-D.nennemann-wimse-ect}} already records causal ordering of agent actions via par references. AERR adds checkpoint semantics, error propagation, and rollback operations on top of this existing structure.

Checkpoint Mechanism

An AERR-compliant agent MUST create a checkpoint ECT before any action it classifies as consequential. An action is consequential if it modifies external state (e.g., network config, database records, API calls with side effects).

Checkpoint as ECT

A checkpoint is an ECT with:

exec_act: "aerr:checkpoint"
par: the jti of the preceding task ECT in the workflow
out_hash: SHA-256 hash of the agent's state snapshot at checkpoint time (for rollback integrity verification)

The ext claim carries AERR-specific metadata:

{
  "ext": {
    "aerr.action_type": "config_update",
    "aerr.target": "router-07.example.com",
    "aerr.reversible": true,
    "aerr.rollback_uri": "https://agent-b.example.com/aerr/rollback",
    "aerr.ttl": 86400
  }
}

{: #fig-checkpoint title="Checkpoint ECT Extension Claims"}

The aerr.reversible field MUST be present. If false, the agent declares that this action cannot be automatically undone and rollback requests MUST be escalated to a human operator via the HITL mechanism {{I-D.nennemann-agent-dag-hitl-safety}}.

Agents MAY create hierarchical checkpoints using the ECT DAG: a parent checkpoint ECT with par references to multiple child checkpoint ECTs. Rolling back the parent rolls back all children.

Checkpoint Storage

Checkpoint ECTs MUST be stored for at least the duration specified by aerr.ttl. At L3 {{I-D.nennemann-wimse-ect}}, checkpoints are automatically preserved in the audit ledger. At L1 and L2, agents MUST store checkpoints in durable local storage that survives agent restarts.

Error Signaling

When an agent detects an error, it MUST produce an error ECT and propagate it to affected agents in the DAG.

Error ECT

An error signal is an ECT with:

exec_act: "aerr:error"
par: the jti of the checkpoint ECT associated with the failing action

The ext claim carries error details:

{
  "ext": {
    "aerr.severity": "critical",
    "aerr.error_type": "action_failed",
    "aerr.description": "BGP session did not establish",
    "aerr.checkpoint_id": "550e8400-e29b-41d4-a716-446655440001",
    "aerr.upstream_errors": []
  }
}

{: #fig-error title="Error ECT Extension Claims"}

Severity levels: info, warning, error, critical.

Error types: action_failed, timeout, constraint_violation, resource_exhausted, upstream_cascade, unknown.

Error Propagation via DAG

When an agent receives an error ECT caused by an action it initiated, it MUST either:

(a) Attempt automatic rollback of its checkpoint ({{rollback}}), or

(b) Escalate to its operator if the action was irreversible.

The aerr.upstream_errors array allows agents to chain error context by referencing jti values of predecessor error ECTs, building a causal trace from symptom to root cause through the DAG.

HITL Escalation

When an error requires human intervention, the error ECT SHOULD trigger a HITL rule per {{I-D.nennemann-agent-dag-hitl-safety}}. Example policy:

{
  "hitl": {
    "rules": [{
      "id": "r-critical-error",
      "trigger": {
        "kind": "keyword_match",
        "op": "eq",
        "value": "critical",
        "input_ref": "ext.aerr.severity"
      },
      "required_role": "operator:oncall",
      "action": "escalate",
      "allow_override": true,
      "override_action": "continue"
    }]
  }
}

{: #fig-hitl-error title="HITL Policy for Critical Errors"}

Circuit Breaker Pattern

Each agent MUST implement a circuit breaker for every downstream agent it communicates with.

States

CLOSED (normal):: Requests flow through. The agent tracks the error rate over a sliding window (default: 60 seconds).
OPEN (failure detected):: When the error rate exceeds a threshold (default: 50% over the window), the breaker opens. All requests to the downstream agent are immediately rejected with aerr.error_type: circuit_open. The agent MUST produce an error ECT and emit it to upstream peers.
HALF-OPEN (recovery probe):: After a cooldown period (default: 30 seconds), the breaker allows a single probe request. If it succeeds, the breaker returns to CLOSED. If it fails, it returns to OPEN with doubled cooldown (exponential backoff, max 300 seconds).

State Change ECTs

Each circuit breaker state change MUST produce an ECT:

exec_act: "aerr:circuit_open", "aerr:circuit_half_open", or "aerr:circuit_closed"
par: the jti of the error ECT that triggered the transition

This records the health topology of the agent network in the ECT DAG, queryable from the audit ledger at L3.

Observability

Agents MUST expose circuit breaker state at:

GET /aerr/circuits

Response:

{
  "circuits": [{
    "downstream_agent": "spiffe://example.com/agent/router-mgr",
    "state": "open",
    "error_rate": 0.75,
    "last_failure_ect": "550e8400-e29b-41d4-a716-446655440099",
    "cooldown_remaining_s": 22
  }]
}

{: #fig-circuits title="Circuit Breaker Status"}

Rollback Protocol

Rollback Request

A rollback is initiated by sending an HTTP POST to the target agent's rollback endpoint:

POST /aerr/rollback HTTP/1.1
Content-Type: application/json
Execution-Context: <rollback-request-ECT>

{
  "rollback_id": "urn:uuid:...",
  "checkpoint_id": "550e8400-e29b-41d4-a716-446655440001",
  "reason": "Upstream action caused cascading failure",
  "cascade": true
}

{: #fig-rollback-req title="Rollback Request"}

The request MUST include an ECT in the Execution-Context header with exec_act: "aerr:rollback_request" and par referencing the error ECT that motivated the rollback.

When cascade is true, the receiving agent MUST also initiate rollback of any downstream checkpoints created as a consequence of the checkpointed action. The ECT DAG's par chain identifies these downstream actions.

Rollback Response

The agent produces a rollback result ECT with:

exec_act: "aerr:rollback_complete" (or "aerr:rollback_escalated")
par: the jti of the rollback request ECT
out_hash: SHA-256 hash of the agent's state after rollback

{
  "ext": {
    "aerr.rollback_id": "urn:uuid:...",
    "aerr.status": "completed",
    "aerr.state_hash_before": "sha256:...",
    "aerr.state_hash_after": "sha256:...",
    "aerr.cascaded": [
      {"agent": "spiffe://example.com/agent/monitor", "status": "completed"},
      {"agent": "spiffe://example.com/agent/classify", "status": "escalated"}
    ]
  }
}

{: #fig-rollback-resp title="Rollback Result ECT"}

Status values: completed, partial, escalated, failed.

escalated means the action was irreversible and a human operator has been notified via HITL. partial means some but not all downstream rollbacks succeeded.

Idempotency

Agents MUST implement idempotent rollback: receiving the same rollback_id twice MUST return the same result without re-executing the rollback.

Security Considerations

Rollback requests are sensitive operations. Agents MUST authenticate rollback requests via the ECT signature chain -- only agents whose ECTs appear in the same workflow DAG (identified by wid) SHOULD be authorized to request rollback.

Checkpoint ECTs contain out_hash of agent state but not the state itself. Agents MUST encrypt stored state snapshots at rest.

Circuit breaker status exposes system health topology. The /aerr/circuits endpoint SHOULD be access-controlled.

Malicious agents could emit false error ECTs to trigger rollbacks. Agents SHOULD verify that error ECTs reference valid checkpoint jti values from their own workflow DAG before initiating rollback. At L2 and L3, ECT signatures prevent forgery.

IANA Considerations

This document requests the following IANA registrations:

An "AERR Error Type" registry under Specification Required policy. Initial entries: action_failed, timeout, constraint_violation, resource_exhausted, upstream_cascade, circuit_open, unknown.
Registration of exec_act values aerr:checkpoint, aerr:error, aerr:rollback_request, aerr:rollback_complete, aerr:circuit_open, aerr:circuit_half_open, aerr:circuit_closed in a future ECT action type registry.

--- back

Acknowledgments

{:numbered="false"}

This document builds on the Execution Context Token specification {{I-D.nennemann-wimse-ect}} for DAG-based audit trails and the Agent Context Policy Token {{I-D.nennemann-agent-dag-hitl-safety}} for HITL escalation of irreversible actions.

12 KiB Raw Permalink Blame History