Files
ietf-draft-analyzer/workspace/drafts/new-drafts/draft-c-hitl-human-in-the-loop-01.md
Christian Nennemann 2506b6325a
Some checks failed
CI / test (3.11) (push) Failing after 1m37s
CI / test (3.12) (push) Failing after 57s
feat: add draft data, gap analysis report, and workspace config
2026-04-06 18:47:15 +02:00

613 lines
19 KiB
Markdown

---
title: "Human-in-the-Loop (HITL) Primitives for Agent Ecosystems"
abbrev: "HITL"
category: std
docname: draft-hitl-human-in-the-loop-01
submissiontype: IETF
number:
date:
v: 3
area: "OPS"
workgroup: "NMOP"
keyword:
- human override
- HITL
- emergency stop
- agentic safety
- explainability
author:
-
fullname: TBD
organization: Independent
email: placeholder@example.com
normative:
RFC2119:
RFC8174:
RFC7519:
RFC8446:
RFC8615:
RFC9110:
I-D.nennemann-wimse-ect:
title: "Execution Context Tokens for Distributed Agentic Workflows"
target: https://datatracker.ietf.org/doc/draft-nennemann-wimse-ect/
I-D.nennemann-agent-dag-hitl-safety:
title: "Agent Context Policy Token: DAG Delegation with Human Override"
target: https://datatracker.ietf.org/doc/draft-nennemann-agent-dag-hitl-safety/
informative:
--- abstract
This document defines runtime HITL (Human-in-the-Loop) primitives
for agent ecosystems: four escalating override levels, approval
gates, timeout and fallback policies, and explainability hooks.
ACP-DAG-HITL defines WHEN humans must intervene (policy rules and
triggers). This specification defines HOW the intervention
actually happens at the protocol level: the HTTP endpoints,
override semantics, agent compliance requirements,
acknowledgment flows, and explainability tokens that allow
operators to make informed decisions. All overrides and decisions
produce ECT nodes, making human interventions part of the same
auditable DAG as agent actions.
--- middle
# Introduction
The current ratio of autonomous capability drafts to human
oversight drafts in the IETF is roughly 7:1. Agents can act but
humans cannot reliably stop them.
ACP-DAG-HITL {{I-D.nennemann-agent-dag-hitl-safety}} defines the
policy: trigger conditions, required roles, and actions (`pause`,
`escalate`, `abort`). But it deliberately defers the runtime
protocol — how does an operator actually send a stop command? How
does the agent acknowledge it? What happens if the operator is
unreachable?
This specification fills that gap. It is the runtime enforcement
companion to ACP-DAG-HITL, inspired by industrial safety systems:
the e-stop button on factory equipment, the circuit breaker in
electrical systems, and the kill switch in robotics.
HITL is deliberately not a governance framework, policy language,
or accountability protocol. It is a panic button with a
well-defined interface.
# Conventions and Definitions
{::boilerplate bcp14-tagged}
Override:
: A human-initiated command that alters an agent's autonomous
operation, taking precedence over the agent's own decisions.
Operator:
: A human user authorized to issue override commands.
Approval Gate:
: A DAG node that blocks workflow progression until a human
approves or rejects continuation.
HITL Intensity Level:
: A deployment-wide configuration of how actively human oversight
is required. Distinct from override levels (which are runtime
commands).
# HITL Intensity Levels {#intensity}
A deployment configures a HITL intensity level that determines
the baseline human oversight requirement. This is orthogonal to
the four runtime override levels ({{levels}}): intensity levels
govern planning; override levels govern runtime intervention.
| Intensity | Label | Human requirement | When to use |
|-----------|-------|-------------------|-------------|
| I0 | Autonomous | No HITL required by default | Dev/test; fully trusted agents |
| I1 | Advisory | Notifications; no blocking | Monitoring-only production deployments |
| I2 | Selective | Approval required on critical paths only | Standard production cross-org deployments |
| I3 | Mandatory | Approval required on every consequential action | Regulated environments; EU AI Act critical systems |
{: #fig-intensity title="HITL Intensity Levels"}
Intensity levels are declared in ACP-DAG-HITL workflow policy and
map to AEM assurance levels (see {{assurance-binding}}):
| HITL Intensity | Minimum AEM Assurance Level |
|---------------|----------------------------|
| I0 | L1 |
| I1 | L1 |
| I2 | L2 |
| I3 | L3 |
{: #fig-intensity-assurance title="Intensity to Assurance Level Mapping"}
# Relationship to ACP-DAG-HITL {#mapping}
ACP-DAG-HITL defines three HITL actions. This specification
maps them to four runtime override levels and extends with
CONSTRAIN (partial restriction):
| ACP-DAG-HITL action | HITL Override Level | Behavior |
|---------------------|---------------------|----------|
| `pause` | Level 1: PAUSE | Suspend autonomous actions, hold state |
| (no equivalent) | Level 2: CONSTRAIN | Restrict to an allowlist of actions |
| `abort` | Level 3: STOP | Cease all actions, enter inert state |
| `escalate` | Level 4: TAKEOVER | Transfer control to human operator |
{: #fig-mapping title="ACP-DAG-HITL to HITL Level Mapping"}
When ACP-DAG-HITL rules trigger, the runtime system uses the
corresponding HITL level to enforce the action.
# Override Levels {#levels}
## Level 1: PAUSE
The agent MUST suspend all autonomous actions and hold current
state. It MUST NOT initiate new actions but MAY complete
in-progress actions if stopping mid-execution would cause harm
(e.g., an in-flight database transaction). The agent resumes
when a RESUME command is received.
## Level 2: CONSTRAIN
The agent MUST restrict its actions to a specified subset. The
override command includes an allowlist of permitted action types.
The agent MUST reject any action not on the allowlist, responding
with HTTP 403 and an ECT noting the constraint violation.
## Level 3: STOP
The agent MUST immediately cease all autonomous actions and enter
an inert state. It MUST NOT take any autonomous actions until
explicitly restarted. This is the e-stop. Any in-progress
consequential actions MUST be abandoned; if abandonment would
leave external state inconsistent, the agent MUST emit an
`atd:error` ECT and the ATD rollback protocol applies.
## Level 4: TAKEOVER
The agent MUST transfer operational control to the human operator.
It enters a pass-through mode where it executes only explicit
operator commands. The agent's sensors and outputs remain
available to the operator as tools. Deployments SHOULD require
two-operator authorization for TAKEOVER (see {{security}}).
# Override Protocol {#protocol}
## Override Command
Override commands are sent as HTTP POST to the agent's well-known
endpoint:
~~~
POST /.well-known/hitl/override HTTP/1.1
Content-Type: application/json
Authorization: Bearer <operator-jwt>
Execution-Context: <override-ect>
~~~
The override ECT MUST contain:
- `exec_act`: `"hitl:override"`
- `par`: the most recent ECT from the agent being overridden
(linking the override into the workflow DAG)
~~~json
{
"exec_act": "hitl:override",
"par": ["agent-last-action-ect"],
"ext": {
"hitl.level": 3,
"hitl.reason": "Agent blocking legitimate traffic",
"hitl.operator_id": "user:alice",
"hitl.scope": "*",
"hitl.constraints": null,
"hitl.ttl": null,
"hitl.nonce": "a3f8b2c1"
}
}
~~~
{: #fig-override title="Override ECT"}
Field definitions:
- `hitl.level`: Integer 1-4. MUST be present.
- `hitl.reason`: Human-readable text. MUST be logged.
- `hitl.scope`: `"*"` for all functions, or an array of function
IDs for partial override.
- `hitl.constraints`: For Level 2 only. Array of permitted action
types.
- `hitl.ttl`: Duration in seconds. If set, override auto-expires.
If null, persists until explicitly lifted.
- `hitl.nonce`: REQUIRED. A random value to prevent replay attacks.
## Acknowledgment
The agent MUST respond within 1 second with an acknowledgment ECT:
- `exec_act`: `"hitl:ack"`
- `par`: the override ECT
~~~json
{
"exec_act": "hitl:ack",
"par": ["override-ect-uuid"],
"ext": {
"hitl.status": "accepted",
"hitl.prior_state": "autonomous",
"hitl.current_state": "stopped",
"hitl.effective_at": "2026-03-01T12:00:00.123Z"
}
}
~~~
{: #fig-ack title="Acknowledgment ECT"}
The override/ack ECT pair serves as the Decision Record defined
in ACP-DAG-HITL Section 6.5. No separate audit mechanism is
needed.
## Resume and Lift
To resume from PAUSE:
~~~
POST /.well-known/hitl/resume HTTP/1.1
Execution-Context: <resume-ect with exec_act="hitl:resume">
~~~
To lift any override:
~~~
POST /.well-known/hitl/lift HTTP/1.1
Execution-Context: <lift-ect with exec_act="hitl:lift">
~~~
Both produce ECTs linked to the original override ECT via `par`.
# Agent Compliance Requirements {#compliance}
Every HITL-compliant agent MUST:
1. Implement the `/.well-known/hitl/override` endpoint per
{{RFC8615}}.
2. Process override commands within 1 second of receipt. The
override path MUST be independent of the agent's main
processing loop and MUST NOT be blocked by ongoing tasks.
3. Acknowledge every override with an ECT response.
4. An agent MUST NOT respond with "rejected". Overrides are
mandatory. If the agent cannot fully comply, it MUST respond
with status `partial` and describe what it could not do.
5. Expose current override status at:
~~~
GET /.well-known/hitl/status
~~~
~~~json
{
"agent_id": "spiffe://example.com/agent/firewall",
"override_active": true,
"current_level": 3,
"override_ect": "override-ect-uuid",
"since": "2026-03-01T12:00:00Z",
"operator_id": "user:alice"
}
~~~
{: #fig-status title="Override Status Response"}
6. The override endpoint SHOULD be served on a separate port or
network interface from the agent's main API to ensure
availability under load.
# Approval Gates {#approval-gates}
An approval gate is a DAG node that blocks workflow progression
until a human approves. Unlike overrides (which interrupt running
agents), approval gates are planned checkpoints in the workflow.
Approval gates are defined as ACP-DAG-HITL nodes with HITL rules:
~~~json
{
"dag": {
"nodes": [
{
"id": "n-approve",
"type": "hitl:approval_gate",
"agent": "system:hitl-gateway",
"constraints": {
"hitl.required_role": "clinician:oncall",
"hitl.timeout_s": 300,
"hitl.timeout_action": "safe_pause"
}
}
]
}
}
~~~
{: #fig-gate title="Approval Gate as DAG Node"}
When the workflow reaches an approval gate, the system:
1. Emits an ECT with `exec_act: "hitl:approval_request"`.
2. Notifies the required human role with an explainability
token (see {{explainability}}).
3. Waits for approval (ECT: `"hitl:approval_granted"`) or
rejection (ECT: `"hitl:approval_denied"`).
4. On timeout, applies `hitl.timeout_action` per {{timeout}}.
## Approval Request and Response ECTs
~~~json
{
"exec_act": "hitl:approval_request",
"par": ["pre-gate-ect-uuid"],
"ext": {
"hitl.required_role": "clinician:oncall",
"hitl.context": "Medication dosage adjustment for patient P-1042",
"hitl.timeout_s": 300,
"hitl.explainability_ref": "expl-ect-uuid"
}
}
~~~
{: #fig-approval-req title="Approval Request ECT"}
~~~json
{
"exec_act": "hitl:approval_granted",
"par": ["approval-request-ect-uuid"],
"ext": {
"hitl.operator_id": "user:dr-jones",
"hitl.scope": "medication:adjust",
"hitl.expires": "2026-03-01T13:00:00Z"
}
}
~~~
{: #fig-approval-grant title="Approval Granted ECT"}
~~~json
{
"exec_act": "hitl:approval_denied",
"par": ["approval-request-ect-uuid"],
"ext": {
"hitl.operator_id": "user:dr-jones",
"hitl.reason": "Dosage exceeds safe maximum for patient weight",
"hitl.alternative": "Use standard protocol dosage"
}
}
~~~
{: #fig-approval-deny title="Approval Denied ECT"}
# Timeout and Fallback Policy {#timeout}
When a human does not respond within `hitl.timeout_s`, the
agent applies `hitl.timeout_action`. Three policies are
supported:
fail-closed:
: Abort the workflow. The agent emits `atd:error` with
`error_type: "timeout"` and the ATD rollback protocol
applies. Use when safety requires no action over wrong action.
fail-open:
: Continue as if approved, recording an audit ECT that no human
approved. Use only when workflow continuity is more important
than human review (I0/I1 intensity deployments).
escalate:
: Move the approval request to the next operator in the
escalation chain (defined in ACP-DAG-HITL policy). If the
escalation chain is exhausted, fall back to `fail-closed`.
The timeout policy is set in ACP-DAG-HITL node constraints:
~~~json
{
"constraints": {
"hitl.timeout_s": 300,
"hitl.timeout_action": "escalate"
}
}
~~~
{: #fig-timeout title="Timeout Policy as Node Constraint"}
Timeout policy MUST be `fail-closed` at HITL intensity I3.
Timeout policy MUST NOT be `fail-open` when assurance level is L3.
# Explainability {#explainability}
When a HITL point is triggered, the agent SHOULD provide an
explainability token that allows the operator to make an informed
decision. At AEM assurance L2+, explainability is REQUIRED for
approval gate requests.
An explainability token is an ECT:
- `exec_act`: `"hitl:explanation"`
~~~json
{
"exec_act": "hitl:explanation",
"par": ["last-agent-action-ect"],
"ext": {
"hitl.summary": "Agent proposes to reroute BGP traffic from AS64496 to AS64497 due to packet loss exceeding 15% threshold over 5-minute window.",
"hitl.proposed_action": "update-bgp-peer router-07 neighbor 198.51.100.1 remove-private-as",
"hitl.evidence_ects": [
"snmp-poll-1-ect-uuid",
"snmp-poll-2-ect-uuid",
"loss-calc-ect-uuid"
],
"hitl.confidence": 0.91,
"hitl.risk_level": "medium",
"hitl.reversible": true
}
}
~~~
{: #fig-explanation title="Explainability Token ECT"}
Field definitions:
- `hitl.summary`: Human-readable description of what the agent
was doing and why HITL was reached. REQUIRED.
- `hitl.proposed_action`: What the agent proposes to do.
REQUIRED.
- `hitl.evidence_ects`: Array of `jti` values from prior ECTs
that support the proposal. SHOULD be present.
- `hitl.confidence`: Float 0.0-1.0; agent's self-assessed
confidence in the proposed action. SHOULD be present.
- `hitl.risk_level`: One of `low`, `medium`, `high`, `critical`.
SHOULD be present.
- `hitl.reversible`: Whether the proposed action can be rolled
back. REQUIRED.
The `hitl.explainability_ref` field in the approval request ECT
({{fig-approval-req}}) references the `jti` of this ECT.
# Binding to AEM Assurance Levels {#assurance-binding}
HITL requirements vary by AEM assurance level. The following
table is normative:
| AEM Level | Required HITL Intensity | Override signing | Explainability |
|-----------|------------------------|-----------------|----------------|
| L1 | I0 (optional) | Optional | Optional |
| L2 | I2 or higher | REQUIRED (signed JWT) | REQUIRED for I2+ |
| L3 | I3 | REQUIRED (signed JWT, L3 ECT) | REQUIRED |
{: #fig-assurance-hitl title="HITL Requirements by Assurance Level"}
At L3, approval gate responses (hitl:approval_granted) MUST be
committed to the audit ledger.
# Broadcast Override {#broadcast}
For environments with many agents, an operator MAY send a
broadcast override to a management endpoint:
~~~
POST /hitl/broadcast HTTP/1.1
Execution-Context: <broadcast-override-ect>
{
"targets": ["spiffe://example.com/agent/a",
"spiffe://example.com/agent/b"],
"level": 3,
"reason": "Coordinated emergency stop"
}
~~~
The broadcast endpoint fans out individual override ECTs to each
target and returns per-agent results. Each fan-out is itself an
ECT linked to the broadcast override ECT.
Broadcast overrides MUST be authenticated at L2 or higher.
# Dead Man's Switch {#dead-man}
For maximum reliability, agents SHOULD implement a heartbeat
mechanism: the agent periodically pings an operator heartbeat
endpoint. If the heartbeat is missed for a configurable duration,
the agent automatically enters Level 1 (PAUSE).
The heartbeat interval SHOULD be 30 seconds. The trigger
threshold SHOULD be 3 missed heartbeats.
This provides a safety net when network connectivity to the
operator is lost. The `unreachable_human` policy from
ACP-DAG-HITL governs behavior when the dead man's switch
activates: either `abort` (→ Level 3) or `safe_pause` (→ Level 1).
# Security Considerations {#security}
## Authentication of Override Commands
All override endpoints MUST require authentication via mutual
TLS ({{RFC8446}}) or signed JWTs ({{RFC7519}}). The JWT MUST
contain the operator's identity and be signed by a trusted key
(per ACP-DAG-HITL operator role configuration).
## Replay Prevention
To prevent replay attacks, agents MUST:
1. Reject override ECTs with `iat` more than 30 seconds in the
past.
2. Reject duplicate `jti` values (require a nonce per override).
3. Require the `hitl.nonce` field in override ECTs.
## Impersonation
Override commands carry high privilege. Agents MUST verify:
- The operator JWT is signed by a trusted key in the ACP-DAG-HITL
operator registry.
- The operator role matches the `required_role` in the triggering
HITL rule.
## Two-Operator Approval for TAKEOVER
Deployments SHOULD implement multi-operator approval for Level 4
(TAKEOVER), requiring two independent operator identities. The
two approval ECTs MUST both appear as `par` in the TAKEOVER
override ECT.
## HITL Bypass Prevention
Agents that claim a HITL gate was satisfied MUST provide the
`jti` of the corresponding `hitl:approval_granted` ECT in the
ECT that follows the gate. Agents MUST NOT proceed past an
approval gate without a valid signed approval ECT.
## Escalation Chain Integrity
The escalation chain in ACP-DAG-HITL policy defines which roles
receive escalations. This chain MUST be signed as part of the
policy token to prevent tampering. Agents MUST NOT follow
escalation chains from unsigned or unverified policy tokens.
# IANA Considerations
## Well-Known URI Registrations
This document requests the following registrations per {{RFC8615}}:
| URI Suffix | Purpose |
|------------|---------|
| `hitl/override` | Override command endpoint |
| `hitl/resume` | Resume from PAUSE |
| `hitl/lift` | Lift any active override |
| `hitl/status` | Override status query |
{: #fig-wellknown title="Well-Known URI Registrations"}
## `exec_act` Values
This document requests registration in the AEM Ecosystem
Extension Registry:
| Value | Description | Reference |
|-------|-------------|-----------|
| `hitl:override` | Human override command | This document |
| `hitl:ack` | Agent acknowledgment of override | This document |
| `hitl:resume` | Resume from PAUSE state | This document |
| `hitl:lift` | Lift any active override | This document |
| `hitl:approval_request` | Workflow blocked at approval gate | This document |
| `hitl:approval_granted` | Human approved continuation | This document |
| `hitl:approval_denied` | Human denied continuation | This document |
| `hitl:explanation` | Explainability token for HITL decision | This document |
{: #fig-iana-actions title="HITL exec_act Registrations"}
--- back
# Acknowledgments
{:numbered="false"}
This specification is the runtime enforcement companion to
ACP-DAG-HITL {{I-D.nennemann-agent-dag-hitl-safety}}. Override
design is inspired by industrial safety systems (IEC 62061,
ISO 13849). The explainability token design is informed by
EU AI Act Article 13 transparency requirements.