Skip to content

RFC: Event-Driven Governance and Actions #230

@krokoko

Description

@krokoko

Primary area: Cross-cutting / multiple

Related issue or feature request: (none — new proposal)


Summary

Today, human-in-the-loop (HITL) and most governance controls are synchronous and tool-centric: Cedar policies in PreToolUse gate Bash, Write, Read, etc. Operators configure bash patterns and tool shapes — not semantic moments like "plan ready", "PR opened", or "cumulative cost exceeded $25".

Meanwhile, observability and notifications are already event-driven via TaskEventsTable and FanOutConsumer, but that plane cannot prevent side effects unless something blocked earlier on the hot path.

This RFC proposes a unified Event Governance layer: a normative event catalog, declarative event rules (condition → action), sync (in-agent, can block) vs async (stream consumer, react only) evaluation modes, registry-native configuration (versioned event-rule-pack assets pinned by blueprints), and UX as a first-class requirement (bgagent submit governance preview, unified bgagent pending for event- and tool-sourced approvals). Tool-level Cedar HITL remains the fail-closed safety net for execution.

Use case and motivation

Who it's for: platform engineers, blueprint authors, security reviewers, and operators using bgagent, Slack, and GitHub fan-out.

Pain today:

Need Today Gap
Approve plan before code runs Awkward Cedar on Write/Bash Same intent reachable via many tool sequences
Pause at cumulative cost threshold Tool-centric only No aggregate rules on agent_cost_update
Notify on pr_created on protected branch Fan-out filters Cannot gate before PR without sync checkpoint
Escalate on high-severity approval_requested Partial fan-out Not unified with rule packs / audit

Expressing lifecycle governance as Cedar on tool argv is unreliable (retry/model variation, approval fatigue, false positives). Reactive needs (notify, audit, post-hoc cancel) do not belong on the tool hot path.

After this RFC: Operators configure governance against named lifecycle moments (checkpoints, milestones, aggregates). Sync checkpoints block irreversible work; async rules notify, escalate, or cancel without pretending they prevented an action that already happened.

Proposal

Core design:

Two planes, one catalog

  1. Normative event catalog — stable names/schemas for lifecycle, execution, milestone, checkpoint, and policy events (JSON Schema, additive versioning).
  2. Event ruleson event + when conditions → actions: require_approval, notify, escalate, cancel_task, inject_nudge, observe_only.
  3. Sync evaluation — in-agent at checkpoints (checkpoint:before_execution, before_open_pr, etc.); same latency class as Cedar; can transition to AWAITING_APPROVAL.
  4. Async evaluationTaskEventsTable stream consumer for notify, aggregates (agent_cost_update cost ceiling), post-hoc cancel; tens–hundreds of ms; must not imply blocking unless UX is explicit.
  5. Precedence: tool Cedar hard-deny always wins; async never overrides sync deny; composable with existing TaskApprovalsTable / bgagent approve / deny.

Capability registry (configuration home)

Asset type Consumed by
cedar-policy-module Agent PolicyEngine
event-rule-pack Sync evaluator + async consumer
notification-profile Fan-out / event consumer
checkpoint-catalog Agent pipeline + Change Manifest L1

Blueprints pin semver assets instead of inlining fragile YAML. Interim: inline eventRules until registry MVP (Phase 3).

UX (before → after)

Before: bgagent pending shows bash argv; submitters surprised by mid-run tool gates.

After:

  • Submit: governance preview from resolved registry pins (estimated interactive gates, rules that may fire).
  • Watch: human-readable moments (Plan verified — awaiting your approval).
  • Pending: unified queue — event-sourced and tool-sourced approvals differ in trigger context, not approve/deny mechanics.
  • Authoring: bgagent registry list, bgagent rules eval --fixture, observe→enforce rollout per pack.

Phased delivery

Phase Scope Outcome
0 Catalog + observe_only + PolicyDecisionEvent "Would have fired" in watch stream
1 Async notify/fan-out + webhook Ping on PR/cost without new HITL
2 Sync checkpoints + manifest Plan review before code (primary UX win)
3 Registry-native event-rule-pack Org-wide versioned policy rollout
4 Advanced aggregates + async cancel Operator automation

Phases 0–2 can ship with inline blueprint config; Phase 3 aligns with agent asset registry MVP.

Data model extensions

  • TaskApprovalsTable: source (tool|event), event_id, checkpoint, rule_pack_id, rule_id.
  • TaskEventsTable: catalog event types; optional correlation_id for dedupe.
  • PolicyDecisionEvent (roadmap): unified audit for every evaluation.

Cross-links: CEDAR_HITL_GATES.md, CHANGE_MANIFEST.md, ORCHESTRATOR.md, Roadmap — Agent asset registry & Centralized policy framework.

Out of scope

  • Replacing tool Cedar with event rules — fail-closed execution safety stays on PreToolUse.
  • Stream-only HITL — async consumer alone cannot block fast agents (race).
  • Inline blueprint YAML as the long-term config model — bootstrap only until registry ships.
  • EventBridge as the primary internal bus — complementary export; TaskEventsTable remains source of truth for task-scoped ordering (initially).
  • Full registry MVP in Phase 0–2 (designed for; not required to land catalog + observe + sync checkpoints).
  • Raw JSONPath/Cedar in default operator UX — verbose mode only.
  • Separate approve commands for event vs tool gates.

Potential challenges

Risk Mitigation
Sync vs async confusion Explicit modes in rule schema; UX copy for reactive approvals
Third DDB stream consumer capacity Plan Kinesis migration; do not multiply consumers blindly
Overlapping tool + event require_approval Scope algebra TBD; idempotency key (task_id, rule_id, correlation_id)
Async require_approval after pr_created UX must state PR already exists — cannot un-create
Evaluator down Sync blocking rules fail-closed; async notify may degrade with telemetry
Registry not ready Inline blueprint eventRules + migration to pins
Checkpoint trust model Mandatory pipeline hooks vs agent-declared milestones — open question
Rule language choice Cedar-on-events vs CEL/JSONLogic — author persona TBD

Open questions (see RFC §12): rule language, scope algebra, checkpoint emission trust, third consumer design, Change Manifest state vs manifest_verified, multi-tenant registry merge order, observe→enforce granularity.

Dependencies and integrations

Component Role
Agent (pipeline.py, runner.py, progress_writer.py, hooks) Emit catalog events; sync checkpoint evaluation
CDK (fanout-task-events.ts, orchestrator, Blueprint construct) Stream consumer; resolve registry pins at task start; extend fan-out
CLI (bgagent submit, watch, pending, future registry/rules eval) Governance preview, unified pending UX
TaskEventsTable / TaskApprovalsTable Event log + approval extensions
Agent asset registry (roadmap) event-rule-pack, notification-profile assets
Centralized policy framework (roadmap) PolicyDecisionEvent, observe/enforce modes

Related roadmap items: Agent asset registry, Centralized policy framework.

Alternative solutions

Alternative Verdict
Extend Cedar only — synthetic action::Event types Defer until event context schema stable; high author burden
Stream-only HITL in Lambda Rejected for blocking — race with fast agents
Inline blueprint YAML forever Bootstrap only; no versioning at scale
Replace tool Cedar with events Rejected
EventBridge as primary bus Complementary export; internal SoT stays TaskEventsTable

Note: Non-triaged RFCs may not get timely review. PRs on non-triaged issues might not be accepted.

  • RFC PR:
  • Approved by:
  • Reviewed by:

Metadata

Metadata

Assignees

Labels

P0highest priorityRFC-proposalRequest for Comments: design proposalapprovedWhen an issue has been approved and readygovernanceADRs, contribution rules, enforcement hooks, inter-agent protocolsecurityCedar/HITL, IAM least-privilege, secrets, PII/DLP, guardrails, supply-chain/CVE

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions