Skip to content

Latest commit

 

History

History
172 lines (137 loc) · 6.75 KB

File metadata and controls

172 lines (137 loc) · 6.75 KB

Architecture overview

Bridle is a runtime control plane for production AI agents. The architecture has three planes (control / data / telemetry) and one identity model that links them.

Three planes

Control plane (bridle/cp_server/)

The central FastAPI service. Owns policy authoring artifacts (bundles in Postgres), gateway registration, audit ingestion, mode-flip operations, and the two reports (shadow + trace).

POST  /v1/bundles                       publish + sign
GET   /v1/bundles/{id}                  fetch by id
GET   /v1/bundles/active                latest for gateway
POST  /v1/gateways/register             register + model_list
GET   /v1/gateways/{id}/status          active bundle + last-seen
POST  /v1/gateways/{id}/heartbeat       gateway pings on activate
POST  /v1/audit                         batch audit ingest
POST  /v1/policies/{id}/mode            shadow ↔ enforce flip
GET   /v1/reports/shadow                aggregated would-have-action
GET   /v1/reports/trace/{trace_id}      ordered obs/decision/outcome
GET   /v1/public_key                    bootstrap the signature verifier

The CP signs bundles with an ed25519 key. Gateways are bootstrapped with the matching public key and verify every bundle they activate.

Data plane (in the gateway)

Two enforcement surfaces, one GatewayInterceptor instance:

  • LLM gateway: LiteLLM Proxy + BridleLogger (CustomLogger in callback position 0). The logger's async_pre_call_hook calls the interceptor, which evaluates policy and returns allow / modified dict / block-string / raise.
  • Tool surface: @bridle.tool("issue_refund") decorator wraps any async tool function. Identity (session_id, agent_id, trace_id, ...) flows via contextvars set by session_context(...). The decorator calls the same interceptor before invoking the tool.

Both surfaces share _pending, state_service, audit_ledger, policy_engine, classifier. This is what makes Bridle "one session, two enforcement surfaces, one policy engine, one audit ledger" — the unifying invariant.

The data plane connects to Postgres for audit + session state, and to the CP via HTTPBundleLoader for signed bundle distribution.

Telemetry plane (the audit ledger)

Every decision lands as one append-only row in audit_rows: tenant_id, agent_id, actor_id, session_id, trace_id, request_id, observation_type, matched_policy_ids, mode_at_evaluation, final_action, would_have_action, final_outcome, cost_at_decision_usd, record_hash, previous_record_hash.

Rows are chained by record_hash for tamper evidence. Queries:

  • Shadow report (/v1/reports/shadow): group by policy across a tenant + window, sum cost_at_decision_usd as a v0 proxy for "prevented spend."
  • Trace report (/v1/reports/trace/{trace_id}): ordered events for one agent turn — the incident-review primitive.

Identity envelope

Every observation, decision, outcome, and audit row carries the same envelope. The fields that link surfaces together are:

Field Purpose
tenant_id Customer-level isolation
session_id Per-product session; joins LLM + tool events for the same agent run
trace_id Per-call trace; can be set by the agent to link a turn across surfaces
agent_id Which agent made the call
actor_id The end-user/service the agent is acting on behalf of
request_id Joins one observation + decision + outcome triple

session_id is the v0.6 grouping primitive. trace_id was added in v0.5.1 hardening — set X-Trace-Id on an LLM call and pass the same value to session_context(trace_id=...) for the tool call, and one trace report links the whole turn.

YAML policy authoring (v0.5)

operator                                  control plane
─────────                                 ─────────────
  edit policy.yaml
       │
  bridle policy publish *.yaml
       │
       │   POST /v1/bundles
       ▼
       (CP validates + signs + persists)
       │
       │
       ▼
  gateway HTTPBundleLoader polls
       │
       │   GET /v1/bundles/active
       ▼
       (verify signature with cached public key)
       (run bundle_validator)
       (check expires_at — refuse if past)
       (engine.set_active_bundle(bundle))
       │
       ▼
       runtime: next request evaluates against the new bundle

YAML compiles to the existing signed PolicyBundle. No runtime changes — only an authoring layer. The six supported type: values map to canonical kebab-case policy IDs the engine already recognizes.

Failure modes (v0.5.1 hardening)

Failure Behavior Test
Bundle signature invalid Loader refuses; cached stays test_http_bundle_loader.test_loader_rejects_bundle_with_bad_signature
Bundle expired Loader refuses; cached stays test_failure_modes.test_loader_rejects_expired_bundle_and_keeps_cached
CP unreachable Loader returns None; cached stays test_failure_modes.test_cp_*
Audit shipper unavailable Re-buffers in memory test_audit_shipper.*
Policy engine raises Synthetic policy-engine-error decision via worst-severity fail_modes.on_engine_error; raw exception never propagates test_failure_modes.test_engine_error_*
Postgres restart All five durable tables survive test_postgres_durability.*

Durability

All five operational stores are Postgres-backed via asyncpg:

Table Holds
audit_rows Every decision
sessions Per-session cost + counters
tool_intents Loop-detector window
policy_bundles Signed bundles + signature blob
gateway_registry Gateway model_list + last-seen + active bundle

Migrations are flat SQL under bridle/migrations/, mounted into the Postgres container as init scripts via docker-compose.yml.

Spike regression suite

The original LiteLLM Path-A spike that picked the architecture lives at tests/spikes/litellm_enforcement/. It pins litellm==1.86.0 and re-runs 16 tests against a live LiteLLM Proxy + mock OpenAI upstream to verify the async_pre_call_hook contract the rest of the product depends on. Run it before any LiteLLM bump:

bash tests/spikes/litellm_enforcement/run_spike.sh

What's intentionally NOT in v0.6

See ADR-006 §"What v0.5.1 deliberately does NOT do" and ADR-005 §"What v0.5 deliberately does NOT do":

  • No web UI
  • No RBAC
  • No billing
  • No arbitrary policy logic / full DSL
  • No additional providers beyond LiteLLM
  • No per-rule targeting in the runtime (bundle-level only)
  • No auto-rollback / canary on bundle activation (mode-flip endpoint is the rollout mechanism)

These wait for design-partner pain.