Four paths from "Bridle is suspect" back to "Bridle is out of the way." Each step is independent — you do not have to do the previous step before the next.
For when one specific policy is misbehaving but the rest are fine.
bridle policy mode --policy <pid> --to shadow --tenant <tenant>The CP creates a new signed bundle with that policy demoted. The gateway picks it up on the next poll (default: 5 seconds).
Effect: that one policy stops blocking; everything else keeps running as configured.
Verify:
bridle gateway status --gateway-id <gw>
# active_bundle_version should have advanced
bridle report shadow --tenant <tenant> --since-minutes 5
# no new action=block rows for the demoted policyFor when "I'm not sure which policy is the problem, just stop enforcing." Demotes EVERY enforce-mode decision at the gateway, regardless of what the bundle says.
export BRIDLE_FORCE_SHADOW=true
# restart the gateway process so the env var takes effectEffect: the gateway reads the env var on construction. Every
suppression-class action becomes shadow with would_have_action set.
Audit rows record both the configured mode (enforce) and the
effective mode (shadow) with force_shadow_override=true.
Verify:
bridle report pilot --tenant <tenant> --since-minutes 30
# counts.force_shadow_rows should equal counts.llm_calls + counts.tool_callsThe break-glass. The BridleLogger short-circuits — every hook returns without invoking the interceptor. No observation, no decision, no audit row.
export BRIDLE_BYPASS=true
# the env var is checked on every hook invocation — no restart needed
# (if the gateway process pre-fetches env, send it SIGHUP or restart)Effect: equivalent to "Bridle is not in the path." Requests flow through LiteLLM to the upstream provider unmodified. Audit gap during the bypass window.
Verify:
bridle report pilot --tenant <tenant> --since-minutes 5
# counts.llm_calls should stop growing
# but the gateway still serves traffic — check your agent's success rateCost: you lose the audit trail for the bypass window. The pilot report will undercount events that happened during it. Note the start/end timestamps in your runbook.
The "we're done with the pilot" path.
- Edit your LiteLLM proxy config:
litellm_settings: callbacks: # - "bridle_callback.bridle_logger" # ← remove this line
- Restart the proxy.
- Bridle is gone. The audit rows already in Postgres are still queryable. The gateway no longer writes new ones.
If you want to keep the data but stop the writes, you can also just drop the BridleLogger module path from PYTHONPATH so LiteLLM fails to load it and runs without callbacks. Operationally equivalent.
Before letting real traffic flow, do a dry run of paths 2 and 3 on a
test request. The v0.7 acceptance test
test_bridle_bypass_short_circuits_the_plugin is the unit-level
proof. For an end-to-end rehearsal:
# 1. Healthy state
bridle health --gateway-id <gw>
# ↳ active_bundle present, last_audit_at recent
# 2. Set BRIDLE_BYPASS=true, restart gateway, send a test request
curl ... /v1/chat/completions ...
# ↳ request succeeds, NO new audit row
# 3. Unset BRIDLE_BYPASS, restart gateway, repeat
# ↳ request succeeds, NEW audit row appears
bridle health --gateway-id <gw>
# ↳ last_audit_at advancedDo this once before kickoff. Document the timestamps. The pilot owner should know that rollback is a 30-second operation, not a meeting.
If your pilot owner pings you at 2am:
- First — is the agent still serving traffic? If yes, breathe.
If no, jump to path 3 (
BRIDLE_BYPASS=true) immediately. - Second — pull the trace report for the last 30 minutes:
Look at
bridle report pilot --tenant <tenant> --since-minutes 30
top_traces. If there's a clear culprit policy, use path 1 (flip that single policy to shadow). - Third — if the report itself looks broken, escalate to path 2 (full force-shadow). Audit keeps flowing; you can investigate later.
- Fourth — only if Bridle itself is unstable, use path 4 (full uninstall) and re-deploy from scratch after debugging.
Never do path 4 first. You'll lose the audit trail of what was happening when the incident started.