Skip to content

Latest commit

 

History

History
140 lines (108 loc) · 4.61 KB

File metadata and controls

140 lines (108 loc) · 4.61 KB

Bridle rollback procedures

Four paths from "Bridle is suspect" back to "Bridle is out of the way." Each step is independent — you do not have to do the previous step before the next.

1. Flip one policy back to shadow (per-policy, no restart)

For when one specific policy is misbehaving but the rest are fine.

bridle policy mode --policy <pid> --to shadow --tenant <tenant>

The CP creates a new signed bundle with that policy demoted. The gateway picks it up on the next poll (default: 5 seconds).

Effect: that one policy stops blocking; everything else keeps running as configured.

Verify:

bridle gateway status --gateway-id <gw>
# active_bundle_version should have advanced
bridle report shadow --tenant <tenant> --since-minutes 5
# no new action=block rows for the demoted policy

2. Force-shadow everything (no restart)

For when "I'm not sure which policy is the problem, just stop enforcing." Demotes EVERY enforce-mode decision at the gateway, regardless of what the bundle says.

export BRIDLE_FORCE_SHADOW=true
# restart the gateway process so the env var takes effect

Effect: the gateway reads the env var on construction. Every suppression-class action becomes shadow with would_have_action set. Audit rows record both the configured mode (enforce) and the effective mode (shadow) with force_shadow_override=true.

Verify:

bridle report pilot --tenant <tenant> --since-minutes 30
# counts.force_shadow_rows should equal counts.llm_calls + counts.tool_calls

3. Bypass Bridle entirely (no restart, no audit)

The break-glass. The BridleLogger short-circuits — every hook returns without invoking the interceptor. No observation, no decision, no audit row.

export BRIDLE_BYPASS=true
# the env var is checked on every hook invocation — no restart needed
# (if the gateway process pre-fetches env, send it SIGHUP or restart)

Effect: equivalent to "Bridle is not in the path." Requests flow through LiteLLM to the upstream provider unmodified. Audit gap during the bypass window.

Verify:

bridle report pilot --tenant <tenant> --since-minutes 5
# counts.llm_calls should stop growing
# but the gateway still serves traffic — check your agent's success rate

Cost: you lose the audit trail for the bypass window. The pilot report will undercount events that happened during it. Note the start/end timestamps in your runbook.

4. Remove Bridle from the LiteLLM config (full uninstall)

The "we're done with the pilot" path.

  1. Edit your LiteLLM proxy config:
    litellm_settings:
      callbacks:
        # - "bridle_callback.bridle_logger"   # ← remove this line
  2. Restart the proxy.
  3. Bridle is gone. The audit rows already in Postgres are still queryable. The gateway no longer writes new ones.

If you want to keep the data but stop the writes, you can also just drop the BridleLogger module path from PYTHONPATH so LiteLLM fails to load it and runs without callbacks. Operationally equivalent.

How to prove rollback before the pilot starts

Before letting real traffic flow, do a dry run of paths 2 and 3 on a test request. The v0.7 acceptance test test_bridle_bypass_short_circuits_the_plugin is the unit-level proof. For an end-to-end rehearsal:

# 1. Healthy state
bridle health --gateway-id <gw>
# ↳ active_bundle present, last_audit_at recent

# 2. Set BRIDLE_BYPASS=true, restart gateway, send a test request
curl ... /v1/chat/completions ...
# ↳ request succeeds, NO new audit row

# 3. Unset BRIDLE_BYPASS, restart gateway, repeat
# ↳ request succeeds, NEW audit row appears
bridle health --gateway-id <gw>
# ↳ last_audit_at advanced

Do this once before kickoff. Document the timestamps. The pilot owner should know that rollback is a 30-second operation, not a meeting.

Order of operations during an incident

If your pilot owner pings you at 2am:

  1. First — is the agent still serving traffic? If yes, breathe. If no, jump to path 3 (BRIDLE_BYPASS=true) immediately.
  2. Second — pull the trace report for the last 30 minutes:
    bridle report pilot --tenant <tenant> --since-minutes 30
    Look at top_traces. If there's a clear culprit policy, use path 1 (flip that single policy to shadow).
  3. Third — if the report itself looks broken, escalate to path 2 (full force-shadow). Audit keeps flowing; you can investigate later.
  4. Fourth — only if Bridle itself is unstable, use path 4 (full uninstall) and re-deploy from scratch after debugging.

Never do path 4 first. You'll lose the audit trail of what was happening when the incident started.