Skip to content

feat(rate-limiter): wipe Redis counter state on disable transition#77

Open
gandhipratik203 wants to merge 6 commits intomainfrom
feat/rate-limiter-wipe-on-disable-only
Open

feat(rate-limiter): wipe Redis counter state on disable transition#77
gandhipratik203 wants to merge 6 commits intomainfrom
feat/rate-limiter-wipe-on-disable-only

Conversation

@gandhipratik203
Copy link
Copy Markdown
Collaborator

Summary

Closes IBM/mcp-context-forge#4576 — sub-issue of #4514 (rate-limiter runtime mode-toggle convergence).

When an operator flips the rate limiter to mode: disabled at runtime, the plugin now clears its accumulated counter state from Redis. Previously the counters survived the toggle, so an immediate re-enable (or a stale-mode worker still serving traffic) would block legitimate requests on counter values inherited from before the disable. After this change, disabled truly means "the limiter's effect is gone" — and re-enable starts every user with a fresh window.

Bumps cpex_rate_limiter 0.0.4 → 0.0.6.


What changes

When RateLimiterPlugin.shutdown() runs, the plugin reads plugin:<name>:mode from Redis. If the value is disabled, the plugin acquires a distributed single-flight lock (<prefix>-wipe-lock, 30s TTL safety net), runs SCAN + UNLINK over its configured redis_key_prefix, and releases the lock. The wipe is bounded to keys under the plugin's configured prefix; nothing outside that namespace is touched.

The mechanism piggybacks on the framework's existing reload_tenant path: when publish_plugin_mode_change("<name>", "disabled") lands, every gateway worker's invalidation listener fires factory.invalidate_all()reload_tenant per cached context → manager.shutdown()plugin.shutdown(). Single-flight via the lock ensures the SCAN+UNLINK runs once across the cluster, even if N workers all process the broadcast simultaneously.


Test coverage

TestWipeOnDisable class — 10 tests in tests/rate_limiter/test_redis_integration.py:

Test Target
test_wipe_fires_when_mode_key_says_disabled Happy path
test_no_wipe_when_mode_key_says_enforce Graceful pod restart safety
test_no_wipe_when_mode_key_absent Fresh-deploy / no-mode-key safety
test_no_wipe_when_redis_unreachable_for_mode_lookup Fail-safe on Redis outage
test_wipe_is_idempotent_under_concurrent_shutdown 5-instance parallel shutdown converges
test_core_shutdown_runs_even_when_wipe_raises Wipe failure doesn't leak Redis connections
test_wipe_skipped_when_another_worker_holds_lock Distributed single-flight
test_wipe_only_deletes_keys_under_configured_prefix Blast radius confined to configured prefix
test_full_toggle_journey_enforce_disable_reenforce Single-instance journey: saturated user is no longer blocked after cycle
test_full_toggle_journey_across_multiple_instances N=3 parallel instances: every user gets a fresh window after re-enable, not just the lock-winner's
Full local gate
cargo fmt -- --check                  clean
cargo clippy -- -D warnings           clean
make test-unit                        Rust tests: 58 passed
make test-integration                 pytest: 66 passed

Deployment caveats (worth knowing before merge)

Two pre-conditions for the wipe path to actually fire — intrinsic properties of how the wipe interacts with the framework's plugin lifecycle, worth flagging so operators aren't surprised at start-up:

  1. Plugin must be loaded for the wipe to fire. When plugins/config.yaml has mode: "disabled" for the plugin from startup, the framework loader does not instantiate the plugin under any cached manager. No plugin instance ⇒ no shutdown to fire ⇒ no wipe path to run, regardless of how many subscribers receive the broadcast. In practice this means: the wipe only takes effect after the plugin has been in enforce mode at least once.
  2. At least one cached manager required. factory.invalidate_all() iterates only cached managers; a fresh gateway with zero traffic has zero cached managers. The first request through the gateway warms a manager; subsequent toggle broadcasts then have something to act on.

Neither caveat affects production deployments after the first request. Both are worth documenting on the rate-limiter README so operators understand the start-up timing.


Migration notes

  • No new required config. Existing deployments work unchanged. The wipe fires automatically on the disable transition.
  • redis_key_prefix is the blast-radius boundary. If multiple rate-limiter instances share one Redis and use the same prefix, a disable on any one wipes counters for all. Default prefix is "rl"; operators with multiple limiters should set distinct prefixes.

When the operator transitions the plugin's mode to "disabled" via the
admin mode API, every Redis counter for the plugin's configured key
prefix is now deleted before the plugin shuts down. Re-enabling starts
every user with a fresh window — counters do not resume from the
pre-disable state. This makes the rate limiter behave like a stateless
plugin from the operator's perspective across mode transitions.

Detection mechanism: publish_plugin_mode_change writes
plugin:<name>:mode to Redis *before* publishing the pub/sub broadcast
that triggers shutdown. The plugin reads that key in shutdown() and
only wipes when its value is exactly "disabled". On a graceful pod
shutdown the key is unchanged, so counters survive restarts.

The wipe uses non-blocking SCAN MATCH <prefix>:* with batched DELs and
is idempotent — concurrent shutdowns across multiple workers all
converge to the same end state without raising. core.shutdown()
always runs in finally, so a wipe failure never blocks connection
release.

Known limitation: per-tenant binding-API mode changes
(POST /v1/tools/plugin_bindings) do not currently trigger a wipe —
that path stores the mode in the database, not in plugin:<name>:mode.
Documented in README, tracked for follow-up.

Adds redis>=5.0 as a runtime dependency (was previously dev-only —
the wipe path needs redis.asyncio outside of tests).

Tests: 6 new integration tests covering wipe-fires-on-disabled,
no-wipe-on-enforce, no-wipe-on-missing-mode-key (graceful-shutdown
safety), no-wipe-on-redis-unreachable (fail-safe), idempotency under
concurrent shutdowns, and core-shutdown-still-runs-when-wipe-raises.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…efix + TLS test depth)

Addresses four P2 findings from Luca's review on PR #74. P2-3 (lifecycle
coupling — move wipe into the Rust core and add an on_mode_change hook
to the framework) is deliberately deferred as a separate follow-up; it
requires framework-level changes in mcp-context-forge that are out of
scope for this PR.

P2-1 — single-flight guard around the wipe.
  Without coordination, every worker that sees mode=disabled runs the
  full SCAN+DEL independently. With 3 replicas × 24 workers = 72
  workers all wiping the same keyspace, that's ~72× the Redis work
  and contention on the same keys. Added _acquire_wipe_lock() /
  _release_wipe_lock() using SET ... NX EX 30 — only the worker that
  wins the lock performs the wipe; the rest skip cleanly.
  Lock key uses a dash separator (`<prefix>-wipe-lock`) so the wipe
  SCAN's `<prefix>:*` pattern doesn't match it. TTL bounds starvation
  if the lock holder dies mid-wipe; the lock is released eagerly after
  a successful wipe so the next disable cycle is responsive.

P2-2 — UNLINK with DEL fallback.
  SCAN avoids blocking key enumeration, but DEL still frees values on
  Redis' main thread. For sliding-window sorted sets, dropping 500 of
  those at once can stutter Redis. Switched to UNLINK (Redis 4.0+,
  defers value-free to a background thread) with a ResponseError
  fallback to DEL for pre-4.0 deployments.

P2-4 — TLS handshake-depth test.
  The existing rediss:// regression test only verifies URL parsing.
  Added test_rediss_url_lazy_handshake_reaches_connectivity_layer that
  drives a tool_pre_invoke and asserts the failure logs are
  connectivity-shaped (refused / IO / timed out), explicitly NOT
  "feature is not enabled" / InvalidClientConfig — the latter being
  the wo-tracker #68217 regression signature. Catches a future Cargo
  feature-flag mistake or a rustls regression that the construction
  test would silently pass.

P2-5 — custom redis_key_prefix coverage.
  Every existing test uses the default prefix, so a regression that
  hard-coded "rl:*" would still pass. Added
  test_wipe_only_deletes_keys_under_configured_prefix that seeds keys
  under a custom prefix and an unrelated namespace, then asserts only
  the configured-prefix keys are deleted.

Local gate: cargo fmt-check + clippy + 57 Rust tests; pytest 115/115
(112 prior + lock-skipped + custom-prefix + TLS-handshake-depth = 115).

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
This branch ships the wipe-on-disable feature in a separate release from
the TLS support PR (which uses 0.0.5). The 0.0.6 bump assumes TLS lands
first; if order flips the bump moves with it.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
The existing TestWipeOnDisable class asserts wipe invariants in
isolation — does the wipe fire on disable, does it skip when it
should, does it respect the prefix.  None of them assert the
*transition composes*: that a user who was being blocked stops
being blocked after a full enforce → disable → enforce cycle.
That is the user-visible contract from User Story 1 of
mcp-context-forge#4576, and this test pins it.

Phase 1 saturates alice's bucket and confirms the 4th request
under 3/m is actually blocked (proves the pre-toggle state was
indeed saturated, which the existing seed-based tests don't).
Phase 2 toggles to disabled, calls shutdown, asserts the wipe
ran.  Phase 3 toggles back to enforce, constructs a fresh plugin
instance (mirroring what the plugin manager does on re-enable),
and asserts alice's first request *passes* — the unique evidence
that the wipe + re-enable produced a fresh window for the same
user who was blocked seconds earlier.

Window sized at 3/m rather than 3/s so the test never straddles
a second boundary on slow runners — without the wipe, alice's
5th request would fall in the same minute window and be blocked,
making a passing 5th request unambiguous evidence the wipe ran
(not just a window roll-over masking the bug).

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…n-disable

Extends the journey coverage to the production fan-out shape: multiple
plugin instances sharing one Redis URL, exercising the distributed lock
under realistic flow.

What this adds beyond the existing tests:
  - test_wipe_is_idempotent_under_concurrent_shutdown (#5) already
    proves N parallel shutdowns converge cleanly, but with seeded
    counters and no re-enable phase.
  - test_full_toggle_journey_enforce_disable_reenforce proves the
    full enforce → disable → re-enable transition composes, but with
    a single plugin instance, so the lock isn't exercised under
    real flow.

This test is the composition of both — three plugin instances each
saturate a different user (alice/bob/carol) in parallel via real
tool_pre_invoke calls, the parallel shutdown under mode=disabled
must converge to all rl:* keys deleted (exactly one wipes, two skip
via the lock without raising), and three fresh plugin instances
must all observe a fresh window for their respective users.

The unique contract this test pins: a regression that wiped only
the user owned by the lock-winning plugin would pass test #5 but
fail Phase 3 here, where every user must get a fresh window.

Window stays at 3/m for the same reason as the single-instance
journey — a 4th-request-blocked assertion in Phase 1 makes the
"first request passes" assertion in Phase 3 non-vacuous.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…st fixtures

Aligns the test fixtures' plugin name with the gateway's own
plugins/config.yaml convention (which has used "RateLimiterPlugin"
since the rate-limiter was added).

The plugin reads its mode key as ``plugin:<self.config.name>:mode``
— so whatever name the test passes at construction is what the test
expects to see in Redis.  Before this commit, the test fixtures
constructed plugins with name="RateLimiter" while production deploys
them with name="RateLimiterPlugin", which silently created two
disjoint mode-key namespaces and made it possible for an
end-to-end test (mcp-context-forge#4511 layer) to publish to one
namespace while the running gateway's plugin read the other.  Found
while empirically exercising the wipe-on-disable convergence path
on a multi-replica stack — the wipe never fired because the test's
PUBLISH used "RateLimiter" and the gateway's plugin looked up
"RateLimiterPlugin".

No code change to the rate-limiter plugin itself: ``self.config.name``
is just a string that gets reflected into the mode key, and the
plugin works equally well with either value as long as the
publisher and subscriber agree.  The rename is purely about
fixture-vs-production consistency.

Touches every PluginConfig construction and every
_set_plugin_mode_key call in test_redis_integration.py — bulk find/
replace, no behavioural change to any individual test.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
@gandhipratik203 gandhipratik203 force-pushed the feat/rate-limiter-wipe-on-disable-only branch from 9554798 to 52e3d43 Compare May 3, 2026 11:51
@lucarlig lucarlig force-pushed the feat/rate-limiter-wipe-on-disable-only branch from b662ddf to 52e3d43 Compare May 5, 2026 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Clear rate-limiter Redis counter state on disabled mode transition

1 participant