feat(rate-limiter): wipe Redis counter state on disable transition by gandhipratik203 · Pull Request #77 · IBM/cpex-plugins

gandhipratik203 · 2026-05-03T11:35:53Z

Summary

Closes IBM/mcp-context-forge#4576 — sub-issue of #4514 (rate-limiter runtime mode-toggle convergence).

When an operator flips the rate limiter to mode: disabled at runtime, the plugin now clears its accumulated counter state from Redis. Previously the counters survived the toggle, so an immediate re-enable (or a stale-mode worker still serving traffic) would block legitimate requests on counter values inherited from before the disable. After this change, disabled truly means "the limiter's effect is gone" — and re-enable starts every user with a fresh window.

Bumps cpex_rate_limiter 0.0.4 → 0.0.6.

What changes

When RateLimiterPlugin.shutdown() runs, the plugin reads plugin:<name>:mode from Redis. If the value is disabled, the plugin acquires a distributed single-flight lock (<prefix>-wipe-lock, 30s TTL safety net), runs SCAN + UNLINK over its configured redis_key_prefix, and releases the lock. The wipe is bounded to keys under the plugin's configured prefix; nothing outside that namespace is touched.

The mechanism piggybacks on the framework's existing reload_tenant path: when publish_plugin_mode_change("<name>", "disabled") lands, every gateway worker's invalidation listener fires factory.invalidate_all() → reload_tenant per cached context → manager.shutdown() → plugin.shutdown(). Single-flight via the lock ensures the SCAN+UNLINK runs once across the cluster, even if N workers all process the broadcast simultaneously.

Test coverage

TestWipeOnDisable class — 10 tests in tests/rate_limiter/test_redis_integration.py:

Test	Target
`test_wipe_fires_when_mode_key_says_disabled`	Happy path
`test_no_wipe_when_mode_key_says_enforce`	Graceful pod restart safety
`test_no_wipe_when_mode_key_absent`	Fresh-deploy / no-mode-key safety
`test_no_wipe_when_redis_unreachable_for_mode_lookup`	Fail-safe on Redis outage
`test_wipe_is_idempotent_under_concurrent_shutdown`	5-instance parallel shutdown converges
`test_core_shutdown_runs_even_when_wipe_raises`	Wipe failure doesn't leak Redis connections
`test_wipe_skipped_when_another_worker_holds_lock`	Distributed single-flight
`test_wipe_only_deletes_keys_under_configured_prefix`	Blast radius confined to configured prefix
`test_full_toggle_journey_enforce_disable_reenforce`	Single-instance journey: saturated user is no longer blocked after cycle
`test_full_toggle_journey_across_multiple_instances`	N=3 parallel instances: every user gets a fresh window after re-enable, not just the lock-winner's

Full local gate

cargo fmt -- --check                  clean
cargo clippy -- -D warnings           clean
make test-unit                        Rust tests: 58 passed
make test-integration                 pytest: 66 passed

Deployment caveats (worth knowing before merge)

Two pre-conditions for the wipe path to actually fire — intrinsic properties of how the wipe interacts with the framework's plugin lifecycle, worth flagging so operators aren't surprised at start-up:

Plugin must be loaded for the wipe to fire. When plugins/config.yaml has mode: "disabled" for the plugin from startup, the framework loader does not instantiate the plugin under any cached manager. No plugin instance ⇒ no shutdown to fire ⇒ no wipe path to run, regardless of how many subscribers receive the broadcast. In practice this means: the wipe only takes effect after the plugin has been in enforce mode at least once.
At least one cached manager required. factory.invalidate_all() iterates only cached managers; a fresh gateway with zero traffic has zero cached managers. The first request through the gateway warms a manager; subsequent toggle broadcasts then have something to act on.

Neither caveat affects production deployments after the first request. Both are worth documenting on the rate-limiter README so operators understand the start-up timing.

Migration notes

No new required config. Existing deployments work unchanged. The wipe fires automatically on the disable transition.
redis_key_prefix is the blast-radius boundary. If multiple rate-limiter instances share one Redis and use the same prefix, a disable on any one wipes counters for all. Default prefix is "rl"; operators with multiple limiters should set distinct prefixes.

When the operator transitions the plugin's mode to "disabled" via the admin mode API, every Redis counter for the plugin's configured key prefix is now deleted before the plugin shuts down. Re-enabling starts every user with a fresh window — counters do not resume from the pre-disable state. This makes the rate limiter behave like a stateless plugin from the operator's perspective across mode transitions. Detection mechanism: publish_plugin_mode_change writes plugin:<name>:mode to Redis *before* publishing the pub/sub broadcast that triggers shutdown. The plugin reads that key in shutdown() and only wipes when its value is exactly "disabled". On a graceful pod shutdown the key is unchanged, so counters survive restarts. The wipe uses non-blocking SCAN MATCH <prefix>:* with batched DELs and is idempotent — concurrent shutdowns across multiple workers all converge to the same end state without raising. core.shutdown() always runs in finally, so a wipe failure never blocks connection release. Known limitation: per-tenant binding-API mode changes (POST /v1/tools/plugin_bindings) do not currently trigger a wipe — that path stores the mode in the database, not in plugin:<name>:mode. Documented in README, tracked for follow-up. Adds redis>=5.0 as a runtime dependency (was previously dev-only — the wipe path needs redis.asyncio outside of tests). Tests: 6 new integration tests covering wipe-fires-on-disabled, no-wipe-on-enforce, no-wipe-on-missing-mode-key (graceful-shutdown safety), no-wipe-on-redis-unreachable (fail-safe), idempotency under concurrent shutdowns, and core-shutdown-still-runs-when-wipe-raises. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

…efix + TLS test depth) Addresses four P2 findings from Luca's review on PR #74. P2-3 (lifecycle coupling — move wipe into the Rust core and add an on_mode_change hook to the framework) is deliberately deferred as a separate follow-up; it requires framework-level changes in mcp-context-forge that are out of scope for this PR. P2-1 — single-flight guard around the wipe. Without coordination, every worker that sees mode=disabled runs the full SCAN+DEL independently. With 3 replicas × 24 workers = 72 workers all wiping the same keyspace, that's ~72× the Redis work and contention on the same keys. Added _acquire_wipe_lock() / _release_wipe_lock() using SET ... NX EX 30 — only the worker that wins the lock performs the wipe; the rest skip cleanly. Lock key uses a dash separator (`<prefix>-wipe-lock`) so the wipe SCAN's `<prefix>:*` pattern doesn't match it. TTL bounds starvation if the lock holder dies mid-wipe; the lock is released eagerly after a successful wipe so the next disable cycle is responsive. P2-2 — UNLINK with DEL fallback. SCAN avoids blocking key enumeration, but DEL still frees values on Redis' main thread. For sliding-window sorted sets, dropping 500 of those at once can stutter Redis. Switched to UNLINK (Redis 4.0+, defers value-free to a background thread) with a ResponseError fallback to DEL for pre-4.0 deployments. P2-4 — TLS handshake-depth test. The existing rediss:// regression test only verifies URL parsing. Added test_rediss_url_lazy_handshake_reaches_connectivity_layer that drives a tool_pre_invoke and asserts the failure logs are connectivity-shaped (refused / IO / timed out), explicitly NOT "feature is not enabled" / InvalidClientConfig — the latter being the wo-tracker #68217 regression signature. Catches a future Cargo feature-flag mistake or a rustls regression that the construction test would silently pass. P2-5 — custom redis_key_prefix coverage. Every existing test uses the default prefix, so a regression that hard-coded "rl:*" would still pass. Added test_wipe_only_deletes_keys_under_configured_prefix that seeds keys under a custom prefix and an unrelated namespace, then asserts only the configured-prefix keys are deleted. Local gate: cargo fmt-check + clippy + 57 Rust tests; pytest 115/115 (112 prior + lock-skipped + custom-prefix + TLS-handshake-depth = 115). Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

This branch ships the wipe-on-disable feature in a separate release from the TLS support PR (which uses 0.0.5). The 0.0.6 bump assumes TLS lands first; if order flips the bump moves with it. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

The existing TestWipeOnDisable class asserts wipe invariants in isolation — does the wipe fire on disable, does it skip when it should, does it respect the prefix. None of them assert the *transition composes*: that a user who was being blocked stops being blocked after a full enforce → disable → enforce cycle. That is the user-visible contract from User Story 1 of mcp-context-forge#4576, and this test pins it. Phase 1 saturates alice's bucket and confirms the 4th request under 3/m is actually blocked (proves the pre-toggle state was indeed saturated, which the existing seed-based tests don't). Phase 2 toggles to disabled, calls shutdown, asserts the wipe ran. Phase 3 toggles back to enforce, constructs a fresh plugin instance (mirroring what the plugin manager does on re-enable), and asserts alice's first request *passes* — the unique evidence that the wipe + re-enable produced a fresh window for the same user who was blocked seconds earlier. Window sized at 3/m rather than 3/s so the test never straddles a second boundary on slow runners — without the wipe, alice's 5th request would fall in the same minute window and be blocked, making a passing 5th request unambiguous evidence the wipe ran (not just a window roll-over masking the bug). Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

…n-disable Extends the journey coverage to the production fan-out shape: multiple plugin instances sharing one Redis URL, exercising the distributed lock under realistic flow. What this adds beyond the existing tests: - test_wipe_is_idempotent_under_concurrent_shutdown (#5) already proves N parallel shutdowns converge cleanly, but with seeded counters and no re-enable phase. - test_full_toggle_journey_enforce_disable_reenforce proves the full enforce → disable → re-enable transition composes, but with a single plugin instance, so the lock isn't exercised under real flow. This test is the composition of both — three plugin instances each saturate a different user (alice/bob/carol) in parallel via real tool_pre_invoke calls, the parallel shutdown under mode=disabled must converge to all rl:* keys deleted (exactly one wipes, two skip via the lock without raising), and three fresh plugin instances must all observe a fresh window for their respective users. The unique contract this test pins: a regression that wiped only the user owned by the lock-winning plugin would pass test #5 but fail Phase 3 here, where every user must get a fresh window. Window stays at 3/m for the same reason as the single-instance journey — a 4th-request-blocked assertion in Phase 1 makes the "first request passes" assertion in Phase 3 non-vacuous. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

…st fixtures Aligns the test fixtures' plugin name with the gateway's own plugins/config.yaml convention (which has used "RateLimiterPlugin" since the rate-limiter was added). The plugin reads its mode key as ``plugin:<self.config.name>:mode`` — so whatever name the test passes at construction is what the test expects to see in Redis. Before this commit, the test fixtures constructed plugins with name="RateLimiter" while production deploys them with name="RateLimiterPlugin", which silently created two disjoint mode-key namespaces and made it possible for an end-to-end test (mcp-context-forge#4511 layer) to publish to one namespace while the running gateway's plugin read the other. Found while empirically exercising the wipe-on-disable convergence path on a multi-replica stack — the wipe never fired because the test's PUBLISH used "RateLimiter" and the gateway's plugin looked up "RateLimiterPlugin". No code change to the rate-limiter plugin itself: ``self.config.name`` is just a string that gets reflected into the mode key, and the plugin works equally well with either value as long as the publisher and subscriber agree. The rename is purely about fixture-vs-production consistency. Touches every PluginConfig construction and every _set_plugin_mode_key call in test_redis_integration.py — bulk find/ replace, no behavioural change to any individual test. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

gandhipratik203 requested review from lucarlig and msureshkumar88 as code owners May 3, 2026 11:35

gandhipratik203 added 6 commits May 3, 2026 12:48

gandhipratik203 force-pushed the feat/rate-limiter-wipe-on-disable-only branch from 9554798 to 52e3d43 Compare May 3, 2026 11:51

lucarlig force-pushed the feat/rate-limiter-wipe-on-disable-only branch from b662ddf to 52e3d43 Compare May 5, 2026 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rate-limiter): wipe Redis counter state on disable transition#77

feat(rate-limiter): wipe Redis counter state on disable transition#77
gandhipratik203 wants to merge 6 commits intomainfrom
feat/rate-limiter-wipe-on-disable-only

gandhipratik203 commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gandhipratik203 commented May 3, 2026

Summary

What changes

Test coverage

Deployment caveats (worth knowing before merge)

Migration notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant