feat(rate-limiter): wipe Redis counter state on disable transition#77
Open
gandhipratik203 wants to merge 6 commits intomainfrom
Open
feat(rate-limiter): wipe Redis counter state on disable transition#77gandhipratik203 wants to merge 6 commits intomainfrom
gandhipratik203 wants to merge 6 commits intomainfrom
Conversation
When the operator transitions the plugin's mode to "disabled" via the admin mode API, every Redis counter for the plugin's configured key prefix is now deleted before the plugin shuts down. Re-enabling starts every user with a fresh window — counters do not resume from the pre-disable state. This makes the rate limiter behave like a stateless plugin from the operator's perspective across mode transitions. Detection mechanism: publish_plugin_mode_change writes plugin:<name>:mode to Redis *before* publishing the pub/sub broadcast that triggers shutdown. The plugin reads that key in shutdown() and only wipes when its value is exactly "disabled". On a graceful pod shutdown the key is unchanged, so counters survive restarts. The wipe uses non-blocking SCAN MATCH <prefix>:* with batched DELs and is idempotent — concurrent shutdowns across multiple workers all converge to the same end state without raising. core.shutdown() always runs in finally, so a wipe failure never blocks connection release. Known limitation: per-tenant binding-API mode changes (POST /v1/tools/plugin_bindings) do not currently trigger a wipe — that path stores the mode in the database, not in plugin:<name>:mode. Documented in README, tracked for follow-up. Adds redis>=5.0 as a runtime dependency (was previously dev-only — the wipe path needs redis.asyncio outside of tests). Tests: 6 new integration tests covering wipe-fires-on-disabled, no-wipe-on-enforce, no-wipe-on-missing-mode-key (graceful-shutdown safety), no-wipe-on-redis-unreachable (fail-safe), idempotency under concurrent shutdowns, and core-shutdown-still-runs-when-wipe-raises. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…efix + TLS test depth) Addresses four P2 findings from Luca's review on PR #74. P2-3 (lifecycle coupling — move wipe into the Rust core and add an on_mode_change hook to the framework) is deliberately deferred as a separate follow-up; it requires framework-level changes in mcp-context-forge that are out of scope for this PR. P2-1 — single-flight guard around the wipe. Without coordination, every worker that sees mode=disabled runs the full SCAN+DEL independently. With 3 replicas × 24 workers = 72 workers all wiping the same keyspace, that's ~72× the Redis work and contention on the same keys. Added _acquire_wipe_lock() / _release_wipe_lock() using SET ... NX EX 30 — only the worker that wins the lock performs the wipe; the rest skip cleanly. Lock key uses a dash separator (`<prefix>-wipe-lock`) so the wipe SCAN's `<prefix>:*` pattern doesn't match it. TTL bounds starvation if the lock holder dies mid-wipe; the lock is released eagerly after a successful wipe so the next disable cycle is responsive. P2-2 — UNLINK with DEL fallback. SCAN avoids blocking key enumeration, but DEL still frees values on Redis' main thread. For sliding-window sorted sets, dropping 500 of those at once can stutter Redis. Switched to UNLINK (Redis 4.0+, defers value-free to a background thread) with a ResponseError fallback to DEL for pre-4.0 deployments. P2-4 — TLS handshake-depth test. The existing rediss:// regression test only verifies URL parsing. Added test_rediss_url_lazy_handshake_reaches_connectivity_layer that drives a tool_pre_invoke and asserts the failure logs are connectivity-shaped (refused / IO / timed out), explicitly NOT "feature is not enabled" / InvalidClientConfig — the latter being the wo-tracker #68217 regression signature. Catches a future Cargo feature-flag mistake or a rustls regression that the construction test would silently pass. P2-5 — custom redis_key_prefix coverage. Every existing test uses the default prefix, so a regression that hard-coded "rl:*" would still pass. Added test_wipe_only_deletes_keys_under_configured_prefix that seeds keys under a custom prefix and an unrelated namespace, then asserts only the configured-prefix keys are deleted. Local gate: cargo fmt-check + clippy + 57 Rust tests; pytest 115/115 (112 prior + lock-skipped + custom-prefix + TLS-handshake-depth = 115). Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
This branch ships the wipe-on-disable feature in a separate release from the TLS support PR (which uses 0.0.5). The 0.0.6 bump assumes TLS lands first; if order flips the bump moves with it. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
The existing TestWipeOnDisable class asserts wipe invariants in isolation — does the wipe fire on disable, does it skip when it should, does it respect the prefix. None of them assert the *transition composes*: that a user who was being blocked stops being blocked after a full enforce → disable → enforce cycle. That is the user-visible contract from User Story 1 of mcp-context-forge#4576, and this test pins it. Phase 1 saturates alice's bucket and confirms the 4th request under 3/m is actually blocked (proves the pre-toggle state was indeed saturated, which the existing seed-based tests don't). Phase 2 toggles to disabled, calls shutdown, asserts the wipe ran. Phase 3 toggles back to enforce, constructs a fresh plugin instance (mirroring what the plugin manager does on re-enable), and asserts alice's first request *passes* — the unique evidence that the wipe + re-enable produced a fresh window for the same user who was blocked seconds earlier. Window sized at 3/m rather than 3/s so the test never straddles a second boundary on slow runners — without the wipe, alice's 5th request would fall in the same minute window and be blocked, making a passing 5th request unambiguous evidence the wipe ran (not just a window roll-over masking the bug). Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…n-disable Extends the journey coverage to the production fan-out shape: multiple plugin instances sharing one Redis URL, exercising the distributed lock under realistic flow. What this adds beyond the existing tests: - test_wipe_is_idempotent_under_concurrent_shutdown (#5) already proves N parallel shutdowns converge cleanly, but with seeded counters and no re-enable phase. - test_full_toggle_journey_enforce_disable_reenforce proves the full enforce → disable → re-enable transition composes, but with a single plugin instance, so the lock isn't exercised under real flow. This test is the composition of both — three plugin instances each saturate a different user (alice/bob/carol) in parallel via real tool_pre_invoke calls, the parallel shutdown under mode=disabled must converge to all rl:* keys deleted (exactly one wipes, two skip via the lock without raising), and three fresh plugin instances must all observe a fresh window for their respective users. The unique contract this test pins: a regression that wiped only the user owned by the lock-winning plugin would pass test #5 but fail Phase 3 here, where every user must get a fresh window. Window stays at 3/m for the same reason as the single-instance journey — a 4th-request-blocked assertion in Phase 1 makes the "first request passes" assertion in Phase 3 non-vacuous. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…st fixtures Aligns the test fixtures' plugin name with the gateway's own plugins/config.yaml convention (which has used "RateLimiterPlugin" since the rate-limiter was added). The plugin reads its mode key as ``plugin:<self.config.name>:mode`` — so whatever name the test passes at construction is what the test expects to see in Redis. Before this commit, the test fixtures constructed plugins with name="RateLimiter" while production deploys them with name="RateLimiterPlugin", which silently created two disjoint mode-key namespaces and made it possible for an end-to-end test (mcp-context-forge#4511 layer) to publish to one namespace while the running gateway's plugin read the other. Found while empirically exercising the wipe-on-disable convergence path on a multi-replica stack — the wipe never fired because the test's PUBLISH used "RateLimiter" and the gateway's plugin looked up "RateLimiterPlugin". No code change to the rate-limiter plugin itself: ``self.config.name`` is just a string that gets reflected into the mode key, and the plugin works equally well with either value as long as the publisher and subscriber agree. The rename is purely about fixture-vs-production consistency. Touches every PluginConfig construction and every _set_plugin_mode_key call in test_redis_integration.py — bulk find/ replace, no behavioural change to any individual test. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
9554798 to
52e3d43
Compare
b662ddf to
52e3d43
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes IBM/mcp-context-forge#4576 — sub-issue of #4514 (rate-limiter runtime mode-toggle convergence).
When an operator flips the rate limiter to
mode: disabledat runtime, the plugin now clears its accumulated counter state from Redis. Previously the counters survived the toggle, so an immediate re-enable (or a stale-mode worker still serving traffic) would block legitimate requests on counter values inherited from before the disable. After this change,disabledtruly means "the limiter's effect is gone" — and re-enable starts every user with a fresh window.Bumps
cpex_rate_limiter0.0.4 → 0.0.6.What changes
When
RateLimiterPlugin.shutdown()runs, the plugin readsplugin:<name>:modefrom Redis. If the value isdisabled, the plugin acquires a distributed single-flight lock (<prefix>-wipe-lock, 30s TTL safety net), runsSCAN + UNLINKover its configuredredis_key_prefix, and releases the lock. The wipe is bounded to keys under the plugin's configured prefix; nothing outside that namespace is touched.The mechanism piggybacks on the framework's existing
reload_tenantpath: whenpublish_plugin_mode_change("<name>", "disabled")lands, every gateway worker's invalidation listener firesfactory.invalidate_all()→reload_tenantper cached context →manager.shutdown()→plugin.shutdown(). Single-flight via the lock ensures the SCAN+UNLINK runs once across the cluster, even if N workers all process the broadcast simultaneously.Test coverage
TestWipeOnDisableclass — 10 tests intests/rate_limiter/test_redis_integration.py:test_wipe_fires_when_mode_key_says_disabledtest_no_wipe_when_mode_key_says_enforcetest_no_wipe_when_mode_key_absenttest_no_wipe_when_redis_unreachable_for_mode_lookuptest_wipe_is_idempotent_under_concurrent_shutdowntest_core_shutdown_runs_even_when_wipe_raisestest_wipe_skipped_when_another_worker_holds_locktest_wipe_only_deletes_keys_under_configured_prefixtest_full_toggle_journey_enforce_disable_reenforcetest_full_toggle_journey_across_multiple_instancesFull local gate
Deployment caveats (worth knowing before merge)
Two pre-conditions for the wipe path to actually fire — intrinsic properties of how the wipe interacts with the framework's plugin lifecycle, worth flagging so operators aren't surprised at start-up:
plugins/config.yamlhasmode: "disabled"for the plugin from startup, the framework loader does not instantiate the plugin under any cached manager. No plugin instance ⇒ no shutdown to fire ⇒ no wipe path to run, regardless of how many subscribers receive the broadcast. In practice this means: the wipe only takes effect after the plugin has been inenforcemode at least once.factory.invalidate_all()iterates only cached managers; a fresh gateway with zero traffic has zero cached managers. The first request through the gateway warms a manager; subsequent toggle broadcasts then have something to act on.Neither caveat affects production deployments after the first request. Both are worth documenting on the rate-limiter README so operators understand the start-up timing.
Migration notes
redis_key_prefixis the blast-radius boundary. If multiple rate-limiter instances share one Redis and use the same prefix, a disable on any one wipes counters for all. Default prefix is"rl"; operators with multiple limiters should set distinct prefixes.