Skip to content

[FEATURE]: Clear rate-limiter Redis counter state on disabled mode transition #4576

@gandhipratik203

Description

@gandhipratik203

🧭 Type of Feature

  • Enhancement to existing functionality
  • New feature or capability
  • New MCP-compliant server
  • New component or integration
  • Developer tooling or test improvement
  • Packaging, automation and deployment (ex: pypi, docker, quay.io, kubernetes, terraform)
  • Other

🧭 Epic

Title: Clear rate-limiter persistent counter state when its mode is toggled to disabled.

Goal: When an operator flips the rate-limiter plugin to disabled at runtime, the counter state it has accumulated in Redis should be released, not left behind to silently reapply when the plugin is re-enabled.

Why now: This is the concrete follow-up to design question #3 in the parent issue (#4514): "Should disabled mode actually reset Redis state? Today it stops checking counters but doesn't clear them." The team has converged on "yes, clear them" as the right behavioural contract for the runtime toggle. This issue tracks delivering that contract.

The user-visible problem today: operators using the runtime toggle as an incident-response release valve (or part of a rollout) expect disabled to mean "the limiter's effect is gone." In practice, because the counters persist, the moment the operator flips the plugin back to enforce, traffic that was perfectly legitimate during the disabled window can be blocked instantly by counter values inherited from before the toggle. The behaviour is surprising and works against the use case the toggle was added for.


🧑🏻‍💻 User Story 1

As an operator using the runtime mode toggle to disable rate-limiting during an incident or rollout,
I want the rate-limiter's persistent counter state to be cleared when I flip the plugin to disabled,
So that when I subsequently re-enable enforcement, traffic is judged against a fresh window and is not immediately blocked by counters that accumulated before the toggle.

✅ Acceptance Criteria

Scenario: counters cleared on enforce → disabled
  Given the rate-limiter is in `enforce` mode
  And there are non-empty counter keys for the limiter in Redis
  When an operator toggles the plugin's mode to `disabled`
  Then the limiter's counter keys are no longer present in Redis once the toggle has propagated

Scenario: re-enable starts a fresh window
  Given the limiter has just been toggled to `disabled` (and its counters cleared)
  When the limiter is toggled back to `enforce`
  Then a freshly-arriving request under the configured rate is allowed
  And no request is blocked solely on the basis of counter state that existed before the disable

🧑🏻‍💻 User Story 2

As a tenant served by a multi-tenant gateway,
I want the cleanup that happens on disabled to respect tenant scoping and the rate-limiter's key prefix,
So that unrelated keys in the same Redis instance (other plugins, shared infra, application data) are unaffected by the toggle.

✅ Acceptance Criteria

Scenario: blast radius is confined to the rate-limiter's keyspace
  Given the rate-limiter has counter keys across multiple tenants
  And there are unrelated keys in the same Redis database that do not belong to the limiter
  When the limiter is toggled to `disabled`
  Then all of the limiter's counter keys (across every tenant) are cleared
  And every unrelated key is left untouched

📐 Design Sketch (optional)

Out of scope here. The agreed direction is "clear on disable"; the approach to clearing — single-flight, batched, prefix-scoped — is being worked through on the investigation branch linked below.


🔗 MCP Standards Check

  • Change adheres to current MCP specifications
  • No breaking changes to existing MCP-compliant integrations
  • If deviations exist, please describe them below

The runtime mode toggle is an internal gateway concern; the MCP-visible surface (tool calls, prompts) is unchanged.


🔄 Alternatives Considered

  • Leave counters in place (status quo). Operators have to wait for TTL expiry (or manually FLUSHDB the appropriate prefix) before re-enable behaves as expected. This is the source of the surprise reported in Rate limiter: runtime mode-toggle convergence behaviour and SLA decisions #4514 design question Update Makefile and pyproject.toml with packaging steps #3.
  • Clear only on disabled → enforce transition. Same end-state on re-enable, but the counters continue consuming Redis memory during the entire disabled window with no functional purpose. Rejected as wasteful and as creating an opportunity for stale state to drift.
  • Operator-driven manual clear (e.g. an admin endpoint). Useful as a follow-up affordance but doesn't solve the surprise-on-re-enable problem unless operators remember to call it on every toggle.

📓 Additional Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    COULDP3: Nice-to-have features with minimal impact if left out; included if time permitsdesignArchitecture and DesignenhancementNew feature or requestpluginstriageIssues / Features awaiting triage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions