Skip to content

[PD] Un-blacklist mooncake sessions when probe succeeds#25287

Merged
ShangmingCai merged 3 commits into
sgl-project:mainfrom
kflansburg:kflansburg/blacklist-polling
May 20, 2026
Merged

[PD] Un-blacklist mooncake sessions when probe succeeds#25287
ShangmingCai merged 3 commits into
sgl-project:mainfrom
kflansburg:kflansburg/blacklist-polling

Conversation

@kflansburg
Copy link
Copy Markdown
Contributor

@kflansburg kflansburg commented May 14, 2026

MooncakeKVManager adds a mooncake_session_id to self.failed_sessions the first time any KV transfer to that peer fails (one strike, no retry). The session is only removed when the decode side re-registers its KV args via the bootstrap thread. If a transient failure (network blip, peer GC pause, peer briefly unhealthy) blacklists the session but the decode side never re-registers, every subsequent prefill->decode request routed through that session fails forever with 'Decode instance could be dead' until one side restarts.

Add a daemon thread on the prefill side that periodically issues a lightweight session probe (engine.send_probe) against every blacklisted session. If the probe succeeds, the session is removed from failed_sessions and session_failures, the event is logged, and sglang:failed_session_recoveries_total is incremented.

Configuration:
SGLANG_FAILED_SESSION_PROBE_INTERVAL_S (default 30s)

The probe loop runs only in PREFILL mode (failed_sessions is prefill-only). Probe exceptions are swallowed with a warning so a single bad peer cannot kill the probe thread.

Requires engine.send_probe on the mooncake-transfer-engine wheel, which is added in kvcache-ai/Mooncake#2088 and must be released (and the bundled MOONCAKE_VERSION bumped) before this change takes effect. Without that, the probe call raises AttributeError which is caught and logged; blacklist behavior is unchanged from today.

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): ✅ Run #25943738343
Latest PR Test (Extra): ⚠️ Not enabled -- add run-ci-extra label to opt in.

MooncakeKVManager adds a mooncake_session_id to self.failed_sessions
the first time any KV transfer to that peer fails (one strike, no
retry). The session is only removed when the decode side re-registers
its KV args via the bootstrap thread. If a transient failure (network
blip, peer GC pause, peer briefly unhealthy) blacklists the session
but the decode side never re-registers, every subsequent prefill->decode
request routed through that session fails forever with 'Decode instance
could be dead' until one side restarts.

Add a daemon thread on the prefill side that periodically issues a
lightweight session probe (engine.send_probe) against every blacklisted
session. If the probe succeeds, the session is removed from
failed_sessions and session_failures, the event is logged, and
sglang:failed_session_recoveries_total is incremented.

Configuration:
  SGLANG_FAILED_SESSION_PROBE_INTERVAL_S (default 30s)

The probe loop runs only in PREFILL mode (failed_sessions is
prefill-only). Probe exceptions are swallowed with a warning so a
single bad peer cannot kill the probe thread.

Requires engine.send_probe on the mooncake-transfer-engine wheel,
which is added in kvcache-ai/Mooncake#2088
and must be released (and the bundled MOONCAKE_VERSION bumped) before
this change takes effect. Without that, the probe call raises
AttributeError which is caught and logged; blacklist behavior is
unchanged from today.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Comment thread python/sglang/srt/disaggregation/mooncake/conn.py Outdated
Comment thread python/sglang/srt/disaggregation/mooncake/conn.py Outdated
Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think this PR has a point, and well-implemented.

But we might need to wait for a new release of mooncake. Also, should we add an env var switch for this feature since we might only need this for specific usage, and other use-cases should not pay (another background thread and cpu overhead) for what they don't need?

@ShangmingCai
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI has passed:

Image

We plan to release a new mooncake version this week. You should be able to use this feature in the new version.

@ShangmingCai ShangmingCai merged commit 3b2178c into sgl-project:main May 20, 2026
234 of 259 checks passed
@kflansburg kflansburg deleted the kflansburg/blacklist-polling branch May 20, 2026 03:18
Shunkangz pushed a commit to Shunkangz/sglang that referenced this pull request May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants