Skip to content

[TE] Expose sendProbe via Python binding#2088

Open
kflansburg wants to merge 1 commit into
kvcache-ai:mainfrom
kflansburg:cf/v0.3.10.post2
Open

[TE] Expose sendProbe via Python binding#2088
kflansburg wants to merge 1 commit into
kvcache-ai:mainfrom
kflansburg:cf/v0.3.10.post2

Conversation

@kflansburg
Copy link
Copy Markdown

Description

Exposes the existing TransferMetadata::sendProbe C++ method through the TransferEngine pybind module as engine.send_probe(peer_server_name). This enables SGLang's MooncakeKVManager to issue lightweight JSON-RPC probes against peers, used to test whether a previously-blacklisted mooncake_session_id has become reachable again so it can be removed from the failed_sessions set.

Returns 0 on success, non-zero on failure (matching the C++ contract). No behavior change for existing engine.* methods.

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

  • New Python unit tests in transfer_engine_initiator_test.py covering both the reachable-peer and unknown-peer cases.
  • Manually validated end-to-end against SGLang's MooncakeKVManager.

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

@kflansburg kflansburg changed the title Expose sendProbe via Python binding [TE] Expose sendProbe via Python binding May 12, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the send_probe method to the Transfer Engine, enabling users to verify peer reachability through a lightweight JSON-RPC probe. The update includes the C++ implementation, Python bindings, updated documentation, and new unit tests. A review comment suggests releasing the Python Global Interpreter Lock (GIL) during the sendProbe execution to avoid blocking other threads during this potentially synchronous network operation.

Comment thread mooncake-integration/transfer_engine/transfer_engine_py.cpp
Exposes the existing TransferMetadata::sendProbe C++ method through the
TransferEngine pybind module as engine.send_probe(peer_server_name).
This enables SGLang's MooncakeKVManager to issue lightweight JSON-RPC
probes against peers, used to test whether a previously-blacklisted
mooncake_session_id has become reachable again so it can be removed
from the failed_sessions set.

Returns 0 on success, non-zero on failure (matching the C++ contract).
No behavior change for existing engine.* methods.

Tested:
- New Python unit tests in transfer_engine_initiator_test.py covering
  both the reachable-peer and unknown-peer cases.
- Manually validated end-to-end against SGLang's MooncakeKVManager.
@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 10 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...integration/transfer_engine/transfer_engine_py.cpp 0.00% 10 Missing ⚠️

📢 Thoughts on this report? Let us know!

@stmatengss
Copy link
Copy Markdown
Collaborator

Thanks for this addition. This looks useful for the recovery logic of MooncakeKVManager. I have a quick question regarding the design:

Usage Scenario: Could you clarify how the send_probe is triggered? Are you implementing a background polling mechanism or a demand-based retry strategy for the blacklisted sessions?

@stmatengss
Copy link
Copy Markdown
Collaborator

@ShangmingCai, there are code changes relevant to sglang. PTAL.

@kflansburg
Copy link
Copy Markdown
Author

kflansburg commented May 14, 2026

Thanks for this addition. This looks useful for the recovery logic of MooncakeKVManager. I have a quick question regarding the design:

Usage Scenario: Could you clarify how the send_probe is triggered? Are you implementing a background polling mechanism or a demand-based retry strategy for the blacklisted sessions?

I'm using a patch right now that does background polling, but I don't have a strong opinion on this. sgl-project/sglang#25287

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
CC: @alogfans please help check this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants