[TE] Expose sendProbe via Python binding#2088
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the send_probe method to the Transfer Engine, enabling users to verify peer reachability through a lightweight JSON-RPC probe. The update includes the C++ implementation, Python bindings, updated documentation, and new unit tests. A review comment suggests releasing the Python Global Interpreter Lock (GIL) during the sendProbe execution to avoid blocking other threads during this potentially synchronous network operation.
Exposes the existing TransferMetadata::sendProbe C++ method through the TransferEngine pybind module as engine.send_probe(peer_server_name). This enables SGLang's MooncakeKVManager to issue lightweight JSON-RPC probes against peers, used to test whether a previously-blacklisted mooncake_session_id has become reachable again so it can be removed from the failed_sessions set. Returns 0 on success, non-zero on failure (matching the C++ contract). No behavior change for existing engine.* methods. Tested: - New Python unit tests in transfer_engine_initiator_test.py covering both the reachable-peer and unknown-peer cases. - Manually validated end-to-end against SGLang's MooncakeKVManager.
ee1fc28 to
e02f196
Compare
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
Thanks for this addition. This looks useful for the recovery logic of MooncakeKVManager. I have a quick question regarding the design: Usage Scenario: Could you clarify how the send_probe is triggered? Are you implementing a background polling mechanism or a demand-based retry strategy for the blacklisted sessions? |
|
@ShangmingCai, there are code changes relevant to sglang. PTAL. |
I'm using a patch right now that does background polling, but I don't have a strong opinion on this. sgl-project/sglang#25287 |
ShangmingCai
left a comment
There was a problem hiding this comment.
LGTM.
CC: @alogfans please help check this PR
Description
Exposes the existing TransferMetadata::sendProbe C++ method through the TransferEngine pybind module as engine.send_probe(peer_server_name). This enables SGLang's MooncakeKVManager to issue lightweight JSON-RPC probes against peers, used to test whether a previously-blacklisted mooncake_session_id has become reachable again so it can be removed from the failed_sessions set.
Returns 0 on success, non-zero on failure (matching the C++ contract). No behavior change for existing engine.* methods.
Module
mooncake-transfer-engine)mooncake-store)mooncake-ep)mooncake-integration)mooncake-p2p-store)mooncake-wheel)mooncake-pg)mooncake-rl)Type of Change
How Has This Been Tested?
Checklist
./scripts/code_format.shbefore submitting.