feat: expose ParameterServer metas via HTTP for cross-process join#90
Open
ddmm2020 wants to merge 4 commits into
Open
feat: expose ParameterServer metas via HTTP for cross-process join#90ddmm2020 wants to merge 4 commits into
ddmm2020 wants to merge 4 commits into
Conversation
Add two HTTP endpoints (GET/POST /v1/checkpoints/{name}/{metas,load-metas})
and a standalone `python -m checkpoint_engine.join_cli` entrypoint, so a
new ParameterServer instance can join an existing P2P weight world over
mooncake RDMA without re-reading checkpoints from disk.
Motivation: in elastic-rollout setups (e.g. mshrl), a long-running training
job already holds pinned CPU weight buffers registered with the mooncake
P2PStore. Newly-started inference replicas should be able to pull these
weights over RDMA instead of re-converting the checkpoint from disk.
Changes:
* api.py: GET /v1/checkpoints/{name}/metas returns pickle.dumps(ps.get_metas())
as application/octet-stream; POST /v1/checkpoints/{name}/load-metas accepts
the same bytes and feeds them into ps.load_metas(). Bad pickle is rejected
with 400; PS errors are surfaced as 500.
* join_cli.py: `python -m checkpoint_engine.join_cli` -- the join() flow from
examples/update.py, packaged as a first-class CLI under the published
package so consumers can invoke it without checking out the source tree.
Accepts metas from either a local pickle file or a remote HTTP URL.
* tests/test_api.py: 6 CPU-only tests covering pickle round-trip, ps-error
propagation, bad-input rejection, and a GET-then-POST chain that validates
the new endpoints are mutually consistent.
Verified end-to-end on a 2-node launchpad job: 14.5 GiB Qwen2.5-7B weights
transferred from main to elastic in 1.49s over real RDMA (4 mlx5_bond HCAs)
vs 6.49s over TCP fallback in environments without RDMA passthrough.
HubertZhang
approved these changes
Jun 24, 2026
Replace pickle with pydantic TypeAdapter(dict[int, MemoryBufferMetaList]) for the metas wire format across the HTTP endpoints, join_cli, and examples/update.py. This reuses the existing pydantic schema (torch.dtype / torch.Size already have serializers in data_types.py), removes the arbitrary-code-execution risk of pickle.loads on request bodies, and makes the metas self-describing for cross-language consumers. - api.py: GET /metas returns application/json; POST /load-metas validates via validate_json and returns 400 on ValidationError (was broad except). - join_cli.py / examples/update.py: read/write metas as JSON; document the --metas-url HTTP path alongside --load-metas-file. - tests/test_api.py: use real MemoryBufferMetaList fixtures; add a schema-mismatch case (valid JSON, wrong shape -> 400).
d083f34 to
f3392da
Compare
| media_type="application/json", | ||
| ) | ||
|
|
||
| @app.post("/v1/checkpoints/{checkpoint_name}/load-metas") |
Collaborator
There was a problem hiding this comment.
We can use the same url like "/v1/checkpoints/{checkpoint_name}/metas" to make this http api more restful
Collaborator
There was a problem hiding this comment.
I think these metas interface should not include checkpoint_name param. Try to simplify it as /v1/metas
| ) | ||
|
|
||
| @app.post("/v1/checkpoints/{checkpoint_name}/load-metas") | ||
| async def load_metas(checkpoint_name: str, raw: Request) -> Response: |
Collaborator
There was a problem hiding this comment.
I found a more elegant method. You can use req directly in the func signature like
async def load_metas(checkpoint_name: str, metas: dict[int, MemoryBufferMetaList]) -> Response:then we can directly use wrap_exception(lambda: ps.load_metas(metas))
| @@ -0,0 +1,173 @@ | |||
| """checkpoint_engine.join_cli | |||
Collaborator
There was a problem hiding this comment.
It seems there're too many duplicated codes compared to examples/update.py. Is it necessary to add this join_cli.py file?
added 2 commits
June 26, 2026 11:30
join_cli was a duplicate of examples/update.py:join() with an extra --metas-url flag. Add the flag to update.py, remove join_cli.
The checkpoint_name path param was never read — ps.get_metas() and ps.load_metas() act on a single global field. Drop it from the URL.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Expose ParameterServer metas over HTTP for cross-process P2P join
Let a new started inference replica pull weights from an existing replica over RDMA instead of reloading from disk. Replicas started by checkpoint-engine already hold pinned CPU weight buffers registered with the mooncake P2PStore; these endpoints expose that metas so a new replica can RDMA-pull directly.
Changes
Verified: end-to-end on a 2-node job — 14.5 GiB Qwen2.5-7B transferred from main to elastic in 1.5s over RDMA (4× mlx5_bond HCAs), vs ~6.5s over TCP fallback.