Skip to content

[WIP] Initial work to add llm-d-vllm framework with H200 #1660

Open
ezrasilvera wants to merge 8 commits into
SemiAnalysisAI:mainfrom
ezrasilvera:llm-d-initial
Open

[WIP] Initial work to add llm-d-vllm framework with H200 #1660
ezrasilvera wants to merge 8 commits into
SemiAnalysisAI:mainfrom
ezrasilvera:llm-d-initial

Conversation

@ezrasilvera
Copy link
Copy Markdown
Collaborator

@ezrasilvera ezrasilvera commented Jun 4, 2026

WIP - don't merge. Initial testing of llm-d integration


Note

Medium Risk
New privileged multi-node Docker/SLURM orchestration and routing stack; limited to benchmark infra but complex failure modes and WIP image pin.

Overview
Adds a new llm-d-vllm multi-node benchmark path on H200: prefill/decode disaggregation with vLLM, llm-d EPP routing, Envoy, and NIXL KV transfer, orchestrated entirely by InferenceX SLURM scripts (similar to the AMD sglang-disagg flow, not srt-slurm).

Benchmark config: Registers dsr1-fp8-h200-llm-d-vllm-simple in nvidia-master.yaml — 1 prefill + 1 decode node, EP=8 dp-attn per side, fixed 1k/1k seq lengths, and CONFIG_FILE=dsr1-fp8-h200-1p1d-simple.yaml (no wide-EP / DeepEP / NVSHMEM ibgda for this “phase 0” shape).

Image & static config: New benchmarks/llm-d/ Dockerfile (llm-d-cuda + EPP + pd-sidecar + Envoy), default epp-config.yaml (file-discovery to /tmp/endpoints.yaml), and envoy.yaml (ext_proc → EPP, ORIGINAL_DST on x-gateway-destination-endpoint).

Runtime: benchmarks/multi_node/llm-d/ (submit.sh, job.slurm, server.sh) allocates nodes, runs one privileged Docker container per rank, starts vLLM + pd-sidecar on leaders, and on the decode leader writes endpoints.yaml, starts EPP/Envoy, runs benchmark_serving concurrency sweeps, and signals job teardown via a done marker. Recipe llm-d-recipes/dsr1-fp8-h200-1p1d-simple.yaml overrides EPP scheduling and per-role vLLM args. Wrapper dsr1_fp8_h200_llm-d-vllm.sh feeds submit.sh.

Runner: launch_h200-dgxc-slurm.sh exports SLURM account/partition, dispatches FRAMEWORK=llm-d-vllm to the wrapper, tails SLURM logs, copies JSON/eval artifacts, and lists llm-d-vllm among supported multinode frameworks.

Reviewed by Cursor Bugbot for commit 95b8b2f. Bugbot is set up for automated code reviews on this repo. Configure here.

…1D recipes

Adds the llm-d-vllm benchmark framework, with two H200 multi-node
DeepSeek-R1 fp8 P/D disagg recipes:

  dsr1-fp8-h200-llm-d-vllm
    Wide-EP shape mirroring the upstream llm-d wide-EP-lws guide:
    1 prefill instance + 1 decode instance, each spanning 2 H200
    nodes, DP=16 EP=16, ISL=2k/OSL=2k. Total 4 H200 nodes / 32 GPUs.

  dsr1-fp8-h200-llm-d-vllm-simple
    Phase 0 single-node-per-role 1P+1D shape (DP=8 EP=8 dp-attn,
    NIXL P-to-D KV transfer, no DeepEP / ibgda) for an apples-to-
    apples comparison vs Dynamo's H200 1P+1D entries.

Both recipes point at ghcr.io/ezrasilvera/llm-d-nokube-vllm:v0.7.0,
the combined image (ghcr.io/llm-d/llm-d-cuda:v0.7.0 base + EPP +
pd-sidecar + envoy on top). EPP and pd-sidecar binaries come from
the upstream dev tags
(ghcr.io/llm-d/llm-d-router-endpoint-picker-dev:main,
 ghcr.io/llm-d/llm-d-router-disagg-sidecar-dev:main).

Configs (epp-config.yaml, envoy.yaml) are NOT baked into the image.
job.slurm bind-mounts them at /etc/epp/config.yaml and
/etc/envoy/envoy.yaml so config-only iteration does not require an
image rebuild.

EPP config mirrors the upstream well-lit-path
guides/pd-disaggregation/router/pd-disaggregation.values.yaml in
github.com/llm-d/llm-d - same plugin set (disagg-headers-handler,
always-disagg-pd-decider, disagg-profile-handler, prefill-filter,
decode-filter, prefix-cache-scorer, queue-scorer,
kv-cache-utilization-scorer, active-request-scorer,
max-score-picker), same prefill/decode profiles and weights. The
only delta vs upstream is the file-discovery plugin pointing at
/tmp/endpoints.yaml; this benchmark runs under SLURM so there is no
K8s control plane to drive endpoint discovery. The coordinator node
writes /tmp/endpoints.yaml at job start; per-recipe variants live
under benchmarks/multi_node/llm-d-recipes/.

envoy.yaml ext_proc is configured for the dev EPP. Setting
request/response body mode to FULL_DUPLEX_STREAMED and trailer +
response-header modes to SEND is required because the dev EPP does
not ack BUFFERED body mode and Envoy times out with 504s.
message_timeout is 1000s to mirror the upstream guide; per-message
generation can take many seconds.

server.sh branches on LWS_GROUP_SIZE: the cross-process DP
coordination flags (--data-parallel-hybrid-lb, size-local, address,
rpc-port, start-rank) are only set when an instance spans more than
one node. /tmp/endpoints.yaml entries carry namespace=inferencex so
that the file-discovery plugin does not drop them (it filters by
namespace; EPP runs with --pool-namespace=inferencex).

runners/launch_h200-dgxc-slurm.sh gains the llm-d-vllm dispatch hook
so the multinode benchmark template can launch this framework.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Comment thread benchmarks/multi_node/llm-d/server.sh Outdated
Comment thread benchmarks/multi_node/llm-d/server.sh Outdated
Comment thread benchmarks/multi_node/llm-d/server.sh Outdated
…h waits, C3 NVSHMEM gate)

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Comment thread runners/launch_h200-dgxc-slurm.sh
Comment thread benchmarks/multi_node/llm-d/submit.sh
Comment thread benchmarks/multi_node/llm-d-recipes/dsr1-fp8-h200-1p1d-wideep.yaml Outdated
Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Comment thread benchmarks/multi_node/llm-d/job.slurm Outdated
Comment thread benchmarks/multi_node/llm-d/server.sh
@ezrasilvera
Copy link
Copy Markdown
Collaborator Author

WIP - don't merge. Initial testing of llm-d integration

…rs, C5 recipe path, C7 docker filter, C8 EPP wait)

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Comment thread benchmarks/multi_node/llm-d/server.sh Outdated
…rdown instead of in-container scancel)

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Comment thread benchmarks/multi_node/llm-d/server.sh Outdated
…fault NIC for NCCL/GLOO/NVSHMEM)

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Comment thread benchmarks/multi_node/llm-d/server.sh Outdated
…to path and served-name)

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 873fe04. Configure here.

Comment thread benchmarks/multi_node/llm-d/server.sh
…ing through

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant