[WIP] Initial work to add llm-d-vllm framework with H200 by ezrasilvera · Pull Request #1660 · SemiAnalysisAI/InferenceX

ezrasilvera · 2026-06-04T10:26:43Z

WIP - don't merge. Initial testing of llm-d integration

Note

Medium Risk
New privileged multi-node Docker/SLURM orchestration and routing stack; limited to benchmark infra but complex failure modes and WIP image pin.

Overview
Adds a new llm-d-vllm multi-node benchmark path on H200: prefill/decode disaggregation with vLLM, llm-d EPP routing, Envoy, and NIXL KV transfer, orchestrated entirely by InferenceX SLURM scripts (similar to the AMD sglang-disagg flow, not srt-slurm).

Benchmark config: Registers dsr1-fp8-h200-llm-d-vllm-simple in nvidia-master.yaml — 1 prefill + 1 decode node, EP=8 dp-attn per side, fixed 1k/1k seq lengths, and CONFIG_FILE=dsr1-fp8-h200-1p1d-simple.yaml (no wide-EP / DeepEP / NVSHMEM ibgda for this “phase 0” shape).

Image & static config: New benchmarks/llm-d/ Dockerfile (llm-d-cuda + EPP + pd-sidecar + Envoy), default epp-config.yaml (file-discovery to /tmp/endpoints.yaml), and envoy.yaml (ext_proc → EPP, ORIGINAL_DST on x-gateway-destination-endpoint).

Runtime: benchmarks/multi_node/llm-d/ (submit.sh, job.slurm, server.sh) allocates nodes, runs one privileged Docker container per rank, starts vLLM + pd-sidecar on leaders, and on the decode leader writes endpoints.yaml, starts EPP/Envoy, runs benchmark_serving concurrency sweeps, and signals job teardown via a done marker. Recipe llm-d-recipes/dsr1-fp8-h200-1p1d-simple.yaml overrides EPP scheduling and per-role vLLM args. Wrapper dsr1_fp8_h200_llm-d-vllm.sh feeds submit.sh.

Runner: launch_h200-dgxc-slurm.sh exports SLURM account/partition, dispatches FRAMEWORK=llm-d-vllm to the wrapper, tails SLURM logs, copies JSON/eval artifacts, and lists llm-d-vllm among supported multinode frameworks.

^{Reviewed by Cursor Bugbot for commit 95b8b2f. Bugbot is set up for automated code reviews on this repo. Configure here.}

…1D recipes Adds the llm-d-vllm benchmark framework, with two H200 multi-node DeepSeek-R1 fp8 P/D disagg recipes: dsr1-fp8-h200-llm-d-vllm Wide-EP shape mirroring the upstream llm-d wide-EP-lws guide: 1 prefill instance + 1 decode instance, each spanning 2 H200 nodes, DP=16 EP=16, ISL=2k/OSL=2k. Total 4 H200 nodes / 32 GPUs. dsr1-fp8-h200-llm-d-vllm-simple Phase 0 single-node-per-role 1P+1D shape (DP=8 EP=8 dp-attn, NIXL P-to-D KV transfer, no DeepEP / ibgda) for an apples-to- apples comparison vs Dynamo's H200 1P+1D entries. Both recipes point at ghcr.io/ezrasilvera/llm-d-nokube-vllm:v0.7.0, the combined image (ghcr.io/llm-d/llm-d-cuda:v0.7.0 base + EPP + pd-sidecar + envoy on top). EPP and pd-sidecar binaries come from the upstream dev tags (ghcr.io/llm-d/llm-d-router-endpoint-picker-dev:main, ghcr.io/llm-d/llm-d-router-disagg-sidecar-dev:main). Configs (epp-config.yaml, envoy.yaml) are NOT baked into the image. job.slurm bind-mounts them at /etc/epp/config.yaml and /etc/envoy/envoy.yaml so config-only iteration does not require an image rebuild. EPP config mirrors the upstream well-lit-path guides/pd-disaggregation/router/pd-disaggregation.values.yaml in github.com/llm-d/llm-d - same plugin set (disagg-headers-handler, always-disagg-pd-decider, disagg-profile-handler, prefill-filter, decode-filter, prefix-cache-scorer, queue-scorer, kv-cache-utilization-scorer, active-request-scorer, max-score-picker), same prefill/decode profiles and weights. The only delta vs upstream is the file-discovery plugin pointing at /tmp/endpoints.yaml; this benchmark runs under SLURM so there is no K8s control plane to drive endpoint discovery. The coordinator node writes /tmp/endpoints.yaml at job start; per-recipe variants live under benchmarks/multi_node/llm-d-recipes/. envoy.yaml ext_proc is configured for the dev EPP. Setting request/response body mode to FULL_DUPLEX_STREAMED and trailer + response-header modes to SEND is required because the dev EPP does not ack BUFFERED body mode and Envoy times out with 504s. message_timeout is 1000s to mirror the upstream guide; per-message generation can take many seconds. server.sh branches on LWS_GROUP_SIZE: the cross-process DP coordination flags (--data-parallel-hybrid-lb, size-local, address, rpc-port, start-rank) are only set when an instance spans more than one node. /tmp/endpoints.yaml entries carry namespace=inferencex so that the file-discovery plugin does not drop them (it filters by namespace; EPP runs with --pool-namespace=inferencex). runners/launch_h200-dgxc-slurm.sh gains the llm-d-vllm dispatch hook so the multinode benchmark template can launch this framework. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

…h waits, C3 NVSHMEM gate) Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

ezrasilvera · 2026-06-04T12:41:05Z

WIP - don't merge. Initial testing of llm-d integration

…rs, C5 recipe path, C7 docker filter, C8 EPP wait) Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

…rdown instead of in-container scancel) Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

…fault NIC for NCCL/GLOO/NVSHMEM) Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

…to path and served-name) Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 873fe04. Configure here.}

…ing through Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

ezrasilvera requested a review from a team June 4, 2026 10:26

ezrasilvera requested a review from kedarpotdar-nv as a code owner June 4, 2026 10:26

github-project-automation Bot added this to InferenceMAX Board Jun 4, 2026

ezrasilvera requested a review from jgangani as a code owner June 4, 2026 10:26

claude Bot reviewed Jun 4, 2026

View reviewed changes