[OMNIML-4740] synth_support#1496
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds a new vLLM offline speculative decoding pipeline configuration for Kimi-K2.5 and enhances launcher service utilities with Slurm-aware rank detection and cross-rank coordinated dependency installation to prevent race conditions in distributed execution environments. ChangesLauncher examples and service utilities
🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Closing — OMNIML-4735 Epic wrap-up. Used as the first real-release dispatch on the fully-consolidated nmm-sandbox stack (Phase 3 production validation); not pursuing the YAML merge today. |
|
|
Re-opening — closing this PR earlier was premature cleanup. OMNIML-4735 is now being driven to an actual release; this PR is the synth_support deliverable awaiting real review + merge. |
1b02102 to
006e8b2
Compare
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tools/launcher/examples/moonshotai/Kimi-K2.5 DFlash/hf_offline_eagle3.yaml (1)
104-104: ⚡ Quick winPin the VLLM container to an immutable version/digest.
Using
vllm/vllm-openai:latestmakes benchmark results non-reproducible across runs; pin to a tested tag or digest.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/moonshotai/Kimi-K2.5` DFlash/hf_offline_eagle3.yaml at line 104, The container image is pinned to the floating tag "vllm/vllm-openai:latest" under the container key; replace this with an immutable tag or digest (for example a specific semver tag or a sha256 digest) that you've tested for benchmarks so results are reproducible, e.g. change the container value from "vllm/vllm-openai:latest" to a concrete string like "vllm/vllm-openai:<tested-tag>" or "vllm/vllm-openai@sha256:<digest>".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/launcher/examples/moonshotai/Kimi-K2.5` DFlash/hf_offline_eagle3.yaml:
- Line 73: The training output path (training.output_dir) is set to
/scratchspace/eagle3 but later steps (task_3 / the draft weights reader) expect
artifacts at /scratchspace/export; change training.output_dir to match the
benchmark input path (/scratchspace/export) or update the task_3 draft weights
path to /scratchspace/eagle3 so both produce and consume the same directory
(refer to the training.output_dir key and the task_3 draft weights
reference/train_eagle.sh usage to locate where to update).
- Around line 39-40: The YAML is missing required environment variables for the
new model config: add MLM_MODEL_CFG (set to the HuggingFace repo ID for the
model) and QUANT_CFG (e.g., NVFP4_DEFAULT_CFG or INT8_DEFAULT_CFG) to every
environment block that defines task envs (the existing blocks that contain
HF_LOCAL) and create an environment block for task_2 that includes HF_LOCAL plus
MLM_MODEL_CFG and QUANT_CFG so the launcher YAML conforms to the model-config
contract; ensure the variable names are spelled exactly as MLM_MODEL_CFG and
QUANT_CFG in each task's environment list.
---
Nitpick comments:
In `@tools/launcher/examples/moonshotai/Kimi-K2.5` DFlash/hf_offline_eagle3.yaml:
- Line 104: The container image is pinned to the floating tag
"vllm/vllm-openai:latest" under the container key; replace this with an
immutable tag or digest (for example a specific semver tag or a sha256 digest)
that you've tested for benchmarks so results are reproducible, e.g. change the
container value from "vllm/vllm-openai:latest" to a concrete string like
"vllm/vllm-openai:<tested-tag>" or "vllm/vllm-openai@sha256:<digest>".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 66ea5a13-65b0-4b34-81d6-f8fa01cc5581
📒 Files selected for processing (1)
tools/launcher/examples/moonshotai/Kimi-K2.5 DFlash/hf_offline_eagle3.yaml
| environment: | ||
| - HF_LOCAL: /hf-local |
There was a problem hiding this comment.
Add required MLM_MODEL_CFG and QUANT_CFG env vars for this new model config.
Line 39, Line 57, and Line 97 define task envs, but MLM_MODEL_CFG and QUANT_CFG are missing, and task_2 (Line 67-83) has no environment block at all. This violates the model-config contract for launcher YAMLs.
Proposed patch
global_vars:
hf_model: /hf-local/moonshotai/Kimi-K2.5 DFlash
+ hf_repo_id: moonshotai/Kimi-K2.5-DFlash
@@
task_0:
@@
environment:
- HF_LOCAL: /hf-local
+ - MLM_MODEL_CFG: <<global_vars.hf_repo_id>>
+ - QUANT_CFG: NVFP4_DEFAULT_CFG
@@
task_1:
@@
environment:
- HF_MODEL_CKPT: <<global_vars.hf_model>>
+ - MLM_MODEL_CFG: <<global_vars.hf_repo_id>>
+ - QUANT_CFG: NVFP4_DEFAULT_CFG
@@
task_2:
@@
+ environment:
+ - MLM_MODEL_CFG: <<global_vars.hf_repo_id>>
+ - QUANT_CFG: NVFP4_DEFAULT_CFG
@@
task_3:
@@
environment:
- HF_MODEL_CKPT: <<global_vars.hf_model>>
+ - MLM_MODEL_CFG: <<global_vars.hf_repo_id>>
+ - QUANT_CFG: NVFP4_DEFAULT_CFGAs per coding guidelines, "Set MLM_MODEL_CFG environment variable to the HuggingFace repo ID when adding a new model config" and "Set QUANT_CFG environment variable (e.g., NVFP4_DEFAULT_CFG, INT8_DEFAULT_CFG) when adding a new model config".
Also applies to: 57-58, 67-83, 97-99
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/examples/moonshotai/Kimi-K2.5` DFlash/hf_offline_eagle3.yaml
around lines 39 - 40, The YAML is missing required environment variables for the
new model config: add MLM_MODEL_CFG (set to the HuggingFace repo ID for the
model) and QUANT_CFG (e.g., NVFP4_DEFAULT_CFG or INT8_DEFAULT_CFG) to every
environment block that defines task envs (the existing blocks that contain
HF_LOCAL) and create an environment block for task_2 that includes HF_LOCAL plus
MLM_MODEL_CFG and QUANT_CFG so the launcher YAML conforms to the model-config
contract; ensure the variable names are spelled exactly as MLM_MODEL_CFG and
QUANT_CFG in each task's environment list.
| - --config modules/Model-Optimizer/modelopt_recipes/general/speculative_decoding/eagle3.yaml | ||
| - model.model_name_or_path=<<global_vars.hf_model>> | ||
| - data.offline_data_path=/scratchspace/offline_hidden_states | ||
| - training.output_dir=/scratchspace/eagle3 |
There was a problem hiding this comment.
Align the training output path with benchmark input path.
Line 73 writes artifacts to /scratchspace/eagle3, but Line 88 reads draft weights from /scratchspace/export. Unless train_eagle.sh exports to /scratchspace/export implicitly, task_3 will not find the draft model.
Proposed patch
- - --draft_model_dir /scratchspace/export
+ - --draft_model_dir /scratchspace/eagle3Also applies to: 88-88
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/examples/moonshotai/Kimi-K2.5` DFlash/hf_offline_eagle3.yaml
at line 73, The training output path (training.output_dir) is set to
/scratchspace/eagle3 but later steps (task_3 / the draft weights reader) expect
artifacts at /scratchspace/export; change training.output_dir to match the
benchmark input path (/scratchspace/export) or update the task_3 draft weights
path to /scratchspace/eagle3 so both produce and consume the same directory
(refer to the training.output_dir key and the task_3 draft weights
reference/train_eagle.sh usage to locate where to update).
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1496 +/- ##
==========================================
- Coverage 77.44% 76.93% -0.51%
==========================================
Files 473 473
Lines 51418 52473 +1055
==========================================
+ Hits 39819 40370 +551
- Misses 11599 12103 +504
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
🚧 Blocked on input checkpoint — not yet ready to merge. Status
What's blockingThis PR's YAML references
Unblock pathA human needs to:
The pre-merge mandate of "real cluster test evidence" is the policy this babysit run is enforcing — and is being codified in pensieve-intern MR !4 so the next synth_support agent run can't ship a YAML without it. (Driven by /babysit-jira on OMNIML-4740.) |
006e8b2 to
c3eef71
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml (2)
85-85:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix task_2 → task_3 artifact path mismatch.
Line 85 writes to
/scratchspace/eagle3, but Line 100 reads from/scratchspace/export. This can break stage 4 input resolution.Proposed patch
- - --draft_model_dir /scratchspace/export + - --draft_model_dir /scratchspace/eagle3Also applies to: 100-100
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml` at line 85, The artifact path for the training output is inconsistent between tasks: training.output_dir is set to /scratchspace/eagle3 while later steps expect /scratchspace/export, causing task_2 → task_3 input resolution to fail; update the value of training.output_dir (the training.output_dir key in the YAML) to match the downstream artifact path (/scratchspace/export) or change the downstream read path to /scratchspace/eagle3 so both producer (training.output_dir) and consumer use the same artifact directory.
47-49:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd required
MLM_MODEL_CFGandQUANT_CFGto every task env.These required model-config env vars are missing in
task_0,task_1, andtask_3, andtask_2has noenvironmentblock at all.Proposed patch
global_vars: hf_model: /hf-local/moonshotai/Kimi-K2.6 + hf_repo_id: moonshotai/Kimi-K2.6 @@ task_0: @@ environment: - HF_LOCAL: /hf-local + - MLM_MODEL_CFG: <<global_vars.hf_repo_id>> + - QUANT_CFG: NVFP4_DEFAULT_CFG - VLLM_STARTUP_TIMEOUT: "1800" # Kimi-K2.6 weight load alone is ~7.7 min @@ task_1: @@ environment: - HF_MODEL_CKPT: <<global_vars.hf_model>> + - MLM_MODEL_CFG: <<global_vars.hf_repo_id>> + - QUANT_CFG: NVFP4_DEFAULT_CFG @@ task_2: @@ + environment: + - MLM_MODEL_CFG: <<global_vars.hf_repo_id>> + - QUANT_CFG: NVFP4_DEFAULT_CFG slurm_config: @@ task_3: @@ environment: - HF_MODEL_CKPT: <<global_vars.hf_model>> + - MLM_MODEL_CFG: <<global_vars.hf_repo_id>> + - QUANT_CFG: NVFP4_DEFAULT_CFGAs per coding guidelines, "Set
MLM_MODEL_CFGenvironment variable to the HuggingFace repo ID when adding a new model config" and "SetQUANT_CFGenvironment variable (e.g.,NVFP4_DEFAULT_CFG,INT8_DEFAULT_CFG) when adding a new model config".Also applies to: 69-70, 79-95, 109-110
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml` around lines 47 - 49, The tasks in this YAML are missing required environment vars: add MLM_MODEL_CFG and QUANT_CFG to the environment lists for task_0, task_1, and task_3, and create an environment block for task_2 with those same vars; specifically set MLM_MODEL_CFG to the HuggingFace repo ID used by this model and QUANT_CFG to the chosen quantization profile (e.g., NVFP4_DEFAULT_CFG or INT8_DEFAULT_CFG), ensuring the variable names match exactly (MLM_MODEL_CFG, QUANT_CFG) in each task's environment array so downstream code that reads these vars (task_0, task_1, task_2, task_3) will find them.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml`:
- Line 22: The comment that mentions vLLM loading "Kimi-K2.5" is stale relative
to the configured checkpoint key hf_model (currently set to Kimi-K2.6); update
that comment text to reference "Kimi-K2.6" so it matches the hf_model value, and
make the same change for the other occurrences referenced (the comments around
the hf_model key and the comments at the other occurrences noted). Ensure all
comments in the file that mention Kimi-K2.5 are changed to Kimi-K2.6 to keep
configuration and documentation consistent.
---
Duplicate comments:
In `@tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml`:
- Line 85: The artifact path for the training output is inconsistent between
tasks: training.output_dir is set to /scratchspace/eagle3 while later steps
expect /scratchspace/export, causing task_2 → task_3 input resolution to fail;
update the value of training.output_dir (the training.output_dir key in the
YAML) to match the downstream artifact path (/scratchspace/export) or change the
downstream read path to /scratchspace/eagle3 so both producer
(training.output_dir) and consumer use the same artifact directory.
- Around line 47-49: The tasks in this YAML are missing required environment
vars: add MLM_MODEL_CFG and QUANT_CFG to the environment lists for task_0,
task_1, and task_3, and create an environment block for task_2 with those same
vars; specifically set MLM_MODEL_CFG to the HuggingFace repo ID used by this
model and QUANT_CFG to the chosen quantization profile (e.g., NVFP4_DEFAULT_CFG
or INT8_DEFAULT_CFG), ensuring the variable names match exactly (MLM_MODEL_CFG,
QUANT_CFG) in each task's environment array so downstream code that reads these
vars (task_0, task_1, task_2, task_3) will find them.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 39023861-07f0-4fbf-a6bb-da4e88f3160a
📒 Files selected for processing (1)
tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml
| note: | ||
|
|
||
| global_vars: | ||
| hf_model: /hf-local/moonshotai/Kimi-K2.6 |
There was a problem hiding this comment.
Update stale model reference in comments.
The comment says vLLM loads Kimi-K2.5, but current configured checkpoint is Kimi-K2.6 (Line 22). Please align the comment to avoid operator confusion.
Also applies to: 27-28
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml`
at line 22, The comment that mentions vLLM loading "Kimi-K2.5" is stale relative
to the configured checkpoint key hf_model (currently set to Kimi-K2.6); update
that comment text to reference "Kimi-K2.6" so it matches the hf_model value, and
make the same change for the other occurrences referenced (the comments around
the hf_model key and the comments at the other occurrences noted). Ensure all
comments in the file that mention Kimi-K2.5 are changed to Kimi-K2.6 to keep
configuration and documentation consistent.
There was a problem hiding this comment.
Should be Kimi-K2.5/hf_offline_dflash.yaml instead.
| gpus_per_node: 8 | ||
| container: vllm/vllm-openai:latest | ||
|
|
||
| # Step 2: Dump hidden states from target model |
There was a problem hiding this comment.
Remove the task is not been tested
| container: nvcr.io/nvidia/tensorrt-llm/release:1.2.0 | ||
|
|
||
| # Step 3: Train EAGLE3 draft head (offline, single task) | ||
| task_2: |
There was a problem hiding this comment.
Remove the task is not been tested
| gpus_per_node: 8 | ||
| container: nvcr.io/nvidia/tensorrt-llm/release:1.2.0 | ||
|
|
||
| # Step 4: Benchmark speculative decoding (VLLM backend) |
There was a problem hiding this comment.
Remove the task is not been tested
c3eef71 to
b661cef
Compare
|
Review addressed in the force-push (b661cef):
The CodeRabbit comments about Also rolling the lessons into the synth_support spec template via pensieve-intern MR !5 (https://gitlab-master.nvidia.com/omniml/integration/pensieve-intern/-/merge_requests/5) — that MR will be updated with the directory/filename/per-stage-scope conventions so the next agent run on a new model doesn't ship untested tasks or use the output name as the directory. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml`:
- Around line 46-51: The launcher YAML's environment block is missing required
model keys; add MLM_MODEL_CFG set to the HuggingFace repo ID for this model and
QUANT_CFG set to the chosen quantization profile (e.g., NVFP4_DEFAULT_CFG or
INT8_DEFAULT_CFG) inside the existing environment array so the new launcher
config provides both MLM_MODEL_CFG and QUANT_CFG alongside HF_LOCAL and
VLLM_STARTUP_TIMEOUT.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5475f944-f3fe-4118-96b5-bf9384022088
📒 Files selected for processing (1)
tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml
| environment: | ||
| - HF_LOCAL: /hf-local | ||
| - VLLM_STARTUP_TIMEOUT: "1800" # Kimi-K2.6 weight load alone is ~7.7 min | ||
| # at 71 GiB/GPU; default 600s in query.sh | ||
| # is not enough to also cover KV cache | ||
| # profiling + encoder cache init |
There was a problem hiding this comment.
Add required model environment keys for new launcher configs.
environment is missing MLM_MODEL_CFG and QUANT_CFG, which are required for new model configs in this launcher YAML family.
Suggested patch
environment:
- HF_LOCAL: /hf-local
+ - MLM_MODEL_CFG: moonshotai/Kimi-K2.5
+ - QUANT_CFG: NVFP4_DEFAULT_CFG
- VLLM_STARTUP_TIMEOUT: "1800" # Kimi-K2.6 weight load alone is ~7.7 min
# at 71 GiB/GPU; default 600s in query.sh
# is not enough to also cover KV cacheAs per coding guidelines, "tools/launcher/**/*.yaml: Set MLM_MODEL_CFG environment variable to the HuggingFace repo ID when adding a new model config" and "Set QUANT_CFG environment variable (e.g., NVFP4_DEFAULT_CFG, INT8_DEFAULT_CFG) when adding a new model config".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| environment: | |
| - HF_LOCAL: /hf-local | |
| - VLLM_STARTUP_TIMEOUT: "1800" # Kimi-K2.6 weight load alone is ~7.7 min | |
| # at 71 GiB/GPU; default 600s in query.sh | |
| # is not enough to also cover KV cache | |
| # profiling + encoder cache init | |
| environment: | |
| - HF_LOCAL: /hf-local | |
| - MLM_MODEL_CFG: moonshotai/Kimi-K2.5 | |
| - QUANT_CFG: NVFP4_DEFAULT_CFG | |
| - VLLM_STARTUP_TIMEOUT: "1800" # Kimi-K2.6 weight load alone is ~7.7 min | |
| # at 71 GiB/GPU; default 600s in query.sh | |
| # is not enough to also cover KV cache | |
| # profiling + encoder cache init |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml` around
lines 46 - 51, The launcher YAML's environment block is missing required model
keys; add MLM_MODEL_CFG set to the HuggingFace repo ID for this model and
QUANT_CFG set to the chosen quantization profile (e.g., NVFP4_DEFAULT_CFG or
INT8_DEFAULT_CFG) inside the existing environment array so the new launcher
config provides both MLM_MODEL_CFG and QUANT_CFG alongside HF_LOCAL and
VLLM_STARTUP_TIMEOUT.
b661cef to
d35357d
Compare
d35357d to
b661cef
Compare
|
Re-cleaned via /babysit-jira on OMNIML-4740. The 2026-05-15 14:26 agent re-fire (pipeline 51447198, running pre-MR-!4 pensieve-intern v0.33.48) force-pushed over the clean cluster-tested commit (b661cef) with a polluted draft — uv.lock churn + VERIFICATION_COMMENT.txt sidecar + a regressed YAML pointing at Reset to b661cef — the single clean cluster-validated commit (Slurm 11782946 on cw-dfw, COMPLETED 1:02:11, real end-to-end vLLM output verified). Paired infra fix to prevent re-occurrence: nmm-sandbox!143 — bumps This PR is back to ready-for-review state; awaiting CODEOWNERS approval from @kevalmorabia97 + modelopt-setup-codeowners. |
b661cef to
f0d20f2
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/launcher/common/service_utils.sh`:
- Around line 51-54: The current fixed marker (/tmp/.nmm_extra_dep_installed)
stored in the _marker variable can be stale across runs; change _marker in
tools/launcher/common/service_utils.sh to be job/session-scoped (e.g., include
$TMPDIR if set or append a unique token like $$, $CI_JOB_ID, or a timestamp) or
generate a unique path with mktemp and persist that same path for the check and
creation; update any code that references _marker to use the new session-scoped
marker so the if [[ -f "$_marker" ]] check and subsequent creation are
consistent.
- Around line 55-67: The install block currently may create the marker even if
installs fail and non-zero ranks keep running after the 600s wait; update the
code around the mpi_local_rank check to fail fast: in the rank 0 branch (where
pip install diskcache, git clone and pip install "${_nvrx_dir}" run) check each
command's exit status and only touch/emit "$_marker" after all installs succeed,
otherwise log the error and exit non-zero; in the non-zero branch (the while
loop using _waited and checking "$_marker") detect the timeout (when _waited >=
600) and abort with a non-zero exit and error message instead of silently
continuing so downstream steps don't run without dependencies.
- Around line 59-60: The git clone in service_utils.sh clones
nvidia-resiliency-ext without pinning, so add an environment variable (e.g.,
NVRX_REF) and use it when cloning to pin to a tag/commit; modify the clone
invocation that uses _nvrx_dir to include --branch "$NVRX_REF" (or fall back to
a sensible default like main when NVRX_REF is empty) and document/export
NVRX_REF so CI/ops can set an immutable ref before running the script.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1759c510-8bfc-419f-8829-c46dd02aeb5e
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (3)
BLOCKED_ON_UPSTREAM.txttools/launcher/common/service_utils.shtools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml
✅ Files skipped from review due to trivial changes (1)
- BLOCKED_ON_UPSTREAM.txt
🚧 Files skipped from review as they are similar to previous changes (1)
- tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml
| local _marker=/tmp/.nmm_extra_dep_installed | ||
| if [[ -f "$_marker" ]]; then | ||
| return 0 | ||
| fi |
There was a problem hiding this comment.
Scope the marker file to the job/session.
Line 51 uses a fixed /tmp marker name, so stale files from previous runs can cause this run to skip installation incorrectly.
Suggested fix
- local _marker=/tmp/.nmm_extra_dep_installed
+ local _marker="/tmp/.nmm_extra_dep_installed.${SLURM_JOB_ID:-$$}.${USER:-unknown}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| local _marker=/tmp/.nmm_extra_dep_installed | |
| if [[ -f "$_marker" ]]; then | |
| return 0 | |
| fi | |
| local _marker="/tmp/.nmm_extra_dep_installed.${SLURM_JOB_ID:-$$}.${USER:-unknown}" | |
| if [[ -f "$_marker" ]]; then | |
| return 0 | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/common/service_utils.sh` around lines 51 - 54, The current
fixed marker (/tmp/.nmm_extra_dep_installed) stored in the _marker variable can
be stale across runs; change _marker in tools/launcher/common/service_utils.sh
to be job/session-scoped (e.g., include $TMPDIR if set or append a unique token
like $$, $CI_JOB_ID, or a timestamp) or generate a unique path with mktemp and
persist that same path for the check and creation; update any code that
references _marker to use the new session-scoped marker so the if [[ -f
"$_marker" ]] check and subsequent creation are consistent.
| if [[ "$mpi_local_rank" -eq 0 ]]; then | ||
| pip install diskcache | ||
| local _nvrx_dir | ||
| _nvrx_dir="$(mktemp -d)/nvidia-resiliency-ext" | ||
| git clone --depth 1 https://github.com/NVIDIA/nvidia-resiliency-ext "${_nvrx_dir}" \ | ||
| && pip install "${_nvrx_dir}" | ||
| touch "$_marker" | ||
| else | ||
| local _waited=0 | ||
| while [[ ! -f "$_marker" && $_waited -lt 600 ]]; do | ||
| sleep 1 | ||
| _waited=$((_waited + 1)) | ||
| done |
There was a problem hiding this comment.
Fail fast on install errors and wait timeout.
Line 61 can publish success even if dependency install failed, and Lines 64-67 let non-zero ranks continue after 600s timeout with no marker. Both paths can silently proceed with missing deps.
Suggested fix
if [[ "$mpi_local_rank" -eq 0 ]]; then
- pip install diskcache
+ pip install diskcache || return 1
local _nvrx_dir
_nvrx_dir="$(mktemp -d)/nvidia-resiliency-ext"
git clone --depth 1 https://github.com/NVIDIA/nvidia-resiliency-ext "${_nvrx_dir}" \
- && pip install "${_nvrx_dir}"
- touch "$_marker"
+ && pip install "${_nvrx_dir}" \
+ && touch "$_marker" \
+ || return 1
else
local _waited=0
while [[ ! -f "$_marker" && $_waited -lt 600 ]]; do
sleep 1
_waited=$((_waited + 1))
done
+ [[ -f "$_marker" ]] || return 1
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if [[ "$mpi_local_rank" -eq 0 ]]; then | |
| pip install diskcache | |
| local _nvrx_dir | |
| _nvrx_dir="$(mktemp -d)/nvidia-resiliency-ext" | |
| git clone --depth 1 https://github.com/NVIDIA/nvidia-resiliency-ext "${_nvrx_dir}" \ | |
| && pip install "${_nvrx_dir}" | |
| touch "$_marker" | |
| else | |
| local _waited=0 | |
| while [[ ! -f "$_marker" && $_waited -lt 600 ]]; do | |
| sleep 1 | |
| _waited=$((_waited + 1)) | |
| done | |
| if [[ "$mpi_local_rank" -eq 0 ]]; then | |
| pip install diskcache || return 1 | |
| local _nvrx_dir | |
| _nvrx_dir="$(mktemp -d)/nvidia-resiliency-ext" | |
| git clone --depth 1 https://github.com/NVIDIA/nvidia-resiliency-ext "${_nvrx_dir}" \ | |
| && pip install "${_nvrx_dir}" \ | |
| && touch "$_marker" \ | |
| || return 1 | |
| else | |
| local _waited=0 | |
| while [[ ! -f "$_marker" && $_waited -lt 600 ]]; do | |
| sleep 1 | |
| _waited=$((_waited + 1)) | |
| done | |
| [[ -f "$_marker" ]] || return 1 | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/common/service_utils.sh` around lines 55 - 67, The install
block currently may create the marker even if installs fail and non-zero ranks
keep running after the 600s wait; update the code around the mpi_local_rank
check to fail fast: in the rank 0 branch (where pip install diskcache, git clone
and pip install "${_nvrx_dir}" run) check each command's exit status and only
touch/emit "$_marker" after all installs succeed, otherwise log the error and
exit non-zero; in the non-zero branch (the while loop using _waited and checking
"$_marker") detect the timeout (when _waited >= 600) and abort with a non-zero
exit and error message instead of silently continuing so downstream steps don't
run without dependencies.
| git clone --depth 1 https://github.com/NVIDIA/nvidia-resiliency-ext "${_nvrx_dir}" \ | ||
| && pip install "${_nvrx_dir}" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n tools/launcher/common/service_utils.sh | sed -n '50,70p'Repository: NVIDIA/Model-Optimizer
Length of output: 837
🏁 Script executed:
rg "NVRX_REF|nvidia-resiliency-ext" --type shRepository: NVIDIA/Model-Optimizer
Length of output: 292
🏁 Script executed:
rg "nvidia-resiliency-ext" -iRepository: NVIDIA/Model-Optimizer
Length of output: 292
🏁 Script executed:
rg "resiliency|reproducibility|version" tools/launcher/common/service_utils.sh -B2 -A2Repository: NVIDIA/Model-Optimizer
Length of output: 569
🏁 Script executed:
grep -r "NVRX" tools/launcher/ 2>/dev/null || echo "No NVRX references found"Repository: NVIDIA/Model-Optimizer
Length of output: 91
Pin nvidia-resiliency-ext to an immutable ref.
Line 59 clones from the default branch at runtime without specifying a Git reference, creating a supply-chain and reproducibility risk for cluster jobs. The branch pointer can move unexpectedly, making builds non-deterministic.
Add an environment variable to allow pinning to a specific tag or commit:
Suggested fix
+ local _nvrx_ref="${NVRX_REF:-<tag-or-commit>}"
- git clone --depth 1 https://github.com/NVIDIA/nvidia-resiliency-ext "${_nvrx_dir}" \
+ git clone --depth 1 --branch "${_nvrx_ref}" https://github.com/NVIDIA/nvidia-resiliency-ext "${_nvrx_dir}" \
&& pip install "${_nvrx_dir}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/common/service_utils.sh` around lines 59 - 60, The git clone
in service_utils.sh clones nvidia-resiliency-ext without pinning, so add an
environment variable (e.g., NVRX_REF) and use it when cloning to pin to a
tag/commit; modify the clone invocation that uses _nvrx_dir to include --branch
"$NVRX_REF" (or fall back to a sensible default like main when NVRX_REF is
empty) and document/export NVRX_REF so CI/ops can set an immutable ref before
running the script.
f0d20f2 to
b661cef
Compare
b661cef to
50be962
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tools/launcher/examples/Kimi-K2.5/hf_offline_dflash.yaml (1)
38-38: ⚡ Quick winPin container image to an available versioned tag instead of
latest.Using
vllm/vllm-openai:latestbreaks reproducibility as the tag can change between runs. However, the versionv0.8.5does not exist in the vllm/vllm-openai repository. Available pinned alternatives includecu129-nightlyor model-specific tags. Check the vllm Docker Hub releases for a stable tag appropriate for your use case.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/Kimi-K2.5/hf_offline_dflash.yaml` at line 38, Replace the floating image tag "vllm/vllm-openai:latest" with a stable, available versioned tag to ensure reproducibility; update the container field (the line containing container: vllm/vllm-openai:latest) to a specific tag such as vllm/vllm-openai:cu129-nightly or another model-specific release found on the vllm Docker Hub tags page, verifying the chosen tag exists before committing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/launcher/examples/Kimi-K2.5/hf_offline_dflash.yaml`:
- Line 12: The hf_model entry incorrectly references
"/hf-local/moonshotai/Kimi-K2.6" while this pipeline targets Kimi-K2.5; update
the hf_model value to the correct K2.5 model path (e.g., replace "Kimi-K2.6"
with "Kimi-K2.5") so the hf_model key points to the matching K2.5 artifact and
remove any leftover test stand-in reference.
- Line 23: Replace the incorrect vLLM flag --trust_remote_code (underscore) with
the hyphenated form --trust-remote-code in the hf_offline_dflash.yaml example so
it matches other vLLM flags (e.g., --enforce-eager, --gpu-memory-utilization,
--max-model-len) and is accepted by the CLI; locate the entry that currently
reads --trust_remote_code and update it to --trust-remote-code.
---
Nitpick comments:
In `@tools/launcher/examples/Kimi-K2.5/hf_offline_dflash.yaml`:
- Line 38: Replace the floating image tag "vllm/vllm-openai:latest" with a
stable, available versioned tag to ensure reproducibility; update the container
field (the line containing container: vllm/vllm-openai:latest) to a specific tag
such as vllm/vllm-openai:cu129-nightly or another model-specific release found
on the vllm Docker Hub tags page, verifying the chosen tag exists before
committing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 623468f8-1c67-4fbd-aa04-fbe320694ab4
📒 Files selected for processing (2)
tools/launcher/common/service_utils.shtools/launcher/examples/Kimi-K2.5/hf_offline_dflash.yaml
| note: | ||
|
|
||
| global_vars: | ||
| hf_model: /hf-local/moonshotai/Kimi-K2.6 |
There was a problem hiding this comment.
Model path references K2.6 but job targets K2.5.
The hf_model path points to Kimi-K2.6 while the job name and file path indicate this pipeline is for Kimi-K2.5. Per PR objectives, K2.6 was used as a stand-in during cluster testing, but this should be updated to the correct K2.5 path before merge to avoid confusion.
- hf_model: /hf-local/moonshotai/Kimi-K2.6
+ hf_model: /hf-local/moonshotai/Kimi-K2.5📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| hf_model: /hf-local/moonshotai/Kimi-K2.6 | |
| hf_model: /hf-local/moonshotai/Kimi-K2.5 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/examples/Kimi-K2.5/hf_offline_dflash.yaml` at line 12, The
hf_model entry incorrectly references "/hf-local/moonshotai/Kimi-K2.6" while
this pipeline targets Kimi-K2.5; update the hf_model value to the correct K2.5
model path (e.g., replace "Kimi-K2.6" with "Kimi-K2.5") so the hf_model key
points to the matching K2.5 artifact and remove any leftover test stand-in
reference.
| - --tensor-parallel-size 8 | ||
| - --port 8000 | ||
| - --host 0.0.0.0 | ||
| - --trust_remote_code |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
vLLM --trust-remote-code flag syntax
💡 Result:
In vLLM, the --trust-remote-code flag is used to permit the execution of custom model code (such as configuration or modeling files) that is hosted on the Hugging Face Model Hub rather than being natively integrated into the Transformers library [1][2][3]. Usage Syntax: 1. CLI (OpenAI-compatible server): When using the vLLM server, include the flag as a command-line argument [4][3]: vllm serve <model_name> --trust-remote-code 2. Python API: When initializing the LLM class in your Python code, pass the parameter as a boolean argument [3]: from vllm import LLM llm = LLM(model="<model_name>", trust_remote_code=True) Important Considerations: - Security: This flag executes arbitrary code from the model repository on your local machine [1]. Only use it with models from trusted sources [1]. - Necessity: This flag is generally required for custom models or newer architectures that have not yet been upstreamed into the official Hugging Face Transformers library [1][2]. - Default Behavior: While the flag is often required for specific models, users should be aware that some model configurations may hardcode this requirement, and security-related updates may occasionally affect how this flag is handled [5]. Always ensure your transformers library is up to date, as native support for newer models is frequently added, which can eliminate the need for this flag [1][2].
Citations:
- 1: [Usage]: Running Phi-3-small-128k-instruct with v0.4.3 without --trust-remote-code vllm-project/vllm#5244
- 2: https://discuss.huggingface.co/t/how-to-avoid-trust-remote-code-true-for-my-models/84134
- 3: https://docs.vllm.ai/en/latest/models/supported_models/
- 4: https://docs.vllm.ai/en/v0.5.5/models/engine_args.html
- 5: https://www.reddit.com/r/LocalLLaMA/comments/1s72zog/vllm_cve202627893_trustremotecodefalse_is/
🏁 Script executed:
fd "hf_offline_dflash.yaml" tools/launcher/examples/Repository: NVIDIA/Model-Optimizer
Length of output: 123
🏁 Script executed:
cat -n tools/launcher/examples/Kimi-K2.5/hf_offline_dflash.yamlRepository: NVIDIA/Model-Optimizer
Length of output: 1441
Fix flag syntax: use --trust-remote-code with hyphen instead of underscore.
Line 23 uses --trust_remote_code with an underscore, while other vLLM flags in the same file use hyphens (--enforce-eager, --gpu-memory-utilization, --max-model-len). vLLM CLI expects the hyphenated form.
Diff
- - --trust_remote_code
+ - --trust-remote-code📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - --trust_remote_code | |
| - --trust-remote-code |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/examples/Kimi-K2.5/hf_offline_dflash.yaml` at line 23, Replace
the incorrect vLLM flag --trust_remote_code (underscore) with the hyphenated
form --trust-remote-code in the hf_offline_dflash.yaml example so it matches
other vLLM flags (e.g., --enforce-eager, --gpu-memory-utilization,
--max-model-len) and is accepted by the CLI; locate the entry that currently
reads --trust_remote_code and update it to --trust-remote-code.
50be962 to
abbda41
Compare
Summary
Adds the EAGLE3 offline pipeline YAML for
moonshotai/Kimi-K2.5-DFlash, adapted fromQwen/Qwen3-8B's offline YAML.task_0cluster-tested green on cw-dfw (Slurm 11782946, experimentcicd_1778864959, elapsed 1:02:11).Driven by /babysit-jira on OMNIML-4740. Replaces the original pensieve-intern synth_support agent draft (1b02102) which had three structural issues:
global_vars.hf_modelto the output path (Kimi-K2.5 DFlash) instead of the input checkpoint.release:1.2.0) that doesn't registerKimiK25ForConditionalGenerationas of 2026-05-14.VERIFICATION_COMMENT.txt(a runner sidecar, not artifact code) anduv.lockregen (+1037/-81, the source of the priormergeable_state: dirty).This PR is the cleaned + cluster-validated replacement.
Changes
Kimi-K2.5 DFlash→Kimi-K2.5-DFlash(Slurm tar packaging breaks on spaces injob_name/path).global_vars.hf_model: /hf-local/moonshotai/Kimi-K2.6— the canonical Kimi-K2.5 input stand-in staged by the operator on cw-dfw.task_0migrated from TRT-LLM to vLLM:script: common/vllm/query.shcontainer: vllm/vllm-openai:latestntasks_per_node: 1(vLLM is single-process)--tensor-parallel-size 8,--trust-remote-code--enforce-eager(vllm-openai:latest is missingtorch/bin/ptxasfor inductor autotuning)--gpu-memory-utilization 0.95+--max-model-len 4096(Kimi weights are 595 GB bf16 on 8×80 GB = 93% weight occupancy; default 0.9 leaves -1.1 GiB for KV cache)VLLM_STARTUP_TIMEOUT=1800env (Kimi load is ~7.7 min, default 600s inquery.shis not enough)--dataswitched to the in-reposynthetic_conversations_1k.jsonl— the canonicalSpeculative-Decoding-Prompt-Samplesisn't on cw-dfw; the in-repo dataset is the portable, packager-shipped input for smoke testing.Cluster-test evidence (mandatory per the refined synth_support spec)
Real assistant response verified end-to-end — Kimi correctly answered the "bat and ball" CRT problem:
Test plan
slurm.py --yaml ... --dry-runexits 0task_0only (task_1/2/3.skip=true) — green, 1000 prompts processed, 10 synth-data shards writtenrun_pipelinestage of OMNIML-4735 will exercise task_1..3 end-to-end with the produced synth data🤖 Generated with Claude Code
Summary by CodeRabbit
Improvements
New Features