[OMNIML-4740] Add task_0 (synth) of DFlash offline pipeline for moonshotai/Kimi-K2.5

ChenhanYu · ChenhanYu · commit b661cef4db76 · 2026-05-15T13:47:42.000-07:00
Single-task synth-only YAML at the new canonical path: tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml - Directory uses the *input* model name (Kimi-K2.5, what task_0 queries), not the *output* (Kimi-K2.5-DFlash, what the full pipeline produces). DFlash is what the pipeline does, not what it queries. - Filename uses the algorithm name (hf_offline_dflash.yaml), not the generic hf_offline_eagle3.yaml. - Only task_0 is included. Tasks 1-3 (hidden-state dump, offline training, benchmark) are added by their respective downstream stages (hidden_state_dump_support, training_support, run_pipeline) once each is independently cluster-tested. Per review: don't ship tasks without evidence. task_0 specifics: - vLLM (not TRT-LLM — release:1.2.0 doesn't register KimiK25ForConditionalGeneration as of 2026-05-14). - Container: vllm/vllm-openai:latest. - ntasks_per_node: 1 (vLLM is single-process; TP via flag, not MPI). - vLLM knobs tuned for Kimi-K2 (595 GB bf16 on 8x80 GB): --enforce-eager (vllm-openai:latest is missing torch/bin/ptxas) --gpu-memory-utilization 0.95 + --max-model-len 4096 (default 0.9 left -1.1 GiB for KV cache) VLLM_STARTUP_TIMEOUT=1800 env (Kimi load is ~7.7 min, default 600s in query.sh is too short) - Input dataset: in-repo synthetic_conversations_1k.jsonl (packager-shipped to /nemo_run/code/...) rather than a cluster-mounted Lustre path that may not be staged on cw-dfw. Cluster-test evidence (cw-dfw, Slurm 11782946, experiment cicd_1778864959, elapsed 1:02:11, exit 0): $ SLURM_CLUSTER=cw_dfw uv run slurm.py \ --yaml '.../moonshotai/Kimi-K2.5/hf_offline_dflash.yaml' \ pipeline.task_1.skip=true \ pipeline.task_2.skip=true \ pipeline.task_3.skip=true \ --yes --detach Loading weights took 461.45 seconds Model loading took 71.44 GiB memory and 465.10 seconds Map (num_proc=32): 100%|##########| 100/100 [06:42<00:00, 4.02s/example] Saved 10 shards to /scratchspace/data/train-{1..10}-00010.jsonl Slurm: COMPLETED 01:02:11 Real assistant response verified end-to-end (Kimi correctly answers the "bat and ball" CRT problem). Replaces the prior pensieve-intern agent draft (1b02102) which had: output-named input path, TRT-LLM container lacking KimiK25 support, sidecar pollution, and uv.lock churn. Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
diff --git a/tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml b/tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml
@@ -0,0 +1,57 @@
+# DFlash offline speculative decoding pipeline for moonshotai/Kimi-K2.5.
+#
+# task_0: Data synthesis — query vLLM server to generate prompt samples.
+#
+# Subsequent tasks (hidden-state dump, offline training, benchmark) are
+# added by their respective downstream stages (hidden_state_dump_support /
+# training_support / run_pipeline) once each is independently cluster-tested.
+# Don't add them here speculatively — each stage carries its own evidence.
+#
+# Usage:
+#   uv run launch.py --yaml examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml --yes
+#   uv run slurm.py --yaml modules/Model-Optimizer/tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml --yes
+
+job_name: Kimi-K2.5_DFlash_offline
+pipeline:
+  allow_to_fail: false
+  skip: false
+  note:
+
+  global_vars:
+    hf_model: /hf-local/moonshotai/Kimi-K2.6
+
+  # Step 1: Data synthesis via vLLM server.
+  # TRT-LLM release:1.2.0 doesn't register KimiK25ForConditionalGeneration
+  # (validated on cw-dfw 2026-05-14, Slurm 11770213) — fell back to vLLM
+  # which loads the model via `--trust-remote-code`.
+  # Args before "--" go to vllm serve; args after "--" go to common/query.py.
+  task_0:
+    script: common/vllm/query.sh
+    args:
+      - --model <<global_vars.hf_model>>
+      - --tensor-parallel-size 8
+      - --max-num-seqs 32
+      - --max-model-len 4096
+      - --gpu-memory-utilization 0.95  # Kimi-K2 weights are ~595 GB bf16 on
+                                       # 8x80 GB = 93% weight occupancy; default
+                                       # 0.9 leaves -1.1 GiB for KV cache.
+      - --port 8000
+      - --host 0.0.0.0
+      - --trust-remote-code
+      - --enforce-eager  # vllm-openai:latest is missing torch/bin/ptxas for
+                         # inductor autotuning; eager mode skips that path
+      - --
+      - --data /nemo_run/code/modules/Model-Optimizer/examples/dataset/synthetic_conversations_1k.jsonl
+      - --save /scratchspace/data
+    environment:
+      - HF_LOCAL: /hf-local
+      - VLLM_STARTUP_TIMEOUT: "1800"  # Kimi-K2.6 weight load alone is ~7.7 min
+                                      # at 71 GiB/GPU; default 600s in query.sh
+                                      # is not enough to also cover KV cache
+                                      # profiling + encoder cache init
+    slurm_config:
+      _factory_: "slurm_factory"
+      nodes: 1
+      ntasks_per_node: 1
+      gpus_per_node: 8
+      container: vllm/vllm-openai:latest