Skip to content

Commit b661cef

Browse files
committed
[OMNIML-4740] Add task_0 (synth) of DFlash offline pipeline for moonshotai/Kimi-K2.5
Single-task synth-only YAML at the new canonical path: tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml - Directory uses the *input* model name (Kimi-K2.5, what task_0 queries), not the *output* (Kimi-K2.5-DFlash, what the full pipeline produces). DFlash is what the pipeline does, not what it queries. - Filename uses the algorithm name (hf_offline_dflash.yaml), not the generic hf_offline_eagle3.yaml. - Only task_0 is included. Tasks 1-3 (hidden-state dump, offline training, benchmark) are added by their respective downstream stages (hidden_state_dump_support, training_support, run_pipeline) once each is independently cluster-tested. Per review: don't ship tasks without evidence. task_0 specifics: - vLLM (not TRT-LLM — release:1.2.0 doesn't register KimiK25ForConditionalGeneration as of 2026-05-14). - Container: vllm/vllm-openai:latest. - ntasks_per_node: 1 (vLLM is single-process; TP via flag, not MPI). - vLLM knobs tuned for Kimi-K2 (595 GB bf16 on 8x80 GB): --enforce-eager (vllm-openai:latest is missing torch/bin/ptxas) --gpu-memory-utilization 0.95 + --max-model-len 4096 (default 0.9 left -1.1 GiB for KV cache) VLLM_STARTUP_TIMEOUT=1800 env (Kimi load is ~7.7 min, default 600s in query.sh is too short) - Input dataset: in-repo synthetic_conversations_1k.jsonl (packager-shipped to /nemo_run/code/...) rather than a cluster-mounted Lustre path that may not be staged on cw-dfw. Cluster-test evidence (cw-dfw, Slurm 11782946, experiment cicd_1778864959, elapsed 1:02:11, exit 0): $ SLURM_CLUSTER=cw_dfw uv run slurm.py \ --yaml '.../moonshotai/Kimi-K2.5/hf_offline_dflash.yaml' \ pipeline.task_1.skip=true \ pipeline.task_2.skip=true \ pipeline.task_3.skip=true \ --yes --detach Loading weights took 461.45 seconds Model loading took 71.44 GiB memory and 465.10 seconds Map (num_proc=32): 100%|##########| 100/100 [06:42<00:00, 4.02s/example] Saved 10 shards to /scratchspace/data/train-{1..10}-00010.jsonl Slurm: COMPLETED 01:02:11 Real assistant response verified end-to-end (Kimi correctly answers the "bat and ball" CRT problem). Replaces the prior pensieve-intern agent draft (1b02102) which had: output-named input path, TRT-LLM container lacking KimiK25 support, sidecar pollution, and uv.lock churn. Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
1 parent e27f76f commit b661cef

1 file changed

Lines changed: 57 additions & 0 deletions

File tree

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# DFlash offline speculative decoding pipeline for moonshotai/Kimi-K2.5.
2+
#
3+
# task_0: Data synthesis — query vLLM server to generate prompt samples.
4+
#
5+
# Subsequent tasks (hidden-state dump, offline training, benchmark) are
6+
# added by their respective downstream stages (hidden_state_dump_support /
7+
# training_support / run_pipeline) once each is independently cluster-tested.
8+
# Don't add them here speculatively — each stage carries its own evidence.
9+
#
10+
# Usage:
11+
# uv run launch.py --yaml examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml --yes
12+
# uv run slurm.py --yaml modules/Model-Optimizer/tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml --yes
13+
14+
job_name: Kimi-K2.5_DFlash_offline
15+
pipeline:
16+
allow_to_fail: false
17+
skip: false
18+
note:
19+
20+
global_vars:
21+
hf_model: /hf-local/moonshotai/Kimi-K2.6
22+
23+
# Step 1: Data synthesis via vLLM server.
24+
# TRT-LLM release:1.2.0 doesn't register KimiK25ForConditionalGeneration
25+
# (validated on cw-dfw 2026-05-14, Slurm 11770213) — fell back to vLLM
26+
# which loads the model via `--trust-remote-code`.
27+
# Args before "--" go to vllm serve; args after "--" go to common/query.py.
28+
task_0:
29+
script: common/vllm/query.sh
30+
args:
31+
- --model <<global_vars.hf_model>>
32+
- --tensor-parallel-size 8
33+
- --max-num-seqs 32
34+
- --max-model-len 4096
35+
- --gpu-memory-utilization 0.95 # Kimi-K2 weights are ~595 GB bf16 on
36+
# 8x80 GB = 93% weight occupancy; default
37+
# 0.9 leaves -1.1 GiB for KV cache.
38+
- --port 8000
39+
- --host 0.0.0.0
40+
- --trust-remote-code
41+
- --enforce-eager # vllm-openai:latest is missing torch/bin/ptxas for
42+
# inductor autotuning; eager mode skips that path
43+
- --
44+
- --data /nemo_run/code/modules/Model-Optimizer/examples/dataset/synthetic_conversations_1k.jsonl
45+
- --save /scratchspace/data
46+
environment:
47+
- HF_LOCAL: /hf-local
48+
- VLLM_STARTUP_TIMEOUT: "1800" # Kimi-K2.6 weight load alone is ~7.7 min
49+
# at 71 GiB/GPU; default 600s in query.sh
50+
# is not enough to also cover KV cache
51+
# profiling + encoder cache init
52+
slurm_config:
53+
_factory_: "slurm_factory"
54+
nodes: 1
55+
ntasks_per_node: 1
56+
gpus_per_node: 8
57+
container: vllm/vllm-openai:latest

0 commit comments

Comments
 (0)