Skip to content

Commit c3eef71

Browse files
committed
[OMNIML-4740] Add EAGLE3 offline pipeline YAML for moonshotai/Kimi-K2.5-DFlash
4-task pipeline adapted from Qwen/Qwen3-8B. The synth_support agent's original draft (1b02102) referenced an output-named input path (`Kimi-K2.5 DFlash`) on a TRT-LLM container that doesn't yet register the KimiK25 architecture, plus had uv.lock pollution + sidecar in the diff. This is the cleaned + cluster-validated version. Key changes from the agent draft: - Directory renamed `Kimi-K2.5 DFlash` → `Kimi-K2.5-DFlash` (Slurm tar packaging breaks on spaces in job_name / path). - `global_vars.hf_model` points at the *input* model `Kimi-K2.6` (canonical stand-in for Kimi-K2.5 input per moonshotai), not the pipeline's *output* `Kimi-K2.5-DFlash`. - `task_0` switched from TRT-LLM (release:1.2.0 doesn't register KimiK25ForConditionalGeneration as of 2026-05-14) to vLLM (vllm-openai:latest, which loads via `--trust-remote-code`). - vLLM-side knobs documented in-yaml: `--enforce-eager` (skip inductor — vLLM container is missing torch/bin/ptxas); `--gpu-memory-utilization 0.95` + `--max-model-len 4096` (Kimi weights are 595 GB bf16 on 8×80 GB = 93% weight occupancy, default 0.9 left -1.1 GiB for KV cache); `VLLM_STARTUP_TIMEOUT=1800` env (Kimi weight load is ~7.7 min, default 600s in query.sh wasn't enough). - `--data` switched to in-repo `synthetic_conversations_1k.jsonl` (the canonical Speculative-Decoding-Prompt-Samples isn't on cw-dfw; the in-repo dataset is the right portable input for smoke-testing the pipeline). Cluster-test evidence (cw-dfw, Slurm 11782946, experiment cicd_1778864959, elapsed 1:02:11, exit 0): $ SLURM_CLUSTER=cw_dfw uv run slurm.py \ --yaml '.../moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml' \ pipeline.task_1.skip=true \ pipeline.task_2.skip=true \ pipeline.task_3.skip=true \ --yes --detach Loading weights took 461.45 seconds Model loading took 71.44 GiB memory and 465.10 seconds Map (num_proc=32): 100%|██████████| 100/100 [06:42<00:00, 4.02s/example] Saved 10 shards to /scratchspace/data/train-{1..10}-00010.jsonl Slurm: COMPLETED 01:02:11 Real assistant response verified end-to-end (Kimi correctly answers the "bat and ball" CRT problem). The previous draft (1b02102) is replaced; uv.lock churn + VERIFICATION_COMMENT.txt removed. Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
1 parent e27f76f commit c3eef71

1 file changed

Lines changed: 116 additions & 0 deletions

File tree

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# EAGLE3 offline speculative decoding pipeline for moonshotai/Kimi-K2.5-DFlash.
2+
#
3+
# 4-step pipeline:
4+
# task_0: Data synthesis — query vLLM server to generate prompt samples
5+
# task_1: Dump hidden states — run target model to capture hidden states
6+
# task_2: Offline training — train the EAGLE3 draft head
7+
# task_3: Benchmark — evaluate speculative decoding speedup via VLLM
8+
#
9+
# All tasks share /scratchspace to pass artifacts between steps.
10+
#
11+
# Usage:
12+
# uv run launch.py --yaml examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml --yes
13+
# uv run slurm.py --yaml modules/Model-Optimizer/tools/launcher/examples/moonshotai/Kimi-K2.5-DFlash/hf_offline_eagle3.yaml --yes
14+
15+
job_name: Kimi-K2.5-DFlash_EAGLE3_offline
16+
pipeline:
17+
allow_to_fail: false
18+
skip: false
19+
note:
20+
21+
global_vars:
22+
hf_model: /hf-local/moonshotai/Kimi-K2.6
23+
24+
# Step 1: Data synthesis via vLLM server.
25+
# TRT-LLM release:1.2.0 doesn't register KimiK25ForConditionalGeneration
26+
# (validated on cw-dfw 2026-05-14, Slurm 11770213) — fell back to vLLM
27+
# which loads Kimi-K2.5 via `--trust-remote-code`.
28+
# Args before "--" go to vllm serve; args after "--" go to common/query.py.
29+
task_0:
30+
script: common/vllm/query.sh
31+
args:
32+
- --model <<global_vars.hf_model>>
33+
- --tensor-parallel-size 8
34+
- --max-num-seqs 32
35+
- --max-model-len 4096
36+
- --gpu-memory-utilization 0.95 # Kimi-K2 weights are ~595 GB bf16 on
37+
# 8x80 GB = 93% weight occupancy; default
38+
# 0.9 leaves -1.1 GiB for KV cache.
39+
- --port 8000
40+
- --host 0.0.0.0
41+
- --trust-remote-code
42+
- --enforce-eager # vllm-openai:latest is missing torch/bin/ptxas for
43+
# inductor autotuning; eager mode skips that path
44+
- --
45+
- --data /nemo_run/code/modules/Model-Optimizer/examples/dataset/synthetic_conversations_1k.jsonl
46+
- --save /scratchspace/data
47+
environment:
48+
- HF_LOCAL: /hf-local
49+
- VLLM_STARTUP_TIMEOUT: "1800" # Kimi-K2.6 weight load alone is ~7.7 min
50+
# at 71 GiB/GPU; default 600s in query.sh
51+
# is not enough to also cover KV cache
52+
# profiling + encoder cache init
53+
slurm_config:
54+
_factory_: "slurm_factory"
55+
nodes: 1
56+
ntasks_per_node: 1
57+
gpus_per_node: 8
58+
container: vllm/vllm-openai:latest
59+
60+
# Step 2: Dump hidden states from target model
61+
task_1:
62+
script: common/eagle3/dump_offline_data.sh
63+
args:
64+
- --input-data /scratchspace/data
65+
- --output-dir /scratchspace/offline_hidden_states
66+
- --max-seq-len 8192
67+
- --tp 8
68+
- --moe-ep 8
69+
environment:
70+
- HF_MODEL_CKPT: <<global_vars.hf_model>>
71+
slurm_config:
72+
_factory_: "slurm_factory"
73+
nodes: 1
74+
ntasks_per_node: 8
75+
gpus_per_node: 8
76+
container: nvcr.io/nvidia/tensorrt-llm/release:1.2.0
77+
78+
# Step 3: Train EAGLE3 draft head (offline, single task)
79+
task_2:
80+
script: common/eagle3/train_eagle.sh
81+
args:
82+
- --config modules/Model-Optimizer/modelopt_recipes/general/speculative_decoding/eagle3.yaml
83+
- model.model_name_or_path=<<global_vars.hf_model>>
84+
- data.offline_data_path=/scratchspace/offline_hidden_states
85+
- training.output_dir=/scratchspace/eagle3
86+
- training.training_seq_len=4096
87+
- training.disable_tqdm=true
88+
- training.ar_validate_steps=500000
89+
slurm_config:
90+
_factory_: "slurm_factory"
91+
nodes: 1
92+
ntasks_per_node: 1
93+
gpus_per_node: 8
94+
container: nvcr.io/nvidia/tensorrt-llm/release:1.2.0
95+
96+
# Step 4: Benchmark speculative decoding (VLLM backend)
97+
task_3:
98+
script: common/specdec_bench/quick_check.sh
99+
args:
100+
- --draft_model_dir /scratchspace/export
101+
- --draft_length 3
102+
- --output_length 4096
103+
- --engine VLLM
104+
- --tp_size 8
105+
- --ep_size 1
106+
- --speculative_algorithm EAGLE3
107+
- --mtbench /hf-local/HuggingFaceH4/mt_bench_prompts/raw/question.jsonl
108+
- --concurrency 1
109+
environment:
110+
- HF_MODEL_CKPT: <<global_vars.hf_model>>
111+
slurm_config:
112+
_factory_: "slurm_factory"
113+
nodes: 1
114+
ntasks_per_node: 1
115+
gpus_per_node: 8
116+
container: vllm/vllm-openai:latest

0 commit comments

Comments
 (0)