Commit b661cef
committed
[OMNIML-4740] Add task_0 (synth) of DFlash offline pipeline for moonshotai/Kimi-K2.5
Single-task synth-only YAML at the new canonical path:
tools/launcher/examples/moonshotai/Kimi-K2.5/hf_offline_dflash.yaml
- Directory uses the *input* model name (Kimi-K2.5, what task_0
queries), not the *output* (Kimi-K2.5-DFlash, what the full
pipeline produces). DFlash is what the pipeline does, not what
it queries.
- Filename uses the algorithm name (hf_offline_dflash.yaml), not
the generic hf_offline_eagle3.yaml.
- Only task_0 is included. Tasks 1-3 (hidden-state dump, offline
training, benchmark) are added by their respective downstream
stages (hidden_state_dump_support, training_support,
run_pipeline) once each is independently cluster-tested. Per
review: don't ship tasks without evidence.
task_0 specifics:
- vLLM (not TRT-LLM — release:1.2.0 doesn't register
KimiK25ForConditionalGeneration as of 2026-05-14).
- Container: vllm/vllm-openai:latest.
- ntasks_per_node: 1 (vLLM is single-process; TP via flag, not MPI).
- vLLM knobs tuned for Kimi-K2 (595 GB bf16 on 8x80 GB):
--enforce-eager (vllm-openai:latest is missing torch/bin/ptxas)
--gpu-memory-utilization 0.95 + --max-model-len 4096 (default
0.9 left -1.1 GiB for KV cache)
VLLM_STARTUP_TIMEOUT=1800 env (Kimi load is ~7.7 min, default
600s in query.sh is too short)
- Input dataset: in-repo synthetic_conversations_1k.jsonl
(packager-shipped to /nemo_run/code/...) rather than a
cluster-mounted Lustre path that may not be staged on cw-dfw.
Cluster-test evidence (cw-dfw, Slurm 11782946, experiment
cicd_1778864959, elapsed 1:02:11, exit 0):
$ SLURM_CLUSTER=cw_dfw uv run slurm.py \
--yaml '.../moonshotai/Kimi-K2.5/hf_offline_dflash.yaml' \
pipeline.task_1.skip=true \
pipeline.task_2.skip=true \
pipeline.task_3.skip=true \
--yes --detach
Loading weights took 461.45 seconds
Model loading took 71.44 GiB memory and 465.10 seconds
Map (num_proc=32): 100%|##########| 100/100 [06:42<00:00, 4.02s/example]
Saved 10 shards to /scratchspace/data/train-{1..10}-00010.jsonl
Slurm: COMPLETED 01:02:11
Real assistant response verified end-to-end (Kimi correctly
answers the "bat and ball" CRT problem).
Replaces the prior pensieve-intern agent draft (1b02102) which
had: output-named input path, TRT-LLM container lacking
KimiK25 support, sidecar pollution, and uv.lock churn.
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>1 parent e27f76f commit b661cef
1 file changed
Lines changed: 57 additions & 0 deletions
Lines changed: 57 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
0 commit comments