Skip to content

Commit f79ab48

Browse files
ChenhanYuclaude
andcommitted
feat(examples): add Qwen3.5-4B vLLM specdec data synthesis YAML
Uses vllm/vllm-openai:qwen3_5-cu130 container with --gpu-memory-utilization 0.87 and --max-tokens 4096 to cap thinking traces. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: chenhany <chenhany@nvidia.com>
1 parent 3b0461f commit f79ab48

File tree

1 file changed

+43
-0
lines changed

1 file changed

+43
-0
lines changed
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Data synthesis for Qwen3.5-4B using the Speculative-Decoding-Multilingual-Prompt-v2 dataset.
2+
#
3+
# Starts a TRT-LLM server with Qwen3.5-4B, then runs query.py against it to generate
4+
# synthetic assistant responses for EAGLE3 draft model training.
5+
#
6+
# Local run (requires GPU + Docker):
7+
# uv run launch.py --yaml examples/Qwen/Qwen3.5-4B/query_specdec_dataset.yaml \
8+
# hf_local=/home/omniml_data_3/hf-local --yes
9+
#
10+
# Slurm run:
11+
# uv run slurm.py --yaml modules/Model-Optimizer/tools/launcher/examples/Qwen/Qwen3.5-4B/query_specdec_dataset.yaml --yes
12+
13+
job_name: Qwen3.5-4B_specdec_query
14+
15+
pipeline:
16+
global_vars:
17+
hf_model: /hf-local/Qwen/Qwen3.5-4B
18+
19+
task_0:
20+
script: common/vllm/query.sh
21+
args:
22+
- --model <<global_vars.hf_model>>
23+
- --tensor-parallel-size 1
24+
- --max-num-seqs 32
25+
- --trust-remote-code
26+
- --gpu-memory-utilization 0.87
27+
- --
28+
- --data /hf-local/nvidia/Speculative-Decoding-Multilingual-Prompt-v2/sample-1K.jsonl
29+
- --save /scratchspace/data
30+
- --num-shards 10
31+
- --num-proc 4
32+
- --max-tokens 4096
33+
environment:
34+
- HF_LOCAL: /hf-local
35+
- LOGNAME: chenhany
36+
- USER: chenhany
37+
- HOME: /tmp
38+
slurm_config:
39+
_factory_: "slurm_factory"
40+
nodes: 1
41+
ntasks_per_node: 1
42+
gpus_per_node: 1
43+
container: vllm/vllm-openai:qwen3_5-cu130

0 commit comments

Comments
 (0)