Commit 88670ef
feat(speculative): add vLLM data synthesis pipeline and Nemotron dataset preparation scripts (#1176)
### What does this PR do?
Type of change: New feature, new example, bug fix
Adds a vLLM-based synthetic data generation pipeline for speculative
decoding draft model training, along with dataset preparation scripts
for NVIDIA's Nemotron Post-Training dataset collections.
**Data synthesis pipeline** (`tools/launcher/common/vllm/query.sh` +
`common/query.py`):
- Launch a vLLM server and run multi-turn inference to synthesize
training data from input conversation skeletons
- Fork-safe OpenAI client: reinitializes HTTP connection pool after
`datasets.map()` forks worker processes, preventing 400 errors from
corrupted connections
- Clear Docker `ENTRYPOINT` so vLLM containers (which default to `vllm
serve`) work correctly under NeMo Run's executor
- `--max-tokens` argument to bound generation length
- Local file loading support (`--data /path/to/file.jsonl`)
- Re-raise connection errors so `datasets.map()` halts the shard instead
of silently producing empty rows
- Map `developer` role to `system` (OpenAI format compatibility)
**Multi-turn reasoning trace handling** (`common/query.py`):
- Strip `<think>...</think>` blocks from intermediate assistant turns
before re-feeding to the model; preserve the full trace only on the
final turn
**Nemotron dataset preparation** (`examples/dataset/`):
- `make_nemotron_ptv2_dataset.py` — prepares
[nvidia/Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2)
(~3.3M rows generate, ~1.9M rows train)
- `make_nemotron_ptv3_dataset.py` — prepares the [Nemotron PTv3
collection](https://huggingface.co/collections/nvidia/nemotron-post-training-v3)
of 16 datasets (~3.4M rows generate, ~3.9M rows train)
- Both support `generate` mode (strips assistant turns for synthesis
input) and `train` mode (normalizes to clean OpenAI format for SFT)
- `conversation_utils.py` — shared utilities: `strip_assistant_turns`,
`normalize_messages`, `make_augment_fn`, `AugmentationSpec`
- `augmentations.yaml` — 12 language-redirect variants + style/format
hints, cycled across dataset rows
- Scripts live in `examples/dataset/` (not under
`speculative_decoding/`) to signal reusability beyond speculative
decoding
**Bug fixes**:
- `strip_assistant_turns()`: return `{"messages": []}` when no user
turns remain (system-only rows were previously passed through instead of
being filtered)
- `concatenate_datasets()`: guard against empty parts list
- SSH tunnel user precedence: explicit `user` arg now correctly
overrides `slurm_config.user`
### Usage
```bash
# Prepare PTv3 input conversations for synthesis (~3.4M rows):
python examples/dataset/make_nemotron_ptv3_dataset.py --output-dir /tmp/ptv3_gen
# Launch vLLM server + synthesize responses:
bash tools/launcher/common/vllm/query.sh \
--model /path/to/model \
--tensor-parallel-size 4 \
-- \
--data /tmp/ptv3_gen/default.jsonl \
--save /tmp/ptv3_responses \
--num-shards 10 --num-proc 4 --max-tokens 4096
# Prepare PTv2 for direct SFT training (~1.9M rows):
python examples/dataset/make_nemotron_ptv2_dataset.py --mode train --output-dir /tmp/ptv2_train
```
### Testing
Tested end-to-end on an NVIDIA GB10 node (119 GiB GPU memory) with
`vllm/vllm-openai:qwen3_5-cu130` container and `Qwen/Qwen3.5-4B`:
- vLLM server starts correctly with cleared Docker entrypoint
- `datasets.map(num_proc=4)` runs without connection errors (fork-safe
client)
- Multi-turn synthesis produces correct assistant responses with
thinking traces handled
### Before your PR is "*Ready for review*"
Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).
Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).
- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A (data synthesis scripts;
tested manually)
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Added dataset generation and augmentation capabilities for Nemotron
post-training datasets (v2 and v3)
* Enhanced query functionality with thinking-block filtering and
improved client management for robust parallel processing
* Added support for local dataset file paths alongside HuggingFace Hub
datasets
* **Bug Fixes**
* Fixed SLURM executor user resolution and Docker container entrypoint
configuration
* Improved error handling for connection failures during dataset
synthesis
* **Documentation**
* Updated dataset preparation guide with new generation modes and
augmentation configuration details
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: Chenhan D. Yu <5185878+ChenhanYu@users.noreply.github.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>1 parent 9594b18 commit 88670ef
File tree
17 files changed
+1337
-39
lines changed- examples
- dataset
- speculative_decoding
- prepare_input_conversations
- tests/examples/speculative_decoding
- tools/launcher
- common
- vllm
- examples/Qwen/Qwen3.5-4B
17 files changed
+1337
-39
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
0 commit comments