Skip to content

Commit 88670ef

Browse files
ChenhanYuEdwardf0t1
authored andcommitted
feat(speculative): add vLLM data synthesis pipeline and Nemotron dataset preparation scripts (#1176)
### What does this PR do? Type of change: New feature, new example, bug fix Adds a vLLM-based synthetic data generation pipeline for speculative decoding draft model training, along with dataset preparation scripts for NVIDIA's Nemotron Post-Training dataset collections. **Data synthesis pipeline** (`tools/launcher/common/vllm/query.sh` + `common/query.py`): - Launch a vLLM server and run multi-turn inference to synthesize training data from input conversation skeletons - Fork-safe OpenAI client: reinitializes HTTP connection pool after `datasets.map()` forks worker processes, preventing 400 errors from corrupted connections - Clear Docker `ENTRYPOINT` so vLLM containers (which default to `vllm serve`) work correctly under NeMo Run's executor - `--max-tokens` argument to bound generation length - Local file loading support (`--data /path/to/file.jsonl`) - Re-raise connection errors so `datasets.map()` halts the shard instead of silently producing empty rows - Map `developer` role to `system` (OpenAI format compatibility) **Multi-turn reasoning trace handling** (`common/query.py`): - Strip `<think>...</think>` blocks from intermediate assistant turns before re-feeding to the model; preserve the full trace only on the final turn **Nemotron dataset preparation** (`examples/dataset/`): - `make_nemotron_ptv2_dataset.py` — prepares [nvidia/Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) (~3.3M rows generate, ~1.9M rows train) - `make_nemotron_ptv3_dataset.py` — prepares the [Nemotron PTv3 collection](https://huggingface.co/collections/nvidia/nemotron-post-training-v3) of 16 datasets (~3.4M rows generate, ~3.9M rows train) - Both support `generate` mode (strips assistant turns for synthesis input) and `train` mode (normalizes to clean OpenAI format for SFT) - `conversation_utils.py` — shared utilities: `strip_assistant_turns`, `normalize_messages`, `make_augment_fn`, `AugmentationSpec` - `augmentations.yaml` — 12 language-redirect variants + style/format hints, cycled across dataset rows - Scripts live in `examples/dataset/` (not under `speculative_decoding/`) to signal reusability beyond speculative decoding **Bug fixes**: - `strip_assistant_turns()`: return `{"messages": []}` when no user turns remain (system-only rows were previously passed through instead of being filtered) - `concatenate_datasets()`: guard against empty parts list - SSH tunnel user precedence: explicit `user` arg now correctly overrides `slurm_config.user` ### Usage ```bash # Prepare PTv3 input conversations for synthesis (~3.4M rows): python examples/dataset/make_nemotron_ptv3_dataset.py --output-dir /tmp/ptv3_gen # Launch vLLM server + synthesize responses: bash tools/launcher/common/vllm/query.sh \ --model /path/to/model \ --tensor-parallel-size 4 \ -- \ --data /tmp/ptv3_gen/default.jsonl \ --save /tmp/ptv3_responses \ --num-shards 10 --num-proc 4 --max-tokens 4096 # Prepare PTv2 for direct SFT training (~1.9M rows): python examples/dataset/make_nemotron_ptv2_dataset.py --mode train --output-dir /tmp/ptv2_train ``` ### Testing Tested end-to-end on an NVIDIA GB10 node (119 GiB GPU memory) with `vllm/vllm-openai:qwen3_5-cu130` container and `Qwen/Qwen3.5-4B`: - vLLM server starts correctly with cleared Docker entrypoint - `datasets.map(num_proc=4)` runs without connection errors (fork-safe client) - Multi-turn synthesis produces correct assistant responses with thinking traces handled ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A (data synthesis scripts; tested manually) - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added dataset generation and augmentation capabilities for Nemotron post-training datasets (v2 and v3) * Enhanced query functionality with thinking-block filtering and improved client management for robust parallel processing * Added support for local dataset file paths alongside HuggingFace Hub datasets * **Bug Fixes** * Fixed SLURM executor user resolution and Docker container entrypoint configuration * Improved error handling for connection failures during dataset synthesis * **Documentation** * Updated dataset preparation guide with new generation modes and augmentation configuration details <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chenhan D. Yu <5185878+ChenhanYu@users.noreply.github.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
1 parent 9594b18 commit 88670ef

File tree

17 files changed

+1337
-39
lines changed

17 files changed

+1337
-39
lines changed

examples/dataset/README.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# Dataset Preparation Scripts
2+
3+
Utilities for building conversation datasets from NVIDIA Nemotron Post-Training
4+
collections and other HuggingFace sources. These scripts produce datasets in
5+
**standard OpenAI chat format** (`{"messages": [{"role": ..., "content": ...}]}`)
6+
and can be used for any downstream fine-tuning task — SFT, distillation,
7+
speculative decoding draft-model training, etc.
8+
9+
## Files
10+
11+
| File | Description |
12+
|---|---|
13+
| `make_nemotron_ptv3_dataset.py` | Build a dataset from the [Nemotron PT v3 collection](https://huggingface.co/collections/nvidia/nemotron-post-training-v3) using a configurable YAML mix |
14+
| `make_nemotron_ptv2_dataset.py` | Build a dataset from [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) |
15+
| `make_dataset.py` | General-purpose mixer for arbitrary HuggingFace datasets (mtbench, sharegpt, ultrachat, magpie, etc.) |
16+
| `conversation_utils.py` | Shared utilities: augmentation, role normalization, assistant-turn stripping |
17+
| `add_nemotron_chat.py` | Add Nemotron v2 chat conversations to an existing dataset |
18+
| `augmentations.yaml` | Augmentation variants (language redirects, style hints) for `make_nemotron_pt*.py` |
19+
| `nemotron_ptv3_datasets.yaml` | Dataset mix config for `make_nemotron_ptv3_dataset.py` |
20+
| `example_data_config.yaml` | Example YAML config for `make_dataset.py` |
21+
22+
## Quick Start
23+
24+
### Install dependencies
25+
26+
```bash
27+
pip install datasets huggingface_hub pyyaml
28+
huggingface-cli login # required for gated datasets
29+
```text
30+
31+
### Build a Nemotron PT v3 dataset
32+
33+
```bash
34+
# Synthetic data generation inputs (strips last assistant turn so a model can regenerate it)
35+
python make_nemotron_ptv3_dataset.py --output-dir /tmp/ptv3_gen
36+
37+
# Full conversations for direct SFT training
38+
python make_nemotron_ptv3_dataset.py --mode train --output-dir /tmp/ptv3_train
39+
40+
# Use a custom dataset mix
41+
python make_nemotron_ptv3_dataset.py --config my_mix.yaml --output-dir /tmp/ptv3_custom
42+
```text
43+
44+
### Build a Nemotron PT v2 dataset
45+
46+
```bash
47+
python make_nemotron_ptv2_dataset.py --output-dir /tmp/ptv2_gen
48+
python make_nemotron_ptv2_dataset.py --mode train --output-dir /tmp/ptv2_train
49+
```text
50+
51+
### Build a general-purpose mixed dataset
52+
53+
```bash
54+
python make_dataset.py --config example_data_config.yaml --output-dir /tmp/mixed
55+
```text
56+
57+
## Dataset Modes
58+
59+
Both `make_nemotron_pt*.py` scripts support two modes:
60+
61+
| Mode | Description | Use case |
62+
|---|---|---|
63+
| `generate` (default) | Strips assistant turns, optionally augments prompts | Input data for synthetic generation (query a target model to produce training responses) |
64+
| `train` | Keeps all turns, normalizes to clean OpenAI format | Direct SFT / distillation training |
65+
66+
## Synthetic Generation Pipeline
67+
68+
The `generate` mode produces conversation skeletons that are fed to a target model
69+
via `tools/launcher/common/query.py` (vLLM or TRT-LLM). The output becomes training
70+
data for a draft model (e.g. EAGLE3 speculative decoding) or a distilled student:
71+
72+
```text
73+
make_nemotron_ptv3_dataset.py --mode generate → skeleton.jsonl
74+
75+
query.py (target model generates responses turn-by-turn)
76+
77+
training data for draft model / student
78+
```text
79+
80+
## Augmentations
81+
82+
`augmentations.yaml` defines language-redirect and style-hint variants that are
83+
applied cyclically across the dataset. Each enabled entry produces one augmented
84+
copy of the source rows.
85+
86+
To customize augmentations:
87+
- **Disable** a variant: add `enabled: false`
88+
- **Add** a language redirect: append a `user_suffix` entry
89+
- **Add** a system prompt: append a `system_prompt` entry
90+
91+
```yaml
92+
augmentations:
93+
- type: user_suffix
94+
text: " Please reply in French instead of English."
95+
- type: system_prompt
96+
content: "You are a helpful assistant."
97+
enabled: false # disable without deleting
98+
```text
99+
100+
## Dataset Mix Config (`nemotron_ptv3_datasets.yaml`)
101+
102+
Edit this file to add, remove, or re-weight datasets without touching the script:
103+
104+
```yaml
105+
datasets:
106+
- repo_id: nvidia/Nemotron-Math-v2
107+
splits: [high_part00, high_part01]
108+
cap_per_split: 200000
109+
augment: true
110+
111+
- repo_id: nvidia/OpenMathReasoning-mini
112+
splits: [train]
113+
augment: false # multilingual — skip language-redirect augmentation
114+
```text
115+
116+
## Output Format
117+
118+
Every output row is a JSONL object with a single `messages` key:
119+
120+
```json
121+
{"messages": [
122+
{"role": "system", "content": "You are a helpful assistant."},
123+
{"role": "user", "content": "What is 2+2?"},
124+
{"role": "assistant", "content": "4"}
125+
]}
126+
```text
127+
128+
In `generate` mode, assistant turns are stripped so the row ends with a user turn.

examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py renamed to examples/dataset/add_nemotron_chat.py

File renamed without changes.
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# Augmentation specs for make_nemotron_ptv2_dataset.py and make_nemotron_ptv3_dataset.py
2+
#
3+
# Each entry defines one augmentation variant applied cyclically across the dataset.
4+
# The augmented copy is the same size as the source — each row gets exactly one variant.
5+
#
6+
# Supported types:
7+
#
8+
# user_suffix
9+
# Appends `text` to the content of every user message in the conversation.
10+
# Example use: language-redirect instructions, style/length hints.
11+
#
12+
# system_prompt
13+
# Prepends a {"role": "system", "content": <content>} message to the conversation.
14+
# Use this for model-specific flags (e.g. /no_think) or persona instructions.
15+
# Set `enabled: false` for variants that are not supported by your target model.
16+
#
17+
# To disable an entry without deleting it, add `enabled: false`.
18+
# To add a new variant, append a new entry following the same schema.
19+
20+
augmentations:
21+
22+
# --- Language redirects (user_suffix) ------------------------------------
23+
24+
- type: user_suffix
25+
text: " Please reply in French instead of English."
26+
27+
- type: user_suffix
28+
text: " Please reply in Italian instead of English."
29+
30+
- type: user_suffix
31+
text: " Please reply in German instead of English."
32+
33+
- type: user_suffix
34+
text: " Please reply in Spanish instead of English."
35+
36+
- type: user_suffix
37+
text: " Please reply in Mandarin Chinese instead of English."
38+
39+
- type: user_suffix
40+
text: " Please reply in Japanese instead of English."
41+
42+
- type: user_suffix
43+
text: " Please reply in Korean instead of English."
44+
45+
- type: user_suffix
46+
text: " Please reply in Turkish instead of English."
47+
48+
- type: user_suffix
49+
text: " Please reply in Modern Standard Arabic instead of English."
50+
51+
- type: user_suffix
52+
text: " Please reply in Russian instead of English."
53+
54+
- type: user_suffix
55+
text: " Please reply in Brazilian Portuguese instead of English."
56+
57+
- type: user_suffix
58+
text: " Please reply in Vietnamese instead of English."
59+
60+
# --- Style / format hints (user_suffix) ----------------------------------
61+
62+
- type: user_suffix
63+
text: " Be concise and answer in as few words as possible."
64+
65+
- type: user_suffix
66+
text: " Provide a detailed, step-by-step explanation."
67+
68+
- type: user_suffix
69+
text: " Format your response using Markdown (headers, bullet points, code blocks where appropriate)."
70+
71+
- type: user_suffix
72+
text: " Do not use Markdown formatting; reply in plain text only."
73+
74+
- type: user_suffix
75+
text: " Explain your answer as if I am a complete beginner with no prior knowledge."
76+
77+
- type: user_suffix
78+
text: " Assume I am an expert; skip basic explanations and go straight to the details."
79+
80+
- type: user_suffix
81+
text: " Think step by step before giving your final answer."
82+
83+
# --- System-prompt variants (system_prompt) ------------------------------
84+
85+
# /no_think: suppresses chain-of-thought in models that support it (e.g. Qwen3).
86+
# Set enabled: false if your target model does not support this flag.
87+
- type: system_prompt
88+
content: "You are a helpful assistant. /no_think"
89+
enabled: false
90+
91+
# Generic helpful-assistant system prompt (no special flags).
92+
- type: system_prompt
93+
content: "You are a helpful, respectful, and honest assistant."

0 commit comments

Comments
 (0)