Skip to content

Commit 4dbb906

Browse files
committed
feat(speculative): add vLLM data synthesis pipeline and Nemotron dataset preparation scripts
- Use slurm_config.user for SSHTunnel user when set - Add assets field to SandboxPipeline for pre-submission asset checks - Add Nemotron specdec dataset preparation scripts - Add vLLM container support for data synthesis - Add Qwen3.5-4B vLLM specdec data synthesis YAML - Move dataset prep scripts to examples/dataset/ - Update spec dec conftest to use examples/dataset/make_dataset.py Signed-off-by: Chenhan D. Yu <5185878+ChenhanYu@users.noreply.github.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
1 parent 4255bc6 commit 4dbb906

File tree

17 files changed

+1337
-39
lines changed

17 files changed

+1337
-39
lines changed

examples/dataset/README.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# Dataset Preparation Scripts
2+
3+
Utilities for building conversation datasets from NVIDIA Nemotron Post-Training
4+
collections and other HuggingFace sources. These scripts produce datasets in
5+
**standard OpenAI chat format** (`{"messages": [{"role": ..., "content": ...}]}`)
6+
and can be used for any downstream fine-tuning task — SFT, distillation,
7+
speculative decoding draft-model training, etc.
8+
9+
## Files
10+
11+
| File | Description |
12+
|---|---|
13+
| `make_nemotron_ptv3_dataset.py` | Build a dataset from the [Nemotron PT v3 collection](https://huggingface.co/collections/nvidia/nemotron-post-training-v3) using a configurable YAML mix |
14+
| `make_nemotron_ptv2_dataset.py` | Build a dataset from [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) |
15+
| `make_dataset.py` | General-purpose mixer for arbitrary HuggingFace datasets (mtbench, sharegpt, ultrachat, magpie, etc.) |
16+
| `conversation_utils.py` | Shared utilities: augmentation, role normalization, assistant-turn stripping |
17+
| `add_nemotron_chat.py` | Add Nemotron v2 chat conversations to an existing dataset |
18+
| `augmentations.yaml` | Augmentation variants (language redirects, style hints) for `make_nemotron_pt*.py` |
19+
| `nemotron_ptv3_datasets.yaml` | Dataset mix config for `make_nemotron_ptv3_dataset.py` |
20+
| `example_data_config.yaml` | Example YAML config for `make_dataset.py` |
21+
22+
## Quick Start
23+
24+
### Install dependencies
25+
26+
```bash
27+
pip install datasets huggingface_hub pyyaml
28+
huggingface-cli login # required for gated datasets
29+
```text
30+
31+
### Build a Nemotron PT v3 dataset
32+
33+
```bash
34+
# Synthetic data generation inputs (strips last assistant turn so a model can regenerate it)
35+
python make_nemotron_ptv3_dataset.py --output-dir /tmp/ptv3_gen
36+
37+
# Full conversations for direct SFT training
38+
python make_nemotron_ptv3_dataset.py --mode train --output-dir /tmp/ptv3_train
39+
40+
# Use a custom dataset mix
41+
python make_nemotron_ptv3_dataset.py --config my_mix.yaml --output-dir /tmp/ptv3_custom
42+
```text
43+
44+
### Build a Nemotron PT v2 dataset
45+
46+
```bash
47+
python make_nemotron_ptv2_dataset.py --output-dir /tmp/ptv2_gen
48+
python make_nemotron_ptv2_dataset.py --mode train --output-dir /tmp/ptv2_train
49+
```text
50+
51+
### Build a general-purpose mixed dataset
52+
53+
```bash
54+
python make_dataset.py --config example_data_config.yaml --output-dir /tmp/mixed
55+
```text
56+
57+
## Dataset Modes
58+
59+
Both `make_nemotron_pt*.py` scripts support two modes:
60+
61+
| Mode | Description | Use case |
62+
|---|---|---|
63+
| `generate` (default) | Strips assistant turns, optionally augments prompts | Input data for synthetic generation (query a target model to produce training responses) |
64+
| `train` | Keeps all turns, normalizes to clean OpenAI format | Direct SFT / distillation training |
65+
66+
## Synthetic Generation Pipeline
67+
68+
The `generate` mode produces conversation skeletons that are fed to a target model
69+
via `tools/launcher/common/query.py` (vLLM or TRT-LLM). The output becomes training
70+
data for a draft model (e.g. EAGLE3 speculative decoding) or a distilled student:
71+
72+
```text
73+
make_nemotron_ptv3_dataset.py --mode generate → skeleton.jsonl
74+
75+
query.py (target model generates responses turn-by-turn)
76+
77+
training data for draft model / student
78+
```text
79+
80+
## Augmentations
81+
82+
`augmentations.yaml` defines language-redirect and style-hint variants that are
83+
applied cyclically across the dataset. Each enabled entry produces one augmented
84+
copy of the source rows.
85+
86+
To customize augmentations:
87+
- **Disable** a variant: add `enabled: false`
88+
- **Add** a language redirect: append a `user_suffix` entry
89+
- **Add** a system prompt: append a `system_prompt` entry
90+
91+
```yaml
92+
augmentations:
93+
- type: user_suffix
94+
text: " Please reply in French instead of English."
95+
- type: system_prompt
96+
content: "You are a helpful assistant."
97+
enabled: false # disable without deleting
98+
```text
99+
100+
## Dataset Mix Config (`nemotron_ptv3_datasets.yaml`)
101+
102+
Edit this file to add, remove, or re-weight datasets without touching the script:
103+
104+
```yaml
105+
datasets:
106+
- repo_id: nvidia/Nemotron-Math-v2
107+
splits: [high_part00, high_part01]
108+
cap_per_split: 200000
109+
augment: true
110+
111+
- repo_id: nvidia/OpenMathReasoning-mini
112+
splits: [train]
113+
augment: false # multilingual — skip language-redirect augmentation
114+
```text
115+
116+
## Output Format
117+
118+
Every output row is a JSONL object with a single `messages` key:
119+
120+
```json
121+
{"messages": [
122+
{"role": "system", "content": "You are a helpful assistant."},
123+
{"role": "user", "content": "What is 2+2?"},
124+
{"role": "assistant", "content": "4"}
125+
]}
126+
```text
127+
128+
In `generate` mode, assistant turns are stripped so the row ends with a user turn.

examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py renamed to examples/dataset/add_nemotron_chat.py

File renamed without changes.
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# Augmentation specs for make_nemotron_ptv2_dataset.py and make_nemotron_ptv3_dataset.py
2+
#
3+
# Each entry defines one augmentation variant applied cyclically across the dataset.
4+
# The augmented copy is the same size as the source — each row gets exactly one variant.
5+
#
6+
# Supported types:
7+
#
8+
# user_suffix
9+
# Appends `text` to the content of every user message in the conversation.
10+
# Example use: language-redirect instructions, style/length hints.
11+
#
12+
# system_prompt
13+
# Prepends a {"role": "system", "content": <content>} message to the conversation.
14+
# Use this for model-specific flags (e.g. /no_think) or persona instructions.
15+
# Set `enabled: false` for variants that are not supported by your target model.
16+
#
17+
# To disable an entry without deleting it, add `enabled: false`.
18+
# To add a new variant, append a new entry following the same schema.
19+
20+
augmentations:
21+
22+
# --- Language redirects (user_suffix) ------------------------------------
23+
24+
- type: user_suffix
25+
text: " Please reply in French instead of English."
26+
27+
- type: user_suffix
28+
text: " Please reply in Italian instead of English."
29+
30+
- type: user_suffix
31+
text: " Please reply in German instead of English."
32+
33+
- type: user_suffix
34+
text: " Please reply in Spanish instead of English."
35+
36+
- type: user_suffix
37+
text: " Please reply in Mandarin Chinese instead of English."
38+
39+
- type: user_suffix
40+
text: " Please reply in Japanese instead of English."
41+
42+
- type: user_suffix
43+
text: " Please reply in Korean instead of English."
44+
45+
- type: user_suffix
46+
text: " Please reply in Turkish instead of English."
47+
48+
- type: user_suffix
49+
text: " Please reply in Modern Standard Arabic instead of English."
50+
51+
- type: user_suffix
52+
text: " Please reply in Russian instead of English."
53+
54+
- type: user_suffix
55+
text: " Please reply in Brazilian Portuguese instead of English."
56+
57+
- type: user_suffix
58+
text: " Please reply in Vietnamese instead of English."
59+
60+
# --- Style / format hints (user_suffix) ----------------------------------
61+
62+
- type: user_suffix
63+
text: " Be concise and answer in as few words as possible."
64+
65+
- type: user_suffix
66+
text: " Provide a detailed, step-by-step explanation."
67+
68+
- type: user_suffix
69+
text: " Format your response using Markdown (headers, bullet points, code blocks where appropriate)."
70+
71+
- type: user_suffix
72+
text: " Do not use Markdown formatting; reply in plain text only."
73+
74+
- type: user_suffix
75+
text: " Explain your answer as if I am a complete beginner with no prior knowledge."
76+
77+
- type: user_suffix
78+
text: " Assume I am an expert; skip basic explanations and go straight to the details."
79+
80+
- type: user_suffix
81+
text: " Think step by step before giving your final answer."
82+
83+
# --- System-prompt variants (system_prompt) ------------------------------
84+
85+
# /no_think: suppresses chain-of-thought in models that support it (e.g. Qwen3).
86+
# Set enabled: false if your target model does not support this flag.
87+
- type: system_prompt
88+
content: "You are a helpful assistant. /no_think"
89+
enabled: false
90+
91+
# Generic helpful-assistant system prompt (no special flags).
92+
- type: system_prompt
93+
content: "You are a helpful, respectful, and honest assistant."

0 commit comments

Comments
 (0)