Skip to content

Commit 658a61b

Browse files
committed
docs(examples): add README for examples/dataset/ and clarify generality
Add README.md covering all scripts, quick start, the generate/train modes, augmentations, dataset mix config, and the synthetic generation pipeline. Also remove "speculative-decoding" scope from the nemotron_ptv3_datasets.yaml header — these scripts are reusable for distillation and other fine-tuning workflows, not spec-dec only. Addresses kevalmorabia97's review feedback on PR #1176. Signed-off-by: chenhany <chenhany@nvidia.com>
1 parent 6627980 commit 658a61b

File tree

2 files changed

+129
-2
lines changed

2 files changed

+129
-2
lines changed

examples/dataset/README.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# Dataset Preparation Scripts
2+
3+
Utilities for building conversation datasets from NVIDIA Nemotron Post-Training
4+
collections and other HuggingFace sources. These scripts produce datasets in
5+
**standard OpenAI chat format** (`{"messages": [{"role": ..., "content": ...}]}`)
6+
and can be used for any downstream fine-tuning task — SFT, distillation,
7+
speculative decoding draft-model training, etc.
8+
9+
## Files
10+
11+
| File | Description |
12+
|---|---|
13+
| `make_nemotron_ptv3_dataset.py` | Build a dataset from the [Nemotron PT v3 collection](https://huggingface.co/collections/nvidia/nemotron-post-training-v3) using a configurable YAML mix |
14+
| `make_nemotron_ptv2_dataset.py` | Build a dataset from [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) |
15+
| `make_dataset.py` | General-purpose mixer for arbitrary HuggingFace datasets (mtbench, sharegpt, ultrachat, magpie, etc.) |
16+
| `conversation_utils.py` | Shared utilities: augmentation, role normalization, assistant-turn stripping |
17+
| `add_nemotron_chat.py` | Add Nemotron v2 chat conversations to an existing dataset |
18+
| `augmentations.yaml` | Augmentation variants (language redirects, style hints) for `make_nemotron_pt*.py` |
19+
| `nemotron_ptv3_datasets.yaml` | Dataset mix config for `make_nemotron_ptv3_dataset.py` |
20+
| `example_data_config.yaml` | Example YAML config for `make_dataset.py` |
21+
22+
## Quick Start
23+
24+
### Install dependencies
25+
26+
```bash
27+
pip install datasets huggingface_hub pyyaml
28+
huggingface-cli login # required for gated datasets
29+
```
30+
31+
### Build a Nemotron PT v3 dataset
32+
33+
```bash
34+
# Synthetic data generation inputs (strips last assistant turn so a model can regenerate it)
35+
python make_nemotron_ptv3_dataset.py --output-dir /tmp/ptv3_gen
36+
37+
# Full conversations for direct SFT training
38+
python make_nemotron_ptv3_dataset.py --mode train --output-dir /tmp/ptv3_train
39+
40+
# Use a custom dataset mix
41+
python make_nemotron_ptv3_dataset.py --config my_mix.yaml --output-dir /tmp/ptv3_custom
42+
```
43+
44+
### Build a Nemotron PT v2 dataset
45+
46+
```bash
47+
python make_nemotron_ptv2_dataset.py --output-dir /tmp/ptv2_gen
48+
python make_nemotron_ptv2_dataset.py --mode train --output-dir /tmp/ptv2_train
49+
```
50+
51+
### Build a general-purpose mixed dataset
52+
53+
```bash
54+
python make_dataset.py --config example_data_config.yaml --output-dir /tmp/mixed
55+
```
56+
57+
## Dataset Modes
58+
59+
Both `make_nemotron_pt*.py` scripts support two modes:
60+
61+
| Mode | Description | Use case |
62+
|---|---|---|
63+
| `generate` (default) | Strips assistant turns, optionally augments prompts | Input data for synthetic generation (query a target model to produce training responses) |
64+
| `train` | Keeps all turns, normalizes to clean OpenAI format | Direct SFT / distillation training |
65+
66+
## Synthetic Generation Pipeline
67+
68+
The `generate` mode produces conversation skeletons that are fed to a target model
69+
via `tools/launcher/common/query.py` (vLLM or TRT-LLM). The output becomes training
70+
data for a draft model (e.g. EAGLE3 speculative decoding) or a distilled student:
71+
72+
```
73+
make_nemotron_ptv3_dataset.py --mode generate → skeleton.jsonl
74+
75+
query.py (target model generates responses turn-by-turn)
76+
77+
training data for draft model / student
78+
```
79+
80+
## Augmentations
81+
82+
`augmentations.yaml` defines language-redirect and style-hint variants that are
83+
applied cyclically across the dataset. Each enabled entry produces one augmented
84+
copy of the source rows.
85+
86+
To customize augmentations:
87+
- **Disable** a variant: add `enabled: false`
88+
- **Add** a language redirect: append a `user_suffix` entry
89+
- **Add** a system prompt: append a `system_prompt` entry
90+
91+
```yaml
92+
augmentations:
93+
- type: user_suffix
94+
text: " Please reply in French instead of English."
95+
- type: system_prompt
96+
content: "You are a helpful assistant."
97+
enabled: false # disable without deleting
98+
```
99+
100+
## Dataset Mix Config (`nemotron_ptv3_datasets.yaml`)
101+
102+
Edit this file to add, remove, or re-weight datasets without touching the script:
103+
104+
```yaml
105+
datasets:
106+
- repo_id: nvidia/Nemotron-Math-v2
107+
splits: [high_part00, high_part01]
108+
cap_per_split: 200000
109+
augment: true
110+
111+
- repo_id: nvidia/OpenMathReasoning-mini
112+
splits: [train]
113+
augment: false # multilingual — skip language-redirect augmentation
114+
```
115+
116+
## Output Format
117+
118+
Every output row is a JSONL object with a single `messages` key:
119+
120+
```json
121+
{"messages": [
122+
{"role": "system", "content": "You are a helpful assistant."},
123+
{"role": "user", "content": "What is 2+2?"},
124+
{"role": "assistant", "content": "4"}
125+
]}
126+
```
127+
128+
In `generate` mode, assistant turns are stripped so the row ends with a user turn.

examples/dataset/nemotron_ptv3_datasets.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
# Dataset mix for make_nemotron_ptv3_dataset.py
22
#
3-
# Each entry defines one HuggingFace dataset to include in the speculative-decoding
4-
# training mix. Fields:
3+
# Each entry defines one HuggingFace dataset to include in the training mix. Fields:
54
#
65
# repo_id — HuggingFace repo ID or local path.
76
# splits — List of splits to load. All splits are concatenated before capping.

0 commit comments

Comments
 (0)