|
| 1 | +# Dataset Preparation Scripts |
| 2 | + |
| 3 | +Utilities for building conversation datasets from NVIDIA Nemotron Post-Training |
| 4 | +collections and other HuggingFace sources. These scripts produce datasets in |
| 5 | +**standard OpenAI chat format** (`{"messages": [{"role": ..., "content": ...}]}`) |
| 6 | +and can be used for any downstream fine-tuning task — SFT, distillation, |
| 7 | +speculative decoding draft-model training, etc. |
| 8 | + |
| 9 | +## Files |
| 10 | + |
| 11 | +| File | Description | |
| 12 | +|---|---| |
| 13 | +| `make_nemotron_ptv3_dataset.py` | Build a dataset from the [Nemotron PT v3 collection](https://huggingface.co/collections/nvidia/nemotron-post-training-v3) using a configurable YAML mix | |
| 14 | +| `make_nemotron_ptv2_dataset.py` | Build a dataset from [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) | |
| 15 | +| `make_dataset.py` | General-purpose mixer for arbitrary HuggingFace datasets (mtbench, sharegpt, ultrachat, magpie, etc.) | |
| 16 | +| `conversation_utils.py` | Shared utilities: augmentation, role normalization, assistant-turn stripping | |
| 17 | +| `add_nemotron_chat.py` | Add Nemotron v2 chat conversations to an existing dataset | |
| 18 | +| `augmentations.yaml` | Augmentation variants (language redirects, style hints) for `make_nemotron_pt*.py` | |
| 19 | +| `nemotron_ptv3_datasets.yaml` | Dataset mix config for `make_nemotron_ptv3_dataset.py` | |
| 20 | +| `example_data_config.yaml` | Example YAML config for `make_dataset.py` | |
| 21 | + |
| 22 | +## Quick Start |
| 23 | + |
| 24 | +### Install dependencies |
| 25 | + |
| 26 | +```bash |
| 27 | +pip install datasets huggingface_hub pyyaml |
| 28 | +huggingface-cli login # required for gated datasets |
| 29 | +``` |
| 30 | + |
| 31 | +### Build a Nemotron PT v3 dataset |
| 32 | + |
| 33 | +```bash |
| 34 | +# Synthetic data generation inputs (strips last assistant turn so a model can regenerate it) |
| 35 | +python make_nemotron_ptv3_dataset.py --output-dir /tmp/ptv3_gen |
| 36 | + |
| 37 | +# Full conversations for direct SFT training |
| 38 | +python make_nemotron_ptv3_dataset.py --mode train --output-dir /tmp/ptv3_train |
| 39 | + |
| 40 | +# Use a custom dataset mix |
| 41 | +python make_nemotron_ptv3_dataset.py --config my_mix.yaml --output-dir /tmp/ptv3_custom |
| 42 | +``` |
| 43 | + |
| 44 | +### Build a Nemotron PT v2 dataset |
| 45 | + |
| 46 | +```bash |
| 47 | +python make_nemotron_ptv2_dataset.py --output-dir /tmp/ptv2_gen |
| 48 | +python make_nemotron_ptv2_dataset.py --mode train --output-dir /tmp/ptv2_train |
| 49 | +``` |
| 50 | + |
| 51 | +### Build a general-purpose mixed dataset |
| 52 | + |
| 53 | +```bash |
| 54 | +python make_dataset.py --config example_data_config.yaml --output-dir /tmp/mixed |
| 55 | +``` |
| 56 | + |
| 57 | +## Dataset Modes |
| 58 | + |
| 59 | +Both `make_nemotron_pt*.py` scripts support two modes: |
| 60 | + |
| 61 | +| Mode | Description | Use case | |
| 62 | +|---|---|---| |
| 63 | +| `generate` (default) | Strips assistant turns, optionally augments prompts | Input data for synthetic generation (query a target model to produce training responses) | |
| 64 | +| `train` | Keeps all turns, normalizes to clean OpenAI format | Direct SFT / distillation training | |
| 65 | + |
| 66 | +## Synthetic Generation Pipeline |
| 67 | + |
| 68 | +The `generate` mode produces conversation skeletons that are fed to a target model |
| 69 | +via `tools/launcher/common/query.py` (vLLM or TRT-LLM). The output becomes training |
| 70 | +data for a draft model (e.g. EAGLE3 speculative decoding) or a distilled student: |
| 71 | + |
| 72 | +``` |
| 73 | +make_nemotron_ptv3_dataset.py --mode generate → skeleton.jsonl |
| 74 | + ↓ |
| 75 | +query.py (target model generates responses turn-by-turn) |
| 76 | + ↓ |
| 77 | +training data for draft model / student |
| 78 | +``` |
| 79 | + |
| 80 | +## Augmentations |
| 81 | + |
| 82 | +`augmentations.yaml` defines language-redirect and style-hint variants that are |
| 83 | +applied cyclically across the dataset. Each enabled entry produces one augmented |
| 84 | +copy of the source rows. |
| 85 | + |
| 86 | +To customize augmentations: |
| 87 | +- **Disable** a variant: add `enabled: false` |
| 88 | +- **Add** a language redirect: append a `user_suffix` entry |
| 89 | +- **Add** a system prompt: append a `system_prompt` entry |
| 90 | + |
| 91 | +```yaml |
| 92 | +augmentations: |
| 93 | + - type: user_suffix |
| 94 | + text: " Please reply in French instead of English." |
| 95 | + - type: system_prompt |
| 96 | + content: "You are a helpful assistant." |
| 97 | + enabled: false # disable without deleting |
| 98 | +``` |
| 99 | +
|
| 100 | +## Dataset Mix Config (`nemotron_ptv3_datasets.yaml`) |
| 101 | + |
| 102 | +Edit this file to add, remove, or re-weight datasets without touching the script: |
| 103 | + |
| 104 | +```yaml |
| 105 | +datasets: |
| 106 | + - repo_id: nvidia/Nemotron-Math-v2 |
| 107 | + splits: [high_part00, high_part01] |
| 108 | + cap_per_split: 200000 |
| 109 | + augment: true |
| 110 | +
|
| 111 | + - repo_id: nvidia/OpenMathReasoning-mini |
| 112 | + splits: [train] |
| 113 | + augment: false # multilingual — skip language-redirect augmentation |
| 114 | +``` |
| 115 | + |
| 116 | +## Output Format |
| 117 | + |
| 118 | +Every output row is a JSONL object with a single `messages` key: |
| 119 | + |
| 120 | +```json |
| 121 | +{"messages": [ |
| 122 | + {"role": "system", "content": "You are a helpful assistant."}, |
| 123 | + {"role": "user", "content": "What is 2+2?"}, |
| 124 | + {"role": "assistant", "content": "4"} |
| 125 | +]} |
| 126 | +``` |
| 127 | + |
| 128 | +In `generate` mode, assistant turns are stripped so the row ends with a user turn. |
0 commit comments