Skip to content

Commit 42b3414

Browse files
benchislettkevalmorabia97
authored andcommitted
Refactor: Clean up EAGLE training dataset preparation (#684)
## What does this PR do? **Type of change:** Refactor **Overview:** - Consolidate input dataset preparation into `make_dataset.py` - Read dataset mix spec from a YAML file - - Can now specify how many samples to take from each split - - Can no longer easily split a dataset into train/test sections. I don't think this feature was really useful to begin with. Most datasets can already be separated into train/val/test at the split level, and those that can't are usually going to be splitted by the training FW anyways. - Add support for a few new dataset types, magpie 300k/500k/1M, nemotron post-training dataset v2. ## Usage See README for detailed example ## Testing Ran it locally on all dataset modes, works successfully and output looks good. Checked shuffling, conversation IDs, and output contents were all unique and usable. - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes - **Did you write any new necessary tests?**: No - **Did you add or update any necessary documentation?**: Yes - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: No <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Updated speculative decoding example documentation with new dataset references and standardized file paths. * **New Features** * Introduced configuration-driven dataset preparation supporting multiple dataset sources with centralized configuration files. * **Refactor** * Simplified dataset preparation workflow with unified tooling and updated default data paths throughout the training pipeline. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
1 parent 00fa5bd commit 42b3414

10 files changed

Lines changed: 585 additions & 600 deletions

File tree

examples/speculative_decoding/README.md

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
Speculative decoding accelerates auto-regressive generation in large language models (LLMs) by leveraging a lightweight draft model to predict the next γ tokens. The main LLM then verifies these candidate tokens in a single forward pass. If the draft model correctly predicts α tokens, the LLM can accept and generate α+1 tokens per verification step, significantly improving generation speed.
66

7-
This folder contains an end-to-end runnable speculative decoding fine‑tuning pipeline in which Llama‑3.2‑1B (Hugging Face) is trained on the Daring‑Anteater dataset.
7+
This folder contains an end-to-end runnable speculative decoding fine‑tuning pipeline in which Llama‑3.2‑1B (Hugging Face) is trained on the [UltraChat-200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset.
88

99
This example focuses on training with Hugging Face. To train with Megatron‑LM, see the [Megatron‑LM example](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt).
1010

@@ -45,14 +45,16 @@ pip install -r requirements.txt
4545

4646
### Data Preparation
4747

48-
We use [Daring-Anteater](https://huggingface.co/datasets/nvidia/Daring-Anteater) dataset in this example. Prepare data by:
48+
We support a range of input datasets. In this example, we will use the [UltraChat-200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset.
4949

5050
```bash
51-
python prepare_input_conversations/add_daring_anteater.py
51+
python prepare_input_conversations/make_dataset.py -f prepare_input_conversations/example_data_config.yaml --full-conversations
5252
```
5353

5454
See [other-datasets](#other-datasets) section for other dataset options and instruction for user-provided data.
5555

56+
Omit `--full-conversations` if you plan to run synthetic data generation (see [data-synthesis](#data-synthesis)).
57+
5658
## Getting Started: Simplified Workflow
5759

5860
```bash
@@ -62,7 +64,7 @@ bash train_eagle3_and_export.sh --base_model meta-llama/Llama-3.2-1B-Instruct
6264
This one-line command runs a minimal example workflow of training and exporting an EAGLE draft model in Modelopt. Specifically, it
6365

6466
- Initializes the draft model with [default settings](https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/speculative/eagle/default_config.py#L18)
65-
- Fine-tunes the model on the [Daring-Anteater](https://huggingface.co/datasets/nvidia/Daring-Anteater) dataset
67+
- Fine-tunes the model on the dataset
6668
- Evaluates the acceptance rate on [MT-Bench](https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts)
6769
- Exports a checkpoint ready for deployment
6870

@@ -73,7 +75,7 @@ For small base models that fit in GPU memory, we can collocate them with draft m
7375
```bash
7476
./launch_train.sh --model $BASE_MODEL \
7577
--output_dir $OUTPUT_DIR \
76-
--data input_conversations/daring-anteater.jsonl \
78+
--data input_conversations/train.jsonl \
7779
--num_epochs $NUM_EPOCH \
7880
--eagle_config eagle_config.json
7981
```
@@ -92,7 +94,7 @@ We support two backends for generating base model hidden states. For better effc
9294
```bash
9395
python collect_hidden_states/compute_hidden_states_trtllm.py \
9496
--model $BASE_MODEL \
95-
--input-file input_conversations/daring-anteater.jsonl \
97+
--input-file input_conversations/train.jsonl \
9698
--output-dir $HIDDEN_STATES_DIR
9799
```
98100

@@ -103,7 +105,7 @@ Alternatively, you can generate the same hidden states with HF:
103105
```bash
104106
python collect_hidden_states/compute_hidden_states_hf.py \
105107
--model $BASE_MODEL \
106-
--input-file input_conversations/daring-anteater.jsonl \
108+
--input-file input_conversations/train.jsonl \
107109
--output-dir $HIDDEN_STATES_DIR
108110
```
109111

@@ -199,16 +201,14 @@ See more details on deployment of quantized model to TRTLLM [here](../llm_ptq/RE
199201

200202
### Other Datasets
201203

202-
In addition to `daring-anteater`, we provide scripts for adding several other commonly used datasets in `prepare_input_conversations`:
204+
In addition to the default dataset, we support adding several other commonly used datasets in `prepare_input_conversations/make_dataset.py`:
203205

204-
```text
205-
prepare_input_conversations/
206-
├── add_daring_anteater.py
207-
├── add_mtbench.py
208-
├── add_sharegpt.py
209-
├── add_ultrachat.py
210-
└── example_make_prompt_dataset.sh
211-
```
206+
- MTBench (for debugging)
207+
- ShareGPT
208+
- UltraChat
209+
- Daring-Anteater
210+
- Magpie (Full 1M, and 500k and 300k filtered)
211+
- Nemotron Post-Training Dataset V2
212212

213213
To use your own datasets, please preprocess your data into a `.jsonl` file with each line in the format:
214214

@@ -232,10 +232,10 @@ vllm serve meta-llama/Llama-3.2-1B-Instruct --api-key token-abc123 --port 8000
232232

233233
Note: Add `--quantization=modelopt` flag for quantized models.
234234

235-
Then, we generate conversations with the base model using prompts from Daring-Anteater:
235+
Then, we generate conversations with the base model using the prepared prompts:
236236

237237
```bash
238-
python scripts/server_generate.py --data_path input_conversations/daring-anteater.jsonl --output_path synthetic/train.jsonl
238+
python scripts/server_generate.py --data_path input_conversations/train.jsonl --output_path synthetic/train.jsonl
239239
```
240240

241241
To add a system prompt, use the `--system_prompt <system_prompt_text>` argument.
@@ -258,7 +258,7 @@ For EAGLE‑1 and EAGLE‑3 we provide a [default model architecture config](htt
258258
We can optionally use smaller vocab size for the draft model for faster training and inference. E.g. Llama3.2-1B has a vocab size of 128256. In this example, we construct a draft vocab mapping of size 32k by finding the most commonly appeared vocabs in our training set:
259259

260260
```bash
261-
python scripts/calibrate_draft_vocab.py --model meta-llama/Llama-3.2-1B-Instruct --data input_conversations/daring-anteater.jsonl --draft_vocab_size 32000 --save_dir draft_vocab_cache
261+
python scripts/calibrate_draft_vocab.py --model meta-llama/Llama-3.2-1B-Instruct --data input_conversations/train.jsonl --draft_vocab_size 32000 --save_dir draft_vocab_cache
262262
```
263263

264264
This will produce a `d2t.pt` file in `save_dir`, which is the mapping from draft token to target token. During inference, draft tokens can be mapped back to target tokens by `target_token = draft_token + d2t[draft_token]`.

examples/speculative_decoding/prepare_input_conversations/add_daring_anteater.py

Lines changed: 0 additions & 102 deletions
This file was deleted.

examples/speculative_decoding/prepare_input_conversations/add_mtbench.py

Lines changed: 0 additions & 105 deletions
This file was deleted.

0 commit comments

Comments
 (0)