You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Refactor: Clean up EAGLE training dataset preparation (#684)
## What does this PR do?
**Type of change:** Refactor
**Overview:**
- Consolidate input dataset preparation into `make_dataset.py`
- Read dataset mix spec from a YAML file
- - Can now specify how many samples to take from each split
- - Can no longer easily split a dataset into train/test sections. I
don't think this feature was really useful to begin with. Most datasets
can already be separated into train/val/test at the split level, and
those that can't are usually going to be splitted by the training FW
anyways.
- Add support for a few new dataset types, magpie 300k/500k/1M, nemotron
post-training dataset v2.
## Usage
See README for detailed example
## Testing
Ran it locally on all dataset modes, works successfully and output looks
good. Checked shuffling, conversation IDs, and output contents were all
unique and usable.
- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes
- **Did you write any new necessary tests?**: No
- **Did you add or update any necessary documentation?**: Yes
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
No
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Documentation**
* Updated speculative decoding example documentation with new dataset
references and standardized file paths.
* **New Features**
* Introduced configuration-driven dataset preparation supporting
multiple dataset sources with centralized configuration files.
* **Refactor**
* Simplified dataset preparation workflow with unified tooling and
updated default data paths throughout the training pipeline.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Copy file name to clipboardExpand all lines: examples/speculative_decoding/README.md
+19-19Lines changed: 19 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
5
5
Speculative decoding accelerates auto-regressive generation in large language models (LLMs) by leveraging a lightweight draft model to predict the next γ tokens. The main LLM then verifies these candidate tokens in a single forward pass. If the draft model correctly predicts α tokens, the LLM can accept and generate α+1 tokens per verification step, significantly improving generation speed.
6
6
7
-
This folder contains an end-to-end runnable speculative decoding fine‑tuning pipeline in which Llama‑3.2‑1B (Hugging Face) is trained on the Daring‑Anteater dataset.
7
+
This folder contains an end-to-end runnable speculative decoding fine‑tuning pipeline in which Llama‑3.2‑1B (Hugging Face) is trained on the [UltraChat-200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset.
8
8
9
9
This example focuses on training with Hugging Face. To train with Megatron‑LM, see the [Megatron‑LM example](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt).
We use [Daring-Anteater](https://huggingface.co/datasets/nvidia/Daring-Anteater) dataset in this example. Prepare data by:
48
+
We support a range of input datasets. In this example, we will use the [UltraChat-200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset.
This one-line command runs a minimal example workflow of training and exporting an EAGLE draft model in Modelopt. Specifically, it
63
65
64
66
- Initializes the draft model with [default settings](https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/speculative/eagle/default_config.py#L18)
65
-
- Fine-tunes the model on the [Daring-Anteater](https://huggingface.co/datasets/nvidia/Daring-Anteater)dataset
67
+
- Fine-tunes the model on the dataset
66
68
- Evaluates the acceptance rate on [MT-Bench](https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts)
67
69
- Exports a checkpoint ready for deployment
68
70
@@ -73,7 +75,7 @@ For small base models that fit in GPU memory, we can collocate them with draft m
To add a system prompt, use the `--system_prompt <system_prompt_text>` argument.
@@ -258,7 +258,7 @@ For EAGLE‑1 and EAGLE‑3 we provide a [default model architecture config](htt
258
258
We can optionally use smaller vocab size for the draft model for faster training and inference. E.g. Llama3.2-1B has a vocab size of 128256. In this example, we construct a draft vocab mapping of size 32k by finding the most commonly appeared vocabs in our training set:
This will produce a `d2t.pt` file in `save_dir`, which is the mapping from draft token to target token. During inference, draft tokens can be mapped back to target tokens by `target_token = draft_token + d2t[draft_token]`.
0 commit comments