Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 31 additions & 65 deletions examples/speculative_decoding/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,14 +73,16 @@ This one-line command runs a minimal example workflow of training and exporting
For small base models that fit in GPU memory, we can collocate them with draft models and train with the following command:

```bash
./launch_train.sh --model $BASE_MODEL \
--output_dir $OUTPUT_DIR \
--data input_conversations/train.jsonl \
--num_epochs $NUM_EPOCH \
--eagle_config eagle_config.json
./launch_train.sh \
--config ../../modelopt_recipes/general/speculative_decoding/eagle3.yaml \
model.model_name_or_path=meta-llama/Llama-3.2-1B \
data.data_path=input_conversations/train.jsonl \
training.output_dir=ckpts/llama-3.2-1b-online
```

FSDP2 is used by default. To enable context parallelism for long-context training, specify `--cp_size n`.
All default training settings are in `eagle3.yaml`. You can adjust them by editing the YAML file or by specifying command-line overrides with OmegaConf dotlist arguments.

To enable context parallelism for long-context training, add `training.cp_size=<N>`.
The saved modelopt checkpoint is similar in architecture to HF models. It can be further optimized through **ModelOpt**, e.g., PTQ and QAT.

## Training Draft Model with Offline Base Model
Expand Down Expand Up @@ -113,15 +115,14 @@ python collect_hidden_states/compute_hidden_states_hf.py \

### Train Draft Model with Dumped Hidden States

Once we finish dumping hidden states, launch offline training with an extra `--offline-data` argument:
Once we finish dumping hidden states, launch offline training pointing to the hidden states directory:

```bash
./launch_train.sh --model $BASE_MODEL \
--output_dir $OUTPUT_DIR \
--data $DATA \
--num_epochs $NUM_EPOCH \
--eagle_config eagle_config.json \
--offline-data $HIDDEN_STATES_DIR
./launch_train.sh \
--config ../../modelopt_recipes/general/speculative_decoding/eagle3.yaml \
model.model_name_or_path=meta-llama/Llama-3.2-1B \
data.offline_data_path=$HIDDEN_STATES_DIR \
training.output_dir=ckpts/llama-3.2-1b-offline
```

## Model Validation
Expand Down Expand Up @@ -244,13 +245,13 @@ For large scale data generation, please see [SLURM prepare data](SLURM_prepare_d

### Configuring Draft Model

For EAGLE‑1 and EAGLE‑3 we provide a [default model architecture config](https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/speculative/config.py#L37) in ModelOpt. You can override default settings by providing an additional JSON dict. E.g. To use 2-layer eagle with 8192 intermediate size for MLP, set `eagle_config.json` to:
For EAGLE‑1 and EAGLE‑3 we provide a [default model architecture config](https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/speculative/config.py#L37) in ModelOpt. You can override default settings via `eagle.eagle_architecture_config` in the YAML. E.g. to use a 2-layer EAGLE head with 8192 intermediate size:

```json
{
"num_hidden_layers": 2,
"intermediate_size":8192
}
```yaml
eagle:
eagle_architecture_config:
num_hidden_layers: 2
intermediate_size: 8192
```

### Draft Vocabulary Compression
Expand All @@ -263,61 +264,26 @@ python scripts/calibrate_draft_vocab.py --model meta-llama/Llama-3.2-1B-Instruct

This will produce a `d2t.pt` file in `save_dir`, which is the mapping from draft token to target token. During inference, draft tokens can be mapped back to target tokens by `target_token = draft_token + d2t[draft_token]`.

Then, simply set `{"draft_vocab_size":32000}` in `eagle_config.json` and include `--draft_vocab_cache <path_to_d2t.pt>` when running `./launch_train.sh`. The draft model will use this provided vocab table during training and export.
Then, set `eagle_architecture_config.draft_vocab_size: 32000` and `data.draft_vocab_cache: <path_to_d2t.pt>` in your YAML. The draft model will use this provided vocab table during training and export.
Comment thread
coderabbitai[bot] marked this conversation as resolved.

### Interact with `modelopt.torch.speculative`

`main.py` provides an example for converting a HF base model for speculative decoding and training it. It consists of a few simple steps:
First, load the base model and tokenizer from Hugging Face:

```python
model = transformers.AutoModelForCausalLM.from_pretrained(
"<path to your pretrained model>"
)
```

Then, load default eagle config and make necessary overwrites:
`main.py` provides a complete example for converting a HF base model for speculative decoding and training it. The core steps are loading the base model, converting it with an eagle config dict, and training with HF Trainer:

```python
# Load default config
config = {
"eagle1": EAGLE1_DEFAULT_CFG,
"eagle3": EAGLE3_DEFAULT_CFG,
}[training_args.mode]["config"]

# overwrite config with custom config
config["eagle_architecture_config"].update({"<overwrite_keys>": "<overwrite_values>"})

# Mandatory: hidden size, vocab size and max position embeddings must match base model
config["eagle_architecture_config"].update(
{
"hidden_size": model.config.hidden_size,
"vocab_size": model.config.vocab_size,
"max_position_embeddings": model.config.max_position_embeddings,
}
)
```
import modelopt.torch.speculative as mtsp

Then, we convert model to a speculative decoding model:
# Convert base model in-place to an EAGLE speculative decoding model
eagle_cfg = {"eagle_decoder_type": "llama", ...} # fields from EagleConfig
mtsp.convert(model, [("eagle", eagle_cfg)])

```python
mtsp.convert(model, [("eagle", config)])
# Train with HF Trainer as usual
trainer = transformers.Trainer(model=model, ...)
trainer.train()
trainer.save_model("<output_dir>")
```

This will modify the model in-place with eagle training forward, making it compatible with HF trainer:

```python
# Create a trainer
trainer = transformers.Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
trainer._move_model_to_device(model, trainer.args.device)

# Enable HF checkpointing so that the saved model will contain the speculative decoding module
mto.enable_huggingface_checkpointing()

trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_state()
trainer.save_model("<path to the output directory>")
```
See `main.py` for the full example including tokenizer setup, dataset loading, and checkpoint handling.

## Support Matrix

Expand Down
2 changes: 0 additions & 2 deletions examples/speculative_decoding/eagle_config.json

This file was deleted.

6 changes: 4 additions & 2 deletions examples/speculative_decoding/eagle_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,9 @@

try:
import wandb
except ImportError:

wandb.log # Verify wandb is functional (not a stub module).
except (ImportError, AttributeError):
wandb = None

IGNORE_TOKEN_ID = LabelSmoother.ignore_index
Expand Down Expand Up @@ -194,7 +196,7 @@ class EagleTrainingPlot(TrainerCallback):

def __init__(self, ar_validate_steps: int = 1000, estimate_ar: bool = False):
self.ar_validate_steps = ar_validate_steps
if wandb and is_master():
if hasattr(wandb, "init") and is_master():
wandb.init()
self.estimate_ar = estimate_ar

Expand Down
1 change: 0 additions & 1 deletion examples/speculative_decoding/fsdp_config.json

This file was deleted.

Loading
Loading