Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
4ce7c47
add: DFlash block diffusion speculative decoding
ChenhanYu Apr 8, 2026
e236e34
refactor: simplify DFlash
ChenhanYu Apr 8, 2026
5745b39
refactor: simplify DFlash implementation and training pipeline
ChenhanYu Apr 9, 2026
68bbd03
fix: revert DFlash-specific changes, add sliding window docs
ChenhanYu Apr 9, 2026
2983baf
refactor: consolidate docs, simplify eval, online GT as default
ChenhanYu Apr 9, 2026
db79f33
add: export, PTQ, and vLLM scripts; document open items
ChenhanYu Apr 9, 2026
6be103e
add: vLLM DFlash deployment script + validated 386 tok/s
ChenhanYu Apr 9, 2026
b0af708
fix: vllm_serve.sh - add TP_SIZE, fix prompt parsing, 1024 max tokens
ChenhanYu Apr 9, 2026
749c155
remove: HTML results page (lives in nmm-sandbox, not Model-Optimizer)
ChenhanYu Apr 9, 2026
6e31e5b
update: DFlash unit tests for _apply meta fix, sliding window, mask_t…
ChenhanYu Apr 9, 2026
371b1e7
update: dflash.md with AR evaluation methods and vLLM per-category re…
ChenhanYu Apr 9, 2026
0a26031
simplify: remove fixed GT vs online GT discussion from dflash.md
ChenhanYu Apr 9, 2026
14f02b1
address PR review: clarify KV injection, anchors, decay, offline, vLLM
ChenhanYu Apr 9, 2026
b722fb5
update: categorize open items by implementation status in dflash.md
ChenhanYu Apr 9, 2026
ac017d0
fix: address CodeRabbit review comments
ChenhanYu Apr 9, 2026
da8a26d
add: illustrated anchor sampling example in dflash.md
ChenhanYu Apr 9, 2026
7c7f24e
clarify: architecture descriptions per reviewer feedback
ChenhanYu Apr 9, 2026
268b84b
add: token-level KV injection illustration in dflash.md
ChenhanYu Apr 9, 2026
9d406de
add: training vs inference with attention mask visualization
ChenhanYu Apr 9, 2026
ce9fc87
rename: AR Evaluation → HuggingFace AR Evaluation
ChenhanYu Apr 9, 2026
e8101a7
address PR review feedback
ChenhanYu Apr 10, 2026
9d45a61
address PR review: chat template, rope, export, vLLM improvements
ChenhanYu Apr 11, 2026
e08d18d
address PR review: round 3
ChenhanYu Apr 11, 2026
f23381f
add Qwen3.5-4B chat template, vLLM smoke test, deployment notes
ChenhanYu Apr 11, 2026
0064b0f
fix: code quality - inline short function call
ChenhanYu Apr 13, 2026
abbf15c
fix: revert local_files_only, lazy rotary init, loss mask simplification
ChenhanYu Apr 13, 2026
31a2934
fix: GPU tests use dflash_mask_token_id (moved from architecture_config)
ChenhanYu Apr 13, 2026
d9c8ef9
fix: support conversations field as fallback for messages
ChenhanYu Apr 13, 2026
61b926e
fix: convert EAGLE3 test data from conversations to messages format
ChenhanYu Apr 13, 2026
0c8244a
fix: use static test data instead of downloading daring-anteater
ChenhanYu Apr 14, 2026
ffc5fc5
standardize on OpenAI messages format throughout pipeline
ChenhanYu Apr 14, 2026
3539b8e
deprecate conversations field with one-time warning
ChenhanYu Apr 14, 2026
3147f82
remove duplicate chat template from modelopt_recipes
ChenhanYu Apr 14, 2026
1482893
fix: add conversations alias in test data for backward compatibility
ChenhanYu Apr 14, 2026
05f6487
fix: add missing json import in conftest (needed by tiny_conversation…
ChenhanYu Apr 14, 2026
20a491c
fix: rename make_eagle_supervised_data_module in offline test
ChenhanYu Apr 14, 2026
7568408
fix: pass chat template explicitly in generation tag tests
ChenhanYu Apr 14, 2026
59746f0
fix: match generation tag variants, revert uv.lock
ChenhanYu Apr 14, 2026
3a64a4c
fix: generation tag verification and test assertions
ChenhanYu Apr 14, 2026
f7bb9ad
fix: revert retries passthrough in core.py (breaks launcher unit test)
ChenhanYu Apr 14, 2026
dddfc6a
add: synthetic dataset, regression test, Qwen3-0.6B example
ChenhanYu Apr 14, 2026
2e88d22
fix: support messages field in calibrate_draft_vocab and compute_hidd…
ChenhanYu Apr 14, 2026
2904b16
fix: Qwen3.5-4B does not need trust_remote_code
ChenhanYu Apr 14, 2026
2f474aa
trigger CI with run-tests label
ChenhanYu Apr 14, 2026
8ab1b25
add: gpu_regression test suite with per-push CI
ChenhanYu Apr 14, 2026
760fc97
fix: export review feedback - dtype mismatch, default, validation
ChenhanYu Apr 14, 2026
2366ffc
retrigger CI
ChenhanYu Apr 14, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .github/workflows/gpu_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,11 @@ jobs:
.github/workflows/gpu_tests.yml
modelopt/**
tests/gpu/**
tests/gpu_regression/**
examples/speculative_decoding/**
examples/dataset/**
modelopt_recipes/general/speculative_decoding/**
tools/launcher/**
pyproject.toml
tox.ini
fail_on_initial_diff_error: true
Expand All @@ -66,6 +71,9 @@ jobs:
timeout: 45
container_image: pytorch:26.01-py3
# tests/gpu/_extensions/test_onnx_extensions.py fails for newer containers until https://github.com/tbenthompson/cppimport/pull/98
- example: gpu-regression
timeout: 15
container_image: pytorch:26.01-py3
- example: gpu-megatron
timeout: 45
container_image: pytorch:26.01-py3
Expand Down
13 changes: 13 additions & 0 deletions examples/dataset/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -219,3 +219,16 @@ python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--workers 32 \
--reasoning_content inline
```

## Synthetic Test Dataset

`synthetic_conversations_1k.jsonl` is a 1,000-sample dataset in OpenAI messages format
(900 single-turn + 100 two-turn conversations) covering writing, reasoning, math, coding,
STEM, extraction, humanities, and roleplay categories.

This dataset was synthesized by Claude (Anthropic) and is licensed under Apache-2.0.
It is intended for testing and CI regression — not for production training.

```json
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
```
5 changes: 4 additions & 1 deletion examples/dataset/make_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -522,7 +522,10 @@ async def main(args: argparse.Namespace) -> None:
)
if "conversation_id" not in entry:
entry["conversation_id"] = id_for_conversation(entry["conversations"])
f.write(json.dumps(entry, ensure_ascii=False) + "\n")
# Output in OpenAI messages format (rename conversations → messages)
output_entry = {k: v for k, v in entry.items() if k != "conversations"}
output_entry["messages"] = entry["conversations"]
f.write(json.dumps(output_entry, ensure_ascii=False) + "\n")


if __name__ == "__main__":
Expand Down
1,000 changes: 1,000 additions & 0 deletions examples/dataset/synthetic_conversations_1k.jsonl

Large diffs are not rendered by default.

46 changes: 44 additions & 2 deletions examples/speculative_decoding/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -217,8 +217,7 @@ To use your own datasets, please preprocess your data into a `.jsonl` file with

```json
{
"conversation_id": <unique id>,
"conversations": [{"role":<user or assistant>, "content":<content>}]
"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]
}
```

Expand Down Expand Up @@ -350,3 +349,46 @@ More models coming soon!
- 💡 [Release Notes](https://nvidia.github.io/Model-Optimizer/reference/0_changelog.html)
- 🐛 [File a bug](https://github.com/NVIDIA/Model-Optimizer/issues/new?template=1_bug_report.md)
- ✨ [File a Feature Request](https://github.com/NVIDIA/Model-Optimizer/issues/new?template=2_feature_request.md)

## DFlash (Block Diffusion for Speculative Decoding)

DFlash is a parallel speculative decoding method based on [Block Diffusion](https://arxiv.org/abs/2602.06036).
Unlike autoregressive draft models (EAGLE3), DFlash predicts an entire block of tokens in a single forward pass
using masked parallel prediction with KV injection from the target model's hidden states.

### Quick Start

For a complete end-to-end example (training + evaluation), see the
[launcher example](../../tools/launcher/examples/Qwen/Qwen3-8B/hf_online_dflash.yaml):

```bash
uv run launch.py --yaml examples/Qwen/Qwen3-8B/hf_online_dflash.yaml --yes
```

### Key Configuration ([dflash.yaml](../../modelopt_recipes/general/speculative_decoding/dflash.yaml))

| Field | Default | Description |
|-------|---------|-------------|
| `dflash.dflash_block_size` | 8 | Block size for parallel prediction |
| `dflash.dflash_num_anchors` | 512 | Number of anchor positions per sample |
| `dflash.dflash_loss_decay_factor` | 4.0 | Exponential decay gamma (0 disables) |
| `dflash.dflash_self_logit_distillation` | true | Use logit distillation from target |
| `dflash.dflash_architecture_config.num_hidden_layers` | 5 | Draft decoder layers |
| `dflash.dflash_architecture_config.mask_token_id` | auto | Token ID for masked positions |
| `training.answer_only_loss` | false | Mask loss on non-assistant tokens |

Qwen3 sliding window attention is automatically supported — draft layers inherit
`layer_types` and `sliding_window` from the config, matching the target model's
attention pattern.

### Export

```bash
python scripts/export_hf_checkpoint.py \
--model_path /path/to/training/output \
--export_path /path/to/exported/model
```

### Results

See [doc/dflash.md](doc/dflash.md) for design details, benchmark results, and open items.
Original file line number Diff line number Diff line change
Expand Up @@ -201,7 +201,7 @@ async def submit_generates():
for entry in dataset:
conversation_id = entry.get("conversation_id", entry.get("uuid"))

conversations = entry["conversations"]
conversations = entry.get("messages") or entry["conversations"]
if not conversations or not isinstance(conversations, list):
num_invalid += 1
continue
Expand Down
Loading
Loading