@@ -217,8 +217,7 @@ To use your own datasets, please preprocess your data into a `.jsonl` file with
217217
218218``` json
219219{
220- "conversation_id" : <unique id>,
221- "conversations" : [{"role" :<user or assistant>, "content":<content> }]
220+ "messages" : [{"role" : " user" , "content" : " ..." }, {"role" : " assistant" , "content" : " ..." }]
222221}
223222```
224223
@@ -350,3 +349,46 @@ More models coming soon!
350349- 💡 [ Release Notes] ( https://nvidia.github.io/Model-Optimizer/reference/0_changelog.html )
351350- 🐛 [ File a bug] ( https://github.com/NVIDIA/Model-Optimizer/issues/new?template=1_bug_report.md )
352351- ✨ [ File a Feature Request] ( https://github.com/NVIDIA/Model-Optimizer/issues/new?template=2_feature_request.md )
352+
353+ ## DFlash (Block Diffusion for Speculative Decoding)
354+
355+ DFlash is a parallel speculative decoding method based on [ Block Diffusion] ( https://arxiv.org/abs/2602.06036 ) .
356+ Unlike autoregressive draft models (EAGLE3), DFlash predicts an entire block of tokens in a single forward pass
357+ using masked parallel prediction with KV injection from the target model's hidden states.
358+
359+ ### Quick Start
360+
361+ For a complete end-to-end example (training + evaluation), see the
362+ [ launcher example] ( ../../tools/launcher/examples/Qwen/Qwen3-8B/hf_online_dflash.yaml ) :
363+
364+ ``` bash
365+ uv run launch.py --yaml examples/Qwen/Qwen3-8B/hf_online_dflash.yaml --yes
366+ ```
367+
368+ ### Key Configuration ([ dflash.yaml] ( ../../modelopt_recipes/general/speculative_decoding/dflash.yaml ) )
369+
370+ | Field | Default | Description |
371+ | -------| ---------| -------------|
372+ | ` dflash.dflash_block_size ` | 8 | Block size for parallel prediction |
373+ | ` dflash.dflash_num_anchors ` | 512 | Number of anchor positions per sample |
374+ | ` dflash.dflash_loss_decay_factor ` | 4.0 | Exponential decay gamma (0 disables) |
375+ | ` dflash.dflash_self_logit_distillation ` | true | Use logit distillation from target |
376+ | ` dflash.dflash_architecture_config.num_hidden_layers ` | 5 | Draft decoder layers |
377+ | ` dflash.dflash_architecture_config.mask_token_id ` | auto | Token ID for masked positions |
378+ | ` training.answer_only_loss ` | false | Mask loss on non-assistant tokens |
379+
380+ Qwen3 sliding window attention is automatically supported — draft layers inherit
381+ ` layer_types ` and ` sliding_window ` from the config, matching the target model's
382+ attention pattern.
383+
384+ ### Export
385+
386+ ``` bash
387+ python scripts/export_hf_checkpoint.py \
388+ --model_path /path/to/training/output \
389+ --export_path /path/to/exported/model
390+ ```
391+
392+ ### Results
393+
394+ See [ doc/dflash.md] ( doc/dflash.md ) for design details, benchmark results, and open items.
0 commit comments