NVIDIA
diff --git a/‎.github/workflows/gpu_tests.yml‎
Lines changed: 8 additions & 0 deletions b/‎.github/workflows/gpu_tests.yml‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 6 additions & 0 deletions b/‎.pre-commit-config.yaml‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎examples/dataset/README.md‎
Lines changed: 13 additions & 0 deletions b/‎examples/dataset/README.md‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎examples/dataset/make_dataset.py‎
Lines changed: 4 additions & 1 deletion b/‎examples/dataset/make_dataset.py‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎examples/dataset/synthetic_conversations_1k.jsonl‎
Lines changed: 1000 additions & 0 deletions b/‎examples/dataset/synthetic_conversations_1k.jsonl‎
Lines changed: 1000 additions & 0 deletions
diff --git a/‎examples/speculative_decoding/README.md‎
Lines changed: 44 additions & 2 deletions b/‎examples/speculative_decoding/README.md‎
Lines changed: 44 additions & 2 deletions
diff --git a/‎examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py‎
Lines changed: 1 addition & 1 deletion b/‎examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py‎
Lines changed: 1 addition & 1 deletion
@@ -42,6 +42,11 @@ jobs:
             .github/workflows/gpu_tests.yml
             modelopt/**
             tests/gpu/**
+            tests/gpu_regression/**
+            examples/speculative_decoding/**
+            examples/dataset/**
+            modelopt_recipes/general/speculative_decoding/**
+            tools/launcher/**
             pyproject.toml
             tox.ini
           fail_on_initial_diff_error: true
@@ -66,6 +71,9 @@ jobs:
             timeout: 45
             container_image: pytorch:26.01-py3
             # tests/gpu/_extensions/test_onnx_extensions.py fails for newer containers until https://github.com/tbenthompson/cppimport/pull/98
+          - example: gpu-regression
+            timeout: 15
+            container_image: pytorch:26.01-py3
           - example: gpu-megatron
             timeout: 45
             container_image: pytorch:26.01-py3
 
@@ -57,6 +57,12 @@ repos:
 
   - repo: local
     hooks:
+      - id: normalize-yaml-ext
+        name: normalize .yml to .yaml in required places, right now only yaml files in modelopt_recipes
+        entry: python tools/precommit/normalize_yaml_ext.py
+        language: system
+        files: ^modelopt_recipes/.*\.yml$
+
       - id: check-modelopt-recipes
         name: validate modelopt recipes
         entry: python tools/precommit/check_modelopt_recipes.py
 
@@ -219,3 +219,16 @@ python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
     --workers 32 \
     --reasoning_content inline
 ```
+
+## Synthetic Test Dataset
+
+`synthetic_conversations_1k.jsonl` is a 1,000-sample dataset in OpenAI messages format
+(900 single-turn + 100 two-turn conversations) covering writing, reasoning, math, coding,
+STEM, extraction, humanities, and roleplay categories.
+
+This dataset was synthesized by Claude (Anthropic) and is licensed under Apache-2.0.
+It is intended for testing and CI regression — not for production training.
+
+```json
+{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
+```
@@ -522,7 +522,10 @@ async def main(args: argparse.Namespace) -> None:
                 )
                 if "conversation_id" not in entry:
                     entry["conversation_id"] = id_for_conversation(entry["conversations"])
-                f.write(json.dumps(entry, ensure_ascii=False) + "\n")
+                # Output in OpenAI messages format (rename conversations → messages)
+                output_entry = {k: v for k, v in entry.items() if k != "conversations"}
+                output_entry["messages"] = entry["conversations"]
+                f.write(json.dumps(output_entry, ensure_ascii=False) + "\n")
 
 
 if __name__ == "__main__":
 
@@ -217,8 +217,7 @@ To use your own datasets, please preprocess your data into a `.jsonl` file with
 
 ```json
 {
-    "conversation_id": <unique id>,
-    "conversations": [{"role":<user or assistant>, "content":<content>}]
+    "messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]
 }
 ```
 
@@ -350,3 +349,46 @@ More models coming soon!
 - 💡 [Release Notes](https://nvidia.github.io/Model-Optimizer/reference/0_changelog.html)
 - 🐛 [File a bug](https://github.com/NVIDIA/Model-Optimizer/issues/new?template=1_bug_report.md)
 - ✨ [File a Feature Request](https://github.com/NVIDIA/Model-Optimizer/issues/new?template=2_feature_request.md)
+
+## DFlash (Block Diffusion for Speculative Decoding)
+
+DFlash is a parallel speculative decoding method based on [Block Diffusion](https://arxiv.org/abs/2602.06036).
+Unlike autoregressive draft models (EAGLE3), DFlash predicts an entire block of tokens in a single forward pass
+using masked parallel prediction with KV injection from the target model's hidden states.
+
+### Quick Start
+
+For a complete end-to-end example (training + evaluation), see the
+[launcher example](../../tools/launcher/examples/Qwen/Qwen3-8B/hf_online_dflash.yaml):
+
+```bash
+uv run launch.py --yaml examples/Qwen/Qwen3-8B/hf_online_dflash.yaml --yes
+```
+
+### Key Configuration ([dflash.yaml](../../modelopt_recipes/general/speculative_decoding/dflash.yaml))
+
+| Field | Default | Description |
+|-------|---------|-------------|
+| `dflash.dflash_block_size` | 8 | Block size for parallel prediction |
+| `dflash.dflash_num_anchors` | 512 | Number of anchor positions per sample |
+| `dflash.dflash_loss_decay_factor` | 4.0 | Exponential decay gamma (0 disables) |
+| `dflash.dflash_self_logit_distillation` | true | Use logit distillation from target |
+| `dflash.dflash_architecture_config.num_hidden_layers` | 5 | Draft decoder layers |
+| `dflash.dflash_architecture_config.mask_token_id` | auto | Token ID for masked positions |
+| `training.answer_only_loss` | false | Mask loss on non-assistant tokens |
+
+Qwen3 sliding window attention is automatically supported — draft layers inherit
+`layer_types` and `sliding_window` from the config, matching the target model's
+attention pattern.
+
+### Export
+
+```bash
+python scripts/export_hf_checkpoint.py \
+    --model_path /path/to/training/output \
+    --export_path /path/to/exported/model
+```
+
+### Results
+
+See [doc/dflash.md](doc/dflash.md) for design details, benchmark results, and open items.
@@ -201,7 +201,7 @@ async def submit_generates():
         for entry in dataset:
             conversation_id = entry.get("conversation_id", entry.get("uuid"))
 
-            conversations = entry["conversations"]
+            conversations = entry.get("messages") or entry["conversations"]
             if not conversations or not isinstance(conversations, list):
                 num_invalid += 1
                 continue