NVIDIA
diff --git a/‎doc/results/dflash_results.html‎ b/‎doc/results/dflash_results.html‎
diff --git a/‎examples/speculative_decoding/README.md‎
Lines changed: 38 additions & 0 deletions b/‎examples/speculative_decoding/README.md‎
Lines changed: 38 additions & 0 deletions
diff --git a/‎examples/speculative_decoding/doc/dflash_results.md‎
Lines changed: 85 additions & 0 deletions b/‎examples/speculative_decoding/doc/dflash_results.md‎
Lines changed: 85 additions & 0 deletions
diff --git a/‎examples/speculative_decoding/eagle_utils.py‎
Lines changed: 50 additions & 19 deletions b/‎examples/speculative_decoding/eagle_utils.py‎
Lines changed: 50 additions & 19 deletions
@@ -350,3 +350,41 @@ More models coming soon!
 - 💡 [Release Notes](https://nvidia.github.io/Model-Optimizer/reference/0_changelog.html)
 - 🐛 [File a bug](https://github.com/NVIDIA/Model-Optimizer/issues/new?template=1_bug_report.md)
 - ✨ [File a Feature Request](https://github.com/NVIDIA/Model-Optimizer/issues/new?template=2_feature_request.md)
+
+## DFlash (Block Diffusion for Speculative Decoding)
+
+DFlash is a parallel speculative decoding method based on [Block Diffusion](https://arxiv.org/abs/2602.06036).
+Unlike autoregressive draft models (EAGLE3), DFlash predicts an entire block of tokens in a single forward pass
+using masked parallel prediction with KV injection from the target model's hidden states.
+
+### Quick Start
+
+```bash
+./launch_train.sh --config ../../modelopt_recipes/general/speculative_decoding/dflash.yaml \
+    model.model_name_or_path=/path/to/Qwen3-8B \
+    data.data_path=/path/to/train.jsonl \
+    training.output_dir=/path/to/output
+```
+
+### Key Configuration (dflash.yaml)
+
+| Field | Default | Description |
+|-------|---------|-------------|
+| `dflash.dflash_block_size` | 8 | Block size for parallel prediction |
+| `dflash.dflash_num_anchors` | 512 | Number of anchor positions per sample |
+| `dflash.dflash_loss_decay_factor` | 4.0 | Exponential decay gamma (0 disables) |
+| `dflash.dflash_self_logit_distillation` | true | Use logit distillation from target |
+| `dflash.dflash_architecture_config.num_hidden_layers` | 5 | Draft decoder layers |
+| `dflash.dflash_architecture_config.mask_token_id` | auto | Token ID for masked positions |
+
+### Export
+
+```bash
+python scripts/export_hf_checkpoint.py \
+    --model_path /path/to/training/output \
+    --export_path /path/to/exported/model
+```
+
+### Results
+
+See [doc/dflash_results.md](doc/dflash_results.md) for benchmark results on Qwen3-8B.
@@ -0,0 +1,85 @@
+# DFlash Block Diffusion — ModelOpt Training Results
+
+Qwen3-8B target model, trained on nvidia/Nemotron-Post-Training-Dataset-v2 (2M samples)
+
+## Key Metrics
+
+| Benchmark | Acceptance Rate |
+|-----------|----------------|
+| **gsm8k** | **5.19** |
+| **MT-Bench** | **4.36** |
+
+> Online validation, block_size=8, osl=512
+
+## Training Configuration
+
+| Parameter | Value |
+|-----------|-------|
+| Target Model | Qwen3-8B |
+| Draft Layers | 5 |
+| Block Size | 8 |
+| Sequence Length | 4096 |
+| Anchors per Sample | 512 |
+| Loss | KD (logit distillation) + exponential decay (gamma=4) |
+| Learning Rate | 6e-4 (linear decay) |
+| Epochs | 10 |
+| GPUs | 64 (8 nodes x 8 H100) |
+| Total Steps | 306,620 |
+| Final Loss | 1.129 |
+| Final Per-Token Acc | 67.0% |
+
+## MT-Bench Per-Category AR (Online Validation)
+
+80 prompts, block_size=8, osl=512, steps=7
+
+| Category | 80K | 150K | 306K (final) |
+|----------|-----|------|-------------|
+| math | 5.44 | 5.54 | **5.52** |
+| extraction | 4.81 | 4.82 | **4.88** |
+| coding | 4.40 | 4.53 | **4.60** |
+| reasoning | 4.34 | 4.41 | **4.44** |
+| stem | 4.05 | 4.15 | **4.17** |
+| writing | 3.76 | 3.79 | **3.84** |
+| roleplay | 3.58 | 3.73 | **3.78** |
+| humanities | 3.55 | 3.62 | **3.65** |
+| **ALL** | **4.24** | **4.32** | **4.36** |
+
+## Comparison with z-lab/Qwen3-8B-DFlash-b16
+
+### ModelOpt Eval (online validation, osl=512)
+
+| Dataset | z-lab | ModelOpt (306K) | Diff |
+|---------|-------|-----------------|------|
+| gsm8k | 4.10 | **5.19** | **+1.09** |
+| MT-Bench | 3.58 | **4.36** | **+0.78** |
+
+### z-lab Official Eval (dflash.benchmark, osl=512)
+
+| Dataset | z-lab | ModelOpt (306K) | Diff |
+|---------|-------|-----------------|------|
+| gsm8k | **5.00** | 4.08 | -0.92 |
+| MT-Bench | **3.28** | 2.99 | -0.29 |
+
+> z-lab model trained with block_size=16. ModelOpt trained with block_size=8.
+
+## Evaluation Method Impact (gsm8k)
+
+| Eval Method | z-lab checkpoint | ModelOpt (306K) |
+|-------------|-----------------|-----------------|
+| Fixed GT (ModelOpt eval) | 2.95 | 4.23 |
+| Online GT (ModelOpt eval) | 4.10 | **5.19** |
+| z-lab official eval | **5.00** | 4.08 |
+
+- **Fixed GT**: pre-compute greedy ground truth, check draft against it.
+- **Online GT**: recompute ground truth after each accepted draft (context-dependent).
+- **z-lab official**: actual speculative decoding with draft KV cache.
+
+## Key Findings
+
+| Finding | Evidence |
+|---------|----------|
+| Loss decay boosts AR | +0.12 AR at 55K steps (gamma=7, bs16); consistent across all checkpoints |
+| Longer sequences help | seq=4096 vs 512: +0.49 AR on AA-Synthetic at same checkpoint |
+| Online validation essential | Fixed GT underestimates by ~1.0 AR; context-dependent GT matches actual spec-decode |
+| Forward pass identical to z-lab | Max diff 0.5 (bf16 noise) on same mask_token_id; 6/7 draft tokens match |
+| sdpa vs flash_attn: negligible | Overall AR 3.31 vs 3.31; hidden states identical, logits differ <2% |
@@ -141,6 +141,7 @@ def make_eagle_supervised_data_module(
     tokenizer: transformers.PreTrainedTokenizer,
     data_args,
     train_len=None,
+    answer_only_loss=False,
 ) -> dict:
     if data_args.offline_data_path is None:
         train_dataset = ShardedDataset("json", data_files=data_args.data_path)
@@ -150,6 +151,7 @@ def make_eagle_supervised_data_module(
                 tokenizer=tokenizer,
                 train_len=train_len,
                 return_labels=True,
+                answer_only_loss=answer_only_loss,
             )
         else:
             data_collator = VisionLanguageDataCollator(
@@ -205,6 +207,12 @@ def on_log(self, args, state, control, **kwargs):
         if not hasattr(state, "training_accs") or len(state.training_accs) == 0:
             return control
         average_acc = np.mean(state.training_accs, axis=0)
+        # Always print accuracy to console
+        try:
+            acc_str = ", ".join(f"{a:.4f}" for a in np.array(average_acc).flatten())
+            print_rank_0(f"Step {state.global_step} Training Acc: [{acc_str}]")
+        except Exception:
+            print_rank_0(f"Step {state.global_step} Training Acc: {average_acc}")
         if self.estimate_ar:
             # Calculate mean training AR since last log
             # NOTE: This is only an estimate of the real AR.
@@ -219,41 +227,64 @@ def on_log(self, args, state, control, **kwargs):
                 est_ar += acc_cumprod
             print_rank_0(f"Step {state.global_step} Estimated Training AR: {est_ar:.4f}")
 
+        # Log accuracy to HF Trainer's logs dict (picked up by TensorBoard)
+        logs = kwargs.get("logs") or {}
+        for i, draft_acc in enumerate(average_acc):
+            for j, step_acc in enumerate(draft_acc):
+                logs[f"train_acc/parallel_{i}_step_{j}"] = float(step_acc)
+        if self.estimate_ar:
+            logs["estimated_training_ar"] = est_ar
+
         # log to wandb
-        if wandb and is_master():
-            logs = kwargs.get("logs") or {}
+        if hasattr(wandb, "init") and is_master():
             if logs:
                 wandb.log({k: v for k, v in logs.items() if v is not None}, step=state.global_step)
-            for i, draft_acc in enumerate(average_acc):
-                for j, step_acc in enumerate(draft_acc):
-                    wandb.log(
-                        {f"parallel_{i}_step_{j}_train_acc": step_acc}, step=state.global_step
-                    )
-            if self.estimate_ar:
-                wandb.log({"estimated_training_ar": est_ar}, step=state.global_step)
 
         # reset training_accs
         state.training_accs = []
         return control
 
     def on_step_end(self, args, state, control, **kwargs):
-        """Run AR validation periodically, if available."""
+        """Run AR validation periodically (single-GPU only).
+
+        AR validation with DDP is not supported because pseudo_speculative_generate
+        runs only on rank 0 while other ranks deadlock waiting for collective ops.
+        When world_size > 1, AR validation is skipped with a one-time warning.
+        Use post-training AR validation instead (online_training.sh runs it after training).
+        """
         if self.ar_validate_steps <= 0:
             return control
         if state.global_step % self.ar_validate_steps == 0 and state.global_step > 0:
+            if torch.distributed.is_initialized() and torch.distributed.get_world_size() > 1:
+                if not hasattr(self, "_ar_ddp_warned"):
+                    self._ar_ddp_warned = True
+                    print_rank_0(
+                        "=== WARNING === AR validation during training is not supported with "
+                        "DDP (world_size > 1). Skipping. Use post-training AR validation."
+                    )
+                return control
+
+            model = kwargs["model"]
+            raw_model = model.module if hasattr(model, "module") else model
+            was_training = raw_model.training
+            raw_model.eval()
             print_rank_0("Running AR validation...")
             try:
-                ars = validate_ar(
-                    model=kwargs["model"],
-                    tokenizer=kwargs["processing_class"],
-                    ds=load_dataset("HuggingFaceH4/mt_bench_prompts")["train"],
-                    device=kwargs["model"].device,
-                )
+                with torch.no_grad():
+                    ars = validate_ar(
+                        model=raw_model,
+                        tokenizer=kwargs["processing_class"],
+                        ds=load_dataset("/hf-local/HuggingFaceH4/mt_bench_prompts")["train"],
+                        device=next(raw_model.parameters()).device,
+                        num_samples=8,
+                    )
                 print_rank_0(f"Step {state.global_step} AR: {sum(ars) / len(ars):.4f}")
-                if wandb and is_master():
+                if wandb:
                     wandb.log({"validate_ar": sum(ars) / len(ars)}, step=state.global_step)
-            except Exception:
-                print_rank_0("AR validation not available.")
+            except Exception as e:
+                print_rank_0(f"AR validation failed: {e}")
+            if was_training:
+                raw_model.train()
         return control