[Docs] flesh out DFlash pipeline + scheduler pages

kashif · kashif · commit 715ca5046d82 · 2026-05-08T10:09:55.000Z
diff --git a/docs/source/en/api/pipelines/dflash.md b/docs/source/en/api/pipelines/dflash.md
@@ -12,8 +12,79 @@ specific language governing permissions and limitations under the License.
 
 # DFlash
 
-`DFlashPipeline` performs block-diffusion speculative decoding using a diffusion draft model and a target causal LM.
-The draft model is conditioned on target hidden features extracted during prefill and verification steps.
+[DFlash](https://huggingface.co/collections/z-lab/dflash) is a block-diffusion speculative decoding scheme. A small
+diffusion *draft* model proposes a block of tokens conditioned on hidden features extracted from intermediate layers
+of a frozen *target* causal LM; the target then verifies the proposed block in a single forward pass and accepts the
+longest matching prefix. The draft model is shared with the target's tokenizer, so no calibration is needed.
+
+`DFlashPipeline` ties the two models together: prefill on the target, draft a block, verify against the target's
+posterior via [`DFlashTokenDiffusionScheduler`], commit the accepted prefix and the next-token resample, and repeat
+until `max_new_tokens` or a stop token. Compatible draft/target pairs include `z-lab/Qwen3-8B-DFlash-b16` with
+`Qwen/Qwen3-8B`, and `z-lab/Qwen3.5-4B-DFlash` with `Qwen/Qwen3.5-4B` (the latter is a hybrid-attention target — see
+the rollback note below).
+
+## Usage
+
+```py
+import torch
+from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
+
+from diffusers import DFlashPipeline
+
+draft = AutoModel.from_pretrained(
+    "z-lab/Qwen3.5-4B-DFlash", trust_remote_code=True, dtype=torch.bfloat16, device_map="auto"
+)
+target = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen3.5-4B", trust_remote_code=True, dtype=torch.bfloat16, device_map="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-4B", trust_remote_code=True)
+
+pipe = DFlashPipeline(draft_model=draft, target_model=target, tokenizer=tokenizer)
+output = pipe(
+    prompt="What is 2 + 2? Answer in one sentence.",
+    max_new_tokens=128,
+    temperature=0.0,
+    chat_template_kwargs={"enable_thinking": False},
+)
+print(output.texts[0])
+```
+
+`DFlashPipeline` currently runs `batch_size=1` only. Multi-prompt batching requires per-row partial-accept tracking
+and is not yet supported.
+
+## Hybrid-attention targets
+
+For target models with linear-attention layers (e.g. Qwen3.5's gated-delta-net), `DynamicCache.crop()` silently
+no-ops on those layers, so a partial-accept block would otherwise leak rejected speculative tokens into the
+recurrent state. The pipeline detects linear-attention caches via
+[`DFlashTokenDiffusionScheduler.cache_has_linear_attention`] and uses a snapshot/restore + accepted-prefix
+re-forward pattern to advance both layer types cleanly. This adds one extra target forward per partial-accept
+block but is required for correctness.
+
+## Fast path
+
+When the draft model exposes a `spec_generate(...)` method (e.g. `z-lab/Qwen3-8B-DFlash-b16`), the pipeline
+delegates to it — that loop is the upstream-canonical implementation and avoids re-running the rollback bookkeeping.
+Newer drafts (`z-lab/Qwen3.5-4B-DFlash`) drop `spec_generate`; the pipeline falls back to its explicit verify loop.
+
+## Callbacks
+
+Callbacks run after each block-verify step. Pass `callback_on_step_end_tensor_inputs` to select which tensors are
+included in `callback_kwargs`. Allowed keys: `block_output_ids` (the drafted block), `draft_logits`,
+`accepted_length`, `next_token`, and `output_ids` (the running output buffer). Return `{"output_ids": ...}` from the
+callback to replace the buffer.
+
+```py
+def on_step_end(pipe, step, timestep, callback_kwargs):
+    output_ids = callback_kwargs["output_ids"]
+    return {"output_ids": output_ids}
+
+out = pipe(
+    prompt="...",
+    callback_on_step_end=on_step_end,
+    callback_on_step_end_tensor_inputs=["output_ids"],
+)
+```
 
 ## DFlashPipeline
 [[autodoc]] DFlashPipeline
diff --git a/docs/source/en/api/schedulers/dflash_token_diffusion.md b/docs/source/en/api/schedulers/dflash_token_diffusion.md
@@ -12,8 +12,19 @@ specific language governing permissions and limitations under the License.
 
 # DFlashTokenDiffusionScheduler
 
-`DFlashTokenDiffusionScheduler` implements the acceptance and posterior sampling logic used in DFlash-style block
-diffusion speculative decoding.
+[`DFlashTokenDiffusionScheduler`] implements the verification step for DFlash-style block-diffusion speculative
+decoding. It samples a posterior block from the target logits, computes the acceptance length as the longest prefix
+where the draft proposal matches the posterior, and exposes the resampled `next_token` for the first rejected
+position. Used by [`DFlashPipeline`].
+
+The scheduler also owns three helpers used by the pipeline's verify loop on hybrid-attention targets:
+
+- `cache_has_linear_attention(cache)` — detect whether a `DynamicCache` contains any linear-attention layers.
+- `snapshot_cache(cache)` / `restore_cache(cache, snapshot)` — clone and restore the full per-layer state so a
+  partial-accept block can be rolled back and the target re-advanced on just the accepted prefix.
+
+These exist because `DynamicCache.crop()` silently no-ops on linear-attention layers, which would otherwise let
+rejected speculative tokens permanently contaminate the recurrent state.
 
 ## DFlashTokenDiffusionScheduler
 [[autodoc]] DFlashTokenDiffusionScheduler