[discrete diffusion] Add dflash pipeline#13699
Conversation
Adds DFlashPipeline + DFlashTokenDiffusionScheduler for block-diffusion speculative decoding with a draft DFlash model and a target causal LM. Verified against the six bug patterns surfaced in the LLaDA2 review (huggingface#13598). DFlash sidesteps most of them by being batch_size=1 only and relying on the causal default for attention; the applicable patterns (huggingface#3 callback bindings, huggingface#4 EOS at first generated position, huggingface#6 inner progress-bar config preservation) are pinned by regression tests. Public surface mirrors the LLaDA2 / SDAR / IDLM conventions: lazy import, dummy objects, scheduler + output dataclass, pipeline + output dataclass, fast tests for both, scheduler doc page, pipeline doc page. Sample/train scripts under examples/discrete_diffusion/.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
- Training: `position_ids` must span `[0, start + block_size)` so the draft's attention RoPE cos/sin covers both `k_ctx` (target_hidden, length `start`) and `k_noise` (noise_embedding, length `block_size`). Previously we passed only `arange(start, start + block_size)` which triggered a K-side broadcast mismatch on the very first batch. - Docs/examples: target loads as plain Qwen3 / Qwen3.5 (no remote code), but the draft's custom DFlashDraftModel class lives in the Hub repo's `auto_map`, so `trust_remote_code=True` is required for draft loads only. Updated the example docstring, pipeline doc page, sample script, train script, and the GPU verify script. Smoke-tested via srun on z-lab/Qwen3.5-4B-DFlash + Qwen/Qwen3.5-4B (H100): 3 steps complete, final checkpoint saved.
…rgets
The pipeline previously short-circuited to `draft.spec_generate(...)` when
the draft model exposed it (e.g. z-lab/Qwen3-8B-DFlash-b16). That path is
the upstream `dflash_generate` loop, which calls `past_key_values_target.crop()`
unconditionally — fine for full-attention targets, but on hybrid targets it
silently corrupts the linear-attention recurrent state.
Confirmed in transformers 5.8.0.dev0 at cache_utils.py:759-761:
def crop(self, max_length: int):
# We don't crop the linear attention cache, so simply do nothing here
pass
`LinearAttentionCacheLayerMixin.crop` is documented as a no-op, so any
verify loop that relies on `cache.crop()` for rollback is wrong on hybrid
attention targets. Our explicit loop already handles this via
`DFlashTokenDiffusionScheduler.snapshot_cache` / `restore_cache` plus an
accepted-prefix re-forward, and reduces to a plain `.crop()` on full-attn
targets.
Verified end-to-end on GPU after the removal:
- z-lab/Qwen3.5-4B-DFlash + Qwen/Qwen3.5-4B (hybrid attn): "2 + 2 equals 4."
- z-lab/Qwen3-8B-DFlash-b16 + Qwen/Qwen3-8B (full attn): "2 + 2 equals 4."
Fast tests: 43 passed.
e97f7ae to
a70e329
Compare
- Use `self._execution_device` instead of device detection via `parameters()` in `DFlashPipeline.__call__`; remove redundant draft-device check/warning - Remove `add_noise` from `DFlashTokenDiffusionScheduler` — it implemented MDLM-style uniform block masking (wrong algorithm for DFlash training) and was never called at inference; DFlash training uses anchor-block masking in the training recipe - Remove the four `add_noise` unit tests that covered the deleted method Co-Authored-By: Kashif Rasul <kashif@huggingface.co>
Speculative decoding with sliding-window and Mamba/linear-attention caches has no efficient general solution: snapshot/restore requires re-running the target on accepted tokens, and crop() silently no-ops on recurrent states. Removing the snapshot_cache / restore_cache / cache_has_linear_attention scheduler methods and the associated pipeline rollback logic; DFlash now requires a standard full-attention DynamicCache target model. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DynamicCache() without a config creates an empty shell with no layer objects. Models like Qwen3.5 that call has_previous_state() on the passed cache raise ValueError when they find no LinearAttentionLayer entries. Passing config=target_model.config (and config=draft_model.config) causes DynamicCache.__init__ to pre-build the correct layer types (LinearAttentionLayer for linear_attention, DynamicLayer for full_attention) from config.layer_types, matching what the model would create if it initialized the cache itself. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Hi, when I try out the example in import torch
from diffusers import DFlashPipeline
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
draft = AutoModel.from_pretrained(
"z-lab/Qwen3-8B-DFlash-b16", trust_remote_code=True, dtype=torch.bfloat16
)
target = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
pipe = DFlashPipeline(draft_model=draft, target_model=target, tokenizer=tokenizer)
out = pipe(prompt="How many positive whole-number divisors does 196 have?")
print(out.texts[0])I get the following output: Is the output being cut off expected? |
|
Also, I saw that huggingface/transformers#45846 was closed, will the KV cache work correctly without it? |
|
thanks @dg845 checking |
…sers into add-dflash-pipeline
The single-anchor sampler caps training signal at ~1/512 of paper Appendix A.1,
which makes the resulting draft model accept far fewer tokens at inference than
the paper reports.
This change brings the training script in line with paper §4.2 / Appendix A.1:
- `--num_anchors` (default 512, paper) — N anchor blocks per sequence, processed
in a single forward via a sparse block-diagonal attention pattern.
- `--attention_backend {sdpa, flex_attention}` (default flex_attention) — flex
is required for N=512 (sdpa materialises a dense [B,1,N*bs,S+N*bs] mask that
OOMs even on 80GB H100s at N=512, seq=4096).
- `--no_overlap_anchors` to opt out of the paper's independent (overlapping)
anchor sampling and use stars-and-bars non-overlapping anchors instead.
- New inline helpers `sample_anchor_starts` and `build_dflash_mask`. The latter
returns a FlexAttention BlockMask or a dense additive mask depending on
backend; both encode "block b sees context < anchor[b] and its own noise only".
- `draft.config._attn_implementation` is set from the CLI flag so the draft's
per-layer attention routes through transformers' ALL_ATTENTION_FUNCTIONS.
- Loss decay weights are tiled across N blocks; the existing Eq. 4 weights
computation stays put.
Module docstring now points users at vLLM/SGLang for the §5.1 target-regenerated
training data step, which is a prerequisite for paper-comparable acceptance
length.
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
mask_mod closes over fresh anchor tensors every step but the create_block_mask machinery itself is the same, so wrap it in a module-level torch.compile once. End-to-end draft-model compile is left to Accelerate's --dynamo_backend so the user can opt in/out without script changes. Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Matches the pattern in examples/text_to_image/train_text_to_image_sdxl.py and other diffusers training examples: when a user enables Accelerate's --dynamo_backend, accelerator.unwrap_model returns the compiled wrapper and save_pretrained needs ._orig_mod instead. The helper handles both cases. Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Concrete pipeline (matching SpecForge's prepare_data.py -> regenerate_train_data.py workflow): standardise to JSONL Conversation Format, serve target via SGLang or vLLM (same OpenAI-compatible API), re-roll assistant turns with temperature 0.7-0.8 (NOT greedy — diversity helps the draft generalise), concurrency 64-128 per server. Lists three concrete tooling options users already have available. Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
… pipeline Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
There was a problem hiding this comment.
Pull request overview
Adds experimental DFlash support to diffusers by introducing a DFlash speculative-decoding pipeline (draft diffusion model + target causal LM) and a companion token-diffusion scheduler, along with tests, docs, and example scripts under examples/discrete_diffusion.
Changes:
- Introduces
DFlashPipelineandDFlashTokenDiffusionScheduler(plus outputs) and wires them into lazy imports/dummy objects. - Adds unit tests for the new scheduler and pipeline behavior (including callback/progress-bar regression checks).
- Adds documentation pages and discrete-diffusion example scripts (sampling, training, and data regeneration).
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/diffusers/schedulers/scheduling_dflash_token_diffusion.py |
New scheduler implementing DFlash verification (posterior sampling + acceptance length). |
src/diffusers/schedulers/__init__.py |
Exposes the new scheduler via lazy import structure. |
src/diffusers/pipelines/dflash/pipeline_dflash.py |
New DFlash pipeline implementing block-wise speculative decoding. |
src/diffusers/pipelines/dflash/__init__.py |
Adds lazy import + optional dependency gating for the DFlash pipeline module. |
src/diffusers/pipelines/__init__.py |
Exposes DFlashPipeline at the pipelines package level. |
src/diffusers/__init__.py |
Exposes the new scheduler and pipeline at the top-level diffusers namespace. |
src/diffusers/utils/dummy_pt_objects.py |
Adds dummy (torch-only) objects for the new scheduler types. |
src/diffusers/utils/dummy_torch_and_transformers_objects.py |
Adds dummy objects for the new pipeline types (torch+transformers). |
tests/schedulers/test_scheduler_dflash_token_diffusion.py |
New scheduler unit tests (timesteps, sampling, step outputs, stop condition). |
tests/pipelines/dflash/test_dflash.py |
New pipeline unit tests + regression tests (callbacks/progress-bar). |
tests/pipelines/dflash/__init__.py |
Adds the test package directory for DFlash pipeline tests. |
docs/source/en/api/schedulers/dflash_token_diffusion.md |
New scheduler API docs page. |
docs/source/en/api/pipelines/dflash.md |
New pipeline API docs page. |
docs/source/en/_toctree.yml |
Registers the new DFlash docs pages in the docs navigation. |
examples/discrete_diffusion/train_dflash.py |
Example training script for DFlash draft models. |
examples/discrete_diffusion/sample_dflash.py |
Example sampling script using DFlashPipeline. |
examples/discrete_diffusion/regenerate_dflash_data.py |
Example async helper to regenerate training data via an OpenAI-compatible server. |
examples/discrete_diffusion/README.md |
Adds DFlash usage + training/data-prep documentation to the discrete diffusion examples README. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@dg845 yep, expected -- you hit the default |
|
@dg845 that transformers PR was for hybrid-attention targets (Qwen3.5's gated-delta-net). For plain Qwen3-8B (full-attention), |
- accepted_length: .int() -> .long() so gather gets int64 indices - sample(): .view() -> .reshape(), cast to float before multinomial - remove undocumented cache helpers from scheduler docs - hybrid-attention: mark as unsupported; switch usage example to Qwen3-8B
Previously the panel showed all max_new_tokens worth of ░ upfront. Now DFlash and LLaDA2 both show only committed tokens + one block of ░ ahead, so the display grows block by block. LLaDA2 uses block_size passed at construction to trim the window. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ing style transfer_index shape is (batch, block_length) not (batch, total_seq_len), so the old [0, num_prompt_tokens:] slice always returned empty and the committed window never grew. Fix maps indices into gen_full correctly accounting for the first block straddling the prompt boundary. Also removes the broken visible_end trimming -- block_x already grows one block at a time, so gen_full naturally shows committed + active block with no future noise tokens. Non-mask tokens in the current block that aren't yet committed are shown dim (still being refined, movie-style). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Block size is now read directly from transfer_index.shape[1] and block_output_ids shape, so the constructor arg was unused. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- add token_display.py alongside the sample scripts (proper home) - add --visualize / --draft_pause to sample_dflash.py - add --visualize to sample_llada2.py - drop dev scripts/sample_*_gpu.py and scripts/token_display.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
What does this PR do?
Adds
DFlashPipelineandDFlashTokenDiffusionSchedulerto diffusers -- extracted from #12911 to keep the scope manageable.DFlash is a speculative decoding framework that uses a lightweight block diffusion model as the draft. The draft generates a full block of tokens in one forward pass; the target LLM verifies them in parallel. This gives 6x+ lossless speedup vs. standard autoregressive decoding (paper: https://huggingface.co/papers/2602.06036).
What's included:
DFlashPipeline-- prefill on target, draft a block, verify with the scheduler, commit accepted prefix + resample, repeatDFlashTokenDiffusionScheduler-- verification step: samples target posterior, computes acceptance length, returns next token at first rejectionPretrained pairs from the z-lab/dflash collection work out of the box. Canonical pair:
z-lab/Qwen3-8B-DFlash-b16+Qwen/Qwen3-8B.Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@yiyixuxu @dg845