Skip to content

[discrete diffusion] Add dflash pipeline#13699

Open
kashif wants to merge 36 commits into
huggingface:mainfrom
kashif:add-dflash-pipeline
Open

[discrete diffusion] Add dflash pipeline#13699
kashif wants to merge 36 commits into
huggingface:mainfrom
kashif:add-dflash-pipeline

Conversation

@kashif

@kashif kashif commented May 8, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Adds DFlashPipeline and DFlashTokenDiffusionScheduler to diffusers -- extracted from #12911 to keep the scope manageable.

DFlash is a speculative decoding framework that uses a lightweight block diffusion model as the draft. The draft generates a full block of tokens in one forward pass; the target LLM verifies them in parallel. This gives 6x+ lossless speedup vs. standard autoregressive decoding (paper: https://huggingface.co/papers/2602.06036).

What's included:

  • DFlashPipeline -- prefill on target, draft a block, verify with the scheduler, commit accepted prefix + resample, repeat
  • DFlashTokenDiffusionScheduler -- verification step: samples target posterior, computes acceptance length, returns next token at first rejection
  • Tests for both pipeline and scheduler
  • Docs and discrete diffusion examples (sampling, training, data regen)

Pretrained pairs from the z-lab/dflash collection work out of the box. Canonical pair: z-lab/Qwen3-8B-DFlash-b16 + Qwen/Qwen3-8B.

Before submitting

Who can review?

@yiyixuxu @dg845

kashif added 2 commits May 8, 2026 10:03
Adds DFlashPipeline + DFlashTokenDiffusionScheduler for block-diffusion
speculative decoding with a draft DFlash model and a target causal LM.

Verified against the six bug patterns surfaced in the LLaDA2 review
(huggingface#13598). DFlash sidesteps most of them by being batch_size=1 only and
relying on the causal default for attention; the applicable patterns
(huggingface#3 callback bindings, huggingface#4 EOS at first generated position, huggingface#6 inner
progress-bar config preservation) are pinned by regression tests.

Public surface mirrors the LLaDA2 / SDAR / IDLM conventions: lazy import,
dummy objects, scheduler + output dataclass, pipeline + output dataclass,
fast tests for both, scheduler doc page, pipeline doc page.

Sample/train scripts under examples/discrete_diffusion/.
@github-actions github-actions Bot added size/L PR with diff > 200 LOC documentation Improvements or additions to documentation tests utils pipelines examples schedulers and removed size/L PR with diff > 200 LOC labels May 8, 2026
@kashif kashif requested a review from DN6 May 8, 2026 10:14
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

- Training: `position_ids` must span `[0, start + block_size)` so the
  draft's attention RoPE cos/sin covers both `k_ctx` (target_hidden,
  length `start`) and `k_noise` (noise_embedding, length `block_size`).
  Previously we passed only `arange(start, start + block_size)` which
  triggered a K-side broadcast mismatch on the very first batch.
- Docs/examples: target loads as plain Qwen3 / Qwen3.5 (no remote
  code), but the draft's custom DFlashDraftModel class lives in the
  Hub repo's `auto_map`, so `trust_remote_code=True` is required for
  draft loads only. Updated the example docstring, pipeline doc page,
  sample script, train script, and the GPU verify script.

Smoke-tested via srun on z-lab/Qwen3.5-4B-DFlash + Qwen/Qwen3.5-4B
(H100): 3 steps complete, final checkpoint saved.
@github-actions github-actions Bot added the size/L PR with diff > 200 LOC label May 8, 2026
kashif added 3 commits May 8, 2026 11:19
…rgets

The pipeline previously short-circuited to `draft.spec_generate(...)` when
the draft model exposed it (e.g. z-lab/Qwen3-8B-DFlash-b16). That path is
the upstream `dflash_generate` loop, which calls `past_key_values_target.crop()`
unconditionally — fine for full-attention targets, but on hybrid targets it
silently corrupts the linear-attention recurrent state.

Confirmed in transformers 5.8.0.dev0 at cache_utils.py:759-761:

    def crop(self, max_length: int):
        # We don't crop the linear attention cache, so simply do nothing here
        pass

`LinearAttentionCacheLayerMixin.crop` is documented as a no-op, so any
verify loop that relies on `cache.crop()` for rollback is wrong on hybrid
attention targets. Our explicit loop already handles this via
`DFlashTokenDiffusionScheduler.snapshot_cache` / `restore_cache` plus an
accepted-prefix re-forward, and reduces to a plain `.crop()` on full-attn
targets.

Verified end-to-end on GPU after the removal:
- z-lab/Qwen3.5-4B-DFlash + Qwen/Qwen3.5-4B (hybrid attn): "2 + 2 equals 4."
- z-lab/Qwen3-8B-DFlash-b16 + Qwen/Qwen3-8B (full attn):    "2 + 2 equals 4."

Fast tests: 43 passed.
@kashif kashif force-pushed the add-dflash-pipeline branch from e97f7ae to a70e329 Compare May 9, 2026 10:33
@kashif kashif requested review from dg845 and removed request for DN6 May 10, 2026 15:38
kashif and others added 5 commits May 11, 2026 10:38
- Use `self._execution_device` instead of device detection via
  `parameters()` in `DFlashPipeline.__call__`; remove redundant
  draft-device check/warning
- Remove `add_noise` from `DFlashTokenDiffusionScheduler` — it
  implemented MDLM-style uniform block masking (wrong algorithm for
  DFlash training) and was never called at inference; DFlash training
  uses anchor-block masking in the training recipe
- Remove the four `add_noise` unit tests that covered the deleted method

Co-Authored-By: Kashif Rasul <kashif@huggingface.co>
Speculative decoding with sliding-window and Mamba/linear-attention caches
has no efficient general solution: snapshot/restore requires re-running the
target on accepted tokens, and crop() silently no-ops on recurrent states.
Removing the snapshot_cache / restore_cache / cache_has_linear_attention
scheduler methods and the associated pipeline rollback logic; DFlash now
requires a standard full-attention DynamicCache target model.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DynamicCache() without a config creates an empty shell with no layer
objects. Models like Qwen3.5 that call has_previous_state() on the
passed cache raise ValueError when they find no LinearAttentionLayer
entries.

Passing config=target_model.config (and config=draft_model.config)
causes DynamicCache.__init__ to pre-build the correct layer types
(LinearAttentionLayer for linear_attention, DynamicLayer for
full_attention) from config.layer_types, matching what the model
would create if it initialized the cache itself.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dg845

dg845 commented May 20, 2026

Copy link
Copy Markdown
Collaborator

Hi, when I try out the example in src/diffusers/pipelines/dflash/pipeline_dflash.py:

import torch
from diffusers import DFlashPipeline
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

draft = AutoModel.from_pretrained(
    "z-lab/Qwen3-8B-DFlash-b16", trust_remote_code=True, dtype=torch.bfloat16
)
target = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

pipe = DFlashPipeline(draft_model=draft, target_model=target, tokenizer=tokenizer)
out = pipe(prompt="How many positive whole-number divisors does 196 have?")
print(out.texts[0])

I get the following output:

<think>
Okay, so I need to figure out how many positive whole-number divisors 196 has. Hmm, divisors... right. Divisors are numbers that divide into 196 without leaving a remainder. So, I guess I need to find all the numbers that can divide 196 evenly.

First, maybe I should start by recalling how to find the number of divisors. I think there's a method involving prime factorization. Let me remember... if you break down a number into its prime factors, then you can use the exponents of those primes to calculate the total number of divisors.

Let me try that. So, first step: find the prime factors of 196. Let me start dividing 196 by the smallest prime numbers.

196 is even, so it's divisible by 2. Let me divide 196 by 2.

196 ÷ 2 = 98.

Okay, so 2 is a prime factor. Now, 98 is also even, so divide by 2 again.

98 ÷ 2 = 49.

So now we have 2 × 2 × 49. Now, 49 is not even, so let's check the next prime number, which is 3. Does 3 divide into 49? Let me check: 3 × 16 is 48, so 49 - 48 is 1. So no, 3 doesn't divide into 49.

Next prime number is 5. 49 ends with a 9, so it's not divisible by 5. Next is 7. Let me try dividing

Is the output being cut off expected?

@dg845

dg845 commented May 20, 2026

Copy link
Copy Markdown
Collaborator

Also, I saw that huggingface/transformers#45846 was closed, will the KV cache work correctly without it? DFlashTokenDiffusionScheduler has some cache methods (e.g. snapshot_cache, restore_cache, etc.) but I'm not sure if they need the transformers PR to work.

@kashif

kashif commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

thanks @dg845 checking

kashif added 10 commits May 26, 2026 10:59
The single-anchor sampler caps training signal at ~1/512 of paper Appendix A.1,
which makes the resulting draft model accept far fewer tokens at inference than
the paper reports.

This change brings the training script in line with paper §4.2 / Appendix A.1:

- `--num_anchors` (default 512, paper) — N anchor blocks per sequence, processed
  in a single forward via a sparse block-diagonal attention pattern.
- `--attention_backend {sdpa, flex_attention}` (default flex_attention) — flex
  is required for N=512 (sdpa materialises a dense [B,1,N*bs,S+N*bs] mask that
  OOMs even on 80GB H100s at N=512, seq=4096).
- `--no_overlap_anchors` to opt out of the paper's independent (overlapping)
  anchor sampling and use stars-and-bars non-overlapping anchors instead.
- New inline helpers `sample_anchor_starts` and `build_dflash_mask`. The latter
  returns a FlexAttention BlockMask or a dense additive mask depending on
  backend; both encode "block b sees context < anchor[b] and its own noise only".
- `draft.config._attn_implementation` is set from the CLI flag so the draft's
  per-layer attention routes through transformers' ALL_ATTENTION_FUNCTIONS.
- Loss decay weights are tiled across N blocks; the existing Eq. 4 weights
  computation stays put.

Module docstring now points users at vLLM/SGLang for the §5.1 target-regenerated
training data step, which is a prerequisite for paper-comparable acceptance
length.

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
mask_mod closes over fresh anchor tensors every step but the create_block_mask
machinery itself is the same, so wrap it in a module-level torch.compile once.
End-to-end draft-model compile is left to Accelerate's --dynamo_backend so the
user can opt in/out without script changes.

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Matches the pattern in examples/text_to_image/train_text_to_image_sdxl.py
and other diffusers training examples: when a user enables Accelerate's
--dynamo_backend, accelerator.unwrap_model returns the compiled wrapper and
save_pretrained needs ._orig_mod instead. The helper handles both cases.

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Concrete pipeline (matching SpecForge's prepare_data.py -> regenerate_train_data.py
workflow): standardise to JSONL Conversation Format, serve target via
SGLang or vLLM (same OpenAI-compatible API), re-roll assistant turns with
temperature 0.7-0.8 (NOT greedy — diversity helps the draft generalise),
concurrency 64-128 per server. Lists three concrete tooling options users
already have available.

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
… pipeline

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds experimental DFlash support to diffusers by introducing a DFlash speculative-decoding pipeline (draft diffusion model + target causal LM) and a companion token-diffusion scheduler, along with tests, docs, and example scripts under examples/discrete_diffusion.

Changes:

  • Introduces DFlashPipeline and DFlashTokenDiffusionScheduler (plus outputs) and wires them into lazy imports/dummy objects.
  • Adds unit tests for the new scheduler and pipeline behavior (including callback/progress-bar regression checks).
  • Adds documentation pages and discrete-diffusion example scripts (sampling, training, and data regeneration).

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/diffusers/schedulers/scheduling_dflash_token_diffusion.py New scheduler implementing DFlash verification (posterior sampling + acceptance length).
src/diffusers/schedulers/__init__.py Exposes the new scheduler via lazy import structure.
src/diffusers/pipelines/dflash/pipeline_dflash.py New DFlash pipeline implementing block-wise speculative decoding.
src/diffusers/pipelines/dflash/__init__.py Adds lazy import + optional dependency gating for the DFlash pipeline module.
src/diffusers/pipelines/__init__.py Exposes DFlashPipeline at the pipelines package level.
src/diffusers/__init__.py Exposes the new scheduler and pipeline at the top-level diffusers namespace.
src/diffusers/utils/dummy_pt_objects.py Adds dummy (torch-only) objects for the new scheduler types.
src/diffusers/utils/dummy_torch_and_transformers_objects.py Adds dummy objects for the new pipeline types (torch+transformers).
tests/schedulers/test_scheduler_dflash_token_diffusion.py New scheduler unit tests (timesteps, sampling, step outputs, stop condition).
tests/pipelines/dflash/test_dflash.py New pipeline unit tests + regression tests (callbacks/progress-bar).
tests/pipelines/dflash/__init__.py Adds the test package directory for DFlash pipeline tests.
docs/source/en/api/schedulers/dflash_token_diffusion.md New scheduler API docs page.
docs/source/en/api/pipelines/dflash.md New pipeline API docs page.
docs/source/en/_toctree.yml Registers the new DFlash docs pages in the docs navigation.
examples/discrete_diffusion/train_dflash.py Example training script for DFlash draft models.
examples/discrete_diffusion/sample_dflash.py Example sampling script using DFlashPipeline.
examples/discrete_diffusion/regenerate_dflash_data.py Example async helper to regenerate training data via an OpenAI-compatible server.
examples/discrete_diffusion/README.md Adds DFlash usage + training/data-prep documentation to the discrete diffusion examples README.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/diffusers/schedulers/scheduling_dflash_token_diffusion.py Outdated
Comment thread src/diffusers/schedulers/scheduling_dflash_token_diffusion.py
Comment thread docs/source/en/api/schedulers/dflash_token_diffusion.md Outdated
Comment thread docs/source/en/api/pipelines/dflash.md Outdated
@kashif

kashif commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@dg845 yep, expected -- you hit the default max_new_tokens=2048 and Qwen3's <think> trace runs well over that. Pass chat_template_kwargs={"enable_thinking": False} to skip thinking mode, or bump max_new_tokens to 8k+ for full traces. Updated the docs to call this out.

@kashif

kashif commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@dg845 that transformers PR was for hybrid-attention targets (Qwen3.5's gated-delta-net). For plain Qwen3-8B (full-attention), DynamicCache.crop() works fine and the closed PR doesn't affect anything. The snapshot/restore cache methods were planned for hybrid-attention support but aren't in this PR -- updated the docs to say hybrid targets aren't supported yet.

kashif and others added 9 commits June 9, 2026 12:59
- accepted_length: .int() -> .long() so gather gets int64 indices
- sample(): .view() -> .reshape(), cast to float before multinomial
- remove undocumented cache helpers from scheduler docs
- hybrid-attention: mark as unsupported; switch usage example to Qwen3-8B
Previously the panel showed all max_new_tokens worth of ░ upfront.
Now DFlash and LLaDA2 both show only committed tokens + one block
of ░ ahead, so the display grows block by block. LLaDA2 uses
block_size passed at construction to trim the window.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ing style

transfer_index shape is (batch, block_length) not (batch, total_seq_len),
so the old [0, num_prompt_tokens:] slice always returned empty and the
committed window never grew. Fix maps indices into gen_full correctly
accounting for the first block straddling the prompt boundary.

Also removes the broken visible_end trimming -- block_x already grows
one block at a time, so gen_full naturally shows committed + active block
with no future noise tokens. Non-mask tokens in the current block that
aren't yet committed are shown dim (still being refined, movie-style).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Block size is now read directly from transfer_index.shape[1] and
block_output_ids shape, so the constructor arg was unused.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- add token_display.py alongside the sample scripts (proper home)
- add --visualize / --draft_pause to sample_dflash.py
- add --visualize to sample_llada2.py
- drop dev scripts/sample_*_gpu.py and scripts/token_display.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation examples pipelines schedulers size/L PR with diff > 200 LOC tests utils

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants