Skip to content

[Feature] support DFlash: Block Diffusion for Flash Speculative Decoding #4530

@hicofeng

Description

@hicofeng

Motivation

https://github.com/z-lab/dflash

DFlash is a lightweight block diffusion model
You can give qwen3.5 27Bvllm sglang mlx already support model modification.

The following is from the publicity introduction:

🚀 Core Breakthrough
DFlash is a lightweight Block Diffusion Model purpose-built for speculative decoding. It predicts an entire token block in a single forward pass, delivering unprecedented inference acceleration.

Limitations of Traditional Methods

Conventional speculative decoding approaches (e.g., EAGLE-3) still generate drafts in an autoregressive manner, where each token must wait for the completion of the previous one. This caps the practical speedup at only 2–3×.

DFlash Innovations

DFlash adopts a fundamentally different strategy:

  • Parallel Draft Generation: Generates a full token block in one forward pass
  • KV Injection Mechanism: Injects hidden layer features from the target model as contextual conditions into every layer of the draft model
  • Feature Fusion: Fuses multi-layer hidden states via FC + RMSNorm to provide highly consistent contextual information

📊 Remarkable Acceleration Results

Qwen3-8B: 6× Lossless Acceleration

On the Qwen3-8B model, DFlash achieves:

  • 6× lossless speedup
  • 2.5× faster than EAGLE-3
  • Acceptance rate as high as 89%+

Qwen3.5-9B: 4.1× Acceleration

On Apple Silicon platforms, the Qwen3.5-9B model delivers:

  • 4.1× speedup
  • Validates 16 tokens generated in a single batch
  • Optimized with custom Metal kernels

🔥 Qwen3.5-27B: 5× Inference Speed Surge

Performance Comparison (Configuration)

Setup 1024 tokens 2048 tokens Speedup
Baseline 14 tok/s 11 tok/s 1×
8-bit Quantization 35 tok/s 26 tok/s 2.5×
4-bit Quantization 28 tok/s 20 tok/s 2.0×

Key Findings

1. 8-bit quantization outperforms 4-bit: Delivers superior speedup while maintaining higher precision
2. Strong long-sequence performance: Retains 2.3× speedup when generating 2048 tokens
3. Lossless decoding: Fully preserves model output quality with zero accuracy degradation

Related resources

https://github.com/z-lab/dflash

Additional context

Image Image

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions