You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
K3 smoke: stop misreporting the DFlash drafter (spec-decode-only, not a transformers model)
The DFlash drafter (z-lab/gemma-4-26B-A4B-it-DFlash) declares
architectures=['DFlashDraftModel'] but ships no modeling file and no
auto_map, and DFlashDraftModel is not a built-in transformers class.
AutoModelForCausalLM therefore silently fell back to the base
model_type=qwen3, dropping the DFlash weights (fc/hidden_norm) and
newly-initialising lm_head/embed_tokens — then ran a standalone forward
and reported drafter_forward_ok=true. That signal was meaningless: the
block-diffusion drafting protocol was never exercised. Per the model
card, DFlash runs only via vLLM (PR #41703) or SGLang speculative decoding.
Fix:
* _detect_drafter_loadability(): flags spec-decode-only drafters
(dflash_config / DFlashDraftModel arch not importable, no auto_map).
* _load_drafter(): for such drafters, load the qwen3 backbone ONLY as a
labeled memory probe (kind=dflash_backbone_memory_probe, faithful=False).
* main(): SKIP the standalone drafter forward for spec-decode-only
drafters (stage drafter_forward_skipped + validation_path), instead of
running garbage through a misloaded backbone.
* summary: drafter_forward_ok=null (n/a) + drafter_faithful_transformers_load
+ drafter_note + drafter_validation_path, instead of a false true.
Faithful DFlash speedup validation is deferred to the vLLM/SGLang path
(Block A part 1).
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
0 commit comments