From fea0e1133332571862baea2c5d7dc0f610f3e069 Mon Sep 17 00:00:00 2001 From: Eric Lake Date: Sun, 26 Apr 2026 11:39:00 -0700 Subject: [PATCH] docs(README): remove degenerate DFlash perf row, add honest disclaimer MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Follow-up to #85 (just merged). Subsequent benchmarking discovered the 70 tok/s DFlash medium/long numbers in that PR were ALWAYS degenerate output ("and and and...", "**UMA** **UMA**...") — high acceptance because draft and target both committed to the same locked-in token every block. Root cause: DFlash uses argMax greedy regardless of request temperature. Vanilla samples stochastically at temp=0.6 which breaks ties; DFlash has no tie-breaker and locks into low-entropy attractors. Mitigation experiments (rep-penalty 1.1, 1.3) only partially help: 1.1 is too weak to dislodge hard attractors (1/5 prompts clean), 1.3 fixes attractors but acceptance crashes 80%->18-46% so DFlash becomes net- negative below vanilla. Proper fix is stochastic posterior sampling with rejection-based accept (Leviathan/Chen), tracked at z-lab/dflash#91. Replaces the misleading row with a clear warning so users do not adopt a degenerate codepath as the recommended config. See z-lab/dflash#91 (issuecomment 4322584783) for the full diagnosis. --- README.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/README.md b/README.md index 5f8465f..6b9d5eb 100644 --- a/README.md +++ b/README.md @@ -80,13 +80,12 @@ Benchmark results for full-RAM (no SSD streaming) MoE inference on M1 Ultra. The | Configuration | Short (~126 tok) | Medium (~400 tok) | Long (~800 tok) | |---|---|---|---| | **Vanilla full-GPU** | **61.7 tok/s** | **62.3 tok/s** | **62.1 tok/s** | -| `--dflash` (block_size=16) † | 52.3 tok/s | **70.3 tok/s** (+13%) | **69.9 tok/s** (+13%) | > *Hardware:* Apple M1 Ultra, 64 GB unified memory, macOS 26.x. Model ~20 GB on disk, ~21.6 GB resident weight + ~2.1 GB KV at runtime. > *Flags:* `--repeat-penalty 1.1 --max-tokens 2000`, `temperature: 0.6`, single-stream `/v1/chat/completions`. > *Vanilla baseline before* `needsMoeFlush` *gate (for reference):* 19.2 / 18.1 / 18.3 tok/s — see #84. -† DFlash uses [`z-lab/Qwen3.6-35B-A3B-DFlash`](https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash) (~948 MB) as the block-diffusion draft model. DFlash gives a clean +13% on medium/long generations but regresses short prompts (block overhead doesn't amortize at low token counts) and changes stop-condition behavior (`finish_reason=null` vs `stop`/`length`). Recommend a quality eval before using as default. +> ⚠️ **DFlash on this model is currently unsuitable for production.** DFlash uses pure greedy (`argMax`) decoding regardless of `temperature`, which on Qwen3.6-35B-A3B + the [`z-lab/Qwen3.6-35B-A3B-DFlash`](https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash) draft locks into low-entropy attractors (`"and and and..."`, `"**UMA** **UMA**..."`). Earlier 70 tok/s DFlash numbers were degenerate output that scored high acceptance because draft and target both committed to the same locked-in token. Repetition-penalty mitigation works on some prompts but tanks acceptance on others — the proper fix is stochastic posterior sampling with rejection-based accept ([Leviathan/Chen](https://arxiv.org/abs/2211.17192) formulation), which is a DFlash architecture change tracked at [z-lab/dflash#91](https://github.com/z-lab/dflash/issues/91). ### DeepSeek-V4-Flash (126 GB, Q3-mixed-gs128-affine) — M5 Pro 64 GB