Skip to content

Qwen3.5-122B-A10B-DFlash + SwiftLM SSD-stream on M1 Ultra: low acceptance + server crash #91

@ericjlake

Description

@ericjlake

First, thank you for the very fast turnaround on releasing z-lab/Qwen3.5-122B-A10B-DFlash (per #81 — released 2026-04-25). Sharing day-one feedback from M1 Ultra + SwiftLM SSD-streaming, since this is the most memory-constrained large-MoE setup the draft is likely to land on.

Setup

  • Hardware: Apple M1 Ultra, 64 GB unified memory, macOS 26.x, internal NVMe
  • Backend: SwiftLM b598-era + recent local fixes (cherry-picked off SharpAI/main), DFlash code path active. The target's Qwen3MoE+DFlash extension picks up Qwen3.5-122B-A10B-4bit correctly:
    [SwiftLM] DFlash: target model supports DFlashTargetModel
    [SwiftLM] DFlash draft model loaded (block_size=16, 6 target layers, mask_token=248077)
    [SwiftLM] Draft model loaded successfully (16 block size, DFlash mode)
    [SwiftLM] Using speculative decoding: …Qwen3.5-122B-A10B-DFlash → …Qwen3.5-122B-A10B-4bit (DFlash block-diffusion)
    
  • Target: mlx-community/Qwen3.5-122B-A10B-4bit (69.6 GB, 48 layers, A10B active per token)
  • SwiftLM flags: --stream-experts --dflash --draft-model …Qwen3.5-122B-A10B-DFlash, SWIFTLM_TOP_K=6 TEND_MOE_CACHE_SLOTS=16 (our standard SSD-stream config; used for the baseline below as well)

Results — generation throughput

Streaming bench via /v1/chat/completions, single request, temperature: 0.6. Same three prompts as the baseline measurement.

Configuration Short (~126 tok in) Medium (~400 tok in) Long (~800 tok in)
--stream-experts baseline (no DFlash) 6.30 tok/s · 153 tok generated · stop 6.11 tok/s · 246 tok · stop 6.22 tok/s · 800 tok · length
--stream-experts --dflash …-DFlash 6.30 tok/s · 200 tok · finish_reason=null 2.78 tok/s · 395 tok · finish_reason=null server crashed mid-run

DFlash is net-negative on this hardware: parity on short, −55% on medium, server crash on long.

Acceptance pattern (this is the interesting part)

The DFlash cycle log shows a clear pathological pattern across hundreds of cycles:

[DFlash] Cycle 180: blockLen=16, verifyLen=16, acceptanceLen=1, commitCount=2
[DFlash] Cycle 181: blockLen=16, verifyLen=16, acceptanceLen=1, commitCount=2
[DFlash] Cycle 182: blockLen=16, verifyLen=16, acceptanceLen=1, commitCount=2
[DFlash] Cycle 183: blockLen=16, verifyLen=16, acceptanceLen=0, commitCount=1
[DFlash] Cycle 184: blockLen=16, verifyLen=16, acceptanceLen=0, commitCount=1
[DFlash] Cycle 185: blockLen=16, verifyLen=16, acceptanceLen=1, commitCount=2
…repeated 200+ cycles, acceptanceLen always 0 or 1

acceptanceLen is consistently 0 or 1 out of block_size=16. The expected/healthy range for a well-aligned DFlash draft is much higher (your README's MLX section implies the draft is meant to commit several tokens per block on average).

So we're paying a 17-position verify pass — which on --stream-experts means SSD reads for 17 positions × 8 experts each = ~136 expert weight reads per layer per cycle, vs ~8 reads per layer per token in vanilla — for ~1 committed token. That fan-out is most of the regression.

Crash on long prompt

The server crashed silently somewhere in the long-prompt bench. The DFlash cycle log abruptly stops at Cycle ~202 followed by a single . and then the process is gone. No backtrace, no OOM marker visible in stdout/stderr. vm_stat immediately after showed plenty of free pages, so it doesn't look like classic system-wide memory pressure — could be MLX-internal (KV cache + draft block buffers compounding under sustained verify) or a DFlash-specific edge case. Happy to instrument and re-run if useful.

Possible directions

Two hypotheses I'd love your read on:

  1. Draft–target distribution mismatch on the MoE routing layer. If the 4-layer draft's hidden state at the routing boundaries differs slightly from the target's, the draft's predicted block routes through "wrong" experts, target rejects almost everything. Is the published draft a stable checkpoint, or is there a known iteration coming with better acceptance?

  2. --stream-experts interaction, similar in spirit to SwiftLM's #72 (SSD streaming + vanilla --draft-model causing fan-out, ultimately auto-capped to 1 draft token). DFlash bypasses that auto-cap because it's a different code path. Worth knowing whether you've validated the 122B draft on a swap-bound / out-of-core path, or only fully-resident-in-RAM setups.

If a fresh draft checkpoint or recommended config (smaller block_size, sliding-window cap, etc.) would change the picture, happy to re-bench and report back.

Repro

SWIFTLM_TOP_K=6 TEND_MOE_CACHE_SLOTS=16 \
  SwiftLM \
    --model <path>/Qwen3.5-122B-A10B-4bit \
    --port 8002 \
    --stream-experts \
    --dflash \
    --draft-model <path>/Qwen3.5-122B-A10B-DFlash

Bench script (single-request streaming, three prompts at 200/400/800 max-tokens) is the same one used for the M1 Ultra baseline numbers in SwiftLM #84. Happy to share JSON / logs if useful.

Cross-reference: also flagging this on the SwiftLM side since it intersects with their SSD-stream + DFlash code path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions