Qwen3.5-122B-A10B-DFlash + SwiftLM SSD-stream on M1 Ultra: low acceptance + server crash

First, thank you for the very fast turnaround on releasing [`z-lab/Qwen3.5-122B-A10B-DFlash`](https://huggingface.co/z-lab/Qwen3.5-122B-A10B-DFlash) (per #81 — released 2026-04-25). Sharing day-one feedback from M1 Ultra + SwiftLM SSD-streaming, since this is the most memory-constrained large-MoE setup the draft is likely to land on.

## Setup

- **Hardware:** Apple M1 Ultra, 64 GB unified memory, macOS 26.x, internal NVMe
- **Backend:** [SwiftLM](https://github.com/SharpAI/SwiftLM) `b598`-era + recent local fixes (cherry-picked off `SharpAI/main`), DFlash code path active. The target's `Qwen3MoE+DFlash` extension picks up `Qwen3.5-122B-A10B-4bit` correctly:
  ```
  [SwiftLM] DFlash: target model supports DFlashTargetModel
  [SwiftLM] DFlash draft model loaded (block_size=16, 6 target layers, mask_token=248077)
  [SwiftLM] Draft model loaded successfully (16 block size, DFlash mode)
  [SwiftLM] Using speculative decoding: …Qwen3.5-122B-A10B-DFlash → …Qwen3.5-122B-A10B-4bit (DFlash block-diffusion)
  ```
- **Target:** `mlx-community/Qwen3.5-122B-A10B-4bit` (69.6 GB, 48 layers, A10B active per token)
- **SwiftLM flags:** `--stream-experts --dflash --draft-model …Qwen3.5-122B-A10B-DFlash`, `SWIFTLM_TOP_K=6 TEND_MOE_CACHE_SLOTS=16` (our standard SSD-stream config; used for the baseline below as well)

## Results — generation throughput

Streaming bench via `/v1/chat/completions`, single request, `temperature: 0.6`. Same three prompts as the baseline measurement.

| Configuration | Short (~126 tok in) | Medium (~400 tok in) | Long (~800 tok in) |
|---|---|---|---|
| `--stream-experts` baseline (no DFlash) | 6.30 tok/s · 153 tok generated · stop | 6.11 tok/s · 246 tok · stop | 6.22 tok/s · 800 tok · length |
| `--stream-experts --dflash …-DFlash` | 6.30 tok/s · 200 tok · `finish_reason=null` | **2.78 tok/s** · 395 tok · `finish_reason=null` | **server crashed mid-run** |

DFlash is **net-negative** on this hardware: parity on short, −55% on medium, server crash on long.

## Acceptance pattern (this is the interesting part)

The DFlash cycle log shows a clear pathological pattern across hundreds of cycles:

```
[DFlash] Cycle 180: blockLen=16, verifyLen=16, acceptanceLen=1, commitCount=2
[DFlash] Cycle 181: blockLen=16, verifyLen=16, acceptanceLen=1, commitCount=2
[DFlash] Cycle 182: blockLen=16, verifyLen=16, acceptanceLen=1, commitCount=2
[DFlash] Cycle 183: blockLen=16, verifyLen=16, acceptanceLen=0, commitCount=1
[DFlash] Cycle 184: blockLen=16, verifyLen=16, acceptanceLen=0, commitCount=1
[DFlash] Cycle 185: blockLen=16, verifyLen=16, acceptanceLen=1, commitCount=2
…repeated 200+ cycles, acceptanceLen always 0 or 1
```

**`acceptanceLen` is consistently 0 or 1 out of `block_size=16`.** The expected/healthy range for a well-aligned DFlash draft is much higher (your README's MLX section implies the draft is meant to commit several tokens per block on average).

So we're paying a 17-position verify pass — which on `--stream-experts` means SSD reads for 17 positions × 8 experts each = ~136 expert weight reads per layer per cycle, **vs ~8 reads per layer per token in vanilla** — for ~1 committed token. That fan-out is most of the regression.

## Crash on long prompt

The server crashed silently somewhere in the long-prompt bench. The DFlash cycle log abruptly stops at `Cycle ~202` followed by a single `.` and then the process is gone. No backtrace, no OOM marker visible in stdout/stderr. `vm_stat` immediately after showed plenty of free pages, so it doesn't look like classic system-wide memory pressure — could be MLX-internal (KV cache + draft block buffers compounding under sustained verify) or a DFlash-specific edge case. Happy to instrument and re-run if useful.

## Possible directions

Two hypotheses I'd love your read on:

1. **Draft–target distribution mismatch on the MoE routing layer.** If the 4-layer draft's hidden state at the routing boundaries differs slightly from the target's, the draft's predicted block routes through "wrong" experts, target rejects almost everything. Is the published draft a stable checkpoint, or is there a known iteration coming with better acceptance?

2. **`--stream-experts` interaction**, similar in spirit to SwiftLM's [#72](https://github.com/SharpAI/SwiftLM/issues/72) (SSD streaming + vanilla `--draft-model` causing fan-out, ultimately auto-capped to 1 draft token). DFlash bypasses that auto-cap because it's a different code path. Worth knowing whether you've validated the 122B draft on a swap-bound / out-of-core path, or only fully-resident-in-RAM setups.

If a fresh draft checkpoint or recommended config (smaller `block_size`, sliding-window cap, etc.) would change the picture, happy to re-bench and report back.

## Repro

```bash
SWIFTLM_TOP_K=6 TEND_MOE_CACHE_SLOTS=16 \
  SwiftLM \
    --model <path>/Qwen3.5-122B-A10B-4bit \
    --port 8002 \
    --stream-experts \
    --dflash \
    --draft-model <path>/Qwen3.5-122B-A10B-DFlash
```

Bench script (single-request streaming, three prompts at 200/400/800 max-tokens) is the same one used for the M1 Ultra baseline numbers in [SwiftLM #84](https://github.com/SharpAI/SwiftLM/issues/84). Happy to share JSON / logs if useful.

Cross-reference: also flagging this on the SwiftLM side since it intersects with their SSD-stream + DFlash code path.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3.5-122B-A10B-DFlash + SwiftLM SSD-stream on M1 Ultra: low acceptance + server crash #91

Setup

Results — generation throughput

Acceptance pattern (this is the interesting part)

Crash on long prompt

Possible directions

Repro

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Configuration	Short (~126 tok in)	Medium (~400 tok in)	Long (~800 tok in)
`--stream-experts` baseline (no DFlash)	6.30 tok/s · 153 tok generated · stop	6.11 tok/s · 246 tok · stop	6.22 tok/s · 800 tok · length
`--stream-experts --dflash …-DFlash`	6.30 tok/s · 200 tok · `finish_reason=null`	2.78 tok/s · 395 tok · `finish_reason=null`	server crashed mid-run

Qwen3.5-122B-A10B-DFlash + SwiftLM SSD-stream on M1 Ultra: low acceptance + server crash #91

Description

Setup

Results — generation throughput

Acceptance pattern (this is the interesting part)

Crash on long prompt

Possible directions

Repro

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions