diff --git a/README.md b/README.md index 3a8d377..5f8465f 100644 --- a/README.md +++ b/README.md @@ -73,6 +73,21 @@ Benchmark results for `gemma-4-26b-a4b-it-4bit` (26B MoE, 4-bit) on M5 Pro 64 GB > Run `./run_benchmark.sh` to generate these metrics on your own device. (See **Benchmarks & Testing** below). +### Qwen3.6-35B-A3B-UD-MLX-4bit (Full-RAM) — M1 Ultra 64 GB + +Benchmark results for full-RAM (no SSD streaming) MoE inference on M1 Ultra. The 3.4× vanilla improvement vs. earlier builds comes from the `needsMoeFlush` gate in `mlx-swift-lm` (see [SwiftLM #84](https://github.com/SharpAI/SwiftLM/issues/84)) — the per-layer GPU sync barrier required for SSD streaming was firing unconditionally on the full-RAM path and flushing MLX's kernel-batching pipeline. + +| Configuration | Short (~126 tok) | Medium (~400 tok) | Long (~800 tok) | +|---|---|---|---| +| **Vanilla full-GPU** | **61.7 tok/s** | **62.3 tok/s** | **62.1 tok/s** | +| `--dflash` (block_size=16) † | 52.3 tok/s | **70.3 tok/s** (+13%) | **69.9 tok/s** (+13%) | + +> *Hardware:* Apple M1 Ultra, 64 GB unified memory, macOS 26.x. Model ~20 GB on disk, ~21.6 GB resident weight + ~2.1 GB KV at runtime. +> *Flags:* `--repeat-penalty 1.1 --max-tokens 2000`, `temperature: 0.6`, single-stream `/v1/chat/completions`. +> *Vanilla baseline before* `needsMoeFlush` *gate (for reference):* 19.2 / 18.1 / 18.3 tok/s — see #84. + +† DFlash uses [`z-lab/Qwen3.6-35B-A3B-DFlash`](https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash) (~948 MB) as the block-diffusion draft model. DFlash gives a clean +13% on medium/long generations but regresses short prompts (block overhead doesn't amortize at low token counts) and changes stop-condition behavior (`finish_reason=null` vs `stop`/`length`). Recommend a quality eval before using as default. + ### DeepSeek-V4-Flash (126 GB, Q3-mixed-gs128-affine) — M5 Pro 64 GB Model: [`Thump604/DeepSeek-V4-Flash-MLX-Q3-mixed-gs128-affine`](https://huggingface.co/Thump604/DeepSeek-V4-Flash-MLX-Q3-mixed-gs128-affine) diff --git a/Sources/SwiftLM/Server.swift b/Sources/SwiftLM/Server.swift index 9bf56ec..012d38d 100644 --- a/Sources/SwiftLM/Server.swift +++ b/Sources/SwiftLM/Server.swift @@ -1180,7 +1180,24 @@ actor PromptCache { if cache.contains(where: { $0 is MambaCache }) { return } - let states = cache.map { $0.state } + let P = tokens.count + // For attention KVCacheSimple layers, the state tensor is [B, H, T, D] with a + // pre-allocated T that can exceed the actual prompt length P. If we store the + // full over-sized buffer, restore()'s trim() by (cached.tokens.count - matchLen) + // still leaves T - P slots of garbage beyond the valid prefix. Slice T to P at + // save time so cached.tokens.count === cached state's T. + let states: [[MLXArray]] = cache.map { layer -> [MLXArray] in + let s = layer.state + if layer is KVCacheSimple { + return s.map { arr -> MLXArray in + guard arr.ndim >= 3 else { return arr } + let T = arr.dim(2) + if T > P { return arr[.ellipsis, ..