-
Notifications
You must be signed in to change notification settings - Fork 34
fix(server): prompt-cache bleed fixes + Qwen3-A3B perf table (resolves #84) #86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -73,6 +73,21 @@ Benchmark results for `gemma-4-26b-a4b-it-4bit` (26B MoE, 4-bit) on M5 Pro 64 GB | |||||
|
|
||||||
| > Run `./run_benchmark.sh` to generate these metrics on your own device. (See **Benchmarks & Testing** below). | ||||||
|
|
||||||
| ### Qwen3.6-35B-A3B-UD-MLX-4bit (Full-RAM) — M1 Ultra 64 GB | ||||||
|
|
||||||
| Benchmark results for full-RAM (no SSD streaming) MoE inference on M1 Ultra. The 3.4× vanilla improvement vs. earlier builds comes from the `needsMoeFlush` gate in `mlx-swift-lm` (see [SwiftLM #84](https://github.com/SharpAI/SwiftLM/issues/84)) — the per-layer GPU sync barrier required for SSD streaming was firing unconditionally on the full-RAM path and flushing MLX's kernel-batching pipeline. | ||||||
|
|
||||||
| | Configuration | Short (~126 tok) | Medium (~400 tok) | Long (~800 tok) | | ||||||
| |---|---|---|---| | ||||||
| | **Vanilla full-GPU** | **61.7 tok/s** | **62.3 tok/s** | **62.1 tok/s** | | ||||||
| | `--dflash` (block_size=16) † | 52.3 tok/s | **70.3 tok/s** (+13%) | **69.9 tok/s** (+13%) | | ||||||
|
|
||||||
| > *Hardware:* Apple M1 Ultra, 64 GB unified memory, macOS 26.x. Model ~20 GB on disk, ~21.6 GB resident weight + ~2.1 GB KV at runtime. | ||||||
|
||||||
| > *Hardware:* Apple M1 Ultra, 64 GB unified memory, macOS 26.x. Model ~20 GB on disk, ~21.6 GB resident weight + ~2.1 GB KV at runtime. | |
| > *Hardware:* Apple M1 Ultra, 64 GB unified memory, macOS (Apple Silicon). Model ~20 GB on disk, ~21.6 GB resident weight + ~2.1 GB KV at runtime. |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -1180,7 +1180,24 @@ actor PromptCache { | |||||||||||||||||||||
| if cache.contains(where: { $0 is MambaCache }) { | ||||||||||||||||||||||
| return | ||||||||||||||||||||||
| } | ||||||||||||||||||||||
| let states = cache.map { $0.state } | ||||||||||||||||||||||
| let P = tokens.count | ||||||||||||||||||||||
| // For attention KVCacheSimple layers, the state tensor is [B, H, T, D] with a | ||||||||||||||||||||||
| // pre-allocated T that can exceed the actual prompt length P. If we store the | ||||||||||||||||||||||
| // full over-sized buffer, restore()'s trim() by (cached.tokens.count - matchLen) | ||||||||||||||||||||||
| // still leaves T - P slots of garbage beyond the valid prefix. Slice T to P at | ||||||||||||||||||||||
| // save time so cached.tokens.count === cached state's T. | ||||||||||||||||||||||
| let states: [[MLXArray]] = cache.map { layer -> [MLXArray] in | ||||||||||||||||||||||
| let s = layer.state | ||||||||||||||||||||||
| if layer is KVCacheSimple { | ||||||||||||||||||||||
| return s.map { arr -> MLXArray in | ||||||||||||||||||||||
| guard arr.ndim >= 3 else { return arr } | ||||||||||||||||||||||
| let T = arr.dim(2) | ||||||||||||||||||||||
| if T > P { return arr[.ellipsis, ..<P, 0...] } | ||||||||||||||||||||||
| return arr | ||||||||||||||||||||||
| } | ||||||||||||||||||||||
| } | ||||||||||||||||||||||
| return s | ||||||||||||||||||||||
| } | ||||||||||||||||||||||
| let metaStates = cache.map { $0.metaState } | ||||||||||||||||||||||
| // Materialize all lazy MLX arrays so they survive cache mutations | ||||||||||||||||||||||
| let allArrays = states.flatMap { $0 } | ||||||||||||||||||||||
|
|
@@ -1206,6 +1223,20 @@ actor PromptCache { | |||||||||||||||||||||
| misses += 1 | ||||||||||||||||||||||
| return nil | ||||||||||||||||||||||
| } | ||||||||||||||||||||||
| // ── Recurrent-layer safety gate ── | ||||||||||||||||||||||
| // MambaCache (and other recurrent caches) store a 2-D hidden state with no | ||||||||||||||||||||||
| // T dimension, so the dim(2) read below would crash. Hybrid Mamba/attention | ||||||||||||||||||||||
| // models (Qwen-Next, Mamba-2, etc.) can't be safely prefix-restored because | ||||||||||||||||||||||
| // the recurrent hidden state was computed over the WHOLE previous sequence | ||||||||||||||||||||||
| // and there is no trim(excess) operator for it. Treat any cache containing | ||||||||||||||||||||||
| // a recurrent layer as a miss before we touch anything. | ||||||||||||||||||||||
| let hasRecurrentLayer = cache.contains { layer in | ||||||||||||||||||||||
| !(layer is KVCacheSimple) && !(String(describing: type(of: layer)).contains("Rotating")) | ||||||||||||||||||||||
| } | ||||||||||||||||||||||
| if hasRecurrentLayer { | ||||||||||||||||||||||
|
Comment on lines
+1232
to
+1236
|
||||||||||||||||||||||
| // a recurrent layer as a miss before we touch anything. | |
| let hasRecurrentLayer = cache.contains { layer in | |
| !(layer is KVCacheSimple) && !(String(describing: type(of: layer)).contains("Rotating")) | |
| } | |
| if hasRecurrentLayer { | |
| // an unsupported cache implementation as a miss before we touch anything. | |
| let hasUnsupportedCacheLayer = cache.contains { layer in | |
| !(layer is KVCacheSimple) && !(layer is RotatingKVCache) | |
| } | |
| if hasUnsupportedCacheLayer { |
Copilot
AI
Apr 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There’s an integration test suite (tests/test-server.sh) but it doesn’t appear to exercise prompt-cache hit/partial-hit behavior or the new “bypass prompt cache when draft model is enabled” branch. Adding a regression test that sends the same prompt twice (and ensures the second response is not the historical 1-token-then-EOS failure) would help prevent these prompt-cache bleed issues from reappearing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The table row label
--dflash (block_size=16)doesn't correspond to the actual CLI flag name used by the server.Server.swiftdeclares this as--dflash-block-size(separate from--dflash), so readers won't be able to reproduce the benchmark as written.