fix(server): prompt-cache bleed fixes + Qwen3-A3B perf table (resolves #84)#86
fix(server): prompt-cache bleed fixes + Qwen3-A3B perf table (resolves #84)#86solderzzc wants to merge 3 commits into
Conversation
Three fixes, now riding on upstream 116ee91: 1. save(): slice KVCacheSimple state T-dim down to P=tokens.count so the cached states' T matches cached.tokens.count. Prevents the over-allocated prefill buffer from carrying uninitialized tokens past the valid prefix. 2. restore(): gate out recurrent-state layers (MambaCache and friends) up front. Their state is 2-D with no T dimension, so the dim(2) read in the pre-flight check would crash; also there's no trim(excess) operator for a recurrent hidden state — we can't partial-restore one safely. Guard with ndim>=3 inside the min-length scan too for belt-and-suspenders. 3. handleChatCompletion(): reorder the decision branch so speculative decoding is checked BEFORE the prompt cache restore. A cache-hit rollback corrupts the draft model's KV state (draft and main cycle tokens in lock-step), so when draftModelRef is set we bypass the cache entirely and pay the full prefill. Partial-match restores stay available on the non-spec path where they still pay off.
Adds a new Performance subsection covering full-RAM Qwen3.6-35B-A3B-UD-MLX-4bit inference on M1 Ultra 64 GB: - Vanilla full-GPU (62 tok/s) — post needsMoeFlush gate (SwiftLM #84) - DFlash spec decode with z-lab/Qwen3.6-35B-A3B-DFlash (+13% medium/long, -15% short due to block overhead, finish_reason behavior changes) Includes 19→62 tok/s before/after reference for the gate fix.
Merges ericjlake's prompt-cache fixes from PR #85, resolving conflicts with the DFlash integration (PR #78). Changes from ericjlake: - MambaCache safety gate + KVCacheSimple T-dim slice in save() - ndim >= 3 guard in minCachedSeqLen scan - Spec-decode short-circuit ordering (check before cache restore) - README: Qwen3-A3B full-RAM perf table (M1 Ultra 64 GB) Conflict resolution: - README.md: kept both Qwen3-A3B and DeepSeek-V4 perf tables - Server.swift save(): kept existing MambaCache early return + new T-dim slice - Server.swift decision branch: combined spec-decode-first + skipPromptCache (kvBits) Closes #84. Co-authored-by: Eric Lake <ericjlake@users.noreply.github.com>
|
Closing in favor of the cross-fork approach — sent the conflict resolution as a PR to Eric's fork (ericjlake#1) so his original PR #85 stays the canonical one. Once he merges the resolution, #85 will be conflict-free and we can land it directly. |
There was a problem hiding this comment.
Pull request overview
This PR merges prompt-cache correctness fixes in the SwiftLM server (to prevent cache “bleed”/drift in several edge cases) and adds a new README performance table for full-RAM Qwen3-A3B benchmarks, aligning with the resolution of issue #84.
Changes:
- Fixes prompt-cache snapshot/restore safety by slicing KVCacheSimple T-dimension on save, adding defensive guards during restore, and adding additional cache-type safety gating.
- Reorders the generation decision branch to short-circuit speculative decoding before attempting prompt-cache restore.
- Adds a README performance subsection/table for Qwen3.6-35B-A3B full-RAM results on M1 Ultra.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| Sources/SwiftLM/Server.swift | Prompt-cache save/restore correctness fixes + spec-decode ordering changes. |
| README.md | Adds Qwen3-A3B (full-RAM) benchmark table and explanatory notes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // a recurrent layer as a miss before we touch anything. | ||
| let hasRecurrentLayer = cache.contains { layer in | ||
| !(layer is KVCacheSimple) && !(String(describing: type(of: layer)).contains("Rotating")) | ||
| } | ||
| if hasRecurrentLayer { |
There was a problem hiding this comment.
hasRecurrentLayer relies on String(describing: type(of: layer)).contains("Rotating") to classify cache layers. This is brittle (renames / different type names will silently change behavior) and can incorrectly disable prompt caching for new non-recurrent cache types. Prefer a capability-based check (e.g. layer.isTrimmable / an explicit protocol for prefix-restorable caches) or an explicit allowlist of known-safe cache types instead of string matching.
| // a recurrent layer as a miss before we touch anything. | |
| let hasRecurrentLayer = cache.contains { layer in | |
| !(layer is KVCacheSimple) && !(String(describing: type(of: layer)).contains("Rotating")) | |
| } | |
| if hasRecurrentLayer { | |
| // an unsupported cache implementation as a miss before we touch anything. | |
| let hasUnsupportedCacheLayer = cache.contains { layer in | |
| !(layer is KVCacheSimple) && !(layer is RotatingKVCache) | |
| } | |
| if hasUnsupportedCacheLayer { |
| | Configuration | Short (~126 tok) | Medium (~400 tok) | Long (~800 tok) | | ||
| |---|---|---|---| | ||
| | **Vanilla full-GPU** | **61.7 tok/s** | **62.3 tok/s** | **62.1 tok/s** | | ||
| | `--dflash` (block_size=16) † | 52.3 tok/s | **70.3 tok/s** (+13%) | **69.9 tok/s** (+13%) | |
There was a problem hiding this comment.
The table row label --dflash (block_size=16) doesn't correspond to the actual CLI flag name used by the server. Server.swift declares this as --dflash-block-size (separate from --dflash), so readers won't be able to reproduce the benchmark as written.
| | `--dflash` (block_size=16) † | 52.3 tok/s | **70.3 tok/s** (+13%) | **69.9 tok/s** (+13%) | | |
| | `--dflash --dflash-block-size 16` † | 52.3 tok/s | **70.3 tok/s** (+13%) | **69.9 tok/s** (+13%) | |
| | **Vanilla full-GPU** | **61.7 tok/s** | **62.3 tok/s** | **62.1 tok/s** | | ||
| | `--dflash` (block_size=16) † | 52.3 tok/s | **70.3 tok/s** (+13%) | **69.9 tok/s** (+13%) | | ||
|
|
||
| > *Hardware:* Apple M1 Ultra, 64 GB unified memory, macOS 26.x. Model ~20 GB on disk, ~21.6 GB resident weight + ~2.1 GB KV at runtime. |
There was a problem hiding this comment.
macOS 26.x looks like a typo / non-existent major version (Apple’s macOS major versions haven’t used 2-digit numbering historically). Consider using the actual macOS version used for the benchmark (or a generic “macOS (Apple Silicon)” if you don’t want to pin it) to avoid confusing readers.
| > *Hardware:* Apple M1 Ultra, 64 GB unified memory, macOS 26.x. Model ~20 GB on disk, ~21.6 GB resident weight + ~2.1 GB KV at runtime. | |
| > *Hardware:* Apple M1 Ultra, 64 GB unified memory, macOS (Apple Silicon). Model ~20 GB on disk, ~21.6 GB resident weight + ~2.1 GB KV at runtime. |
| // ── Decision branch ── | ||
| // Speculative decoding is CHECKED FIRST because a cache-hit rollback | ||
| // corrupts the draft model's KV state (draft and main model cycle tokens | ||
| // in lock-step). We'd rather pay the prefill than emit garbage. | ||
| // | ||
| // Skip prompt cache for quantized-KV requests: the prompt cache stores KV state | ||
| // produced with KVCacheSimple; restoring it into a QuantizedKVCache (or vice-versa) | ||
| // is unsafe and produces incorrect results or runtime failures. | ||
| let skipPromptCache = isMultimodalRequest || params.kvBits != nil | ||
| var stream: AsyncStream<Generation> | ||
| if !skipPromptCache, let cachedCount = await promptCache.restore(newTokens: promptTokens, into: cache) { | ||
| if let draftRef = draftModelRef { | ||
| // Speculative decoding path: draft model generates candidates, main model verifies. | ||
| // Bypass prompt cache to avoid draft/main KV drift on partial-match restores. | ||
| print("[SwiftLM] Using speculative decoding (\(numDraftTokens) draft tokens/round)") | ||
| stream = try MLXLMCommon.generate( | ||
| input: lmInput, cache: cache, parameters: params, context: context, | ||
| draftModel: draftRef.model, numDraftTokens: numDraftTokens | ||
| ) | ||
| } else if !skipPromptCache, let cachedCount = await promptCache.restore(newTokens: promptTokens, into: cache) { |
There was a problem hiding this comment.
There’s an integration test suite (tests/test-server.sh) but it doesn’t appear to exercise prompt-cache hit/partial-hit behavior or the new “bypass prompt cache when draft model is enabled” branch. Adding a regression test that sends the same prompt twice (and ensures the second response is not the historical 1-token-then-EOS failure) would help prevent these prompt-cache bleed issues from reappearing.
This PR lands the changes from @ericjlake's PR #85 with merge conflicts resolved against current
main(post-DFlash merge, PR #78).Why a separate PR? Eric's fork has "Allow edits from maintainers" disabled, so we couldn't push the conflict resolution directly to his
perf/combinedbranch. This PR cherry-picks his exact changes and resolves the three conflicts:Conflict resolution
README.mdServer.swiftsave()Server.swiftdecision branchskipPromptCacheguard (kvBits)skipPromptCachegateEric's changes (unchanged)
save()— prevents garbage beyond valid prefix on cache restorendim >= 3guard inminCachedSeqLenscan — defensive guard against non-3D state tensorsrestore()— invalidates cache hits for hybrid Mamba modelsCompanion PR (already merged): SharpAI/mlx-swift-lm#34 (
needsMoeFlushgate)Closes #84.
Co-authored-by: Eric Lake ericjlake@users.noreply.github.com