fix(server): prompt-cache bleed fixes + Qwen3-A3B perf table (resolves #84) by solderzzc · Pull Request #86 · SharpAI/SwiftLM

solderzzc · 2026-04-26T06:43:34Z

This PR lands the changes from @ericjlake's PR #85 with merge conflicts resolved against current main (post-DFlash merge, PR #78).

Why a separate PR? Eric's fork has "Allow edits from maintainers" disabled, so we couldn't push the conflict resolution directly to his perf/combined branch. This PR cherry-picks his exact changes and resolves the three conflicts:

Conflict resolution

File	Eric's change	Our main	Resolution
`README.md`	Qwen3-A3B perf table	DeepSeek-V4 perf table	Keep both
`Server.swift` save()	T-dim slice fix	MambaCache early return	Keep both — MambaCache guard runs first, T-dim slice runs after
`Server.swift` decision branch	Spec-decode checked first	`skipPromptCache` guard (kvBits)	Combined — spec-decode first, then `skipPromptCache` gate

Eric's changes (unchanged)

KVCacheSimple T-dim slice in save() — prevents garbage beyond valid prefix on cache restore
ndim >= 3 guard in minCachedSeqLen scan — defensive guard against non-3D state tensors
Spec-decode short-circuit ordering — prevents draft/main KV drift on partial-match restores
Recurrent-layer safety gate in restore() — invalidates cache hits for hybrid Mamba models
README perf table — Qwen3-A3B full-RAM benchmarks on M1 Ultra 64 GB

Companion PR (already merged): SharpAI/mlx-swift-lm#34 (needsMoeFlush gate)

Closes #84.
Co-authored-by: Eric Lake ericjlake@users.noreply.github.com

Three fixes, now riding on upstream 116ee91: 1. save(): slice KVCacheSimple state T-dim down to P=tokens.count so the cached states' T matches cached.tokens.count. Prevents the over-allocated prefill buffer from carrying uninitialized tokens past the valid prefix. 2. restore(): gate out recurrent-state layers (MambaCache and friends) up front. Their state is 2-D with no T dimension, so the dim(2) read in the pre-flight check would crash; also there's no trim(excess) operator for a recurrent hidden state — we can't partial-restore one safely. Guard with ndim>=3 inside the min-length scan too for belt-and-suspenders. 3. handleChatCompletion(): reorder the decision branch so speculative decoding is checked BEFORE the prompt cache restore. A cache-hit rollback corrupts the draft model's KV state (draft and main cycle tokens in lock-step), so when draftModelRef is set we bypass the cache entirely and pay the full prefill. Partial-match restores stay available on the non-spec path where they still pay off.

Adds a new Performance subsection covering full-RAM Qwen3.6-35B-A3B-UD-MLX-4bit inference on M1 Ultra 64 GB: - Vanilla full-GPU (62 tok/s) — post needsMoeFlush gate (SwiftLM #84) - DFlash spec decode with z-lab/Qwen3.6-35B-A3B-DFlash (+13% medium/long, -15% short due to block overhead, finish_reason behavior changes) Includes 19→62 tok/s before/after reference for the gate fix.

Merges ericjlake's prompt-cache fixes from PR #85, resolving conflicts with the DFlash integration (PR #78). Changes from ericjlake: - MambaCache safety gate + KVCacheSimple T-dim slice in save() - ndim >= 3 guard in minCachedSeqLen scan - Spec-decode short-circuit ordering (check before cache restore) - README: Qwen3-A3B full-RAM perf table (M1 Ultra 64 GB) Conflict resolution: - README.md: kept both Qwen3-A3B and DeepSeek-V4 perf tables - Server.swift save(): kept existing MambaCache early return + new T-dim slice - Server.swift decision branch: combined spec-decode-first + skipPromptCache (kvBits) Closes #84. Co-authored-by: Eric Lake <ericjlake@users.noreply.github.com>

solderzzc · 2026-04-26T06:45:09Z

Closing in favor of the cross-fork approach — sent the conflict resolution as a PR to Eric's fork (ericjlake#1) so his original PR #85 stays the canonical one. Once he merges the resolution, #85 will be conflict-free and we can land it directly.

Copilot

Pull request overview

This PR merges prompt-cache correctness fixes in the SwiftLM server (to prevent cache “bleed”/drift in several edge cases) and adds a new README performance table for full-RAM Qwen3-A3B benchmarks, aligning with the resolution of issue #84.

Changes:

Fixes prompt-cache snapshot/restore safety by slicing KVCacheSimple T-dimension on save, adding defensive guards during restore, and adding additional cache-type safety gating.
Reorders the generation decision branch to short-circuit speculative decoding before attempting prompt-cache restore.
Adds a README performance subsection/table for Qwen3.6-35B-A3B full-RAM results on M1 Ultra.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
Sources/SwiftLM/Server.swift	Prompt-cache save/restore correctness fixes + spec-decode ordering changes.
README.md	Adds Qwen3-A3B (full-RAM) benchmark table and explanatory notes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-26T06:48:34Z

+        // a recurrent layer as a miss before we touch anything.
+        let hasRecurrentLayer = cache.contains { layer in
+            !(layer is KVCacheSimple) && !(String(describing: type(of: layer)).contains("Rotating"))
+        }
+        if hasRecurrentLayer {


hasRecurrentLayer relies on String(describing: type(of: layer)).contains("Rotating") to classify cache layers. This is brittle (renames / different type names will silently change behavior) and can incorrectly disable prompt caching for new non-recurrent cache types. Prefer a capability-based check (e.g. layer.isTrimmable / an explicit protocol for prefix-restorable caches) or an explicit allowlist of known-safe cache types instead of string matching.

Suggested change

// a recurrent layer as a miss before we touch anything.

let hasRecurrentLayer = cache.contains { layer in

!(layer is KVCacheSimple) && !(String(describing: type(of: layer)).contains("Rotating"))

}

if hasRecurrentLayer {

// an unsupported cache implementation as a miss before we touch anything.

let hasUnsupportedCacheLayer = cache.contains { layer in

!(layer is KVCacheSimple) && !(layer is RotatingKVCache)

}

if hasUnsupportedCacheLayer {

Copilot · 2026-04-26T06:48:34Z

+| Configuration | Short (~126 tok) | Medium (~400 tok) | Long (~800 tok) |
+|---|---|---|---|
+| **Vanilla full-GPU** | **61.7 tok/s** | **62.3 tok/s** | **62.1 tok/s** |
+| `--dflash` (block_size=16) † | 52.3 tok/s | **70.3 tok/s** (+13%) | **69.9 tok/s** (+13%) |


The table row label --dflash (block_size=16) doesn't correspond to the actual CLI flag name used by the server. Server.swift declares this as --dflash-block-size (separate from --dflash), so readers won't be able to reproduce the benchmark as written.

Suggested change

| `--dflash` (block_size=16) † | 52.3 tok/s | **70.3 tok/s** (+13%) | **69.9 tok/s** (+13%) |

| `--dflash --dflash-block-size 16` † | 52.3 tok/s | **70.3 tok/s** (+13%) | **69.9 tok/s** (+13%) |

Copilot · 2026-04-26T06:48:35Z

+| **Vanilla full-GPU** | **61.7 tok/s** | **62.3 tok/s** | **62.1 tok/s** |
+| `--dflash` (block_size=16) † | 52.3 tok/s | **70.3 tok/s** (+13%) | **69.9 tok/s** (+13%) |
+
+> *Hardware:* Apple M1 Ultra, 64 GB unified memory, macOS 26.x. Model ~20 GB on disk, ~21.6 GB resident weight + ~2.1 GB KV at runtime.


macOS 26.x looks like a typo / non-existent major version (Apple’s macOS major versions haven’t used 2-digit numbering historically). Consider using the actual macOS version used for the benchmark (or a generic “macOS (Apple Silicon)” if you don’t want to pin it) to avoid confusing readers.

Suggested change

> *Hardware:* Apple M1 Ultra, 64 GB unified memory, macOS 26.x. Model ~20 GB on disk, ~21.6 GB resident weight + ~2.1 GB KV at runtime.

> *Hardware:* Apple M1 Ultra, 64 GB unified memory, macOS (Apple Silicon). Model ~20 GB on disk, ~21.6 GB resident weight + ~2.1 GB KV at runtime.

Copilot · 2026-04-26T06:48:35Z

+        // ── Decision branch ──
+        // Speculative decoding is CHECKED FIRST because a cache-hit rollback
+        // corrupts the draft model's KV state (draft and main model cycle tokens
+        // in lock-step). We'd rather pay the prefill than emit garbage.
+        //
+        // Skip prompt cache for quantized-KV requests: the prompt cache stores KV state
+        // produced with KVCacheSimple; restoring it into a QuantizedKVCache (or vice-versa)
        // is unsafe and produces incorrect results or runtime failures.
        let skipPromptCache = isMultimodalRequest || params.kvBits != nil
        var stream: AsyncStream<Generation>
-        if !skipPromptCache, let cachedCount = await promptCache.restore(newTokens: promptTokens, into: cache) {
+        if let draftRef = draftModelRef {
+            // Speculative decoding path: draft model generates candidates, main model verifies.
+            // Bypass prompt cache to avoid draft/main KV drift on partial-match restores.
+            print("[SwiftLM] Using speculative decoding (\(numDraftTokens) draft tokens/round)")
+            stream = try MLXLMCommon.generate(
+                input: lmInput, cache: cache, parameters: params, context: context,
+                draftModel: draftRef.model, numDraftTokens: numDraftTokens
+            )
+        } else if !skipPromptCache, let cachedCount = await promptCache.restore(newTokens: promptTokens, into: cache) {


There’s an integration test suite (tests/test-server.sh) but it doesn’t appear to exercise prompt-cache hit/partial-hit behavior or the new “bypass prompt cache when draft model is enabled” branch. Adding a regression test that sends the same prompt twice (and ensures the second response is not the historical 1-token-then-EOS failure) would help prevent these prompt-cache bleed issues from reappearing.

Eric and others added 3 commits April 22, 2026 04:05

Copilot AI review requested due to automatic review settings April 26, 2026 06:43

Copilot started reviewing on behalf of solderzzc April 26, 2026 06:44 View session

solderzzc closed this Apr 26, 2026

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): prompt-cache bleed fixes + Qwen3-A3B perf table (resolves #84)#86

fix(server): prompt-cache bleed fixes + Qwen3-A3B perf table (resolves #84)#86
solderzzc wants to merge 3 commits into
mainfrom
merge/eric-pr85-resolved

solderzzc commented Apr 26, 2026

Uh oh!

solderzzc commented Apr 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Copilot AI Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	\| `--dflash` (block_size=16) † \| 52.3 tok/s \| 70.3 tok/s (+13%) \| 69.9 tok/s (+13%) \|
	\| `--dflash --dflash-block-size 16` † \| 52.3 tok/s \| 70.3 tok/s (+13%) \| 69.9 tok/s (+13%) \|

	> Hardware: Apple M1 Ultra, 64 GB unified memory, macOS 26.x. Model ~20 GB on disk, ~21.6 GB resident weight + ~2.1 GB KV at runtime.
	> Hardware: Apple M1 Ultra, 64 GB unified memory, macOS (Apple Silicon). Model ~20 GB on disk, ~21.6 GB resident weight + ~2.1 GB KV at runtime.

Conversation

solderzzc commented Apr 26, 2026

Conflict resolution

Eric's changes (unchanged)

Uh oh!

solderzzc commented Apr 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants