Skip to content

fix(server): prompt-cache bleed fixes + Qwen3-A3B perf table (resolves #84)#86

Closed
solderzzc wants to merge 3 commits into
mainfrom
merge/eric-pr85-resolved
Closed

fix(server): prompt-cache bleed fixes + Qwen3-A3B perf table (resolves #84)#86
solderzzc wants to merge 3 commits into
mainfrom
merge/eric-pr85-resolved

Conversation

@solderzzc
Copy link
Copy Markdown
Member

This PR lands the changes from @ericjlake's PR #85 with merge conflicts resolved against current main (post-DFlash merge, PR #78).

Why a separate PR? Eric's fork has "Allow edits from maintainers" disabled, so we couldn't push the conflict resolution directly to his perf/combined branch. This PR cherry-picks his exact changes and resolves the three conflicts:

Conflict resolution

File Eric's change Our main Resolution
README.md Qwen3-A3B perf table DeepSeek-V4 perf table Keep both
Server.swift save() T-dim slice fix MambaCache early return Keep both — MambaCache guard runs first, T-dim slice runs after
Server.swift decision branch Spec-decode checked first skipPromptCache guard (kvBits) Combined — spec-decode first, then skipPromptCache gate

Eric's changes (unchanged)

  1. KVCacheSimple T-dim slice in save() — prevents garbage beyond valid prefix on cache restore
  2. ndim >= 3 guard in minCachedSeqLen scan — defensive guard against non-3D state tensors
  3. Spec-decode short-circuit ordering — prevents draft/main KV drift on partial-match restores
  4. Recurrent-layer safety gate in restore() — invalidates cache hits for hybrid Mamba models
  5. README perf table — Qwen3-A3B full-RAM benchmarks on M1 Ultra 64 GB

Companion PR (already merged): SharpAI/mlx-swift-lm#34 (needsMoeFlush gate)

Closes #84.
Co-authored-by: Eric Lake ericjlake@users.noreply.github.com

Eric and others added 3 commits April 22, 2026 04:05
Three fixes, now riding on upstream 116ee91:

1. save(): slice KVCacheSimple state T-dim down to P=tokens.count so the
   cached states' T matches cached.tokens.count. Prevents the over-allocated
   prefill buffer from carrying uninitialized tokens past the valid prefix.

2. restore(): gate out recurrent-state layers (MambaCache and friends) up
   front. Their state is 2-D with no T dimension, so the dim(2) read in the
   pre-flight check would crash; also there's no trim(excess) operator for
   a recurrent hidden state — we can't partial-restore one safely. Guard
   with ndim>=3 inside the min-length scan too for belt-and-suspenders.

3. handleChatCompletion(): reorder the decision branch so speculative
   decoding is checked BEFORE the prompt cache restore. A cache-hit rollback
   corrupts the draft model's KV state (draft and main cycle tokens in
   lock-step), so when draftModelRef is set we bypass the cache entirely
   and pay the full prefill. Partial-match restores stay available on the
   non-spec path where they still pay off.
Adds a new Performance subsection covering full-RAM Qwen3.6-35B-A3B-UD-MLX-4bit
inference on M1 Ultra 64 GB:
- Vanilla full-GPU (62 tok/s) — post needsMoeFlush gate (SwiftLM #84)
- DFlash spec decode with z-lab/Qwen3.6-35B-A3B-DFlash (+13% medium/long,
  -15% short due to block overhead, finish_reason behavior changes)

Includes 19→62 tok/s before/after reference for the gate fix.
Merges ericjlake's prompt-cache fixes from PR #85, resolving conflicts
with the DFlash integration (PR #78).

Changes from ericjlake:
- MambaCache safety gate + KVCacheSimple T-dim slice in save()
- ndim >= 3 guard in minCachedSeqLen scan
- Spec-decode short-circuit ordering (check before cache restore)
- README: Qwen3-A3B full-RAM perf table (M1 Ultra 64 GB)

Conflict resolution:
- README.md: kept both Qwen3-A3B and DeepSeek-V4 perf tables
- Server.swift save(): kept existing MambaCache early return + new T-dim slice
- Server.swift decision branch: combined spec-decode-first + skipPromptCache (kvBits)

Closes #84.
Co-authored-by: Eric Lake <ericjlake@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 26, 2026 06:43
@solderzzc
Copy link
Copy Markdown
Member Author

Closing in favor of the cross-fork approach — sent the conflict resolution as a PR to Eric's fork (ericjlake#1) so his original PR #85 stays the canonical one. Once he merges the resolution, #85 will be conflict-free and we can land it directly.

@solderzzc solderzzc closed this Apr 26, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR merges prompt-cache correctness fixes in the SwiftLM server (to prevent cache “bleed”/drift in several edge cases) and adds a new README performance table for full-RAM Qwen3-A3B benchmarks, aligning with the resolution of issue #84.

Changes:

  • Fixes prompt-cache snapshot/restore safety by slicing KVCacheSimple T-dimension on save, adding defensive guards during restore, and adding additional cache-type safety gating.
  • Reorders the generation decision branch to short-circuit speculative decoding before attempting prompt-cache restore.
  • Adds a README performance subsection/table for Qwen3.6-35B-A3B full-RAM results on M1 Ultra.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
Sources/SwiftLM/Server.swift Prompt-cache save/restore correctness fixes + spec-decode ordering changes.
README.md Adds Qwen3-A3B (full-RAM) benchmark table and explanatory notes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1232 to +1236
// a recurrent layer as a miss before we touch anything.
let hasRecurrentLayer = cache.contains { layer in
!(layer is KVCacheSimple) && !(String(describing: type(of: layer)).contains("Rotating"))
}
if hasRecurrentLayer {
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasRecurrentLayer relies on String(describing: type(of: layer)).contains("Rotating") to classify cache layers. This is brittle (renames / different type names will silently change behavior) and can incorrectly disable prompt caching for new non-recurrent cache types. Prefer a capability-based check (e.g. layer.isTrimmable / an explicit protocol for prefix-restorable caches) or an explicit allowlist of known-safe cache types instead of string matching.

Suggested change
// a recurrent layer as a miss before we touch anything.
let hasRecurrentLayer = cache.contains { layer in
!(layer is KVCacheSimple) && !(String(describing: type(of: layer)).contains("Rotating"))
}
if hasRecurrentLayer {
// an unsupported cache implementation as a miss before we touch anything.
let hasUnsupportedCacheLayer = cache.contains { layer in
!(layer is KVCacheSimple) && !(layer is RotatingKVCache)
}
if hasUnsupportedCacheLayer {

Copilot uses AI. Check for mistakes.
Comment thread README.md
| Configuration | Short (~126 tok) | Medium (~400 tok) | Long (~800 tok) |
|---|---|---|---|
| **Vanilla full-GPU** | **61.7 tok/s** | **62.3 tok/s** | **62.1 tok/s** |
| `--dflash` (block_size=16) † | 52.3 tok/s | **70.3 tok/s** (+13%) | **69.9 tok/s** (+13%) |
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table row label --dflash (block_size=16) doesn't correspond to the actual CLI flag name used by the server. Server.swift declares this as --dflash-block-size (separate from --dflash), so readers won't be able to reproduce the benchmark as written.

Suggested change
| `--dflash` (block_size=16)| 52.3 tok/s | **70.3 tok/s** (+13%) | **69.9 tok/s** (+13%) |
| `--dflash --dflash-block-size 16`| 52.3 tok/s | **70.3 tok/s** (+13%) | **69.9 tok/s** (+13%) |

Copilot uses AI. Check for mistakes.
Comment thread README.md
| **Vanilla full-GPU** | **61.7 tok/s** | **62.3 tok/s** | **62.1 tok/s** |
| `--dflash` (block_size=16) † | 52.3 tok/s | **70.3 tok/s** (+13%) | **69.9 tok/s** (+13%) |

> *Hardware:* Apple M1 Ultra, 64 GB unified memory, macOS 26.x. Model ~20 GB on disk, ~21.6 GB resident weight + ~2.1 GB KV at runtime.
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

macOS 26.x looks like a typo / non-existent major version (Apple’s macOS major versions haven’t used 2-digit numbering historically). Consider using the actual macOS version used for the benchmark (or a generic “macOS (Apple Silicon)” if you don’t want to pin it) to avoid confusing readers.

Suggested change
> *Hardware:* Apple M1 Ultra, 64 GB unified memory, macOS 26.x. Model ~20 GB on disk, ~21.6 GB resident weight + ~2.1 GB KV at runtime.
> *Hardware:* Apple M1 Ultra, 64 GB unified memory, macOS (Apple Silicon). Model ~20 GB on disk, ~21.6 GB resident weight + ~2.1 GB KV at runtime.

Copilot uses AI. Check for mistakes.
Comment on lines +1555 to +1573
// ── Decision branch ──
// Speculative decoding is CHECKED FIRST because a cache-hit rollback
// corrupts the draft model's KV state (draft and main model cycle tokens
// in lock-step). We'd rather pay the prefill than emit garbage.
//
// Skip prompt cache for quantized-KV requests: the prompt cache stores KV state
// produced with KVCacheSimple; restoring it into a QuantizedKVCache (or vice-versa)
// is unsafe and produces incorrect results or runtime failures.
let skipPromptCache = isMultimodalRequest || params.kvBits != nil
var stream: AsyncStream<Generation>
if !skipPromptCache, let cachedCount = await promptCache.restore(newTokens: promptTokens, into: cache) {
if let draftRef = draftModelRef {
// Speculative decoding path: draft model generates candidates, main model verifies.
// Bypass prompt cache to avoid draft/main KV drift on partial-match restores.
print("[SwiftLM] Using speculative decoding (\(numDraftTokens) draft tokens/round)")
stream = try MLXLMCommon.generate(
input: lmInput, cache: cache, parameters: params, context: context,
draftModel: draftRef.model, numDraftTokens: numDraftTokens
)
} else if !skipPromptCache, let cachedCount = await promptCache.restore(newTokens: promptTokens, into: cache) {
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s an integration test suite (tests/test-server.sh) but it doesn’t appear to exercise prompt-cache hit/partial-hit behavior or the new “bypass prompt cache when draft model is enabled” branch. Adding a regression test that sends the same prompt twice (and ensures the second response is not the historical 1-token-then-EOS failure) would help prevent these prompt-cache bleed issues from reappearing.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Qwen3-A3B full-RAM generation: 18 → 63 tok/s on M1 Ultra 64GB after needsMoeFlush gate + prompt-cache bleed fixes

3 participants