merge: resolve conflicts with SharpAI/SwiftLM main (DFlash integration)#1
Conversation
Critical bug fix and performance optimizations for DFlash speculative decoding. Acceptance rate improved from 25% to 89% (matching Python reference), throughput from 6.7 to 42 tok/s. Root cause: hiddenNorm was declared as without @ModuleInfo, so its RMSNorm weight was never loaded from safetensors. The key "hidden_norm.weight" didn't match the reflected key "hiddenNorm.weight", leaving the weight at all-ones instead of the trained values (~0.98). This single missing weight distorted every draft prediction, compounding through all 5 draft layers. Fix: Added @ModuleInfo(key: "hidden_norm") annotation, matching the safetensors key. Also added @ModuleInfo for norm and fc for consistency. Performance optimizations: - Streaming: replaced generateSync + buffered array with generateStreaming + Continuation, yielding tokens immediately - Draft prefetch: launch next cycle's draft with asyncEval before rollback, overlapping GPU work - Batched asyncEval: changed blocking eval() to asyncEval() for verify logits and hidden states - asyncEval(committedHidden): unblocks prefetch window - Stop token Set: precomputed O(1) lookup - Removed double fflush, added DFlashDumper call-site guards Submodule updates: - mlx-swift-lm: exactSmallProjPad for quantized linear at small seq_len (<16), DFlash protocols, open MambaCache/ArraysCache - mlx-swift: remove stale .air kernel files Benchmark (Qwen3.5-27B-4bit, thinking mode, 2048 tokens): 41.9 tok/s, 89.4% acceptance, 216 cycles
… streaming When SSD expert streaming is active, expert weight tensors (.weight) are replaced with zero-filled placeholders of the correct shape/dtype during loading. Only scales and biases are loaded into RAM — the actual expert weight data is read from SSD at runtime via pread/mmap. RAM savings for MoE models: - Qwen3.6-35B-A3B: 18.4 GB → 5.1 GB (73% reduction) - Expert weights skipped: 16.1 GB (weight only, not scales/biases) - Expert scales+biases loaded: ~2 GB (needed for dequantization) Performance on Qwen3.6-35B-A3B (512 tokens, math prompt): - No SSD streaming: 11.5 tok/s, 18.4 GB RAM - SSD streaming only: 11.5 tok/s, 5.1 GB RAM - SSD + DFlash: 32.2 tok/s, 5.1 GB RAM
Both streaming and non-streaming chat/text completion responses now include a 'timings' object with: - predicted_per_second: generation speed in tokens/second - predicted_n: number of completion tokens - predicted_ms: total generation wall-clock time in ms This matches llama-server's timing convention and allows clients to see generation speed directly from the API response without external measurement.
Tests 4 configurations for Qwen3.6-35B-A3B-4bit with same math prompt: - Baseline (no SSD, no DFlash) - SSD Streaming only - SSD Streaming + DFlash - DFlash only Results (512 tokens, 3 runs each): Baseline: 26.3 tok/s, 18.8 GB RAM SSD Streaming: 12.5 tok/s, 5.4 GB RAM SSD + DFlash: 33.3 tok/s, 7.4 GB RAM ← best tradeoff DFlash only: 125.4 tok/s, 20.0 GB RAM
- Add StreamableMoE conformance to Qwen3NextModelInner - Add LayerPartitionable conformance to Qwen3NextModelInner - Add DFlashTargetModel conformance to Qwen3NextModel - dflashEmbedTokens, dflashLmHeadLogits, dflashForwardWithCapture - dflashGatedDeltaForward with tape recording for GDN rollback - Add dflashForwardWithTape to Qwen3NextGatedDeltaNet - Add bridge file Qwen3Next+DFlash.swift - Short prompt works: 68.8% acceptance, 9.8 GB RAM (vs 45 GB full load) - Longer runs crash — likely Metal watchdog on 512-expert SSD reads
…b440) - Bumps mlx-swift-lm submodule to b440 (tag) / 63707c0: fix(Gemma4Text): dispatch QuantizedKVCache correctly in LLM attention (merges PR #29, closes #71) - Server.swift: expose `kv_bits` as a per-request API field (ChatCompletionRequest.kvBits -> GenerateParameters.kvBits) enabling native MLX QuantizedKVCache without a server restart. - run_benchmark.sh: add Test 9 — QuantizedKVCache regression suite [1/4] kv_bits=4 short [2/4] kv_bits=8 short [3/4] kv_bits=4 long (KV-sharing path) [4/4] baseline Test 9 passed on mlx-community/gemma-4-26b-a4b-it-4bit.
README.md: - Added '🔧 Per-Request API Parameters' section with kv_bits table, kv_bits vs --turbo-kv comparison table, and curl usage example - Clarified --turbo-kv CLI entry: 'activates after 2048 tokens, server-wide' Server.swift: - Added kv_bits input validation (only nil/4/8 accepted; returns 400 otherwise) - Bypass prompt cache restore when kv_bits is set (prevents unsafe mixing of QuantizedKVCache and KVCacheSimple states across requests) - Bypass prompt cache save when kv_bits is set (same safety reason) run_benchmark.sh (Test 9): - Corrected header comment to match actual assertions (removed false ≥20 token and multi-turn claims; stated actual ≥3 token / non-empty checks) - Added explicit SERVER_READY flag + post-loop failure with log dump - Widened thinking-block regex to handle both <|channel|>thought and <|channel>thought
- Replace 🧠 with 📡 heading emoji - Rewrite as structured tables (Text / Vision / Audio) with all 50+ model families derived from the actual MLXLLM + MLXVLM model file inventory - LLM table: Gemma, Qwen, Phi, Mistral, Llama, GLM, DeepSeek, Falcon, LFM2, OLMo, Granite, SmolLM3, InternLM2, Cohere, Jamba, Exaone, MiMo, Ernie, Baichuan, Bailing, NemotronH, Starcoder2, OpenELM, BitNet, MiniMax, Apertus/AfMoE, MiniCPM, Qwen3Next - VLM table: Gemma4, Gemma3, Qwen3-VL, Qwen2-VL/2.5-VL, LFM2-VL, Pixtral, PaliGemma, Idefics3, Mistral3, FastVLM, SmolVLM2, GlmOcr, QwenVL - ALM table: Gemma-4-e4b only (factually correct — Qwen2-Audio removed; it was never wired into the audio pipeline here)
…reverted from main in 50c3732)
fix: Gemma-4 QuantizedKVCache + kv_bits API + Test 9 (mlx-swift-lm b440)
… fix CORS/parallel test gaps - Server.swift: add defer-based heartbeat cleanup in both handleChatStreaming and handleTextStreaming so heartbeatTask is always cancelled on any exit path (client disconnect during prefill no longer leaks the heartbeat task) - ServerSSETests.swift: add missing import Foundation for Data/JSONSerialization - test-server.sh Test 32: fail on empty curl response instead of false-passing - test-server.sh Test 33: use conditional curl; fail if request fails entirely - test-server.sh Test 34: redirect CORS preflight to CORS_PORT (--cors server) instead of the main server which has no CORS middleware - test-server.sh Test 35: spin up a dedicated --parallel 2 server so concurrent requests actually overlap and stress the global hook under real parallelism - test-opencode.sh: capture opencode exit code separately; classify parse errors vs acceptable non-zero exits to prevent false passes
…ch in Tests 32-33 The new conditional curl patterns in Tests 32 and 33 combined with the existing set -euo pipefail caused the script to abort when grep found no match (exit 1) in the EVENT_DATA pipeline. All grep/jq calls that may produce no output now use || true or are wrapped in if/else to prevent premature script exit.
[codex] make OpenAI streaming strict by default
…experts are combined Fixes #72: on a 16GB Mac Mini M4, adding --draft-model alongside --stream-experts caused RAM to spike to the physical limit and trigger swap, even though the draft model is only a 4B (~3.5GB) model. Root causes and fixes: 1. [Bug] draftConfig.lazyLoad was never set — draft weights were eagerly paged into unified RAM. Fix: set draftConfig.lazyLoad = true when --stream-experts is active, mirroring what already happens for the main model config. 2. [Bug] Memory.cacheLimit / Memory.memoryLimit were applied after both model loads, so neither the main nor draft model loaded under a cache budget. Fix: apply the SSD memory cap immediately after ExpertStreamingConfig.shared.activate() — before any LLMModelFactory.loadContainer() calls — so both models respect the page-cache limit throughout loading. 3. [Bug] physicalBudget did not account for the draft model's resident footprint, leaving the cap 3–4 GB too high. Fix: profile the draft model directory before loading and subtract its weightMemoryGB from physicalBudget in all three affected strategy branches (swapAssisted, layerPartitioned, early cap). A 2 GB floor guard prevents the budget going negative on very constrained machines. Expected result on 16GB M4: - Draft model weights are mmap'd (lazy) — only accessed pages in RAM - Both models load under the ~6GB effective page-cache budget (9.6GB - 3.5GB draft) - No swap; total RAM stays within the SSD streaming budget
…draft model
- Extract computeSSDMemoryBudget() from inline formula so it can be unit tested
without loading a real model or touching Memory.cacheLimit
- Wire all three budget call sites to use the extracted function (no behaviour change)
- Add SSDMemoryBudgetTests.swift with 8 tests covering:
* Baseline 16 GB / no draft (formula correctness)
* Issue #72 regression: 16 GB + 3.5 GB draft → budget reduced by exact footprint
* Floor guard: deeply negative raw result clamped to 2 GB
* Floor value: confirmed at exactly 2 GB
* Default-arg == 0 (no silent reduction without a draft model)
* Monotonicity: larger draft → smaller or equal budget
* Typical fleet: 24 GB and 64 GB with 3.5 GB draft
Two correctness issues flagged in inline review: 1. GiB/GB unit mismatch — weightMemoryGB is computed as bytes/1e9 (decimal GB), but was multiplied back to bytes using 1_073_741_824 (GiB), causing ~7% budget drift. Fix: use draftProfile.weightFileSizeBytes directly (exact bytes, no conversion needed). 2. Repeated ModelProfiler.profile() filesystem walks — the draft model directory was enumerated once in the early cap block and again in each strategy branch (swapAssisted, layerPartitioned). Fix: compute draftFootprintBytes once before the streamExperts block and reuse it everywhere. Also addresses a third Copilot comment: the early SSD cap was only applied when modelDirectory != nil, so first-run downloads were unprotected. Now the cap is applied whenever --stream-experts is set, even if the model isn't cached yet (modelling via the else-if branch). All 8 SSDMemoryBudgetTests still pass.
fix(ssd-stream): prevent RAM explosion when --draft-model + --stream-experts combined (#72)
…odel (#72 follow-up) Reporter confirmed the original fix addressed load-time RAM, but swap still explodes during inference: OS_RAM=20.7GB / MEM_DEMAND=40.2GB on a 16GB machine. Root cause (inference-time): The 200GB memoryLimit sentinel is necessary for SSD streaming alone — it bypasses MLX eval_impl's spin-wait loop when expert pages are evicted mid-graph. However, with speculative decoding the draft model (4B / 3GB) and main model (35B / 20GB) alternate forward passes in tight succession. Both models' expert pages are demanded within the same inference cycle, combined demand ~23GB >> 16GB physical. The 200GB sentinel provides zero back-pressure, so macOS swaps aggressively (10+ GB observed in Activity Monitor). Fix: When --stream-experts + --draft-model are both set AND combinedFootprint > 70% of physical RAM, lower memoryLimit from 200GB to physicalRAM × 1.1. This forces MLX to hit its hard limit sooner and evict stale expert pages more aggressively rather than extending into swap. A clear startup warning is also printed:⚠️ SSD + draft-model RAM pressure warning: Main model: 20.4GB Draft: 3.0GB Combined: 23.4GB Physical RAM: 16.0GB Speculative decoding alternates both models' forward passes. On this machine the combined weight exceeds physical RAM, causing page-cache thrashing and swap during inference. → Recommendation: remove --draft-model on this machine, or use a smaller draft model whose weights fit in remaining RAM after the main model's page budget (6GB). Memory limit set to 17GB (tight cap for MLX eviction pressure) When combined footprint fits in RAM (e.g. smaller draft on a 32GB machine), the 200GB sentinel is still used as before — no regression for capable hardware.
…el cache Replace if-branch masking with metal::select for zero warp-divergence state updates. Reorganize KernelCache from 8 flat named vars to tapeReplay[vec][msk] and gatedDeltaTape[vec][msk] 2D arrays. Simplify dispatch call sites to one-liner index lookups. Minor whitespace cleanup in DFlashIntermediateDumper.
… property Add MambaSnapshotCache: lightweight O(1) snapshot-based rollback (lazy reference capture, no GPU copy) as an alternative to RecurrentRollbackCache's innovation-tape replay. Add dflashUseTapeRollback Bool to DFlashTargetModel (default true) so models can opt in to either strategy. Update makeTargetCache and arm/rollback helpers with clearer comments. Also switch RecurrentRollbackCache.armRollback to lazy reference capture (removes unnecessary MLX.contiguous copies on arm path).
Add DFlashKernelBench executable for isolated kernel timing. Exclude DFlashKernelsOptimized.swift from the DFlash library target (work-in-progress alternative kernel implementations kept for reference).
…next.sh bench_35b.sh: save per-run raw response JSON, extract structured results into bench_results.json (tok/s, RAM, timing per config) for downstream tooling. Use slug variable consistently for log file naming. Add bench_coder_next.sh for benchmarking Qwen3-Coder-Next model variants.
Move comparison tests from tests/DFlashComparison/ to tests/DFlash/, adding DFlashBenchmark.swift, DFlashProfiler.swift, updated cosine similarity comparison tools, and a README. Update .gitignore intermediates path.
…-draft-model (#72) Git history audit (mlx-swift-lm): e6ba580 - 8.5x speedup (0.58→4.95 tok/s) from cross-projection batching (Eric Lake, M1 Ultra) 2c71c6c - ssd-opt-v2: +4% more via persistent expert buffers (asyncEval warm path) 2b1c653 - PAPPS N+1 prefetch permanently disabled (hurt Apple-native TPS) README (line 245) explicitly states: 'Speculative decoding is counterproductive for SSD-streaming MoE specifically. The verify pass sends N+1 tokens, each routing to *different* experts — SSD I/O scales with the *union* of all positions' expert selections.' Strategy (not a hard error): When --stream-experts + --draft-model are combined: - Auto-cap --num-draft-tokens to 1 (verify pass = 2 positions, not N+1) - At 1 draft token: fan-out is 2× SSD I/O (vs 5× at default 4 tokens) - If acceptance rate ≥ 50% (typical for same-family models), net TPS is positive - Print a clear advisory so users understand the tradeoff - Persistent expert buffers (~5 GB warm path, ssd-opt-v2) are PRESERVED — no regression to Eric Lake's M1 Ultra benchmark What is NOT changed: - SwitchLayers.swift warm path: untouched (idx.size <= 32 guard intact) - ExpertStreamingConfig: no new flags added (reverted failed hasDraftModel attempt) - computeSSDMemoryBudget() + cacheLimit logic from load-time fix: intact - Tight memoryLimit sentinel (physicalRAM × 1.1) when combined > 70% RAM: intact Test coverage (18 tests, 0 failures): SSDDraftStrategyTests (10 new): - Fan-out arithmetic: 4 draft tokens → 5× I/O, 1 token → 2× I/O - Auto-cap fires only when streamExperts + draftModel + numDraftTokens > 1 - Auto-cap does NOT fire for solo SSD streaming or pure RAM speculative decoding - Net throughput model: 70% acceptance at 2× fan-out is net positive - memoryLimit sentinel selection: tight cap on 16 GB, sentinel on 64 GB SSDMemoryBudgetTests (8 existing): all pass, no regressions
…sion
Three-check E2E test for the --stream-experts + --draft-model fix:
[1/3] Auto-cap guard: verifies server log contains the 'auto-capping'
warning, proving numDraftTokens was reduced from 4 to 1 at startup
[2/3] RAM guard: measures vm_stat peak RAM during inference and fails
if it exceeds 80% of physical RAM (the indicator that exposed the
original swap explosion on reporter's 16GB M4 Mini)
[3/3] Inference: verifies the combination still produces valid content
(not crashed/empty), proving functional correctness
Uses small models (Qwen3.5-4B main + Qwen3.5-0.8B draft) — same
parameter-class proportions as the reporter's 35B+4B scenario but
runnable on any machine without 35B weights.
Run: ./run_benchmark.sh → option 10
Prompt cache save/restore was incorrectly applied to Qwen3Next which uses a hybrid KVCache+MambaCache architecture. MambaCache RNN states cannot be arbitrarily trimmed or replayed at arbitrary token boundaries unlike KVCacheSimple, so attempting to restore a partial match would corrupt the linear attention state and cause spurious 1-token outputs. Fix: PromptCache.save() and PromptCache.restore() now skip immediately if any layer in the cache is a MambaCache instance. Also fixes run_benchmark.sh Test 0 (automated matrix) to pass MODEL via environment variable instead of feeding it through stdin, so the model selection prompt is correctly bypassed when MODEL is pre-set.
Replacing the stdin pipe approach with an env var so child invocations from Test 0's automated matrix loop skip the interactive menu entirely. The previous echo-pipe was consumed by the 'read suite_opt' prompt but any subsequent reads (model selection) had no input, causing the script to fall through to option 3 by default.
When SUITE_OPT is set (automated matrix mode), skip all menu echoes and the read prompt entirely. Child processes now run silently with only test-relevant output.
Both test-speculative.sh and test-dflash.sh grep for 'Using speculative decoding' in the server log to confirm the speculative path was activated. This string was never emitted — the tests were checking a log line that didn't exist, causing speculative-decoding and dflash-speculative-decoding CI jobs to always fail on Test 1. Fix: emit the exact expected log line: - Standard spec: after draft model is loaded successfully - DFlash spec: at generation dispatch in Server.swift Server log now contains all strings the tests grep for: ✅ 'Draft model loaded successfully' ✅ 'Using speculative decoding' ✅ 'speculative decoding' (for test-speculative-eval.sh)
test-dflash.sh grepped for:
1. 'Draft model loaded successfully' — only emitted by standard draft path,
not DFlash path which has its own 'DFlash draft model loaded' message
2. 'Using speculative decoding' — not emitted by DFlash path at all
3. 'speculative decoding' — was present but test was failing on (1)
Add both required lines immediately after DFlash draft model weights load,
mirroring the standard speculative decoding path. The streaming failures
('missing [DONE] sentinel') were downstream of the model-not-found state
caused by the load log mismatch, not an inference bug.
Adds Sources/SwiftLM/{Qwen3,Qwen3MoE,Llama}+DFlash.swift — each
declares the DFlashTargetModel protocol conformance and delegates to
the model's public callCapturing / embedTokens / lmHead
(now on *ModelInner via mlx-swift-lm b453).
Coverage:
Qwen3Model → Qwen3-8B and similar dense Qwen3 variants
Qwen3MoEModel → Qwen3-Coder-30B-A3B and other Qwen3 MoE variants
LlamaModel → Meta-Llama-3.x, Mistral, and Llama-family models
Qwen35MoEModel → already covered via Qwen35Model inheritance
Qwen36MoE → no separate Swift class found; uses Qwen35MoE path
Co-authored-by: clandestine.eth <96172957+0xClandestine@users.noreply.github.com>
Gemma4 omni (5.2GB) on a 7.5GB runner is tight. After other CI jobs have run and filled the model cache, available RAM can drop below the threshold needed for stable Metal command buffer execution, causing sporadic GPU timeout crashes (kIOGPUCommandBufferCallbackErrorTimeout). Add a vm_stat-based preflight check: if available+inactive RAM < 2.5GB, exit 0 (skip) instead of crashing the whole run.
This reverts commit 9fc993c.
Own DeepSeek V3 (deepseek_v3 / kimi_k25) and Kimi Linear (kimi_linear) model implementations directly in SwiftLM so DFlashTargetModel conformance is available without any upstream submodule changes. - DeepseekV3DFlash.swift: full DSV3Config + model with callCapturing - KimiLinearDFlash.swift: hybrid KDA/MLA Kimi 2.6 model with DFlash - DFlashModelRegistry.swift: registers all three model types via LLMTypeRegistry.shared.registerModelType() at startup - Server.swift: call registerDFlashModelTypes() before model loading
Use @ModuleInfo(key: "model") on the inner model property so weights at model.* paths are found correctly. Also use @ModuleInfo(key: "norm") for norm layers initialized in init() so their weights are tracked.
… limit DeepseekV3DFlash.sanitize(): - Strip 'language_model.' wrapper prefix present in kimi_k25 and some other HuggingFace exports so weight keys resolve to model.* paths - After stacking per-expert weights into switch_mlp, remove the original experts.N.* keys to prevent verify: .noUnusedKeys crash - Generalize layer filter to use numHiddenLayers instead of hardcoded 61 Server.run(): - Raise RLIMIT_NOFILE to 4096 at startup; large sharded models (kimi_k25 has 182 safetensor shards) exhaust the default macOS limit of 256
…prevent GPU timeouts
- Move MLX_MAX_OPS_PER_BUFFER=50 to top of run() before Metal init - Enable --stream-experts automatically on <12GB machines in test-dflash.sh so weights are paged via mmap/pread instead of macOS VM swap - Auto-cap draft tokens to 1 under SSD streaming (minimal fan-out) - Always compute draftFootprintBytes regardless of --stream-experts flag
* feat: bump mlx-swift-lm submodule for DeepSeek-V4 support Points mlx-swift-lm to feat/deepseek-v4 branch (SharpAI/mlx-swift-lm#33) which adds DeepseekV4.swift and registers the deepseek_v4 model type. * feat: DeepSeek-V4-Flash benchmark results + profiler improvements - README: add DeepSeek-V4-Flash (126GB Q3) benchmark table for M5 Pro 64GB SSD+TurboQuant delivers 4.16 tok/s at 40K context (13x vs plain SSD Stream) - profile_runner.py: track peak GPU InUse via background polling thread (0.5s) instead of single post-generation snapshot; rename gpu_in_use → gpu_in_use_peak throughout; add separate GPU_InUse peak visualization section - run_benchmark.sh: add Thump604/DeepSeek-V4-Flash-MLX-Q3-mixed-gs128-affine to Test 1 model list (option 11) - mlx-swift-lm: bump submodule to 8a8da29 (attn_sink dtype fix) * chore: bump mlx-swift-lm submodule to b463 (DeepSeek-V4 merged to main)
feat: add DFlash speculative decoding
Merges ericjlake's prompt-cache fixes from PR #85, resolving conflicts with the DFlash integration (PR #78). Changes from ericjlake: - MambaCache safety gate + KVCacheSimple T-dim slice in save() - ndim >= 3 guard in minCachedSeqLen scan - Spec-decode short-circuit ordering (check before cache restore) - README: Qwen3-A3B full-RAM perf table (M1 Ultra 64 GB) Conflict resolution: - README.md: kept both Qwen3-A3B and DeepSeek-V4 perf tables - Server.swift save(): kept existing MambaCache early return + new T-dim slice - Server.swift decision branch: combined spec-decode-first + skipPromptCache (kvBits) Closes #84. Co-authored-by: Eric Lake <ericjlake@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR resolves post-fork conflicts by merging SharpAI/SwiftLM:main into the target branch while preserving DFlash integration changes, and expands test/CI coverage around SSE streaming strictness, OpenAI SDK compatibility, and DFlash/SSD+draft regressions.
Changes:
- Adds a new
DFlashSwiftPM library/module (runtime, draft model, rollback caches, model bridges) and related benchmarking/profiling utilities. - Extends integration + unit tests for SSE strict streaming / opt-in heartbeat, OpenAI SDK parsing compatibility, DFlash E2E, and SSD+draft memory guard (Issue SharpAI#72).
- Updates profiling scripts/docs and CI workflows to run the new test suites and regression jobs.
Reviewed changes
Copilot reviewed 44 out of 46 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test-server.sh |
Adds SSE strict-streaming and opt-in heartbeat/CORS/concurrency integration tests. |
tests/test-opencode.sh |
New OpenAI SDK + OpenCode CLI compatibility integration test. |
tests/test-dflash.sh |
New DFlash speculative decoding E2E test script (dual-model, streaming, stability). |
tests/SwiftLMTests/ServerSSETests.swift |
New XCTest coverage for SSE prefill chunks, header parsing, and PrefillState invariants. |
tests/SwiftLMTests/SSDPersistentBufferGuardTests.swift |
New regression tests for SSD streaming + draft auto-cap/memory-limit strategy (Issue SharpAI#72). |
tests/DFlash/dump_python_intermediates.py |
Python reference dumper for DFlash intermediate tensors (.npy + meta). |
tests/DFlash/compare_swift_python.py |
Compares Python vs Swift dumps via cosine similarity to localize divergence. |
tests/DFlash/compare_cosine.py |
Python self-consistency + “Swift-equivalent” path comparison helper. |
tests/DFlash/README.md |
Documentation for DFlash benchmarking/profiling/comparison tooling. |
tests/DFlash/DFlashProfiler.swift |
Swift micro-profiler for kernel performance and numerical consistency. |
tests/DFlash/DFlashCosSimComparison.swift |
Swift-side comparison tool scaffolding for Python ↔ Swift intermediates. |
scripts/profiling/profile_runner.py |
Adds background polling to capture peak GPU “in use” memory during requests. |
scripts/profiling/bench_coder_next.sh |
Adds benchmark runner for Qwen3-Coder-Next across SSD/DFlash configs. |
scripts/profiling/bench_35b.sh |
Adds benchmark runner + JSON extraction for 35B DFlash/SSD configs. |
run_benchmark.sh |
Adds headless invocation support, new test menu entries, and Issue SharpAI#71/SharpAI#72 regression suites. |
docs/profiling/profiling_results_simbas-MacBook-Pro.md |
Updates recorded profiling results/table schema for new GPU peak metric. |
Sources/SwiftLM/Qwen3Next+DFlash.swift |
Adds DFlashTargetModel bridge for Qwen3Next models. |
Sources/SwiftLM/Qwen3MoE+DFlash.swift |
Adds DFlashTargetModel bridge for Qwen3 MoE models. |
Sources/SwiftLM/Qwen35+DFlash.swift |
Adds DFlashTargetModel conformance bridges for Qwen3.5 models. |
Sources/SwiftLM/Qwen3+DFlash.swift |
Adds DFlashTargetModel bridge for Qwen3 dense models. |
Sources/SwiftLM/Llama+DFlash.swift |
Adds DFlashTargetModel bridge for Llama/Mistral-style models. |
Sources/SwiftLM/DeepseekV3DFlash.swift |
Adds SwiftLM-owned DeepSeek V3 model implementation with DFlash support. |
Sources/SwiftLM/DFlashModelRegistry.swift |
Registers SwiftLM-owned DFlash-capable model types in the global registry. |
Sources/SwiftLM/ModelProfiler.swift |
Accounts for draft-model weight bytes and adjusts swap-assisted memoryLimit sentinel. |
Sources/DFlash/RecurrentRollbackCache.swift |
Adds rollback-capable MambaCache subclasses (tape replay + snapshot rollback). |
Sources/DFlash/DFlashRuntime.swift |
Core DFlash runtime: prefill, draft/verify loop, accept/reject, rollback, event streaming. |
Sources/DFlash/DFlashKernelProvider.swift |
Adds global provider registry for specialized DFlash kernels. |
Sources/DFlash/DFlashEngine.swift |
Adds engine abstraction for verify/rollback strategies (full-attn vs hybrid-GDN). |
Sources/DFlash/DFlashDraftRegistry.swift |
Maps known target model names to draft model IDs (auto-resolution). |
Sources/DFlash/DFlashDraftModel.swift |
Implements the DFlash block-diffusion draft model + context-only KV cache. |
Sources/DFlash/DFlashDraftBackend.swift |
Implements greedy draft-token generation backend using target embed/lm_head. |
Sources/DFlash/DFlashIntermediateDumper.swift |
Adds .npy dump utility for Swift intermediates to compare with Python reference. |
README.md |
Updates supported models/methodologies and adds DeepSeek V4 Flash profiling results + notes. |
Package.swift |
Adds DFlash library, DFlashKernelBench executable, and SwiftLMTests test target. |
Package.resolved |
Updates dependency lock revisions/versions. |
.gitignore |
Ignores generated DFlash intermediates directory. |
.github/workflows/ci.yml |
Runs new SwiftLMTests, adds opencode modality, and introduces DFlash + Issue SharpAI#72 guard jobs. |
.agents/workflows/review-github-pr.md |
Adds/updates internal workflow guidance for reviewing SharpAI/SwiftLM PRs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| print("[DFlash] Cycle \(cyclesCompleted + 1): blockLen=\(blockLen), verifyLen=\(verifyTokenIDs.dim(0)), acceptanceLen=\(acceptanceLen), commitCount=\(1 + acceptanceLen)") | ||
| fflush(stdout) | ||
|
|
| /// Snapshot of the cache state before the verify pass. | ||
| private var snapshotState: [MLXArray?]? | ||
|
|
||
| public init(convKernelSize: Int = 4) { |
| if [ -z "$STRICT_STREAM" ] || ! echo "$STRICT_STREAM" | grep -q 'data: \[DONE\]'; then | ||
| # Only fail if it was a curl failure (empty), not a missing event | ||
| [ -z "$STRICT_STREAM" ] && fail "Strict mode: stream was empty" | ||
| elif echo "$STRICT_STREAM" | grep -q "^event:"; then | ||
| fail "Strict mode: unexpected named SSE event without opt-in header" | ||
| else | ||
| pass "Strict mode: no named SSE events in default streaming" | ||
| fi |
| if targetHidden == nil { | ||
| targetHidden = MLXArray.zeros( | ||
| [feat.dim(0), promptLen, feat.dim(-1)], | ||
| dtype: feat.dtype | ||
| ) | ||
| } | ||
| targetHidden![0..., chunkStart ..< chunkEnd, 0...] = feat | ||
| eval(targetHidden!) | ||
|
|
|
|
||
| /// Registry to allow models to use DFlash kernels without module circular dependencies. | ||
| public struct DFlashKernelRegistry: Sendable { | ||
| public nonisolated(unsafe) static var provider: DFlashKernelProvider? = nil |
| ### 3. DFlashCosSimComparison.swift | ||
| Compares intermediate values between Python and Swift implementations. | ||
|
|
||
| **Usage:** | ||
| ```bash | ||
| swift run DFlashCompare --dir tests/DFlashComparison/intermediates | ||
| ``` | ||
|
|
||
| ## Python Comparison | ||
|
|
||
| The benchmark format is compatible with `dflash-mlx/benchmark/` results: | ||
| - Same JSON structure | ||
| - Same metrics (TPS, TTFT, acceptance ratio) | ||
| - Same hardware info collection | ||
|
|
||
| You can compare Swift vs Python results by loading both JSON files and comparing the `summary` sections. | ||
|
|
||
| ## Results Directory | ||
|
|
||
| Create a `results/` directory here or specify custom output paths: | ||
| ```bash | ||
| mkdir -p tests/DFlashComparison/results | ||
| swift run DFlashBenchmark --output tests/DFlashComparison/results/benchmark.json | ||
| ``` |
| log "Installing opencode-ai in isolated directory..." | ||
| mkdir -p /tmp/opencode_cli_test | ||
| cd /tmp/opencode_cli_test | ||
| npm install opencode-ai@latest --silent >/dev/null 2>&1 | ||
|
|
| # test-speculative.sh — Speculative decoding E2E verification | ||
| # | ||
| # Uses a small draft model (Qwen3.5-0.8B) to accelerate a larger main model | ||
| # (Qwen3.5-4B) via speculative decoding. Verifies: | ||
| # 1. Dual-model loading (draft + main) | ||
| # 2. Speculative decoding path activation | ||
| # 3. Correct token generation | ||
| # 4. Server stability under dual-model memory pressure | ||
| # | ||
| # Usage: | ||
| # ./tests/test-speculative.sh [binary_path] [port] | ||
| # |
| ) -> AsyncStream<DFlashEvent> { | ||
| // Streaming: yield events from inside the generation loop | ||
| // via a Continuation, avoiding the buffered-array bottleneck. | ||
| AsyncStream(bufferingPolicy: .unbounded) { continuation in | ||
| let task = Task { |
|
Heads up — the diff looks massive (~10k lines) but that's just because it's merging our current The only thing that needs your attention is the conflict resolution in 2 files:
Once you merge this, your PR SharpAI#85 on SharpAI/SwiftLM will be conflict-free. 👍 |
Hey Eric — we tried to push this directly to your
perf/combinedbranch but the "Allow edits from maintainers" permission blocked it. So here's a PR instead!This merges
SharpAI/SwiftLM:maininto your branch to resolve the three conflicts from our DFlash integration (PR SharpAI#78) that landed after you forked.Conflict resolution
README.mdServer.swiftsave()Server.swiftdecision branchskipPromptCacheguard (kvBits)skipPromptCachegateOnce you merge this, your PR SharpAI#85 on SharpAI/SwiftLM will be conflict-free and we can land it. 🚀