Skip to content

Commit 9533e45

Browse files
authored
feat: DeepSeek-V4 support via mlx-swift-lm b463
* feat: bump mlx-swift-lm submodule for DeepSeek-V4 support Points mlx-swift-lm to feat/deepseek-v4 branch (SharpAI/mlx-swift-lm#33) which adds DeepseekV4.swift and registers the deepseek_v4 model type. * feat: DeepSeek-V4-Flash benchmark results + profiler improvements - README: add DeepSeek-V4-Flash (126GB Q3) benchmark table for M5 Pro 64GB SSD+TurboQuant delivers 4.16 tok/s at 40K context (13x vs plain SSD Stream) - profile_runner.py: track peak GPU InUse via background polling thread (0.5s) instead of single post-generation snapshot; rename gpu_in_use → gpu_in_use_peak throughout; add separate GPU_InUse peak visualization section - run_benchmark.sh: add Thump604/DeepSeek-V4-Flash-MLX-Q3-mixed-gs128-affine to Test 1 model list (option 11) - mlx-swift-lm: bump submodule to 8a8da29 (attn_sink dtype fix) * chore: bump mlx-swift-lm submodule to b463 (DeepSeek-V4 merged to main)
1 parent b33801a commit 9533e45

5 files changed

Lines changed: 116 additions & 36 deletions

File tree

README.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,25 @@ Benchmark results for `gemma-4-26b-a4b-it-4bit` (26B MoE, 4-bit) on M5 Pro 64 GB
7373

7474
> Run `./run_benchmark.sh` to generate these metrics on your own device. (See **Benchmarks & Testing** below).
7575
76+
### DeepSeek-V4-Flash (126 GB, Q3-mixed-gs128-affine) — M5 Pro 64 GB
77+
78+
Model: [`Thump604/DeepSeek-V4-Flash-MLX-Q3-mixed-gs128-affine`](https://huggingface.co/Thump604/DeepSeek-V4-Flash-MLX-Q3-mixed-gs128-affine)
79+
80+
> Dense/Vanilla and TurboQuant (non-SSD) configurations are skipped automatically — the 126 GB model exceeds physical RAM.
81+
82+
| Configuration | 512 ctx | 40K ctx |
83+
|---|---|---|
84+
| SSD Stream | 4.65 tok/s · 28.4 GB | 0.32 tok/s · 60.5 GB |
85+
| **SSD + TurboQuant** | **4.78 tok/s · 29.5 GB** | **4.16 tok/s · 40.6 GB** |
86+
| SSD + 16-Worker Prefetch | 4.43 tok/s · 29.3 GB | 0.32 tok/s · 60.9 GB |
87+
88+
> Values shown as `generation speed · GPU memory allocated (virtual, incl. SSD-backed pages)`
89+
90+
**Key takeaways:**
91+
- 🏆 **SSD + TurboQuant dominates at long context** — 4.16 tok/s at 40K vs 0.32 tok/s for plain SSD Stream (**13× faster**), with 33% lower GPU allocation (40.6 GB vs 60.5 GB).
92+
- At 512-token context all configurations perform similarly (~4.4–4.8 tok/s); TurboQuant's advantage is KV-cache compression at long context.
93+
- Peak physical RAM (GPU InUse) stays ≤ 17 GB across all configurations — the rest streams from NVMe SSD.
94+
7695
---
7796

7897
## 🚀 Features
Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,16 @@
1-
### `mlx-community/gemma-4-26b-a4b-it-4bit` — Context & Memory Profile
1+
### `Thump604/DeepSeek-V4-Flash-MLX-Q3-mixed-gs128-affine` — Context & Memory Profile
22

3-
Context depths tested: 512
3+
Context depths tested: 512,40000
44

5-
| Configuration | Context Size | TTFT | Generation Speed | Model Size | Active RAM (Physical) | GPU Memory Allocated |
6-
|---|---|---|---|---|---|---|
5+
| Configuration | Context Size | TTFT | Generation Speed | Model Size | Active RAM (OS) | GPU_Alloc (virtual) | GPU_InUse peak (physical) |
6+
|---|---|---|---|---|---|---|---|
7+
| SSD Stream | 512 | 6.80s | 4.65 tok/s | N/A | 17.0 GB | 28.4 GB | 16.7 GB |
8+
| SSD Stream | 40000 | 565.02s | 0.32 tok/s | N/A | 48.3 GB | 60.5 GB | 12.5 GB |
9+
| SSD + TurboQuant | 512 | 6.35s | 4.78 tok/s | N/A | 16.9 GB | 29.5 GB | 16.8 GB |
10+
| SSD + TurboQuant | 40000 | 363.76s | 4.16 tok/s | N/A | 28.3 GB | 40.6 GB | 16.8 GB |
11+
| SSD + 16-Worker Prefetch | 512 | 5.84s | 4.43 tok/s | N/A | 16.9 GB | 29.3 GB | 16.6 GB |
12+
| SSD + 16-Worker Prefetch | 40000 | 565.50s | 0.32 tok/s | N/A | 48.3 GB | 60.9 GB | 13.6 GB |
713

8-
> **Active RAM (Physical)**: Real memory wired into RAM by macOS (capped by device RAM).
9-
> **GPU Memory Allocated**: Total memory requested by the GPU — includes data swapped to SSD. This shows the TRUE memory demand and reveals TurboQuant compression benefits even when Active RAM is saturated.
14+
> **Active RAM (OS)**: Memory wired into physical RAM by macOS (from server log).
15+
> **GPU_Alloc (virtual)**: Total GPU address-space allocation including SSD-backed pages — the TRUE memory demand, can exceed physical RAM.
16+
> **GPU_InUse peak (physical)**: Peak physical RAM occupied by the GPU during the entire request (prefill + generation), sampled every 0.5 s. This is the real active footprint — for SSD-streaming configs it reflects the high-water mark while layers are being read, not a post-generation snapshot.

run_benchmark.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -235,6 +235,7 @@ else
235235
"mlx-community/phi-4-mlx-4bit"
236236
"baa-ai/GLM-5.1-RAM-270GB-MLX"
237237
"baa-ai/GLM-5.1-4bit"
238+
"Thump604/DeepSeek-V4-Flash-MLX-Q3-mixed-gs128-affine"
238239
"Custom (Enter your own Hub ID)"
239240
"Quit"
240241
)

scripts/profiling/profile_runner.py

Lines changed: 82 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import argparse
22
import subprocess
3+
import threading
34
import time
45
import urllib.request
56
import urllib.error
@@ -176,20 +177,40 @@ def get_gpu_alloc_gb():
176177
return 0, 0
177178

178179
def make_request_stream(prompt_len, max_tokens, port=5422):
180+
"""Run a streaming inference request and return (ok, ttft, tps, peak_gpu_in_use_gb).
181+
GPU 'In use system memory' is polled every 0.5s in a background thread so we
182+
capture the PEAK physical RAM usage during the full prefill+generation window,
183+
not a post-generation snapshot after macOS has evicted layer weights back to SSD.
184+
"""
179185
prompt = "apple " * int(prompt_len * 0.75)
180186
data = json.dumps({
181187
"messages": [{"role": "user", "content": prompt}],
182188
"max_tokens": max_tokens,
183189
"temperature": 0.0,
184190
"stream": True
185191
}).encode('utf-8')
186-
192+
187193
req = urllib.request.Request(
188194
f"http://127.0.0.1:{port}/v1/chat/completions",
189195
data=data,
190196
headers={'Content-Type': 'application/json'}
191197
)
192-
198+
199+
# ── Background GPU-memory poller ──────────────────────────────────────────
200+
peak_in_use = [0.0]
201+
poller_stop = threading.Event()
202+
203+
def _poll_gpu():
204+
while not poller_stop.is_set():
205+
_, in_use = get_gpu_alloc_gb()
206+
if in_use > peak_in_use[0]:
207+
peak_in_use[0] = in_use
208+
poller_stop.wait(timeout=0.5)
209+
210+
poller = threading.Thread(target=_poll_gpu, daemon=True)
211+
poller.start()
212+
# ─────────────────────────────────────────────────────────────────────────
213+
193214
ttft = None
194215
start = time.time()
195216
tokens = 0
@@ -205,13 +226,17 @@ def make_request_stream(prompt_len, max_tokens, port=5422):
205226
if ttft is None:
206227
ttft = time.time() - start
207228
tokens += 1
208-
total_time = time.time() - start
209-
gen_time = total_time - ttft if ttft else 0
210-
tps = (tokens - 1) / gen_time if gen_time > 0 and tokens > 1 else 0
211-
return True, ttft, tps
229+
total_time = time.time() - start
230+
gen_time = total_time - ttft if ttft else 0
231+
tps = (tokens - 1) / gen_time if gen_time > 0 and tokens > 1 else 0
232+
poller_stop.set()
233+
poller.join(timeout=2)
234+
return True, ttft, tps, peak_in_use[0]
212235
except Exception as e:
213236
print(f"Request failed: {e}")
214-
return False, 0, 0
237+
poller_stop.set()
238+
poller.join(timeout=2)
239+
return False, 0, 0, 0.0
215240

216241
def extract_base_memory(log_path):
217242
try:
@@ -323,16 +348,20 @@ def main():
323348

324349
for ctx_size in context_sizes:
325350
print(f"\n>> Running {ctx_size}-token context test (max generation 60)...")
326-
ok, ttft, tps = make_request_stream(prompt_len=ctx_size, max_tokens=60)
327-
351+
ok, ttft, tps, peak_in_use = make_request_stream(prompt_len=ctx_size, max_tokens=60)
352+
328353
# Wait for server to flush post-generation logs
329354
time.sleep(1)
330-
355+
331356
os_ram = extract_os_ram(log_path)
332-
333-
# Query Apple GPU driver for the TOTAL allocated memory (physical + swapped)
334-
gpu_alloc, gpu_in_use = get_gpu_alloc_gb()
335-
357+
358+
# Query Apple GPU driver for the TOTAL allocated (physical + SSD-swapped) memory.
359+
# This is a post-generation snapshot — accurate for GPU_Alloc (virtual) but NOT
360+
# for GPU_InUse (physical): by the time generation finishes, SSD-streaming configs
361+
# have already evicted layer weights back to SSD. We use the peak value captured
362+
# during the request by the background poller instead.
363+
gpu_alloc, _ = get_gpu_alloc_gb()
364+
336365
if ok:
337366
results.append({
338367
"config": config["name"],
@@ -342,9 +371,9 @@ def main():
342371
"static_mem": static_mem,
343372
"os_ram": os_ram,
344373
"gpu_alloc": f"{gpu_alloc:.1f}",
345-
"gpu_in_use": f"{gpu_in_use:.1f}",
374+
"gpu_in_use_peak": f"{peak_in_use:.1f}",
346375
})
347-
print(f" TTFT={ttft:.2f}s TPS={tps:.2f} OS_RAM={os_ram}GB GPU_Alloc={gpu_alloc:.1f}GB GPU_InUse={gpu_in_use:.1f}GB")
376+
print(f" TTFT={ttft:.2f}s TPS={tps:.2f} OS_RAM={os_ram}GB GPU_Alloc={gpu_alloc:.1f}GB GPU_InUse(peak)={peak_in_use:.1f}GB")
348377
else:
349378
print(f" FAILED / OOM")
350379

@@ -357,13 +386,14 @@ def main():
357386
with open(args.out, "w") as f:
358387
f.write(f"### `{args.model}` — Context & Memory Profile\n\n")
359388
f.write(f"Context depths tested: {args.contexts}\n\n")
360-
f.write("| Configuration | Context Size | TTFT | Generation Speed | Model Size | Active RAM (Physical) | GPU Memory Allocated |\n")
361-
f.write("|---|---|---|---|---|---|---|\n")
389+
f.write("| Configuration | Context Size | TTFT | Generation Speed | Model Size | Active RAM (OS) | GPU_Alloc (virtual) | GPU_InUse peak (physical) |\n")
390+
f.write("|---|---|---|---|---|---|---|---|\n")
362391
for r in results:
363-
f.write(f"| {r['config']} | {r['context']} | {r['ttft']}s | {r['tps']} tok/s | {r['static_mem']} | {r['os_ram']} GB | {r['gpu_alloc']} GB |\n")
364-
365-
f.write(f"\n> **Active RAM (Physical)**: Real memory wired into RAM by macOS (capped by device RAM).\n")
366-
f.write(f"> **GPU Memory Allocated**: Total memory requested by the GPU — includes data swapped to SSD. This shows the TRUE memory demand and reveals TurboQuant compression benefits even when Active RAM is saturated.\n")
392+
f.write(f"| {r['config']} | {r['context']} | {r['ttft']}s | {r['tps']} tok/s | {r['static_mem']} | {r['os_ram']} GB | {r['gpu_alloc']} GB | {r['gpu_in_use_peak']} GB |\n")
393+
394+
f.write(f"\n> **Active RAM (OS)**: Memory wired into physical RAM by macOS (from server log).\n")
395+
f.write(f"> **GPU_Alloc (virtual)**: Total GPU address-space allocation including SSD-backed pages — the TRUE memory demand, can exceed physical RAM.\n")
396+
f.write(f"> **GPU_InUse peak (physical)**: Peak physical RAM occupied by the GPU during the entire request (prefill + generation), sampled every 0.5 s. This is the real active footprint — for SSD-streaming configs it reflects the high-water mark while layers are being read, not a post-generation snapshot.\n")
367397

368398
print(f"\nDone. Matrix saved to {args.out}")
369399

@@ -464,10 +494,10 @@ def print_visualization(results, model_name, baseline_alloc):
464494
crown = f" {C.YELLOW}{C.RESET}" if ttft_val == best_in_ctx and len(ctx_results) > 1 else ""
465495
print(f"{label} {b} {val_str}{crown}")
466496

467-
# ── 3) GPU Memory Demand ──
468-
print(f"\n{C.BOLD} 💾 GPU Memory Allocated (GB) — lower is better{C.RESET}")
497+
# ── 3) GPU Memory Allocated (virtual, includes SSD) ──
498+
print(f"\n{C.BOLD} 💾 GPU_Alloc (GB, virtual incl. SSD) — lower is better{C.RESET}")
469499
print(f"{C.DIM} {'─' * (W - 4)}{C.RESET}")
470-
500+
471501
all_gpu = [float(r["gpu_alloc"]) for r in results if r["gpu_alloc"] != "N/A"]
472502
max_gpu = max(all_gpu) if all_gpu else 1
473503

@@ -485,7 +515,29 @@ def print_visualization(results, model_name, baseline_alloc):
485515
crown = f" {C.YELLOW}{C.RESET}" if gpu_val == best_in_ctx and len(ctx_results) > 1 else ""
486516
print(f"{label} {b} {val_str}{crown}")
487517

488-
# ── 4) Summary scoreboard ──
518+
# ── 4) GPU InUse peak (physical RAM high-water mark) ──
519+
print(f"\n{C.BOLD} 💡 GPU_InUse peak (GB, physical RAM) — lower is better{C.RESET}")
520+
print(f"{C.DIM} Polled every 0.5s during prefill+generation; reflects real RAM pressure{C.RESET}")
521+
print(f"{C.DIM} {'─' * (W - 4)}{C.RESET}")
522+
523+
all_peak = [float(r["gpu_in_use_peak"]) for r in results if r.get("gpu_in_use_peak", "N/A") != "N/A"]
524+
max_peak = max(all_peak) if all_peak else 1
525+
526+
for ctx in ctx_sizes:
527+
ctx_results = [r for r in results if r["context"] == ctx]
528+
ctx_label = f"{ctx:,} tokens"
529+
print(f"\n {C.BOLD}{C.WHITE}{ctx_label}{C.RESET}")
530+
for r in ctx_results:
531+
peak_val = float(r.get("gpu_in_use_peak", 0))
532+
color = CONFIG_COLORS.get(r["config"], "")
533+
label = f" {r['config']:<20}"
534+
b = bar(peak_val, max_peak, width=28, color=color)
535+
val_str = f"{C.BOLD}{peak_val:>6.1f}{C.RESET} GB"
536+
best_in_ctx = min(float(x.get("gpu_in_use_peak", 0)) for x in ctx_results)
537+
crown = f" {C.YELLOW}{C.RESET}" if peak_val == best_in_ctx and len(ctx_results) > 1 else ""
538+
print(f"{label} {b} {val_str}{crown}")
539+
540+
# ── 5) Summary scoreboard ──
489541
print(f"\n{C.CYAN}{'─' * W}{C.RESET}")
490542
print(f"{C.BOLD} 🏆 Configuration Ranking (by avg TPS across all contexts){C.RESET}")
491543
print(f"{C.DIM} {'─' * (W - 4)}{C.RESET}")
@@ -497,12 +549,13 @@ def print_visualization(results, model_name, baseline_alloc):
497549

498550
ranked = sorted(config_avg.items(), key=lambda x: x[1], reverse=True)
499551
medals = ["🥇", "🥈", "🥉", " "]
500-
552+
501553
for i, (cfg_name, avg_tps) in enumerate(ranked):
502554
medal = medals[min(i, 3)]
503555
color = CONFIG_COLORS.get(cfg_name, "")
504-
avg_gpu = sum(float(r["gpu_alloc"]) for r in results if r["config"] == cfg_name) / max(1, len([r for r in results if r["config"] == cfg_name]))
505-
print(f" {medal} {color}{C.BOLD}{cfg_name:<22}{C.RESET} avg {avg_tps:>5.1f} tok/s | avg {avg_gpu:>5.1f} GB GPU")
556+
avg_gpu_alloc = sum(float(r["gpu_alloc"]) for r in results if r["config"] == cfg_name) / max(1, len([r for r in results if r["config"] == cfg_name]))
557+
avg_peak = sum(float(r.get("gpu_in_use_peak", 0)) for r in results if r["config"] == cfg_name) / max(1, len([r for r in results if r["config"] == cfg_name]))
558+
print(f" {medal} {color}{C.BOLD}{cfg_name:<22}{C.RESET} avg {avg_tps:>5.1f} tok/s | alloc {avg_gpu_alloc:>5.1f} GB | peak {avg_peak:>5.1f} GB RAM")
506559

507560
print(f"\n{C.CYAN}{'═' * W}{C.RESET}")
508561
print()

0 commit comments

Comments
 (0)