@@ -51,8 +51,48 @@ Then start the server (models download automatically if not cached):
5151
5252* (Add ` --stream-experts ` when running oversized MoE models to bypass macOS virtual memory swapping and stream expert layers directly from NVMe SSD.)*
5353
54+ ## 📊 Performance: MTP Speculative Decoding — Gemma 4-26B (MacBook Pro M5 Pro 64 GB)
55+
56+ Benchmarked with ` gemma-4-26b-a4b-it-4bit ` running three configurations across 512 / 40K / 100K token contexts.
57+
58+ ### Generation Speed (tok/s) — higher is better
59+
60+ | Configuration | 512 tokens | 40K tokens | 100K tokens | Avg TPS* |
61+ | ---| ---| ---| ---| ---|
62+ | Baseline | 70.8 | 34.3 | 25.8 | 36.6 |
63+ | ** MTP Speculative** | 71.5 (1.01×) | 38.4 (1.12×) | 29.1 (1.13×) | ** 40.3** |
64+ | ** MTP + TurboQuant** ⭐ | ** 72.1 (1.02×)** | ** 65.2 (1.90×)** | ** 62.1 (2.41×)** | ** 66.2** |
65+
66+ * \* Time-weighted average: ` total_tokens / sum(60/TPS) ` — correct wall-clock representation vs arithmetic mean.*
67+
68+
69+ ### Time to First Token (seconds) — lower is better
70+
71+ | Configuration | 512 tokens | 40K tokens | 100K tokens |
72+ | ---| ---| ---| ---|
73+ | Baseline | 0.64s | 22.85s | 63.11s |
74+ | ** MTP Speculative** | 0.34s | 20.45s | 55.17s |
75+ | ** MTP + TurboQuant** ⭐ | ** 0.33s** | ** 13.17s** | ** 33.95s** |
76+
77+ ### GPU Memory (allocated / peak physical RAM)
78+
79+ | Configuration | 512 tokens | 40K tokens | 100K tokens |
80+ | ---| ---| ---| ---|
81+ | Baseline | 20.4 GB / 15.8 GB | 54.8 GB / 19.3 GB | 54.3 GB / 23.3 GB |
82+ | MTP Speculative | 20.4 GB / 16.0 GB | 54.6 GB / 20.8 GB | 54.3 GB / 23.1 GB |
83+ | ** MTP + TurboQuant** ⭐ | ** 20.3 GB / 15.8 GB** | ** 23.9 GB / 17.3 GB** | ** 26.4 GB / 18.2 GB** |
84+
85+ ** Key takeaways:**
86+ - 🚀 ** 1.81× avg throughput** — MTP + TurboQuant delivers 66.2 tok/s time-weighted vs 36.6 tok/s baseline
87+ - 🏎️ ** Nearly 2× faster TTFT at 100K context** — 33.95s vs 63.11s baseline (46% reduction)
88+ - 💾 ** Massive memory savings at long context** — GPU allocation drops from 54.8 GB → 23.9 GB at 40K tokens (TurboQuant KV compression)
89+ - 🔬 ** MTP alone is free** — 1.10× time-weighted TPS and lower TTFT with zero additional memory overhead
90+
91+ > Run ` python3 -u scripts/profiling/profile_runner.py --model gemma-4-26b-a4b-it-4bit --contexts "512,40000,100000" ` to reproduce on your device.
92+
5493## 📊 Performance: Gemma 4-26B on Apple Silicon
5594
95+
5696Benchmark results for ` gemma-4-26b-a4b-it-4bit ` (26B MoE, 4-bit) on M5 Pro 64 GB.
5797
5898### Headline Numbers
0 commit comments