Skip to content

Commit d5a9d11

Browse files
docs: add MTP speculative decoding benchmark results (M5 Pro 64GB) (#105)
* docs: add MTP speculative decoding benchmark results (M5 Pro 64GB) Gemma 4-26B-A4B benchmarks across Baseline / MTP Speculative / MTP+TurboQuant: - MTP + TurboQuant: 66.5 tok/s avg (+53% vs baseline) - TTFT at 100K context: 33.95s vs 63.11s (-46%) - GPU alloc at 40K context: 23.9 GB vs 54.8 GB (-56%) - MTP alone: +6% TPS, lower TTFT, zero memory overhead * docs: fix avg TPS to use time-weighted harmonic mean * docs: show per-context speedup multipliers in benchmark table --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
1 parent 79f0ef3 commit d5a9d11

1 file changed

Lines changed: 40 additions & 0 deletions

File tree

README.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,48 @@ Then start the server (models download automatically if not cached):
5151

5252
*(Add `--stream-experts` when running oversized MoE models to bypass macOS virtual memory swapping and stream expert layers directly from NVMe SSD.)*
5353

54+
## 📊 Performance: MTP Speculative Decoding — Gemma 4-26B (MacBook Pro M5 Pro 64 GB)
55+
56+
Benchmarked with `gemma-4-26b-a4b-it-4bit` running three configurations across 512 / 40K / 100K token contexts.
57+
58+
### Generation Speed (tok/s) — higher is better
59+
60+
| Configuration | 512 tokens | 40K tokens | 100K tokens | Avg TPS* |
61+
|---|---|---|---|---|
62+
| Baseline | 70.8 | 34.3 | 25.8 | 36.6 |
63+
| **MTP Speculative** | 71.5 (1.01×) | 38.4 (1.12×) | 29.1 (1.13×) | **40.3** |
64+
| **MTP + TurboQuant**| **72.1 (1.02×)** | **65.2 (1.90×)** | **62.1 (2.41×)** | **66.2** |
65+
66+
*\* Time-weighted average: `total_tokens / sum(60/TPS)` — correct wall-clock representation vs arithmetic mean.*
67+
68+
69+
### Time to First Token (seconds) — lower is better
70+
71+
| Configuration | 512 tokens | 40K tokens | 100K tokens |
72+
|---|---|---|---|
73+
| Baseline | 0.64s | 22.85s | 63.11s |
74+
| **MTP Speculative** | 0.34s | 20.45s | 55.17s |
75+
| **MTP + TurboQuant**| **0.33s** | **13.17s** | **33.95s** |
76+
77+
### GPU Memory (allocated / peak physical RAM)
78+
79+
| Configuration | 512 tokens | 40K tokens | 100K tokens |
80+
|---|---|---|---|
81+
| Baseline | 20.4 GB / 15.8 GB | 54.8 GB / 19.3 GB | 54.3 GB / 23.3 GB |
82+
| MTP Speculative | 20.4 GB / 16.0 GB | 54.6 GB / 20.8 GB | 54.3 GB / 23.1 GB |
83+
| **MTP + TurboQuant**| **20.3 GB / 15.8 GB** | **23.9 GB / 17.3 GB** | **26.4 GB / 18.2 GB** |
84+
85+
**Key takeaways:**
86+
- 🚀 **1.81× avg throughput** — MTP + TurboQuant delivers 66.2 tok/s time-weighted vs 36.6 tok/s baseline
87+
- 🏎️ **Nearly 2× faster TTFT at 100K context** — 33.95s vs 63.11s baseline (46% reduction)
88+
- 💾 **Massive memory savings at long context** — GPU allocation drops from 54.8 GB → 23.9 GB at 40K tokens (TurboQuant KV compression)
89+
- 🔬 **MTP alone is free** — 1.10× time-weighted TPS and lower TTFT with zero additional memory overhead
90+
91+
> Run `python3 -u scripts/profiling/profile_runner.py --model gemma-4-26b-a4b-it-4bit --contexts "512,40000,100000"` to reproduce on your device.
92+
5493
## 📊 Performance: Gemma 4-26B on Apple Silicon
5594

95+
5696
Benchmark results for `gemma-4-26b-a4b-it-4bit` (26B MoE, 4-bit) on M5 Pro 64 GB.
5797

5898
### Headline Numbers

0 commit comments

Comments
 (0)