Skip to content

Commit 5a05548

Browse files
committed
chore: Update Gemma 4 benchmark metrics and add comprehensive testing suite
1 parent d770bce commit 5a05548

13 files changed

Lines changed: 835 additions & 111 deletions

README.md

Lines changed: 35 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -49,16 +49,42 @@ Reference implementations: [`turboquant-mlx`](https://github.com/sharpner/turboq
4949

5050
## 💻 Tested Hardware & Benchmarks
5151

52-
To reliably run massive 122B parameter MoE models over SSD streaming, `SwiftLM` was designed and benchmarked natively on the following hardware:
52+
SwiftLM is designed to leverage Apple Silicon Unified Memory limits to their absolute maximum. All first-party benchmarks are generated on an **Apple M5 Pro (64 GB Unified Memory)**.
5353

54-
- **Machine**: MacBook Pro, Apple M5 Pro
55-
- **Memory**: 64 GB Unified Memory
56-
- **Model**: Qwen3.5-122B-A10B-4bit
57-
- **SSD**: Internal Apple NVMe (Zero-Copy Streaming)
54+
Since we have limited access to the full spectrum of Apple hardware, **we welcome Pull Requests with benchmark results** from other devices (especially base M1/M2/M3 chips and Mac Studios).
5855

59-
> **⚠️ Quantization Disclaimer**: While heavier quantization shrinks the required memory footprint, **4-bit quantization** remains the strict production standard for MoE models. Our metrics indicated that aggressive 2-bit quantization heavily destabilizes JSON grammars—routinely producing broken keys like `\name\` instead of `"name"`—which systematically breaks OpenAI-compatible tool calling.
56+
To run the extreme context benchmark suite on your device, execute:
57+
```bash
58+
bash tests/run_extreme_context.sh <model-id>
59+
```
6060

61-
---
61+
### Extreme Context Performance (100K Tokens)
62+
Tested on M5 Pro (64GB) handling a monolithic **100,000 token** system prompt with **TurboKV Acceleration** enabled.
63+
64+
| Model | Configuration | Time To First Token (TTFT) | Peak GPU Memory (w/ TurboKV) |
65+
|---|---|---|---|
66+
| `gemma-4-e4b-it-8bit` | Dense (4B) | 60.92s | 11.83 GB |
67+
| `gemma-4-26b-a4b-it-4bit` | MoE (26B) | 66.99s | 16.86 GB |
68+
| `gemma-4-31b-it-4bit` | MoE (31B) | 533.37s | 29.23 GB |
69+
70+
### Throughput & Inference Memory Profile
71+
Tested by rendering exactly 20 tokens under standard conversational evaluation (`--prefill-size 512`) to capture precise Token Generation (TPS) and Apple Metal memory footprint limits:
72+
73+
| Model | Time To First Token (s) | Generation Speed (tok/s) | Peak GPU Memory (GB) |
74+
|---|---|---|---|
75+
| `gemma-4-e2b-it-4bit` | 0.08s | 116.27 tok/s | 1.37 GB |
76+
| `gemma-4-e4b-it-8bit` | 0.33s | 48.21 tok/s | 7.64 GB |
77+
| `gemma-4-26b-a4b-it-4bit` | 0.14s | 85.49 tok/s | 13.46 GB |
78+
| `gemma-4-31b-it-4bit` | 0.55s | 14.82 tok/s | 16.83 GB |
79+
80+
To run the automated suite on your machine for these models, execute:
81+
```bash
82+
python3 tests/run_4models_benchmark.py
83+
```
84+
85+
> **🧠 How it works:** SwiftLM implements **Chunked Prefill** (controlled via `--prefill-size`, defaulting to 512). This is functionally equivalent to `llama.cpp`'s `--batch-size` parameter and mirrors the [`mlx-lm` Python library](https://github.com/ml-explore/mlx/tree/main/mlx_lm)'s reference implementation approach to preventing $O(N^2)$ Unified Memory over-allocation during massive sequence parsing.
86+
87+
> **⚠️ Quantization Disclaimer**: While heavier quantization shrinks the required memory footprint, **4-bit quantization** remains the strict production standard for MoE models. Our metrics indicated that aggressive 2-bit quantization heavily destabilizes JSON grammars—routinely producing broken keys like `\name\` instead of `"name"`—which systematically breaks OpenAI-compatible tool calling.
6288
6389
---
6490

@@ -173,6 +199,7 @@ curl http://localhost:5413/v1/chat/completions \
173199
| `--port` | `5413` | Port to listen on |
174200
| `--host` | `127.0.0.1` | Host to bind |
175201
| `--max-tokens` | `2048` | Max tokens limit per generation |
202+
| `--prefill-size`| `512` | Prompt prefill chunk size (micro-batching for long contexts) |
176203
| `--gpu-layers` | `model_default`| Restrict the amount of layers allocated to GPU hardware |
177204
| `--stream-experts` | `false` | Enable experimental SSD streaming for MoE model expert matrices |
178205

@@ -187,6 +214,7 @@ curl http://localhost:5413/v1/chat/completions \
187214

188215
Built entirely on the hard work of the Apple MLX community.
189216
- [mlx-swift](https://github.com/ml-explore/mlx-swift) — Apple MLX framework for Swift
217+
- [mlx-lm](https://github.com/ml-explore/mlx/tree/main/mlx_lm) — Python reference implementation for MLX Language Models (inspiration for prompt chunking architecture)
190218
- [Hummingbird](https://github.com/hummingbird-project/hummingbird) — Event-driven Swift HTTP server
191219
- [flash-moe](https://github.com/danveloper/flash-moe) — Reference for SSD Expert Streaming
192220

tests/benchmark.sh

Lines changed: 0 additions & 103 deletions
This file was deleted.

tests/run_4models_100k.py

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
import subprocess
2+
import time
3+
import urllib.request
4+
import json
5+
import os
6+
7+
models = [
8+
"mlx-community/gemma-4-e4b-it-8bit",
9+
"mlx-community/gemma-4-26b-a4b-it-4bit",
10+
"mlx-community/gemma-4-31b-it-4bit"
11+
]
12+
13+
port = 5440
14+
# 1 word ~ 1.3 tokens. "test " repeated 80,000 times gives ~100k tokens.
15+
prompt_text = "Please write a story about a little bird. " + ("test " * 80000)
16+
results = []
17+
18+
print("==========================================================")
19+
print(f" 🚀 EXTREME CONTEXT ISOLATION (100K TOKENS) MATRIX")
20+
print("==========================================================\n")
21+
22+
for idx, model in enumerate(models):
23+
print(f"\n======================================")
24+
print(f"[{idx+1}/3] Benchmarking: {model}")
25+
print(f"======================================")
26+
27+
# Enable turbo-kv for extreme context
28+
server_cmd = [
29+
".build/release/SwiftLM",
30+
"--model", model,
31+
"--port", str(port),
32+
"--turbo-kv"
33+
]
34+
35+
with open("benchmark_100k_server.log", "w") as log_file:
36+
server_proc = subprocess.Popen(server_cmd, stdout=log_file, stderr=subprocess.STDOUT)
37+
38+
# Wait for server to load the model
39+
loaded = False
40+
for i in range(1200):
41+
try:
42+
req = urllib.request.Request(f"http://127.0.0.1:{port}/health", method="GET")
43+
with urllib.request.urlopen(req) as response:
44+
if response.status == 200:
45+
loaded = True
46+
break
47+
except:
48+
time.sleep(1.0)
49+
50+
if not loaded:
51+
print(f"Error: Server failed to start for {model}")
52+
server_proc.terminate()
53+
server_proc.wait()
54+
continue
55+
56+
print(f"Model {model} loaded. Submitting 100K context...")
57+
time.sleep(2) # Stabilize memory
58+
59+
# Fire off evaluation
60+
payload = {
61+
"model": model,
62+
"stream": True,
63+
"max_tokens": 10,
64+
"messages": [{"role": "user", "content": prompt_text}]
65+
}
66+
67+
req = urllib.request.Request(
68+
f"http://127.0.0.1:{port}/v1/chat/completions",
69+
data=json.dumps(payload).encode('utf-8'),
70+
headers={'Content-Type': 'application/json'},
71+
method="POST"
72+
)
73+
74+
start_time = time.time()
75+
ttft = None
76+
tokens = 0
77+
78+
try:
79+
# This blocks until the first byte arrives (prefill duration)
80+
with urllib.request.urlopen(req, timeout=1200) as response:
81+
for line in response:
82+
line = line.decode('utf-8').strip()
83+
if not line or line == "data: [DONE]":
84+
continue
85+
if line.startswith("data: "):
86+
data_str = line[6:]
87+
try:
88+
data = json.loads(data_str)
89+
if data.get('choices') and data['choices'][0]['delta'].get('content'):
90+
if ttft is None:
91+
ttft = time.time() - start_time
92+
print(f" -> First Token received in {ttft:.2f}s!")
93+
tokens += 1
94+
except json.JSONDecodeError:
95+
continue
96+
except Exception as e:
97+
print(f"Generation failed: {e}")
98+
99+
duration = time.time() - start_time
100+
if ttft is None:
101+
ttft = duration
102+
103+
# Get peak memory
104+
peak_gb = 0
105+
try:
106+
req = urllib.request.Request(f"http://127.0.0.1:{port}/health", method="GET")
107+
with urllib.request.urlopen(req) as response:
108+
data = json.loads(response.read().decode('utf-8'))
109+
peak_mb = data.get("memory", {}).get("peak_mb", 0)
110+
peak_gb = peak_mb / 1024.0
111+
except Exception as e:
112+
print(f"Failed to fetch memory: {e}")
113+
114+
# Teardown
115+
server_proc.terminate()
116+
try:
117+
server_proc.wait(timeout=5)
118+
except subprocess.TimeoutExpired:
119+
server_proc.kill()
120+
121+
tps = tokens / (duration - ttft) if (duration - ttft) > 0 and tokens > 1 else 0
122+
print(f"--- Results for {model} ---")
123+
print(f"TTFT: {ttft:.2f}s | TPS: {tps:.2f} tok/s | Peak RAM: {peak_gb:.2f} GB | Tokens: {tokens}")
124+
125+
results.append({
126+
"Model": model.split("/")[-1],
127+
"TTFT (s)": round(ttft, 2),
128+
"TPS": round(tps, 2),
129+
"Peak Mem (GB)": round(peak_gb, 2)
130+
})
131+
132+
print("\n\n=== FINAL 100K CONTEXT MARKDOWN TABLE ===")
133+
print("| Model | 100K Time To First Token | Generation Speed | Peak GPU Memory (w/ TurboKV) |")
134+
print("|---|---|---|---|")
135+
for r in results:
136+
print(f"| `{r['Model']}` | {r['TTFT (s)']}s | {r['TPS']} tok/s | {r['Peak Mem (GB)']} GB |")

0 commit comments

Comments
 (0)