You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -89,25 +89,76 @@ Benchmark results for `gemma-4-26b-a4b-it-4bit` (26B MoE, 4-bit) on M5 Pro 64 GB
89
89
90
90
---
91
91
92
-
## 🧠 Supported Models & Methodologies
92
+
## 📡 Supported Models & Methodologies
93
93
94
-
`SwiftLM` dynamically maps Apple MLX primitives to standard HuggingFace architectures, enabling complete support for the latest frontier open-weights models across modalities (Text, Vision, Audio).
94
+
`SwiftLM` dynamically maps Apple MLX primitives to standard HuggingFace architectures, enabling native Metal inference across the latest frontier open-weights models.
95
95
96
-
### Text (LLMs)
97
-
-**Gemma 4**: Fully supports both Dense (`gemma-4-e4b`) and Sparse Mixture of Experts (MoE) architectures (`gemma-4-26b`, `gemma-4-31b`).
98
-
-**Qwen 2.5 & 3**: Robust support for sliding window attention limits and custom RoPE scaling.
|`--draft-model`| (none) | Draft model path/ID for speculative decoding (in-RAM models only) |
357
408
|`--num-draft-tokens`|`4`| Number of draft tokens per speculation round |
358
409
410
+
## 🔧 Per-Request API Parameters
411
+
412
+
In addition to the standard OpenAI fields (`temperature`, `top_p`, `max_tokens`, etc.), SwiftLM accepts the following **SwiftLM-specific** fields on `POST /v1/chat/completions`:
413
+
414
+
| Field | Type | Description |
415
+
|---|---|---|
416
+
|`kv_bits`|`int` (4 or 8) | Enable **MLX-native quantized KV cache** for this request. Uses `QuantizedKVCache` (standard group quantization) instead of `KVCacheSimple`. Separate from `--turbo-kv`. Reduces KV memory ~2–4× at mild quality cost. |
417
+
|`enable_thinking`|`bool`| Force-enable or disable chain-of-thought thinking blocks for Gemma-4 / Qwen3. |
418
+
|`kv_group_size`|`int`| Group size for `kv_bits` quantization (default: `64`). |
Copy file name to clipboardExpand all lines: Sources/SwiftLM/Server.swift
+26-2Lines changed: 26 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -1048,9 +1048,20 @@ func handleChatCompletion(
1048
1048
// These are accepted but may not affect generation if MLX doesn't support them
1049
1049
}
1050
1050
1051
+
// ── Validate kv_bits: only nil, 4, and 8 are supported ──
1052
+
iflet kb = chatReq.kvBits, kb !=4 && kb !=8{
1053
+
leterrBody="{\"error\":{\"message\":\"Invalid kv_bits value \(kb). Supported values are 4 and 8.\",\"type\":\"invalid_request_error\",\"code\":\"invalid_kv_bits\"}}"
if !isMultimodalRequest,let cachedCount =await promptCache.restore(newTokens: promptTokens, into: cache){
1220
+
if !skipPromptCache,let cachedCount =await promptCache.restore(newTokens: promptTokens, into: cache){
1206
1221
// Cache hit: KV state is pre-populated up to cachedCount tokens.
1207
1222
// Only compute the remaining (new) tokens.
1208
1223
varstartIndex= cachedCount
@@ -1251,6 +1266,10 @@ func handleChatCompletion(
1251
1266
letonPrefillDone:(()async->Void)?={
1252
1267
if turboHasCompressed {
1253
1268
print("[SwiftLM] 🧠 Skipping prompt cache save — TurboQuant has compressed \(cache.compactMap{($0 as?KVCacheSimple)?.compressedOffset }.max()??0) tokens. Saving would decode ~37 GB back to fp16.")
1269
+
}elseif params.kvBits !=nil{
1270
+
// kv_bits is set: the cache contains QuantizedKVCache layers whose token
1271
+
// format is incompatible with the FP16 KVCacheSimple format expected by
1272
+
// promptCache.save. Skip saving to prevent unsafe mixed-format restores.
0 commit comments