@@ -17,14 +17,15 @@ No Python runtime, no Global Interpreter Lock (GIL), no unnecessary memory copie
1717
1818## 🆚 Why ` mlx-server ` ? (vs. llama.cpp & python mlx-lm)
1919
20- | Feature | ` mlx-server ` (Swift/C++) | ` llama.cpp ` (Metal) | ` python mlx-lm ` |
21- | :--- | :--- | :--- | :--- |
22- | ** Backend Math** | Official Apple MLX (Metal) | Custom Metal Shaders | Official Apple MLX (Metal) |
23- | ** Concurrency / GIL** | 🟢 ** Zero GIL** (Swift async) | 🟢 ** Zero GIL** (C++) | 🔴 ** GIL Bottlenecked** (Python) |
24- | ** Model Format** | Native HuggingFace (Safetensors)| GGUF (Requires Conversion) | Native HuggingFace (Safetensors)|
25- | ** MoE Memory Footprint** | 🟢 ** Direct SSD Streaming** | 🟡 CPU ` mmap ` Swapping | 🔴 OS Swap (High memory pressure) |
26- | ** KV Cache** | 🟢 ** TurboQuantization** | 🟢 Aggressive Quantization | 🟡 Standard Python Hooks |
27- | ** Dependencies** | None (Single Native Binary) | None (Single Native Binary) | Python Runtime, ` pip ` packages |
20+ | Feature | ` mlx-server ` (Swift) | ` llama.cpp ` (Metal) | ` python mlx-lm ` | ` vLLM ` (Flash-MoE) |
21+ | :--- | :--- | :--- | :--- | :--- |
22+ | ** Backend Math** | Official Apple MLX (Metal) | Custom Metal Shaders | Official Apple MLX | NVIDIA CUDA / Triton |
23+ | ** Target Hardware** | Consumer Apple Silicon | Universal (CPU/Mac) | Consumer Apple Silicon | Datacenter NVIDIA GPUs |
24+ | ** Concurrency / GIL** | 🟢 ** Zero GIL** (Swift async) | 🟢 ** Zero GIL** (C++) | 🔴 ** GIL Bottlenecked** (Python) | 🟢 ** Zero GIL** (C++/Python) |
25+ | ** Model Format** | Native HF (Safetensors)| GGUF (Requires Conversion) | Native HF (Safetensors)| Native HF (Safetensors) |
26+ | ** MoE Memory Footprint** | 🟢 ** Direct SSD Streaming** | 🟡 CPU ` mmap ` Swapping | 🔴 OS Swap (High pressure) | 🟢 ** Flash-MoE** (High VRAM required) |
27+ | ** KV Cache** | 🟢 ** TurboQuantization** | 🟢 Aggressive Quantization | 🟡 Standard Python Hooks | 🟢 PagedAttention |
28+ | ** Dependencies** | None (Single Native Binary) | None (Single Native Binary) | Python Runtime, ` pip ` | Heavy CUDA Python Environment |
2829
2930** The TL;DR:**
3031- Use ** ` llama.cpp ` ** if you prefer GGUF formats and are running cross-platform on Windows/Linux.
0 commit comments