Skip to content

Commit 64df241

Browse files
simbasimba
authored andcommitted
docs: add Flash-MoE and vLLM to comparison table
1 parent db32d1b commit 64df241

2 files changed

Lines changed: 10 additions & 13 deletions

File tree

LocalPackages/mlx-swift/Source/Cmlx/mlx/mlx/core/moe_stream_op.cpp

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -141,11 +141,7 @@ kernel void streamed_moe_gemm(
141141
encoder.set_input_array(w, 1);
142142

143143
// Ensure memory is allocated for output BEFORE adding to the Metal encoder
144-
o.set_data(
145-
allocator::malloc(o.size() * o.itemsize()),
146-
o.size(),
147-
o.strides(),
148-
o.flags());
144+
o.set_data(allocator::malloc(o.nbytes()));
149145
encoder.set_output_array(o, 2);
150146

151147
uint M = static_cast<uint>(x.size() / x.shape().back());

README.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -17,14 +17,15 @@ No Python runtime, no Global Interpreter Lock (GIL), no unnecessary memory copie
1717

1818
## 🆚 Why `mlx-server`? (vs. llama.cpp & python mlx-lm)
1919

20-
| Feature | `mlx-server` (Swift/C++) | `llama.cpp` (Metal) | `python mlx-lm` |
21-
| :--- | :--- | :--- | :--- |
22-
| **Backend Math** | Official Apple MLX (Metal) | Custom Metal Shaders | Official Apple MLX (Metal) |
23-
| **Concurrency / GIL** | 🟢 **Zero GIL** (Swift async) | 🟢 **Zero GIL** (C++) | 🔴 **GIL Bottlenecked** (Python) |
24-
| **Model Format** | Native HuggingFace (Safetensors)| GGUF (Requires Conversion) | Native HuggingFace (Safetensors)|
25-
| **MoE Memory Footprint**| 🟢 **Direct SSD Streaming** | 🟡 CPU `mmap` Swapping | 🔴 OS Swap (High memory pressure) |
26-
| **KV Cache** | 🟢 **TurboQuantization** | 🟢 Aggressive Quantization | 🟡 Standard Python Hooks |
27-
| **Dependencies** | None (Single Native Binary) | None (Single Native Binary) | Python Runtime, `pip` packages |
20+
| Feature | `mlx-server` (Swift) | `llama.cpp` (Metal) | `python mlx-lm` | `vLLM` (Flash-MoE) |
21+
| :--- | :--- | :--- | :--- | :--- |
22+
| **Backend Math** | Official Apple MLX (Metal) | Custom Metal Shaders | Official Apple MLX | NVIDIA CUDA / Triton |
23+
| **Target Hardware** | Consumer Apple Silicon | Universal (CPU/Mac) | Consumer Apple Silicon | Datacenter NVIDIA GPUs |
24+
| **Concurrency / GIL** | 🟢 **Zero GIL** (Swift async) | 🟢 **Zero GIL** (C++) | 🔴 **GIL Bottlenecked** (Python) | 🟢 **Zero GIL** (C++/Python) |
25+
| **Model Format** | Native HF (Safetensors)| GGUF (Requires Conversion) | Native HF (Safetensors)| Native HF (Safetensors) |
26+
| **MoE Memory Footprint**| 🟢 **Direct SSD Streaming** | 🟡 CPU `mmap` Swapping | 🔴 OS Swap (High pressure) | 🟢 **Flash-MoE** (High VRAM required) |
27+
| **KV Cache** | 🟢 **TurboQuantization** | 🟢 Aggressive Quantization | 🟡 Standard Python Hooks | 🟢 PagedAttention |
28+
| **Dependencies** | None (Single Native Binary) | None (Single Native Binary) | Python Runtime, `pip` | Heavy CUDA Python Environment |
2829

2930
**The TL;DR:**
3031
- Use **`llama.cpp`** if you prefer GGUF formats and are running cross-platform on Windows/Linux.

0 commit comments

Comments
 (0)