docs: add Flash-MoE and vLLM to comparison table

simba · simba · commit 64df24155b4a · 2026-03-29T11:05:17.000-07:00
diff --git a/LocalPackages/mlx-swift/Source/Cmlx/mlx/mlx/core/moe_stream_op.cpp b/LocalPackages/mlx-swift/Source/Cmlx/mlx/mlx/core/moe_stream_op.cpp
@@ -141,11 +141,7 @@ kernel void streamed_moe_gemm(
         encoder.set_input_array(w, 1);
         
         // Ensure memory is allocated for output BEFORE adding to the Metal encoder
-        o.set_data(
-            allocator::malloc(o.size() * o.itemsize()),
-            o.size(),
-            o.strides(),
-            o.flags());
+        o.set_data(allocator::malloc(o.nbytes()));
         encoder.set_output_array(o, 2);
 
         uint M = static_cast<uint>(x.size() / x.shape().back());
diff --git a/README.md b/README.md
@@ -17,14 +17,15 @@ No Python runtime, no Global Interpreter Lock (GIL), no unnecessary memory copie
 
 ## 🆚 Why `mlx-server`? (vs. llama.cpp & python mlx-lm)
 
-| Feature | `mlx-server` (Swift/C++) | `llama.cpp` (Metal) | `python mlx-lm` |
-| :--- | :--- | :--- | :--- |
-| **Backend Math** | Official Apple MLX (Metal) | Custom Metal Shaders | Official Apple MLX (Metal) |
-| **Concurrency / GIL** | 🟢 **Zero GIL** (Swift async) | 🟢 **Zero GIL** (C++) | 🔴 **GIL Bottlenecked** (Python) |
-| **Model Format** | Native HuggingFace (Safetensors)| GGUF (Requires Conversion) | Native HuggingFace (Safetensors)|
-| **MoE Memory Footprint**| 🟢 **Direct SSD Streaming** | 🟡 CPU `mmap` Swapping | 🔴 OS Swap (High memory pressure) |
-| **KV Cache** | 🟢 **TurboQuantization** | 🟢 Aggressive Quantization | 🟡 Standard Python Hooks |
-| **Dependencies** | None (Single Native Binary) | None (Single Native Binary) | Python Runtime, `pip` packages |
+| Feature | `mlx-server` (Swift) | `llama.cpp` (Metal) | `python mlx-lm` | `vLLM` (Flash-MoE) |
+| :--- | :--- | :--- | :--- | :--- |
+| **Backend Math** | Official Apple MLX (Metal) | Custom Metal Shaders | Official Apple MLX | NVIDIA CUDA / Triton |
+| **Target Hardware** | Consumer Apple Silicon | Universal (CPU/Mac) | Consumer Apple Silicon | Datacenter NVIDIA GPUs |
+| **Concurrency / GIL** | 🟢 **Zero GIL** (Swift async) | 🟢 **Zero GIL** (C++) | 🔴 **GIL Bottlenecked** (Python) | 🟢 **Zero GIL** (C++/Python) |
+| **Model Format** | Native HF (Safetensors)| GGUF (Requires Conversion) | Native HF (Safetensors)| Native HF (Safetensors) |
+| **MoE Memory Footprint**| 🟢 **Direct SSD Streaming** | 🟡 CPU `mmap` Swapping | 🔴 OS Swap (High pressure) | 🟢 **Flash-MoE** (High VRAM required) |
+| **KV Cache** | 🟢 **TurboQuantization** | 🟢 Aggressive Quantization | 🟡 Standard Python Hooks | 🟢 PagedAttention |
+| **Dependencies** | None (Single Native Binary) | None (Single Native Binary) | Python Runtime, `pip` | Heavy CUDA Python Environment |
 
 **The TL;DR:**
 - Use **`llama.cpp`** if you prefer GGUF formats and are running cross-platform on Windows/Linux.