You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> **Note:** Requires `mlx.metallib` next to the binary for GPU compute. See [README](https://github.com/SharpAI/mlx-server#metal-shader-library) for setup.
116
+
> **Note:** Requires `mlx.metallib` next to the binary for GPU compute. See [README](https://github.com/SharpAI/SwiftLM#metal-shader-library) for setup.
Copy file name to clipboardExpand all lines: AEGIS_INTEGRATION.md
+14-14Lines changed: 14 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,24 +1,24 @@
1
1
# 🛡️ Aegis-AI Integration Guide
2
2
3
-
`mlx-server` is designed to be a **completely transparent, drop-in replacement** for `llama-server` or any cloud VLM gateway within Aegis-AI, delivering dramatically faster zero-latency inference on Apple Silicon.
3
+
`SwiftLM` is designed to be a **completely transparent, drop-in replacement** for `llama-server` or any cloud VLM gateway within Aegis-AI, delivering dramatically faster zero-latency inference on Apple Silicon.
4
4
5
5
---
6
6
7
7
## 🚀 Quick Start for Aegis-AI
8
8
9
9
### 1. Download the Binary
10
10
11
-
Download the latest pre-built binary from the [Releases page](https://github.com/SharpAI/mlx-server/releases) — no Xcode required:
11
+
Download the latest pre-built binary from the [Releases page](https://github.com/SharpAI/SwiftLM/releases) — no Xcode required:
12
12
13
13
```bash
14
14
# Extract and make executable
15
-
tar -xzf mlx-server-*-macos-arm64.tar.gz
16
-
chmod +x mlx-server
15
+
tar -xzf SwiftLM-*-macos-arm64.tar.gz
16
+
chmod +x SwiftLM
17
17
```
18
18
19
19
### 2. Point Aegis-AI at the Server
20
20
21
-
In your `~/.aegis-ai/llm-config.json`, set the base URL to the mlx-server endpoint:
21
+
In your `~/.aegis-ai/llm-config.json`, set the base URL to the SwiftLM endpoint:
22
22
23
23
```json
24
24
{
@@ -30,10 +30,10 @@ In your `~/.aegis-ai/llm-config.json`, set the base URL to the mlx-server endpoi
30
30
31
31
### 3. Launch the Sidecar
32
32
33
-
Aegis-AI should spin up `mlx-server` as a managed subprocess:
33
+
Aegis-AI should spin up `SwiftLM` as a managed subprocess:
34
34
35
35
```bash
36
-
/path/to/mlx-server \
36
+
/path/to/SwiftLM \
37
37
--model mlx-community/Qwen2.5-7B-Instruct-4bit \
38
38
--host 127.0.0.1 \
39
39
--port 5413
@@ -54,7 +54,7 @@ Aegis-AI should **wait for this event** before routing any requests to the serve
54
54
If you are running a Mixture of Experts (MoE) model — such as `Qwen3.5-122B-A10B` — you **must** pass the `--stream-experts true` flag.
55
55
56
56
```bash
57
-
/path/to/mlx-server \
57
+
/path/to/SwiftLM \
58
58
--model mlx-community/Qwen3.5-122B-A10B-4bit \
59
59
--host 127.0.0.1 \
60
60
--port 5413 \
@@ -66,7 +66,7 @@ If you are running a Mixture of Experts (MoE) model — such as `Qwen3.5-122B-A1
66
66
67
67
### Why `--stream-experts` Works
68
68
69
-
MoE models like Qwen3.5-122B have 122B *total* parameters, but only ~10B are **active** on any single forward pass. `mlx-server` exploits this sparsity:
69
+
MoE models like Qwen3.5-122B have 122B *total* parameters, but only ~10B are **active** on any single forward pass. `SwiftLM` exploits this sparsity:
70
70
71
71
- The 60GB+ of expert weight matrices are `mmap`'d directly from your NVMe SSD
72
72
- Only the **2-4 specific expert shards** selected by the router for the current token (~1.5MB each) are streamed into GPU RAM via a zero-copy DMA path
@@ -85,13 +85,13 @@ Due to SSD streaming, TTFT is higher than a fully in-memory model. This is **exp
85
85
| Long (1000+ tokens) | 1–3 minutes |
86
86
87
87
> [!TIP]
88
-
> **Aegis-AI Prompt Cache**: `mlx-server` automatically caches the KV state for repeated system prompts. After the first request with a given system prompt, subsequent requests with the same system prompt will skip the expensive prefill phase and start streaming almost immediately.
88
+
> **Aegis-AI Prompt Cache**: `SwiftLM` automatically caches the KV state for repeated system prompts. After the first request with a given system prompt, subsequent requests with the same system prompt will skip the expensive prefill phase and start streaming almost immediately.
89
89
90
90
---
91
91
92
92
## 📡 API Reference
93
93
94
-
`mlx-server` is **fully OpenAI-compatible** — any client using the OpenAI SDK works without modification.
94
+
`SwiftLM` is **fully OpenAI-compatible** — any client using the OpenAI SDK works without modification.
On Apple Silicon, GPU and system RAM are the **same physical chips** (Unified Memory Architecture). `mlx-server` uses a layered strategy to fit the largest possible models:
177
+
On Apple Silicon, GPU and system RAM are the **same physical chips** (Unified Memory Architecture). `SwiftLM` uses a layered strategy to fit the largest possible models:
178
178
179
179
| Model Size vs. RAM | Strategy | Notes |
180
180
|---|---|---|
@@ -186,7 +186,7 @@ On Apple Silicon, GPU and system RAM are the **same physical chips** (Unified Me
186
186
You can always inspect the computed memory plan before loading a model:
Copy file name to clipboardExpand all lines: README.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# ⚡️ mlx-server
1
+
# ⚡️ SwiftLM
2
2
3
3
A blazingly fast, native Swift inference server that serves [MLX](https://github.com/ml-explore/mlx) models with a strict **OpenAI-compatible API**.
4
4
@@ -17,7 +17,7 @@ No Python runtime, no Global Interpreter Lock (GIL), no unnecessary memory copie
17
17
18
18
## ⚡️ TurboQuantization: KV Cache Compression
19
19
20
-
`mlx-server` implements **TurboQuant** (AISTATS/ICLR 2026) for on-the-fly KV cache compression, enabling long-context inference with drastically reduced memory. At 3 bits/coordinate, the KV cache is compressed ~5.8× vs FP16 with near-zero accuracy loss.
20
+
`SwiftLM` implements **TurboQuant** (AISTATS/ICLR 2026) for on-the-fly KV cache compression, enabling long-context inference with drastically reduced memory. At 3 bits/coordinate, the KV cache is compressed ~5.8× vs FP16 with near-zero accuracy loss.
To reliably run massive 122B parameter MoE models over SSD streaming, `mlx-server` was designed and benchmarked natively on the following hardware:
48
+
To reliably run massive 122B parameter MoE models over SSD streaming, `SwiftLM` was designed and benchmarked natively on the following hardware:
49
49
50
50
-**Machine**: MacBook Pro, Apple M5 Pro
51
51
-**Memory**: 64 GB Unified Memory
@@ -59,7 +59,7 @@ To reliably run massive 122B parameter MoE models over SSD streaming, `mlx-serve
59
59
## 🛠️ Quick Start
60
60
61
61
### Fastest: Download Pre-built Binary
62
-
The absolute fastest way to get started is to [download the latest pre-compiled macOS binary](https://github.com/SharpAI/mlx-server/releases) directly from the Releases page. Just extract it and run!
62
+
The absolute fastest way to get started is to [download the latest pre-compiled macOS binary](https://github.com/SharpAI/SwiftLM/releases) directly from the Releases page. Just extract it and run!
63
63
64
64
### Build from Source
65
65
@@ -70,7 +70,7 @@ swift build -c release
70
70
### Run (Downloads model natively on first launch)
71
71
72
72
```bash
73
-
.build/release/mlx-server \
73
+
.build/release/SwiftLM \
74
74
--model Qwen3.5-122B-A10B-4bit \
75
75
--stream-experts true \
76
76
--port 5413
@@ -133,7 +133,7 @@ Built entirely on the hard work of the Apple MLX community.
133
133
134
134
### 🙏 TurboQuant Credits
135
135
136
-
The TurboQuant KV cache compression implemented in `mlx-server` is directly based on the following open-source work and research:
136
+
The TurboQuant KV cache compression implemented in `SwiftLM` is directly based on the following open-source work and research:
137
137
138
138
-**[TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant/tree/feature/turboquant-kv-cache)** — The primary reference for the C and Metal GPU implementation. The `turbo-wht.h` Fast Walsh-Hadamard kernel, WHT sign arrays (seed=42), Lloyd-Max centroid tables, and the `ggml-turbo-quant.c` quantize/dequantize logic were ported directly from this repository into our MLX C++ and Metal backend.
0 commit comments