Skip to content

Commit 781ae91

Browse files
committed
docs: add M5 to requirements and highlight pre-built binary usage
1 parent 0cc53b9 commit 781ae91

1 file changed

Lines changed: 31 additions & 3 deletions

File tree

README.md

Lines changed: 31 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,10 @@ To reliably run massive 122B parameter MoE models over SSD streaming, `mlx-serve
4949

5050
## 🛠️ Quick Start
5151

52-
### Build
52+
### Fastest: Download Pre-built Binary
53+
The absolute fastest way to get started is to [download the latest pre-compiled macOS binary](https://github.com/SharpAI/mlx-server/releases) directly from the Releases page. Just extract it and run!
54+
55+
### Build from Source
5356

5457
```bash
5558
swift build -c release
@@ -92,9 +95,34 @@ curl http://localhost:5413/v1/chat/completions \
9295
]
9396
}'
9497
```
95-
9698
---
9799

100+
## 🛡️ Aegis-AI & System Integration
101+
102+
`mlx-server` is designed to be a completely transparent, drop-in substitution for `llama-server` or cloud VLM gateways within local intelligence platforms like **Aegis-AI**, offering dramatically faster zero-latency inference on macOS instances.
103+
104+
When configuring local inference workflows (e.g., within `~/.aegis-ai/llm-config.json`), apply the following integration details:
105+
106+
### 1. Gateway Emulation
107+
`mlx-server` exposes a fully standard OpenAI-compatible API layer:
108+
- **`GET /health`**: Exposes advanced JSON containing GPU metrics and VRAM allocations.
109+
- **`GET /v1/models`**: Lists actively loaded topologies.
110+
- **`POST /v1/chat/completions`**: Supports both `stream: true` and `stream: false`. Natively handles tool-calls, system prompts, and temperature variables.
111+
112+
### 2. Process Configuration (The Sidecar)
113+
When your agent engine spins up `mlx-server` using standard subprocess management, you must be explicitly aware of the memory requirements for *Mixture of Expert* (MoE) models.
114+
115+
> [!CAUTION]
116+
> **Critical Memory Routing (`--stream-experts`)**
117+
> When managing MoE models (e.g., `Qwen3.5-122B-A10B` where active parameters are significantly smaller than the total payload), you **must artificially append** the `--stream-experts true` flag to the process arguments. If omitted, macOS will inevitably suffer a `Data Abort` memory-mapping fault when mapping > 100GB of tensors onto the unified GPU hardware.
118+
119+
### 3. Client Reliability Overrides (First-Request Lock)
120+
> [!WARNING]
121+
> Because zero-copy heavy matrices (like 122B parameter graphs) require **3-5+ minutes of pure compile-time lockup** on their *very first request* to build the specialized Apple Metal compilation graphs, **standard short-timeouts (e.g., 60s) will fail the first request.**
122+
>
123+
> Ensure your system's `node-fetch` metrics checks, network controllers, or `curl` abort signals extend standard timeouts to `> 300s`.
124+
125+
---
98126
## ⚙️ CLI Options
99127

100128
| Option | Default | Description |
@@ -109,7 +137,7 @@ curl http://localhost:5413/v1/chat/completions \
109137
## 📦 Requirements
110138

111139
- macOS 14.0+
112-
- Apple Silicon (M1/M2/M3/M4)
140+
- Apple Silicon (M1/M2/M3/M4/M5)
113141
- Xcode Command Line Tools
114142
- Metal Toolchain (`xcodebuild -downloadComponent MetalToolchain`)
115143

0 commit comments

Comments
 (0)