Skip to content

Commit 85a5004

Browse files
committed
docs: remove outdated Metal compile lockup warning as MoE streamed inference primitives resolve the delay
1 parent 781ae91 commit 85a5004

2 files changed

Lines changed: 37 additions & 5 deletions

File tree

README.md

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -116,11 +116,6 @@ When your agent engine spins up `mlx-server` using standard subprocess managemen
116116
> **Critical Memory Routing (`--stream-experts`)**
117117
> When managing MoE models (e.g., `Qwen3.5-122B-A10B` where active parameters are significantly smaller than the total payload), you **must artificially append** the `--stream-experts true` flag to the process arguments. If omitted, macOS will inevitably suffer a `Data Abort` memory-mapping fault when mapping > 100GB of tensors onto the unified GPU hardware.
118118
119-
### 3. Client Reliability Overrides (First-Request Lock)
120-
> [!WARNING]
121-
> Because zero-copy heavy matrices (like 122B parameter graphs) require **3-5+ minutes of pure compile-time lockup** on their *very first request* to build the specialized Apple Metal compilation graphs, **standard short-timeouts (e.g., 60s) will fail the first request.**
122-
>
123-
> Ensure your system's `node-fetch` metrics checks, network controllers, or `curl` abort signals extend standard timeouts to `> 300s`.
124119

125120
---
126121
## ⚙️ CLI Options

mlx_integration_prompt.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Aegis-AI Integration Prompt: mlx-server Backend
2+
3+
Aegis-AI can seamlessly substitute `llama-server` or cloud VLM gateways with `mlx-server` for dramatically faster zero-latency inference on macOS instances.
4+
5+
When configuring local inference within the `~/.aegis-ai/llm-config.json` configuration layer, apply the following integration details:
6+
7+
## 1. Gateway Emulation
8+
`mlx-server` exposes a fully standard OpenAI-compatible API layer.
9+
- **Endpoint**: `http://127.0.0.1:5413/v1` (or whatever `port` is provided)
10+
- **Supported Routes**:
11+
- `GET /health` (Exposes advanced JSON containing GPU metrics/VRAM allocations)
12+
- `GET /v1/models`
13+
- `POST /v1/chat/completions` (Supports both `stream: true` and `stream: false`, natively handles tool-calls, system prompts, and temperature checks)
14+
15+
## 2. Process Configuration (The Sidecar)
16+
17+
The Aegis-AI deployment engine should spin up the `mlx-server` sidecar using standard subprocess management.
18+
19+
### Launch Command
20+
```bash
21+
/path/to/mlx-server/.build/release/mlx-server \
22+
--model {MODEL_IDENTIFIER_OR_PATH} \
23+
--host 127.0.0.1 \
24+
--port {PORT_NUMBER}
25+
```
26+
27+
### 🧠 Critical Memory Routing Flag (`--stream-experts`)
28+
When managing MoE models (e.g. `Qwen3.5-122B-A10B` or any model where active parameters (`A10B`) are substantially smaller than totally memory payload), Aegis-AI **must artificially append** the `--stream-experts true` flag to the process arguments.
29+
If this flag is omitted, macOS will inevitably suffer a `Data Abort` memory-mapping fault when loading > 100GB of tensors onto the unified GPU hardware.
30+
31+
Example:
32+
```bash
33+
.build/release/mlx-server --model mlx-community/Qwen3.5-122B-A10B-4bit --stream-experts true --port 5413
34+
```
35+
36+
37+
_You can safely feed this document block sequentially into the Aegis zero-assumption onboarding agent to establish permanent MLX local pipelines._

0 commit comments

Comments
 (0)