Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
337ec6d
feat: Phase 1 API parity with mlx-lm
Mar 23, 2026
6589cbe
feat: Phase 2 — JSON mode, VLM vision support, multipart content, ext…
Mar 23, 2026
e4ebecb
feat: Phase 3 — Memory limit, /metrics, enhanced /health, graceful sh…
Mar 23, 2026
9fe2175
feat: API key authentication (--api-key flag)
Mar 23, 2026
20e1ce8
feat: Prompt caching — reduce TTFT by reusing system prompt KV state
Mar 24, 2026
30c06d6
fix: CI — install mlx.metallib from Python mlx package
Mar 24, 2026
fd4a5e3
feat: GPU yield — prevent Metal from starving macOS WindowServer
Mar 24, 2026
9c89cfc
feat: add memory-aware model partitioning framework
Mar 27, 2026
837ced0
feat: wire GPU/CPU layer partitioning to --gpu-layers flag
Mar 27, 2026
ebef9b0
fix: restore Package.resolved and add CI retry for HF downloads
Mar 27, 2026
8dd340e
chore: update fork to f8f315b (20 model architectures with LayerParti…
Mar 28, 2026
3aae8e2
feat: add auto-calibration 'Wisdom' system
Mar 28, 2026
7210980
feat(moe): Expose --stream-experts flag to enable SSD inference strea…
Mar 28, 2026
5255cd6
feat: add download speed and progress bar UX
Mar 28, 2026
3472cba
fix(ux): progress bar fraction handling for non-byte counts
Mar 28, 2026
441cb2b
feat(ux): add robust caching bandwidth speedometer
Mar 28, 2026
7d23eba
fix(ux): add autonomous task-driven progress bar and restore MB counts
Mar 28, 2026
eced528
feat: localize MLX frameworks to write C++ turboquant and ssd streame…
Mar 28, 2026
86bcee8
feat(mlx): integrate core C++ unified memory SSD streaming primitives
Mar 29, 2026
b5e6ade
feat(mlx-c): expose SSD streamed_gather_mm primitive to c-api
Mar 29, 2026
e29155c
feat(server): auto-wire safetensors resolution and stream environment…
Mar 29, 2026
9131ff7
feat(mlx-swift): expose MLXFast.streamedGatherMM and update c-api sig…
Mar 29, 2026
ed2d2b0
docs: recreate README with mlx-server comparisons and architecture de…
Mar 29, 2026
91852f8
docs: add Flash-MoE and vLLM to comparison table
Mar 29, 2026
e0da633
test: structure test scripts into tests directory and ignore artifacts
solderzzc Mar 29, 2026
8082313
docs: fix hardware specs and document 4-bit JSON quantization caveat
solderzzc Mar 29, 2026
c6078ad
fix(ci): relocate standalone test scripts to scripts/ to prevent impl…
solderzzc Mar 29, 2026
310b940
docs: remove vLLM column and correctly attribute Flash-MoE features
solderzzc Mar 29, 2026
b2ccae9
docs: revert incorrect Flash-MoE designation for mlx-server and resto…
solderzzc Mar 29, 2026
8b1d723
docs: fix test hardware to M5 Pro 64GB
solderzzc Mar 29, 2026
5b31a29
docs: remove vLLM completely from matrix
solderzzc Mar 29, 2026
349fc53
docs: update quick start and curl snippet to demonstrate 122B model d…
solderzzc Mar 29, 2026
9f70969
fix: enforce SIGKILL in e2e tests and expand HF timeout
solderzzc Mar 29, 2026
1aaa13b
fix(ci): add mlx-swift-lm git submodule and checkout submodules in CI
solderzzc Mar 29, 2026
e6a421a
ci: re-trigger workflow after mlx-swift-lm submodule push
solderzzc Mar 29, 2026
7f3911e
test(e2e): extend test suite from 21 to 31 tests
solderzzc Mar 29, 2026
d40e9e4
fix: correct finish_reason=length and tool_calls test robustness
solderzzc Mar 29, 2026
5d2ea4b
fix(Server): update mlx-swift-lm submodule to receive Evaluate.swift …
solderzzc Mar 29, 2026
2de99c9
docs: add M5 to requirements and highlight pre-built binary usage
solderzzc Mar 29, 2026
0c3288b
docs: remove outdated Metal compile lockup warning as MoE streamed in…
solderzzc Mar 29, 2026
f27da83
docs: remove Aegis-AI integration block temporarily to prepare for ne…
solderzzc Mar 29, 2026
3c879db
build(ci): add .gitmodules mapping to fix mlx-swift-lm cloning failur…
solderzzc Mar 29, 2026
b0b3b9b
feat(mlx-swift): implement 1-second interval aggregated SSD read metr…
solderzzc Mar 30, 2026
b19abdb
docs: add AEGIS_INTEGRATION.md with complete Aegis-AI sidecar setup g…
solderzzc Mar 30, 2026
df7d154
fix(memory): use physical RAM budget for SSD streaming instead of Met…
solderzzc Mar 30, 2026
d3da36e
feat: add llama-server style generation logging + API response format…
solderzzc Mar 30, 2026
4c5e54b
feat: log full JSON response body matching llama-server log_server_r …
solderzzc Mar 30, 2026
54a619f
feat: add thinking and ssd_stream to Config log line for observability
solderzzc Mar 30, 2026
40d65d0
fix: replace SWAP-ASSISTED warning with SSD STREAMING label when stre…
solderzzc Mar 30, 2026
4dc61a6
fix(metrics): correct gpu_layers, strategy, and estimated_tok_s for S…
solderzzc Mar 30, 2026
95a126d
fix(build): capture streamExperts as local let before escaping health…
solderzzc Mar 30, 2026
8d3e15f
feat: per-request chat_template_kwargs.enable_thinking support
solderzzc Mar 30, 2026
da21efe
feat: real-time token streaming to stdout + fflush
solderzzc Mar 30, 2026
82bcc4b
feat: llama-server style logging + SSE CRLF fix
solderzzc Mar 30, 2026
65e5497
fix(ssd): Rewrite streamed_gather_mm primitive to load directly into …
solderzzc Mar 30, 2026
231c62c
fix(mlx-server): Restore SSD streaming throughput and mem-limit enfor…
solderzzc Mar 30, 2026
5849c5a
fix(mlx-server): Restore correct output by using prefault+slice inste…
solderzzc Mar 30, 2026
2f655a1
feat(ssd): Restore SSD stream metrics around prefault() call
solderzzc Mar 30, 2026
3ea2afb
feat(ssd): Add mlx_fast_pread_into for direct NVMe reads into evaluat…
solderzzc Mar 30, 2026
1c1ded9
feat(ssd): Wire mlx_fast_pread_into for high-throughput SSD weight st…
solderzzc Mar 30, 2026
b814c88
fix(ssd): Key expert offset cache by tensor_name not E — gate/up/down…
solderzzc Mar 30, 2026
95edfdc
fix(ssd): Restore 5 GB/s throughput + correct output via tensor_name …
solderzzc Mar 30, 2026
7c20227
fix(server): Remove debug prompt_debug print from slot_launch log
solderzzc Mar 30, 2026
ab35fd2
docs: Add TurboQuant KV cache algorithm description to README
solderzzc Mar 30, 2026
2bb7017
feat: Implement real TurboQuant KV cache compression (ported from lla…
solderzzc Mar 30, 2026
511a59b
docs: remove llama.cpp VLM comparison table from README
solderzzc Mar 30, 2026
a042879
test: Add TurboQuant unit tests to CI/CD pipeline
solderzzc Mar 30, 2026
dfa9fba
feat: Add thinking/reasoning support (ThinkingStateTracker + prefill …
solderzzc Mar 30, 2026
7dd655f
fix: Correct buffer range removal in ThinkingStateTracker (use ..<upp…
solderzzc Mar 30, 2026
70ac5e8
fix(mlx-c): stub out turbo_encode to fix CI build
solderzzc Mar 30, 2026
8286492
feat(turboquant): implement turbo_encode_k/v CPU encode path
solderzzc Mar 30, 2026
2ca5b02
feat(turboquant): wire --turbo-kv flag into server and KVCache
solderzzc Mar 30, 2026
480e349
refactor: rename project from mlx-server to SwiftLM
solderzzc Mar 30, 2026
1f7087b
docs: add MIGRATION_NOTE.md for Aegis-AI mlx-server → SwiftLM rename
solderzzc Mar 30, 2026
60cc3e3
fix(metrics): rename Prometheus metrics from mlx_server_ to swiftlm_ …
solderzzc Mar 30, 2026
ce626c1
fix(ci): Fix stale PCH module cache error after mlx-server→SwiftLM re…
solderzzc Mar 30, 2026
60d538b
fix(ssd-stream): route metrics to stderr, throttle to 10s, fix MB/s calc
solderzzc Mar 30, 2026
26d2319
feat(metrics): expose SSD Flash-Stream stats to /metrics endpoint
solderzzc Mar 30, 2026
a83fa7d
docs(engine): add TurboQuant C++ architecture notes
solderzzc Mar 30, 2026
6c6d62c
docs: clarify TurboQuant hybrid architecture in README
solderzzc Mar 30, 2026
3f4cc09
fix(server): support Qwen <thinking> tags in state tracker
solderzzc Mar 30, 2026
e444985
fix(server): support top-level enable_thinking parameter
solderzzc Mar 30, 2026
c323412
chore(deps): bump mlx-swift-lm for SSD background telemetry fixes
solderzzc Mar 31, 2026
4143d3b
fix(server): null content in tool_calls log; bump mlx-swift-lm for tu…
solderzzc Mar 31, 2026
48fd996
feat(turbo-kv): support head_dim=256 via two 128-dim sub-groups
solderzzc Mar 31, 2026
957f763
feat: add SwiftLM Chat multiplatform app (iOS + macOS)
solderzzc Mar 31, 2026
e141627
feat(turbo-kv): add turbo_decode_k/v — batch dequantize for SDPA atte…
solderzzc Mar 31, 2026
28a00f9
feat(turbo-kv): stage 2 — activate compressed KV attention pipeline
solderzzc Mar 31, 2026
dc6af72
feat(turbo-kv): add mlx_turbo_kv_record C atomic + 10s log hook
solderzzc Mar 31, 2026
3df7430
fix(turbo-kv): drop token count from log, show ratio+MB saved (layer-…
solderzzc Mar 31, 2026
6ede853
docs(turbo-kv): add implementation status, hot-window design rational…
solderzzc Mar 31, 2026
01df003
feat(prompt-cache): token-by-token prefix match (llama-server style)
solderzzc Mar 31, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
37 changes: 28 additions & 9 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ name: Build

on:
push:
branches: [main]
branches: [main, develop, feature/*]
pull_request:
branches: [main]

Expand All @@ -11,24 +11,43 @@ jobs:
runs-on: macos-15
steps:
- uses: actions/checkout@v4
with:
submodules: recursive

- name: Install Metal Toolchain
run: xcodebuild -downloadComponent MetalToolchain || true

- name: Cache Swift packages
uses: actions/cache@v4
with:
path: .build
# Key includes product name so any rename (e.g. mlx-server→SwiftLM)
# automatically busts the cache and prevents stale PCH errors.
key: ${{ runner.os }}-spm-SwiftLM-${{ hashFiles('Package.resolved') }}
restore-keys: |
${{ runner.os }}-spm-SwiftLM-

- name: Resolve dependencies
run: swift package resolve

- name: Clear stale module cache
# Prevents: "PCH was compiled with module cache path '…mlx-server…'
# but the path is currently '…SwiftLM…'" after repo rename.
run: find .build -type d -name ModuleCache -exec rm -rf {} + 2>/dev/null || true

- name: Build (Release)
run: swift build -c release

- name: Verify binary
run: |
ls -lh .build/release/mlx-server
file .build/release/mlx-server
ls -lh .build/release/SwiftLM
file .build/release/SwiftLM

- name: Upload binary
uses: actions/upload-artifact@v4
with:
name: mlx-server-arm64
path: .build/release/mlx-server
retention-days: 30
- name: TurboQuant unit tests
run: |
# Compile and run standalone C++ unit tests for the TurboQuant
# KV cache compression algorithm (ported from TheTom/llama-cpp-turboquant).
# Tests: centroids, WHT self-inverse, rotation orthogonality,
# 3-bit pack/unpack, V-cache SNR, K-cache IP SNR, fp16 round-trip.
clang++ -std=c++17 -O2 -o /tmp/tq_test tests/test_turbo_quant.cpp
/tmp/tq_test
85 changes: 85 additions & 0 deletions .github/workflows/e2e-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
name: E2E Tests

on:
push:
branches: [main, feature/*]
pull_request:
branches: [main]

concurrency:
group: e2e-${{ github.ref }}
cancel-in-progress: true

jobs:
e2e:
runs-on: macos-15
timeout-minutes: 30

steps:
- uses: actions/checkout@v4
with:
submodules: recursive

- name: Cache Swift packages
uses: actions/cache@v4
with:
path: .build
key: ${{ runner.os }}-spm-SwiftLM-${{ hashFiles('Package.resolved') }}
restore-keys: |
${{ runner.os }}-spm-SwiftLM-

- name: Clear stale module cache
# Prevents: "PCH was compiled with module cache path '…mlx-server…'
# but the path is currently '…SwiftLM…'" after repo rename.
run: find .build -type d -name ModuleCache -exec rm -rf {} + 2>/dev/null || true

- name: Build (Release)
run: swift build -c release

- name: Install MLX Metal library
run: |
python3 -m venv /tmp/mlx_venv
/tmp/mlx_venv/bin/pip install --quiet mlx
cp /tmp/mlx_venv/lib/python*/site-packages/mlx/lib/mlx.metallib .build/release/

- name: Cache MLX model
uses: actions/cache@v4
with:
path: ~/.cache/huggingface
key: mlx-model-qwen2.5-0.5b-4bit

- name: TurboQuant unit tests
run: |
# Fast pre-flight: verify compression math before expensive model download.
# Tests: Lloyd-Max centroids, WHT correctness, rotation orthogonality,
# 3-bit pack/unpack, V-cache SNR (14.6 dB), K-cache IP SNR (13.7 dB), fp16.
# No external deps — compiles standalone with clang++.
clang++ -std=c++17 -O2 -o /tmp/tq_test tests/test_turbo_quant.cpp
/tmp/tq_test

- name: Run E2E tests
env:
HF_HUB_DOWNLOAD_TIMEOUT: "600"
run: |
chmod +x tests/test-server.sh
# Retry up to 2 times for transient HuggingFace download failures
for attempt in 1 2 3; do
echo "Attempt $attempt of 3..."
if tests/test-server.sh .build/release/SwiftLM 15413; then
exit 0
fi
if [ "$attempt" -lt 3 ]; then
echo "Test failed, retrying in 10s..."
sleep 10
fi
done
echo "All attempts failed"
exit 1

- name: Upload test logs on failure
if: failure()
uses: actions/upload-artifact@v4
with:
name: e2e-test-logs
path: /tmp/SwiftLM-test-*.log
retention-days: 7
29 changes: 15 additions & 14 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ jobs:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Full history for build number
submodules: recursive

- name: Install Metal Toolchain
run: xcodebuild -downloadComponent MetalToolchain || true
Expand Down Expand Up @@ -64,31 +65,31 @@ jobs:

- name: Verify binary
run: |
ls -lh .build/release/mlx-server
file .build/release/mlx-server
.build/release/mlx-server --help || true
ls -lh .build/release/SwiftLM
file .build/release/SwiftLM
.build/release/SwiftLM --help || true

- name: Package binary
run: |
mkdir -p release
cp .build/release/mlx-server release/
cp .build/release/SwiftLM release/
cp LICENSE README.md release/
cd release
tar -czvf ../mlx-server-${{ steps.tag.outputs.name }}-macos-arm64.tar.gz .
tar -czvf ../SwiftLM-${{ steps.tag.outputs.name }}-macos-arm64.tar.gz .

- name: Upload artifact
uses: actions/upload-artifact@v4
with:
name: mlx-server-${{ steps.tag.outputs.name }}-macos-arm64
path: mlx-server-${{ steps.tag.outputs.name }}-macos-arm64.tar.gz
name: SwiftLM-${{ steps.tag.outputs.name }}-macos-arm64
path: SwiftLM-${{ steps.tag.outputs.name }}-macos-arm64.tar.gz
retention-days: 90

- name: Prepare release notes
id: notes
run: |
CHANGELOG=$(cat /tmp/changelog.txt)
cat > /tmp/release_notes.md << 'RELEASE_EOF'
## mlx-server ${{ steps.tag.outputs.full }}
## SwiftLM ${{ steps.tag.outputs.full }}

<details open>

Expand All @@ -104,25 +105,25 @@ jobs:

### Download

- [macOS Apple Silicon (arm64)](https://github.com/SharpAI/mlx-server/releases/download/${{ steps.tag.outputs.name }}/mlx-server-${{ steps.tag.outputs.name }}-macos-arm64.tar.gz)
- [macOS Apple Silicon (arm64)](https://github.com/SharpAI/SwiftLM/releases/download/${{ steps.tag.outputs.name }}/SwiftLM-${{ steps.tag.outputs.name }}-macos-arm64.tar.gz)

### Quick Start
```bash
tar -xzf mlx-server-${{ steps.tag.outputs.name }}-macos-arm64.tar.gz
./mlx-server --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413
tar -xzf SwiftLM-${{ steps.tag.outputs.name }}-macos-arm64.tar.gz
./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413
```

> **Note:** Requires `mlx.metallib` next to the binary for GPU compute. See [README](https://github.com/SharpAI/mlx-server#metal-shader-library) for setup.
> **Note:** Requires `mlx.metallib` next to the binary for GPU compute. See [README](https://github.com/SharpAI/SwiftLM#metal-shader-library) for setup.
RELEASE_EOF

- name: Create release
if: ${{ github.event_name == 'push' || github.event.inputs.create_release == 'true' }}
uses: softprops/action-gh-release@v2
with:
tag_name: ${{ steps.tag.outputs.name }}
name: "mlx-server ${{ steps.tag.outputs.name }}"
name: "SwiftLM ${{ steps.tag.outputs.name }}"
body_path: /tmp/release_notes.md
files: |
mlx-server-${{ steps.tag.outputs.name }}-macos-arm64.tar.gz
SwiftLM-${{ steps.tag.outputs.name }}-macos-arm64.tar.gz
draft: false
prerelease: false
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,10 @@ DerivedData/
# IDE
.vscode/
.idea/

# Temporary Artifacts & Logs
*.log
*.metallib
*.pid
curl_out.txt
sample.txt
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "mlx-swift-lm"]
path = mlx-swift-lm
url = https://github.com/SharpAI/mlx-swift-lm.git
Loading
Loading