Commit 1416ac6

simba

authored and

simba

committed

feat: add memory-aware model partitioning framework

- New ModelProfiler.swift: reads config.json, measures weight files (follows HF Hub symlinks), computes memory requirements (weights + KV cache + 20% overhead), and outputs a PartitionPlan with strategy (fullGPU/swapAssisted/layerPartitioned/tooLarge) - New --info flag: dry-run profiler prints formatted memory analysis report and exits without loading the model - New --gpu-layers option: accepts 'auto' or integer, ready for future GPU/CPU layer splitting (Phase 2) - Pre-load profiling: automatically detects overcommit ratio and sets MLX cache limits (2MB cache for swap-assisted mode to let OS manage page caching, inspired by Flash-MoE research) - Enhanced /health endpoint: includes partition data (strategy, overcommit_ratio, weight/kv/total GB, GPU layers, estimated tok/s) - Ready event JSON: includes partition data for downstream integration - Rename main.swift -> Server.swift (required by Swift compiler when adding second source file with @main attribute)

1 parent 4d4ade2 commit 1416ac6Copy full SHA for 1416ac6

2 files changed

Sources/mlx-server
- ModelProfiler.swift
- Server.swift

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit 1416ac6

File tree

0 commit comments