Commit 1416ac6
feat: add memory-aware model partitioning framework
- New ModelProfiler.swift: reads config.json, measures weight files
(follows HF Hub symlinks), computes memory requirements (weights +
KV cache + 20% overhead), and outputs a PartitionPlan with strategy
(fullGPU/swapAssisted/layerPartitioned/tooLarge)
- New --info flag: dry-run profiler prints formatted memory analysis
report and exits without loading the model
- New --gpu-layers option: accepts 'auto' or integer, ready for future
GPU/CPU layer splitting (Phase 2)
- Pre-load profiling: automatically detects overcommit ratio and sets
MLX cache limits (2MB cache for swap-assisted mode to let OS manage
page caching, inspired by Flash-MoE research)
- Enhanced /health endpoint: includes partition data (strategy,
overcommit_ratio, weight/kv/total GB, GPU layers, estimated tok/s)
- Ready event JSON: includes partition data for downstream integration
- Rename main.swift -> Server.swift (required by Swift compiler when
adding second source file with @main attribute)1 parent 4d4ade2 commit 1416ac6
2 files changed
Lines changed: 677 additions & 13 deletions
0 commit comments