-
|
The document says local inference server can load huggingface model. It is not clear what the "load" means ? Does it download the model again ? If I have a model saved on disk, how do I ask the local inference server to use it instead of downloading another copy again ? Actually I am interested in loading mlx-optiq models. I know I can use the OpenAI API from mlx-optiq. But if the local inference server can load the model from disk straight, it saves a layer of communication. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
Hi @rguiscard, let me clarify both points: 1. Where models are stored OptiLLM uses HuggingFace's standard cache ( 2. Loading MLX models (including from local disk) Yes, OptiLLM has native MLX support on Apple Silicon. Any MLX model is auto-detected and routed through the MLX inference pipeline (via Quick start: # Install MLX support
pip install mlx-lm
# Enable local inference and start the server
export OPTILLM_API_KEY=optillm
python optillm.pyThen call it with any MLX model — either a HF repo ID or a local path: from openai import OpenAI
client = OpenAI(api_key="optillm", base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
# Option A: HF repo ID (downloaded once, cached locally)
model="mlx-community/Qwen3-8B-4bit",
# Option B: local path to your MLX model directory
# model="/Users/you/models/my-mlx-model",
messages=[{"role": "user", "content": "Hello!"}],
)For a local path to be detected as MLX, make sure the directory name contains You can also combine optimization approaches with MLX models, e.g. Let me know if you hit any issues! |
Beta Was this translation helpful? Give feedback.
Hi @rguiscard, let me clarify both points:
1. Where models are stored
OptiLLM uses HuggingFace's standard cache (
~/.cache/huggingface/hub/). When you pass a model ID likemlx-community/Qwen3-8B-4bit, it downloads once and reuses it on subsequent runs so it does not re-download. You can override the location with theHF_HOMEenvironment variable.2. Loading MLX models (including from local disk)
Yes, OptiLLM has native MLX support on Apple Silicon. Any MLX model is auto-detected and routed through the MLX inference pipeline (via
mlx_lm) when the model name matchesmlx-community/,mlx-, or-mlx-patterns. Seeoptillm/inference.py(should_use_mlx,MLXInferencePipeline).Quick start:
# Instal…