Where to put local LLM models for Local inference server ? #307

rguiscard · 2026-05-07T03:04:48Z

rguiscard
May 7, 2026

The document says local inference server can load huggingface model. It is not clear what the "load" means ? Does it download the model again ? If I have a model saved on disk, how do I ask the local inference server to use it instead of downloading another copy again ?

Actually I am interested in loading mlx-optiq models. I know I can use the OpenAI API from mlx-optiq. But if the local inference server can load the model from disk straight, it saves a layer of communication.

Answered by codelion

May 7, 2026

Hi @rguiscard, let me clarify both points:

1. Where models are stored

OptiLLM uses HuggingFace's standard cache (~/.cache/huggingface/hub/). When you pass a model ID like mlx-community/Qwen3-8B-4bit, it downloads once and reuses it on subsequent runs so it does not re-download. You can override the location with the HF_HOME environment variable.

2. Loading MLX models (including from local disk)

Yes, OptiLLM has native MLX support on Apple Silicon. Any MLX model is auto-detected and routed through the MLX inference pipeline (via mlx_lm) when the model name matches mlx-community/, mlx-, or -mlx- patterns. See optillm/inference.py (should_use_mlx, MLXInferencePipeline).

Quick start:

# Instal…

View full answer

codelion · 2026-05-07T03:29:27Z

codelion
May 7, 2026
Maintainer

Hi @rguiscard, let me clarify both points:

1. Where models are stored

OptiLLM uses HuggingFace's standard cache (~/.cache/huggingface/hub/). When you pass a model ID like mlx-community/Qwen3-8B-4bit, it downloads once and reuses it on subsequent runs so it does not re-download. You can override the location with the HF_HOME environment variable.

2. Loading MLX models (including from local disk)

Yes, OptiLLM has native MLX support on Apple Silicon. Any MLX model is auto-detected and routed through the MLX inference pipeline (via mlx_lm) when the model name matches mlx-community/, mlx-, or -mlx- patterns. See optillm/inference.py (should_use_mlx, MLXInferencePipeline).

Quick start:

# Install MLX support
pip install mlx-lm

# Enable local inference and start the server
export OPTILLM_API_KEY=optillm
python optillm.py

Then call it with any MLX model — either a HF repo ID or a local path:

from openai import OpenAI

client = OpenAI(api_key="optillm", base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    # Option A: HF repo ID (downloaded once, cached locally)
    model="mlx-community/Qwen3-8B-4bit",

    # Option B: local path to your MLX model directory
    # model="/Users/you/models/my-mlx-model",

    messages=[{"role": "user", "content": "Hello!"}],
)

For a local path to be detected as MLX, make sure the directory name contains mlx (e.g. my-model-mlx), or you can symlink/rename it. Otherwise it will fall back to the PyTorch path.

You can also combine optimization approaches with MLX models, e.g. model="moa-mlx-community/Qwen3-8B-4bit".

Let me know if you hit any issues!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where to put local LLM models for Local inference server ? #307

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Where to put local LLM models for Local inference server ? #307

Uh oh!

Uh oh!

rguiscard May 7, 2026

Replies: 1 comment

Uh oh!

Uh oh!

codelion May 7, 2026 Maintainer

rguiscard
May 7, 2026

codelion
May 7, 2026
Maintainer