Stop inference testing on macOS and adjust the "hot" models by jeremyfowers · Pull Request #1675 · lemonade-sdk/lemonade

jeremyfowers · 2026-04-20T19:25:37Z

Closes #1674
Closes #1525

This PR unblocks the automated llamacpp upgrade workflow in two ways.

macOS

GitHub-hosted macOS runners are virtualized and fail Metal kernel execution on newer llama.cpp builds. This PR disables any test that would load or inference a model on macOS.

Also bumps llamacpp metal b8460 -> b8766 to match vulkan/cpu.

Hot Models

Removes qwen3.5-122b from the hot models list because:

I don't think its actually that hot
it's huge and kinda borderline for fitting into memory on Windows, which makes CI flakey

GitHub-hosted macOS runners are virtualized and fail Metal kernel execution on newer llama.cpp builds. Set LLAMA_ARG_N_GPU_LAYERS=0 so the Metal artifact still loads and is smoke-tested, but inference stays on CPU. Real Metal coverage must be validated manually on Apple Silicon before bumping metal in backend_versions.json. Also bumps llamacpp metal b8460 -> b8766 to match vulkan/cpu. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Geramy · 2026-04-21T13:02:56Z

I think this works, has the CI ran successfully, I'm not able to find backend_versions.json on Mac. Once O do I'll test the latest llama.

Geramy · 2026-04-21T13:03:38Z

@claude on macOS where is backend_versions.JSON install path?

github-actions · 2026-04-21T13:03:55Z

Claude finished @Geramy's task in 55s —— View job

`backend_versions.json` on macOS

Read source files to find the macOS installation path

The file is loaded by BackendManager via get_resource_path("resources/backend_versions.json") (backend_manager.cpp:19). The path resolution logic in path_utils.cpp:137-173 checks these locations in order:

Next to the lemond executable (dev builds / local build): <path-to-lemond>/resources/backend_versions.json
User local install: ~/.local/share/lemonade-server/resources/backend_versions.json
System install (macOS): /Library/Application Support/Lemonade/resources/backend_versions.json
/usr/local/share/lemonade-server/resources/backend_versions.json
/opt/share/lemonade-server/resources/backend_versions.json

So for a standard macOS installation, it will be at:

/Library/Application Support/Lemonade/resources/backend_versions.json

If you built from source, it will be in your build output directory next to the lemond binary (e.g. build/resources/backend_versions.json).

Note: The file in this PR (src/cpp/resources/backend_versions.json) is the source file that gets copied into the install location at build/package time. Editing it there and rebuilding is how you'd test a version bump locally.
|

Geramy · 2026-04-21T13:04:30Z

I would rename this too Unblock MacOS CI Pipeline by using CPU and not metal :)

GitHub-hosted macOS runners are virtualized and Apple's Virtualization Framework blocks Metal GPU access, so llama-server fails to start when loaded via the metal artifact. The earlier LLAMA_ARG_N_GPU_LAYERS=0 attempt didn't help because the crash occurs during ggml Metal backend device init, before any llama-server CLI arg is parsed. llama.cpp's own CI faces this same limitation and only build-tests Metal. Changes: - Delete the test-dmg-inference job entirely (every test in it loaded a model — no salvageable coverage on virtualized macOS). - Add skip_on_macos_ci decorator in test/utils/capabilities.py, gated on sys.platform==darwin AND LEMONADE_CI_MODE set. Real Apple Silicon dev machines still run everything. - Apply the decorator per test function across server_cli2.py, server_endpoints.py, test_ollama.py, server_streaming_errors.py — only to tests that trigger a model load (chat/completions, /load, /run, /generate, etc.). Non-loading tests (pull/list/status/help/ launch-with-fake-agents/501 stubs/validation errors) still run on macOS CI. Windows and Linux test coverage is unchanged. Metal inference must be validated manually on Apple Silicon before bumping metal in backend_versions.json. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Geramy

Looks good.

Geramy · 2026-04-21T19:20:28Z

@claude where does llama-server install to on macOS? Where can I find it locally when it gets unzipped.

github-actions · 2026-04-21T19:20:43Z

Claude finished @Geramy's task in 56s —— View job

`llama-server` install path on macOS

Read source files to find where llama-server is installed/extracted on macOS

llama-server is downloaded from GitHub Releases and extracted into the Lemonade cache directory. On macOS, the path is:

~/.cache/lemonade/bin/llamacpp/metal/llama-server

(If running as root, it goes to /Library/Application Support/lemonade/.cache/bin/llamacpp/metal/llama-server instead.)

What gets downloaded

The artifact pulled for macOS is a .tar.gz from the llama.cpp GitHub releases:

https://github.com/ggml-org/llama.cpp/releases/download/b8766/llama-b8766-bin-macos-arm64.tar.gz

It's extracted with tar --strip-components=1 into the directory above, so the binary lands at:

~/.cache/lemonade/bin/llamacpp/metal/llama-server

A version.txt file is also written there to track the installed version.

Installation flow

On first model load, LlamaCppServer::install() is called
It invokes BackendUtils::get_install_directory("llamacpp", "metal") → ~/.cache/lemonade/bin/llamacpp/metal
Downloads the .tar.gz to a temp dir
Extracts it into the install directory
Sets chmod 0755 on the binary
Cleans up the temp archive

Key source files

src/cpp/server/backends/backend_utils.cpp — download + extract logic
src/cpp/server/utils/path_utils.cpp — get_cache_dir() path resolution
src/cpp/resources/backend_versions.json — current pinned version (b8766 on this branch)

Geramy · 2026-04-21T19:22:41Z

@claude if I want to run llama-server manually and load a model where do I put the folder path to load the model?

github-actions · 2026-04-21T19:22:56Z

Claude finished @Geramy's task in 59s —— View job

Running `llama-server` manually with a model

Found how llama-server is invoked and where model files live

The key flag is -m (or --model), which takes the full path to a .gguf file.

Basic command

~/.cache/lemonade/bin/llamacpp/metal/llama-server \
  -m /path/to/your/model.gguf \
  --port 8080 \
  --ctx-size 4096 \
  --jinja

Where Lemonade stores downloaded models

Models downloaded via lemonade pull land in the HuggingFace hub cache — the same place the huggingface_hub Python library uses:

~/.cache/huggingface/hub/models--<org>--<model-name>/snapshots/<hash>/<file>.gguf

For example, after pulling llama3.2-3b-instruct, the GGUF would be somewhere like:

~/.cache/huggingface/hub/models--amd--Llama-3.2-3B-Instruct-GGUF/snapshots/<hash>/Llama-3.2-3B-Instruct.Q4_K_M.gguf

You can find the exact path by running:

find ~/.cache/huggingface/hub -name "*.gguf"

What Lemonade actually passes

From llamacpp_server.cpp:277, the full command Lemonade builds is roughly:

llama-server \
  -m <resolved_gguf_path> \
  --ctx-size <ctx_size> \
  --port <port> \
  --jinja \
  --keep 16 \
  --reasoning-format auto \
  --no-webui

So for a quick manual test, the minimum you need is:

~/.cache/lemonade/bin/llamacpp/metal/llama-server -m ~/.cache/huggingface/hub/.../model.gguf --port 8080

Geramy · 2026-04-21T19:35:38Z

@claude how does ggml-org/llama.cpp#21936 effect upgrading to the latest llama.cpp revision on Mac?

jeremyfowers self-assigned this Apr 20, 2026

jeremyfowers linked an issue Apr 20, 2026 that may be closed by this pull request

Update the hot models list #1674

Closed

jeremyfowers changed the title ~~Force CPU-only inference in macOS Metal CI smoke test~~ Unblock the automated llamacpp upgrade workflow Apr 20, 2026

update models list

ef64abe

jfowers and others added 2 commits April 21, 2026 14:00

Merge branch 'main' into 1674-update-the-hot-models-list

ec67b4a

jeremyfowers changed the title ~~Unblock the automated llamacpp upgrade workflow~~ Stop inference testing on macOS and adjust the "hot" models Apr 21, 2026

jeremyfowers requested a review from Geramy April 21, 2026 18:46

jeremyfowers marked this pull request as ready for review April 21, 2026 18:46

Geramy approved these changes Apr 21, 2026

View reviewed changes

Geramy mentioned this pull request Apr 21, 2026

Misc. bug: Installing llama.cpp on macOS through lemonade-idk ggml-org/llama.cpp#22229

Open

jeremyfowers closed this Apr 21, 2026

lemonade-sdk deleted a comment from github-actions Bot Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop inference testing on macOS and adjust the "hot" models#1675

Stop inference testing on macOS and adjust the "hot" models#1675
jeremyfowers wants to merge 4 commits intomainfrom
1674-update-the-hot-models-list

jeremyfowers commented Apr 20, 2026 •

edited

Loading

Uh oh!

Geramy commented Apr 21, 2026

Uh oh!

Geramy commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

Geramy commented Apr 21, 2026

Uh oh!

Geramy left a comment

Uh oh!

Geramy commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

Geramy commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

Geramy commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jeremyfowers commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

macOS

Hot Models

Uh oh!

Geramy commented Apr 21, 2026

Uh oh!

Geramy commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

backend_versions.json on macOS

Uh oh!

Geramy commented Apr 21, 2026

Uh oh!

Geramy left a comment

Choose a reason for hiding this comment

Uh oh!

Geramy commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

llama-server install path on macOS

What gets downloaded

Installation flow

Key source files

Uh oh!

Geramy commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Running llama-server manually with a model

Basic command

Where Lemonade stores downloaded models

What Lemonade actually passes

Uh oh!

Geramy commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jeremyfowers commented Apr 20, 2026 •

edited

Loading

github-actions Bot commented Apr 21, 2026 •

edited

Loading

`backend_versions.json` on macOS

github-actions Bot commented Apr 21, 2026 •

edited

Loading

`llama-server` install path on macOS

github-actions Bot commented Apr 21, 2026 •

edited

Loading

Running `llama-server` manually with a model