Skip to content

Stop inference testing on macOS and adjust the "hot" models#1675

Closed
jeremyfowers wants to merge 4 commits intomainfrom
1674-update-the-hot-models-list
Closed

Stop inference testing on macOS and adjust the "hot" models#1675
jeremyfowers wants to merge 4 commits intomainfrom
1674-update-the-hot-models-list

Conversation

@jeremyfowers
Copy link
Copy Markdown
Member

@jeremyfowers jeremyfowers commented Apr 20, 2026

Closes #1674
Closes #1525

This PR unblocks the automated llamacpp upgrade workflow in two ways.

macOS

GitHub-hosted macOS runners are virtualized and fail Metal kernel execution on newer llama.cpp builds. This PR disables any test that would load or inference a model on macOS.

Also bumps llamacpp metal b8460 -> b8766 to match vulkan/cpu.

Hot Models

Removes qwen3.5-122b from the hot models list because:

  1. I don't think its actually that hot
  2. it's huge and kinda borderline for fitting into memory on Windows, which makes CI flakey

GitHub-hosted macOS runners are virtualized and fail Metal kernel
execution on newer llama.cpp builds. Set LLAMA_ARG_N_GPU_LAYERS=0 so
the Metal artifact still loads and is smoke-tested, but inference
stays on CPU. Real Metal coverage must be validated manually on
Apple Silicon before bumping metal in backend_versions.json.

Also bumps llamacpp metal b8460 -> b8766 to match vulkan/cpu.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jeremyfowers jeremyfowers self-assigned this Apr 20, 2026
@jeremyfowers jeremyfowers linked an issue Apr 20, 2026 that may be closed by this pull request
@jeremyfowers jeremyfowers changed the title Force CPU-only inference in macOS Metal CI smoke test Unblock the automated llamacpp upgrade workflow Apr 20, 2026
@Geramy
Copy link
Copy Markdown
Member

Geramy commented Apr 21, 2026

I think this works, has the CI ran successfully, I'm not able to find backend_versions.json on Mac. Once O do I'll test the latest llama.

@Geramy
Copy link
Copy Markdown
Member

Geramy commented Apr 21, 2026

@claude on macOS where is backend_versions.JSON install path?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

Claude finished @Geramy's task in 55s —— View job


backend_versions.json on macOS

  • Read source files to find the macOS installation path

The file is loaded by BackendManager via get_resource_path("resources/backend_versions.json") (backend_manager.cpp:19). The path resolution logic in path_utils.cpp:137-173 checks these locations in order:

  1. Next to the lemond executable (dev builds / local build): <path-to-lemond>/resources/backend_versions.json
  2. User local install: ~/.local/share/lemonade-server/resources/backend_versions.json
  3. System install (macOS): /Library/Application Support/Lemonade/resources/backend_versions.json
  4. /usr/local/share/lemonade-server/resources/backend_versions.json
  5. /opt/share/lemonade-server/resources/backend_versions.json

So for a standard macOS installation, it will be at:

/Library/Application Support/Lemonade/resources/backend_versions.json

If you built from source, it will be in your build output directory next to the lemond binary (e.g. build/resources/backend_versions.json).

Note: The file in this PR (src/cpp/resources/backend_versions.json) is the source file that gets copied into the install location at build/package time. Editing it there and rebuilding is how you'd test a version bump locally.
|

@Geramy
Copy link
Copy Markdown
Member

Geramy commented Apr 21, 2026

I would rename this too Unblock MacOS CI Pipeline by using CPU and not metal :)

jfowers and others added 2 commits April 21, 2026 14:00
GitHub-hosted macOS runners are virtualized and Apple's Virtualization
Framework blocks Metal GPU access, so llama-server fails to start when
loaded via the metal artifact. The earlier LLAMA_ARG_N_GPU_LAYERS=0
attempt didn't help because the crash occurs during ggml Metal backend
device init, before any llama-server CLI arg is parsed. llama.cpp's
own CI faces this same limitation and only build-tests Metal.

Changes:
- Delete the test-dmg-inference job entirely (every test in it loaded a
  model — no salvageable coverage on virtualized macOS).
- Add skip_on_macos_ci decorator in test/utils/capabilities.py, gated
  on sys.platform==darwin AND LEMONADE_CI_MODE set. Real Apple Silicon
  dev machines still run everything.
- Apply the decorator per test function across server_cli2.py,
  server_endpoints.py, test_ollama.py, server_streaming_errors.py —
  only to tests that trigger a model load (chat/completions, /load,
  /run, /generate, etc.). Non-loading tests (pull/list/status/help/
  launch-with-fake-agents/501 stubs/validation errors) still run on
  macOS CI.

Windows and Linux test coverage is unchanged. Metal inference must be
validated manually on Apple Silicon before bumping metal in
backend_versions.json.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jeremyfowers jeremyfowers changed the title Unblock the automated llamacpp upgrade workflow Stop inference testing on macOS and adjust the "hot" models Apr 21, 2026
@jeremyfowers jeremyfowers requested a review from Geramy April 21, 2026 18:46
@jeremyfowers jeremyfowers marked this pull request as ready for review April 21, 2026 18:46
Copy link
Copy Markdown
Member

@Geramy Geramy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@Geramy
Copy link
Copy Markdown
Member

Geramy commented Apr 21, 2026

@claude where does llama-server install to on macOS? Where can I find it locally when it gets unzipped.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

Claude finished @Geramy's task in 56s —— View job


llama-server install path on macOS

  • Read source files to find where llama-server is installed/extracted on macOS

llama-server is downloaded from GitHub Releases and extracted into the Lemonade cache directory. On macOS, the path is:

~/.cache/lemonade/bin/llamacpp/metal/llama-server

(If running as root, it goes to /Library/Application Support/lemonade/.cache/bin/llamacpp/metal/llama-server instead.)

What gets downloaded

The artifact pulled for macOS is a .tar.gz from the llama.cpp GitHub releases:

https://github.com/ggml-org/llama.cpp/releases/download/b8766/llama-b8766-bin-macos-arm64.tar.gz

It's extracted with tar --strip-components=1 into the directory above, so the binary lands at:

~/.cache/lemonade/bin/llamacpp/metal/llama-server

A version.txt file is also written there to track the installed version.

Installation flow

  1. On first model load, LlamaCppServer::install() is called
  2. It invokes BackendUtils::get_install_directory("llamacpp", "metal")~/.cache/lemonade/bin/llamacpp/metal
  3. Downloads the .tar.gz to a temp dir
  4. Extracts it into the install directory
  5. Sets chmod 0755 on the binary
  6. Cleans up the temp archive

Key source files

  • src/cpp/server/backends/backend_utils.cpp — download + extract logic
  • src/cpp/server/utils/path_utils.cppget_cache_dir() path resolution
  • src/cpp/resources/backend_versions.json — current pinned version (b8766 on this branch)

@Geramy
Copy link
Copy Markdown
Member

Geramy commented Apr 21, 2026

@claude if I want to run llama-server manually and load a model where do I put the folder path to load the model?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

Claude finished @Geramy's task in 59s —— View job


Running llama-server manually with a model

  • Found how llama-server is invoked and where model files live

The key flag is -m (or --model), which takes the full path to a .gguf file.

Basic command

~/.cache/lemonade/bin/llamacpp/metal/llama-server \
  -m /path/to/your/model.gguf \
  --port 8080 \
  --ctx-size 4096 \
  --jinja

Where Lemonade stores downloaded models

Models downloaded via lemonade pull land in the HuggingFace hub cache — the same place the huggingface_hub Python library uses:

~/.cache/huggingface/hub/models--<org>--<model-name>/snapshots/<hash>/<file>.gguf

For example, after pulling llama3.2-3b-instruct, the GGUF would be somewhere like:

~/.cache/huggingface/hub/models--amd--Llama-3.2-3B-Instruct-GGUF/snapshots/<hash>/Llama-3.2-3B-Instruct.Q4_K_M.gguf

You can find the exact path by running:

find ~/.cache/huggingface/hub -name "*.gguf"

What Lemonade actually passes

From llamacpp_server.cpp:277, the full command Lemonade builds is roughly:

llama-server \
  -m <resolved_gguf_path> \
  --ctx-size <ctx_size> \
  --port <port> \
  --jinja \
  --keep 16 \
  --reasoning-format auto \
  --no-webui

So for a quick manual test, the minimum you need is:

~/.cache/lemonade/bin/llamacpp/metal/llama-server -m ~/.cache/huggingface/hub/.../model.gguf --port 8080

@Geramy
Copy link
Copy Markdown
Member

Geramy commented Apr 21, 2026

@claude how does ggml-org/llama.cpp#21936 effect upgrading to the latest llama.cpp revision on Mac?

@lemonade-sdk lemonade-sdk deleted a comment from github-actions Bot Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update the hot models list Rev llamacpp metal to b8648

3 participants