Stop inference testing on macOS and adjust the "hot" models#1675
Stop inference testing on macOS and adjust the "hot" models#1675jeremyfowers wants to merge 4 commits intomainfrom
Conversation
GitHub-hosted macOS runners are virtualized and fail Metal kernel execution on newer llama.cpp builds. Set LLAMA_ARG_N_GPU_LAYERS=0 so the Metal artifact still loads and is smoke-tested, but inference stays on CPU. Real Metal coverage must be validated manually on Apple Silicon before bumping metal in backend_versions.json. Also bumps llamacpp metal b8460 -> b8766 to match vulkan/cpu. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
I think this works, has the CI ran successfully, I'm not able to find backend_versions.json on Mac. Once O do I'll test the latest llama. |
|
@claude on macOS where is backend_versions.JSON install path? |
|
Claude finished @Geramy's task in 55s —— View job
|
|
I would rename this too Unblock MacOS CI Pipeline by using CPU and not metal :) |
GitHub-hosted macOS runners are virtualized and Apple's Virtualization Framework blocks Metal GPU access, so llama-server fails to start when loaded via the metal artifact. The earlier LLAMA_ARG_N_GPU_LAYERS=0 attempt didn't help because the crash occurs during ggml Metal backend device init, before any llama-server CLI arg is parsed. llama.cpp's own CI faces this same limitation and only build-tests Metal. Changes: - Delete the test-dmg-inference job entirely (every test in it loaded a model — no salvageable coverage on virtualized macOS). - Add skip_on_macos_ci decorator in test/utils/capabilities.py, gated on sys.platform==darwin AND LEMONADE_CI_MODE set. Real Apple Silicon dev machines still run everything. - Apply the decorator per test function across server_cli2.py, server_endpoints.py, test_ollama.py, server_streaming_errors.py — only to tests that trigger a model load (chat/completions, /load, /run, /generate, etc.). Non-loading tests (pull/list/status/help/ launch-with-fake-agents/501 stubs/validation errors) still run on macOS CI. Windows and Linux test coverage is unchanged. Metal inference must be validated manually on Apple Silicon before bumping metal in backend_versions.json. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@claude where does llama-server install to on macOS? Where can I find it locally when it gets unzipped. |
|
Claude finished @Geramy's task in 56s —— View job
|
|
@claude if I want to run llama-server manually and load a model where do I put the folder path to load the model? |
|
Claude finished @Geramy's task in 59s —— View job Running
|
|
@claude how does ggml-org/llama.cpp#21936 effect upgrading to the latest llama.cpp revision on Mac? |
Closes #1674
Closes #1525
This PR unblocks the automated llamacpp upgrade workflow in two ways.
macOS
GitHub-hosted macOS runners are virtualized and fail Metal kernel execution on newer llama.cpp builds. This PR disables any test that would load or inference a model on macOS.
Also bumps llamacpp metal b8460 -> b8766 to match vulkan/cpu.
Hot Models
Removes qwen3.5-122b from the hot models list because: