common: fix split model loading by sorting file list by brettp · Pull Request #21535 · ggml-org/llama.cpp

brettp · 2026-04-06T22:59:28Z

Overview

Fix split model loading from cache in offline mode.

Additional information

File order is not guaranteed when listing dirs. In offline mode, files are not sorted when read from cache, which can result in the wrong part loading first if the files have changed metadata (e.g., by moving, symlinking, etc).

This is a minimal approach to always sort model parts so the correct part is loaded first when downloaded or loaded from cache, in online or offline mode.

A test is included, but can be relocated if not in the best spot since it introduces additional dependencies in tests/test-gguf-model-data.cpp.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES. AI was used to familiarize myself with the code base when decided the best injection point, to ask about sorting functions, and to write the test.

brettp · 2026-04-07T19:57:28Z

@ggerganov - Just a heads up the github-actions bot mislabeled this as a test. This does include a test, but primarily it fixes a reproducible bug in split model loading.

angt · 2026-04-08T07:08:56Z

Hi @brettp,

What issue are you solving exactly ?
The one mentioned is solved here: #21019

brettp · 2026-04-08T13:52:31Z

Hi @angt - Thanks for the response.

#21019 did not address cases when llama-server is started in offline mode and the FS returns the split files out of order. In my case, this was after I had to relocate the HF cache dir.

Here's a quick demo built from this morning's latest master:

# dir order of the faulty models returns 00002 first

Mac:~ brett$ ls -lU /Users/brett/.cache/huggingface/hub/models--unsloth--gemma-4-26B-A4B-it-GGUF/snapshots/80bdc5e5210f6abe797a0cd0388bef5a7f9b240b/BF16
total 0
lrwxr-xr-x 1 brett staff 79 Apr  4 00:15 gemma-4-26B-A4B-it-BF16-00002-of-00002.gguf -> ../../../blobs/c79ca03db75b9a8644cf7dca80c248f4957324410547a88cfb5b0c07875516da
lrwxr-xr-x 1 brett staff 79 Apr  4 00:25 gemma-4-26B-A4B-it-BF16-00001-of-00002.gguf -> ../../../blobs/230cfdee23fc55e9d5c7488af7a1e4d1310ab80fc259cb91cab988bfd6bf2666

# offline mode fails

Mac:llama.cpp brett$ ./build/bin/llama-server -v -hf unsloth/gemma-4-26B-A4B-it-GGUF:BF16 --offline
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.009 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 115448.73 MB
migrate_old_cache_to_hf_cache: skipping migration in offline mode (will run when online)
common_download_file_single: required file is not available in cache (offline mode): /Users/brett/Library/Caches/llama.cpp/unsloth_gemma-4-26B-A4B-it-GGUF_preset.ini
no remote preset found, skipping
common_download_file_single: using cached file (offline mode): /Users/brett/.cache/huggingface/hub/models--unsloth--gemma-4-26B-A4B-it-GGUF/snapshots/80bdc5e5210f6abe797a0cd0388bef5a7f9b240b/BF16/gemma-4-26B-A4B-it-BF16-00002-of-00002.gguf
common_download_file_single: using cached file (offline mode): /Users/brett/.cache/huggingface/hub/models--unsloth--gemma-4-26B-A4B-it-GGUF/snapshots/80bdc5e5210f6abe797a0cd0388bef5a7f9b240b/BF16/gemma-4-26B-A4B-it-BF16-00001-of-00002.gguf
common_download_file_single: using cached file (offline mode): /Users/brett/.cache/huggingface/hub/models--unsloth--gemma-4-26B-A4B-it-GGUF/snapshots/80bdc5e5210f6abe797a0cd0388bef5a7f9b240b/mmproj-BF16.gguf
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b8714-3ba12fed0
system_info: n_threads = 12 (n_threads_batch = 12) / 16 | MTL : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | SME = 1 | ACCELERATE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 15 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/Users/brett/.cache/huggingface/hub/models--unsloth--gemma-4-26B-A4B-it-GGUF/snapshots/80bdc5e5210f6abe797a0cd0388bef5a7f9b240b/BF16/gemma-4-26B-A4B-it-BF16-00002-of-00002.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: getting device memory data for initial parameters:
llama_model_load_from_file_impl: using device MTL0 (Apple M4 Max) (unknown id) - 110100 MiB free
llama_model_load: error loading model: illegal split file idx: 1 (file: /Users/brett/.cache/huggingface/hub/models--unsloth--gemma-4-26B-A4B-it-GGUF/snapshots/80bdc5e5210f6abe797a0cd0388bef5a7f9b240b/BF16/gemma-4-26B-A4B-it-BF16-00002-of-00002.gguf), model must be loaded with the first split
llama_model_load_from_file_impl: failed to load model
llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model
llama_params_fit: fitting params to free memory took 0.00 seconds
llama_model_load_from_file_impl: using device MTL0 (Apple M4 Max) (unknown id) - 110100 MiB free
llama_model_load: error loading model: illegal split file idx: 1 (file: /Users/brett/.cache/huggingface/hub/models--unsloth--gemma-4-26B-A4B-it-GGUF/snapshots/80bdc5e5210f6abe797a0cd0388bef5a7f9b240b/BF16/gemma-4-26B-A4B-it-BF16-00002-of-00002.gguf), model must be loaded with the first split
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/Users/brett/.cache/huggingface/hub/models--unsloth--gemma-4-26B-A4B-it-GGUF/snapshots/80bdc5e5210f6abe797a0cd0388bef5a7f9b240b/BF16/gemma-4-26B-A4B-it-BF16-00002-of-00002.gguf'
srv    load_model: failed to load model, '/Users/brett/.cache/huggingface/hub/models--unsloth--gemma-4-26B-A4B-it-GGUF/snapshots/80bdc5e5210f6abe797a0cd0388bef5a7f9b240b/BF16/gemma-4-26B-A4B-it-BF16-00002-of-00002.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

This could be quirk of MacOS, but the included test demonstrates this behavior by writing the files out of order and forcing offline mode. It currently fails on master.

File order is not guaranteed when listing dirs. In offline mode, files are not sorted when read from cache, which can result in the wrong part loading first if the files have changed metadata (e.g., by moving, symlinking, etc). This is a minimal approach to ensure model files are correctly sorted when downloaded or loaded from cache, in online or offline mode. Tests are included. Disclaimer: An AI agent was used to refine the approach and write the test. refs: ggml-org#21019 ggml-org#21016

angt · 2026-04-10T16:15:26Z

Can you test with master if it's still an issue ?

ngxson · 2026-04-10T16:18:06Z

IMO it might be more convenient if libllama support loading non-first-shard. It should not be too complicated to implement

brettp · 2026-04-11T17:27:12Z

@angt - Confirmed this is fixed by fb38d6f. Thank you!

brettp requested review from a team and ggerganov as code owners April 6, 2026 22:59

github-actions bot added the testing Everything test related label Apr 6, 2026

angt mentioned this pull request Apr 8, 2026

common : skip non-primary GGUF split files when selecting model #21633

Merged

brettp force-pushed the comm-split-file-sort branch from ede0917 to c620659 Compare April 10, 2026 13:07

brettp closed this Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common: fix split model loading by sorting file list#21535

common: fix split model loading by sorting file list#21535
brettp wants to merge 1 commit intoggml-org:masterfrom
brettp:comm-split-file-sort

brettp commented Apr 6, 2026

Uh oh!

brettp commented Apr 7, 2026

Uh oh!

angt commented Apr 8, 2026

Uh oh!

brettp commented Apr 8, 2026

Uh oh!

angt commented Apr 10, 2026

Uh oh!

ngxson commented Apr 10, 2026

Uh oh!

brettp commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

brettp commented Apr 6, 2026

Overview

Additional information

Requirements

Uh oh!

brettp commented Apr 7, 2026

Uh oh!

angt commented Apr 8, 2026

Uh oh!

brettp commented Apr 8, 2026

Uh oh!

angt commented Apr 10, 2026

Uh oh!

ngxson commented Apr 10, 2026

Uh oh!

brettp commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants