Skip to content

common: fix split model loading by sorting file list#21535

Closed
brettp wants to merge 1 commit intoggml-org:masterfrom
brettp:comm-split-file-sort
Closed

common: fix split model loading by sorting file list#21535
brettp wants to merge 1 commit intoggml-org:masterfrom
brettp:comm-split-file-sort

Conversation

@brettp
Copy link
Copy Markdown

@brettp brettp commented Apr 6, 2026

Overview

Fix split model loading from cache in offline mode.

Refs: #21019 #21016

Additional information

File order is not guaranteed when listing dirs. In offline mode, files are not sorted when read from cache, which can result in the wrong part loading first if the files have changed metadata (e.g., by moving, symlinking, etc).

This is a minimal approach to always sort model parts so the correct part is loaded first when downloaded or loaded from cache, in online or offline mode.

A test is included, but can be relocated if not in the best spot since it introduces additional dependencies in tests/test-gguf-model-data.cpp.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES. AI was used to familiarize myself with the code base when decided the best injection point, to ask about sorting functions, and to write the test.

@brettp brettp requested review from a team and ggerganov as code owners April 6, 2026 22:59
@github-actions github-actions bot added the testing Everything test related label Apr 6, 2026
@brettp
Copy link
Copy Markdown
Author

brettp commented Apr 7, 2026

@ggerganov - Just a heads up the github-actions bot mislabeled this as a test. This does include a test, but primarily it fixes a reproducible bug in split model loading.

@angt
Copy link
Copy Markdown
Member

angt commented Apr 8, 2026

Hi @brettp,

What issue are you solving exactly ?
The one mentioned is solved here: #21019

@brettp
Copy link
Copy Markdown
Author

brettp commented Apr 8, 2026

Hi @angt - Thanks for the response.

#21019 did not address cases when llama-server is started in offline mode and the FS returns the split files out of order. In my case, this was after I had to relocate the HF cache dir.

Here's a quick demo built from this morning's latest master:

# dir order of the faulty models returns 00002 first

Mac:~ brett$ ls -lU /Users/brett/.cache/huggingface/hub/models--unsloth--gemma-4-26B-A4B-it-GGUF/snapshots/80bdc5e5210f6abe797a0cd0388bef5a7f9b240b/BF16
total 0
lrwxr-xr-x 1 brett staff 79 Apr  4 00:15 gemma-4-26B-A4B-it-BF16-00002-of-00002.gguf -> ../../../blobs/c79ca03db75b9a8644cf7dca80c248f4957324410547a88cfb5b0c07875516da
lrwxr-xr-x 1 brett staff 79 Apr  4 00:25 gemma-4-26B-A4B-it-BF16-00001-of-00002.gguf -> ../../../blobs/230cfdee23fc55e9d5c7488af7a1e4d1310ab80fc259cb91cab988bfd6bf2666

# offline mode fails

Mac:llama.cpp brett$ ./build/bin/llama-server -v -hf unsloth/gemma-4-26B-A4B-it-GGUF:BF16 --offline
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.009 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 115448.73 MB
migrate_old_cache_to_hf_cache: skipping migration in offline mode (will run when online)
common_download_file_single: required file is not available in cache (offline mode): /Users/brett/Library/Caches/llama.cpp/unsloth_gemma-4-26B-A4B-it-GGUF_preset.ini
no remote preset found, skipping
common_download_file_single: using cached file (offline mode): /Users/brett/.cache/huggingface/hub/models--unsloth--gemma-4-26B-A4B-it-GGUF/snapshots/80bdc5e5210f6abe797a0cd0388bef5a7f9b240b/BF16/gemma-4-26B-A4B-it-BF16-00002-of-00002.gguf
common_download_file_single: using cached file (offline mode): /Users/brett/.cache/huggingface/hub/models--unsloth--gemma-4-26B-A4B-it-GGUF/snapshots/80bdc5e5210f6abe797a0cd0388bef5a7f9b240b/BF16/gemma-4-26B-A4B-it-BF16-00001-of-00002.gguf
common_download_file_single: using cached file (offline mode): /Users/brett/.cache/huggingface/hub/models--unsloth--gemma-4-26B-A4B-it-GGUF/snapshots/80bdc5e5210f6abe797a0cd0388bef5a7f9b240b/mmproj-BF16.gguf
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b8714-3ba12fed0
system_info: n_threads = 12 (n_threads_batch = 12) / 16 | MTL : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | SME = 1 | ACCELERATE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 15 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/Users/brett/.cache/huggingface/hub/models--unsloth--gemma-4-26B-A4B-it-GGUF/snapshots/80bdc5e5210f6abe797a0cd0388bef5a7f9b240b/BF16/gemma-4-26B-A4B-it-BF16-00002-of-00002.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: getting device memory data for initial parameters:
llama_model_load_from_file_impl: using device MTL0 (Apple M4 Max) (unknown id) - 110100 MiB free
llama_model_load: error loading model: illegal split file idx: 1 (file: /Users/brett/.cache/huggingface/hub/models--unsloth--gemma-4-26B-A4B-it-GGUF/snapshots/80bdc5e5210f6abe797a0cd0388bef5a7f9b240b/BF16/gemma-4-26B-A4B-it-BF16-00002-of-00002.gguf), model must be loaded with the first split
llama_model_load_from_file_impl: failed to load model
llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model
llama_params_fit: fitting params to free memory took 0.00 seconds
llama_model_load_from_file_impl: using device MTL0 (Apple M4 Max) (unknown id) - 110100 MiB free
llama_model_load: error loading model: illegal split file idx: 1 (file: /Users/brett/.cache/huggingface/hub/models--unsloth--gemma-4-26B-A4B-it-GGUF/snapshots/80bdc5e5210f6abe797a0cd0388bef5a7f9b240b/BF16/gemma-4-26B-A4B-it-BF16-00002-of-00002.gguf), model must be loaded with the first split
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/Users/brett/.cache/huggingface/hub/models--unsloth--gemma-4-26B-A4B-it-GGUF/snapshots/80bdc5e5210f6abe797a0cd0388bef5a7f9b240b/BF16/gemma-4-26B-A4B-it-BF16-00002-of-00002.gguf'
srv    load_model: failed to load model, '/Users/brett/.cache/huggingface/hub/models--unsloth--gemma-4-26B-A4B-it-GGUF/snapshots/80bdc5e5210f6abe797a0cd0388bef5a7f9b240b/BF16/gemma-4-26B-A4B-it-BF16-00002-of-00002.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

This could be quirk of MacOS, but the included test demonstrates this behavior by writing the files out of order and forcing offline mode. It currently fails on master.

File order is not guaranteed when listing dirs. In offline mode, files are
not sorted when read from cache, which can result in the wrong part loading
first if the files have changed metadata (e.g., by moving, symlinking, etc).

This is a minimal approach to ensure model files are correctly sorted
when downloaded or loaded from cache, in online or offline mode. Tests are
included.

Disclaimer: An AI agent was used to refine the approach and write the test.

refs: ggml-org#21019 ggml-org#21016
@brettp brettp force-pushed the comm-split-file-sort branch from ede0917 to c620659 Compare April 10, 2026 13:07
@angt
Copy link
Copy Markdown
Member

angt commented Apr 10, 2026

Can you test with master if it's still an issue ?

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Apr 10, 2026

IMO it might be more convenient if libllama support loading non-first-shard. It should not be too complicated to implement

@brettp
Copy link
Copy Markdown
Author

brettp commented Apr 11, 2026

@angt - Confirmed this is fixed by fb38d6f. Thank you!

@brettp brettp closed this Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants