Skip to content

Update llama.cpp and support arm64#19

Open
rodydavis wants to merge 3 commits intoasg017:mainfrom
rodydavis:update-llama-cpp
Open

Update llama.cpp and support arm64#19
rodydavis wants to merge 3 commits intoasg017:mainfrom
rodydavis:update-llama-cpp

Conversation

@rodydavis
Copy link
Copy Markdown

  • Updates the llama.cpp submodule to the latest version.
  • Adapts the code to the new llama.cpp API.
  • Fixes the build process.
  • Updates the tests to reflect the changes in the embeddings.
  • Adds support for arm64.
  • Updates the .gitignore file.

- Updates the llama.cpp submodule to the latest version.
- Adapts the code to the new llama.cpp API.
- Fixes the build process.
- Updates the tests to reflect the changes in the embeddings.
@rodydavis
Copy link
Copy Markdown
Author

Fixes #18

@O-J1
Copy link
Copy Markdown

O-J1 commented Sep 5, 2025

Worked for me on Ubuntu as a minimal test script. If anyone else sees the size difference and is confused, that seems(?) to be expected (2MB -> 76kb)

I wasnt able to figure out cross-compiling it to windows sadly.

Thanks for this @rodydavis

@vlasky
Copy link
Copy Markdown

vlasky commented Dec 4, 2025

Excellent work on this PR @rodydavis ! The API adaptations are spot-on.

One thing worth noting: the llama.cpp update changes the embedding values produced by BERT models. The test uses "alex garcia" with all-MiniLM-L6-v2:

Old llama.cpp (2b33896): first float ≈ -0.092
New llama.cpp (4fd1242): first float ≈ +0.005

I investigated the cause. It appears to be due to ggml-org/llama.cpp@6562e5a ("context: allow cache-less context for embeddings") which optimizes BERT models to skip KV cache allocation. A side effect is that llama_decode() now redirects to llama_encode() for these models, which returns the [CLS] token embedding instead of the last token embedding.

Claude advises me that "this new behavior is actually more correct for BERT models [CLS] is the designated sentence-level representation". Absolute values have changed, but semantic similarity is preserved. Nonetheless, embeddings from old and new versions aren't directly comparable.

The lesson is: users need to be aware that updates to llama.cpp have the potential to affect embeddings, which may require stored embeddings to be regenerated.

Here is a full breakdown

@vlasky
Copy link
Copy Markdown

vlasky commented Dec 4, 2025

@rodydavis you might also be interested in checking out PR#21

vlasky added a commit to vlasky/sqlite-lembed that referenced this pull request Dec 4, 2025
Updates the llama.cpp submodule and adapts code to the new API:
- llama_tokenize() now takes vocab from llama_model_get_vocab()
- llama_n_embd() -> llama_model_n_embd()
- llama_kv_cache_clear() -> llama_memory_clear(llama_get_memory(), false)
- llama_token_get_score() -> llama_vocab_get_score()
- llama_token_to_piece() now takes vocab and additional parameter
- llama_load_model_from_file() -> llama_model_load_from_file()
- llama_new_context_with_model() -> llama_init_from_model()
- llama_free_model() -> llama_model_free()
- ggml_static -> ggml in CMakeLists.txt
- Remove seed from context_options (no longer supported)

Based on PR asg017#19 by @rodydavis.

Co-Authored-By: Rody Davis <rody.davis.jr@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com>
vlasky added a commit to vlasky/sqlite-lembed that referenced this pull request Dec 4, 2025
- Add Darwin arm64/x86_64 architecture detection in Makefile
- Add tests/__pycache__/ to .gitignore

From PR asg017#19 by @rodydavis.

Co-Authored-By: Rody Davis <rody.davis.jr@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants