server: (anthropic API) fix prefix caching by kvc0 · Pull Request #21793 · ggml-org/llama.cpp

kvc0 · 2026-04-12T05:55:49Z

Overview

When testing claude code against llama.cpp, I noticed that only
n_past 18577 was used even when context was 60k or more. The log
in llama-server says:

slot update_slots: id  3 | task 10342 | old: ... ; cch= | defa0;You are
slot update_slots: id  3 | task 10342 | new: ... ; cch= | 1c8b4;

I observed that the cch value changed every time. Reading about that,
the x-anthropic-billing-header system message seems to be specially
handled inside of the anthropic api. I could remove it, but there
is a meaningful string sometimes included at the end. So instead,
I just replace the changing cch checksum with fffff.

I'm treating this as an anthropic message body API detail - I think this
is the right way to do this, but by all means please correct me!

It's always 5 hexadecimal characters, but I've written the replacement
defensively in case they change the protocol.

Additional information

When asking "explain this repo to me on a different repo," using a freshly started llama-server, the second request:
Before:

selected slot by LCP similarity, sim_best = 0.566 (> 0.050 thold), f_keep = 0.704

This is the best case, but it gets progressively worse as the matched length never
goes longer than 18577 (up to 18580 theoretically, but I never saw higher than 18578).

After:

selected slot by LCP similarity, sim_best = 0.805 (> 0.050 thold), f_keep = 1.000

And further along, I see prefixes that only differ in tool call details, as you would expect:

selected slot by LCP similarity, sim_best = 0.994 (> 0.050 thold), f_keep = 0.999
[...]
slot update_slots: id  1 | task 449 | old: ... =command>
 | cd /home/kenny/g
slot update_slots: id  1 | task 449 | new: ... =command>
 | git status
</parameter>

After this change, similarity looks normal and caching is performing well.

While debugging this, I dumped the /slots api a couple times on subsequent requests.
The diffs in the prompt field were like:

diff prompt1 prompt2 --unchanged-line-format="" --old-line-format="< :%dn: %L" --new-line-format="> :%dn: %L"
< :62: x-anthropic-billing-header: cc_version=2.1.101.e51; cc_entrypoint=cli; cch=a5145;You are Claude Code, Anthropic's official CLI for Claude.
> :62: x-anthropic-billing-header: cc_version=2.1.101.e51; cc_entrypoint=cli; cch=4a1a8;You are Claude Code, Anthropic's official CLI for Claude.
> :5130: </tool_response><|im_end|>
> :5131: <|im_start|>assistant
> :5132: <think>
> :5133: The diagnostics still show an issue with [...]

You can see line 62 has a cch diff, and then over 5000 common lines before the diff.
This should have been a total cache hit because it's all new starting at line 5130. But
because of the line 62 diff, it had to re-ingest nearly the whole thing. Without this
change, llama-server does this on every request because of anthropic's magic "header."

Performance:
The impact of this change to users who aren't using claude to send messages to the
anthropic api is a single-position O(1) string prefix check per system message. I don't
imagine too many system messages start with x so in the usual case it will early out
at 1 character's worth of comparison.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO. I read and wrote all of this myself.

ngxson · 2026-04-12T20:53:58Z

Please address issues from automated review: ngxson#98

ngxson · 2026-04-12T22:09:27Z

+// re-format an x-anthropic-billing-header system prompt's cch section for prompt caching friendliness
+void normalize_anthropic_billing_header(std::string & system_text);


this function is never used outside of the file, it should not be in the header

Okay - I got a warning from cmake that it wasn't declared first. I'll just declare it in the cpp then! I don't see a pattern for pre-declaring at the top of the file, so I'll just put the declaration next to the helper :-)

fixed, thank you

drrros · 2026-04-13T20:55:41Z

Tried this branch and it feels it's less recomputations now

kvc0 · 2026-04-16T19:54:07Z

@ngxson Thank you for tolerating a ping - am I missing anything to get this merged? I'd love to address any gaps; this is my first contribution to llama.cpp, so if there's a process I missed or something please let me know!

drrros · 2026-04-17T17:55:41Z

PR have no conflicts yet, so it seems applying patch to up to date master is the way for now :( hope this will be merged at some point.

ngxson · 2026-04-17T19:37:12Z

@ServeurpersoCom can you give the 2nd approval? Thanks!

drrros · 2026-04-22T20:58:34Z

@ServeurpersoCom PTAL, this is very useful with claude code

ServeurpersoCom · 2026-04-23T04:04:14Z

I had a week without GitHub notifications due to a bad configuration on my side you did well to notify me.

ServeurpersoCom · 2026-04-23T04:10:00Z

There's now a conflict on tools/server/server-common.cpp from master moving, a rebase should unblock the merge.

kvc0 · 2026-04-23T04:21:50Z

@ServeurpersoCom A rebase wasn't quite clean. I just did a manual migration to master a half hour ago, and I'm testing now to ensure I got the new location right. It was just 1 helper function and like 5 lines of diff in the newly-moved code. It is doing the right thing, so I'll update soon! Thanks for your patience <3

When testing claude code against llama.cpp, I noticed that only n_past 18577 was used even when context was 60k or more. The log in llama-server says: ``` slot update_slots: id 3 | task 10342 | old: ... ; cch= | defa0;You are slot update_slots: id 3 | task 10342 | new: ... ; cch= | 1c8b4; ``` I observed that the cch value changed every time. Reading about that, the x-anthropic-billing-header system message seems to be specially handled inside of the anthropic api. I could remove it, but there is a meaningful string sometimes included at the end. So instead, I just replace the changing cch checksum with fffff. I'm treating this as an anthropic message body API detail - I think this is the right way to do this, but by all means please correct me! It's always 5 hexadecimal characters, but I've written the replacement defensively in case they change the protocol.

ServeurpersoCom · 2026-04-23T04:26:31Z

Thanks for handling the manual migration! Ping me when it's up and I'll take another look!

kvc0 · 2026-04-23T04:26:39Z

@ServeurpersoCom I don't like force-pushing after people have reviewed, but I just migrated the change from the place it went yesterday to the place it goes today.

ServeurpersoCom

Force-pushing is the right call here, it's just a code relocation, not a logic change. Re-approving now:)

ServeurpersoCom · 2026-04-23T15:45:27Z

(CI errors are independent of this PR)

When testing claude code against llama.cpp, I noticed that only n_past 18577 was used even when context was 60k or more. The log in llama-server says: ``` slot update_slots: id 3 | task 10342 | old: ... ; cch= | defa0;You are slot update_slots: id 3 | task 10342 | new: ... ; cch= | 1c8b4; ``` I observed that the cch value changed every time. Reading about that, the x-anthropic-billing-header system message seems to be specially handled inside of the anthropic api. I could remove it, but there is a meaningful string sometimes included at the end. So instead, I just replace the changing cch checksum with fffff. I'm treating this as an anthropic message body API detail - I think this is the right way to do this, but by all means please correct me! It's always 5 hexadecimal characters, but I've written the replacement defensively in case they change the protocol.

kvc0 requested a review from a team as a code owner April 12, 2026 05:55

github-actions Bot added examples server labels Apr 12, 2026

ngxson mentioned this pull request Apr 12, 2026

[Mirror] anthropic: fix prefix caching ngxson/llama.cpp#98

Open

ngxson approved these changes Apr 12, 2026

View reviewed changes

ngxson changed the title ~~anthropic: fix prefix caching~~ server: (anthropic API) fix prefix caching Apr 12, 2026

ngxson requested a review from a team April 12, 2026 22:03

ngxson reviewed Apr 12, 2026

View reviewed changes

ngxson requested a review from a team April 13, 2026 21:00

ServeurpersoCom approved these changes Apr 23, 2026

View reviewed changes

kvc0 force-pushed the claude branch from 1c814b0 to 67f8219 Compare April 23, 2026 04:24

ServeurpersoCom approved these changes Apr 23, 2026

View reviewed changes

ServeurpersoCom merged commit c807c6e into ggml-org:master Apr 23, 2026
46 of 49 checks passed

		// re-format an x-anthropic-billing-header system prompt's cch section for prompt caching friendliness
		void normalize_anthropic_billing_header(std::string & system_text);

Uh oh!

Conversation

kvc0 commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

ngxson commented Apr 12, 2026

Uh oh!

ngxson Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kvc0 Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

kvc0 Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

drrros commented Apr 13, 2026

Uh oh!

kvc0 commented Apr 16, 2026

Uh oh!

drrros commented Apr 17, 2026

Uh oh!

ngxson commented Apr 17, 2026

Uh oh!

drrros commented Apr 22, 2026

Uh oh!

ServeurpersoCom commented Apr 23, 2026

Uh oh!

ServeurpersoCom commented Apr 23, 2026

Uh oh!

kvc0 commented Apr 23, 2026

Uh oh!

ServeurpersoCom commented Apr 23, 2026

Uh oh!

kvc0 commented Apr 23, 2026

Uh oh!

ServeurpersoCom left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ServeurpersoCom commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kvc0 commented Apr 12, 2026 •

edited

Loading

ngxson Apr 12, 2026 •

edited

Loading