Skip to content

server: (anthropic API) fix prefix caching#21793

Merged
ServeurpersoCom merged 1 commit into
ggml-org:masterfrom
kvc0:claude
Apr 23, 2026
Merged

server: (anthropic API) fix prefix caching#21793
ServeurpersoCom merged 1 commit into
ggml-org:masterfrom
kvc0:claude

Conversation

@kvc0

@kvc0 kvc0 commented Apr 12, 2026

Copy link
Copy Markdown
Contributor

Overview

When testing claude code against llama.cpp, I noticed that only
n_past 18577 was used even when context was 60k or more. The log
in llama-server says:

slot update_slots: id  3 | task 10342 | old: ... ; cch= | defa0;You are
slot update_slots: id  3 | task 10342 | new: ... ; cch= | 1c8b4;

I observed that the cch value changed every time. Reading about that,
the x-anthropic-billing-header system message seems to be specially
handled inside of the anthropic api. I could remove it, but there
is a meaningful string sometimes included at the end. So instead,
I just replace the changing cch checksum with fffff.

I'm treating this as an anthropic message body API detail - I think this
is the right way to do this, but by all means please correct me!

It's always 5 hexadecimal characters, but I've written the replacement
defensively in case they change the protocol.

Additional information

When asking "explain this repo to me on a different repo," using a freshly started llama-server, the second request:
Before:

selected slot by LCP similarity, sim_best = 0.566 (> 0.050 thold), f_keep = 0.704

This is the best case, but it gets progressively worse as the matched length never
goes longer than 18577 (up to 18580 theoretically, but I never saw higher than 18578).

After:

selected slot by LCP similarity, sim_best = 0.805 (> 0.050 thold), f_keep = 1.000

And further along, I see prefixes that only differ in tool call details, as you would expect:

selected slot by LCP similarity, sim_best = 0.994 (> 0.050 thold), f_keep = 0.999
[...]
slot update_slots: id  1 | task 449 | old: ... =command>
 | cd /home/kenny/g
slot update_slots: id  1 | task 449 | new: ... =command>
 | git status
</parameter>

After this change, similarity looks normal and caching is performing well.

While debugging this, I dumped the /slots api a couple times on subsequent requests.
The diffs in the prompt field were like:

diff prompt1 prompt2 --unchanged-line-format="" --old-line-format="< :%dn: %L" --new-line-format="> :%dn: %L"
< :62: x-anthropic-billing-header: cc_version=2.1.101.e51; cc_entrypoint=cli; cch=a5145;You are Claude Code, Anthropic's official CLI for Claude.
> :62: x-anthropic-billing-header: cc_version=2.1.101.e51; cc_entrypoint=cli; cch=4a1a8;You are Claude Code, Anthropic's official CLI for Claude.
> :5130: </tool_response><|im_end|>
> :5131: <|im_start|>assistant
> :5132: <think>
> :5133: The diagnostics still show an issue with [...]

You can see line 62 has a cch diff, and then over 5000 common lines before the diff.
This should have been a total cache hit because it's all new starting at line 5130. But
because of the line 62 diff, it had to re-ingest nearly the whole thing. Without this
change, llama-server does this on every request because of anthropic's magic "header."

Performance:
The impact of this change to users who aren't using claude to send messages to the
anthropic api is a single-position O(1) string prefix check per system message. I don't
imagine too many system messages start with x so in the usual case it will early out
at 1 character's worth of comparison.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: NO. I read and wrote all of this myself.

@ngxson

ngxson commented Apr 12, 2026

Copy link
Copy Markdown
Collaborator

Please address issues from automated review: ngxson#98

@ngxson ngxson changed the title anthropic: fix prefix caching server: (anthropic API) fix prefix caching Apr 12, 2026
@ngxson ngxson requested a review from a team April 12, 2026 22:03
Comment thread tools/server/server-common.h Outdated
Comment on lines +350 to +351
// re-format an x-anthropic-billing-header system prompt's cch section for prompt caching friendliness
void normalize_anthropic_billing_header(std::string & system_text);

@ngxson ngxson Apr 12, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is never used outside of the file, it should not be in the header

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay - I got a warning from cmake that it wasn't declared first. I'll just declare it in the cpp then! I don't see a pattern for pre-declaring at the top of the file, so I'll just put the declaration next to the helper :-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thank you

@drrros

drrros commented Apr 13, 2026

Copy link
Copy Markdown
Contributor

Tried this branch and it feels it's less recomputations now

@ngxson ngxson requested a review from a team April 13, 2026 21:00
@kvc0

kvc0 commented Apr 16, 2026

Copy link
Copy Markdown
Contributor Author

@ngxson Thank you for tolerating a ping - am I missing anything to get this merged? I'd love to address any gaps; this is my first contribution to llama.cpp, so if there's a process I missed or something please let me know!

@drrros

drrros commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

PR have no conflicts yet, so it seems applying patch to up to date master is the way for now :( hope this will be merged at some point.

@ngxson

ngxson commented Apr 17, 2026

Copy link
Copy Markdown
Collaborator

@ServeurpersoCom can you give the 2nd approval? Thanks!

@drrros

drrros commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

@ServeurpersoCom PTAL, this is very useful with claude code

@ServeurpersoCom

Copy link
Copy Markdown
Contributor

I had a week without GitHub notifications due to a bad configuration on my side you did well to notify me.

@ServeurpersoCom

Copy link
Copy Markdown
Contributor

There's now a conflict on tools/server/server-common.cpp from master moving, a rebase should unblock the merge.

@kvc0

kvc0 commented Apr 23, 2026

Copy link
Copy Markdown
Contributor Author

@ServeurpersoCom A rebase wasn't quite clean. I just did a manual migration to master a half hour ago, and I'm testing now to ensure I got the new location right. It was just 1 helper function and like 5 lines of diff in the newly-moved code. It is doing the right thing, so I'll update soon! Thanks for your patience <3

When testing claude code against llama.cpp, I noticed that only
n_past 18577 was used even when context was 60k or more. The log
in llama-server says:
```
slot update_slots: id  3 | task 10342 | old: ... ; cch= | defa0;You are
slot update_slots: id  3 | task 10342 | new: ... ; cch= | 1c8b4;
```
I observed that the cch value changed every time. Reading about that,
the x-anthropic-billing-header system message seems to be specially
handled inside of the anthropic api. I could remove it, but there
is a meaningful string sometimes included at the end. So instead,
I just replace the changing cch checksum with fffff.

I'm treating this as an anthropic message body API detail - I think this
is the right way to do this, but by all means please correct me!

It's always 5 hexadecimal characters, but I've written the replacement
defensively in case they change the protocol.
@ServeurpersoCom

Copy link
Copy Markdown
Contributor

Thanks for handling the manual migration! Ping me when it's up and I'll take another look!

@kvc0

kvc0 commented Apr 23, 2026

Copy link
Copy Markdown
Contributor Author

@ServeurpersoCom I don't like force-pushing after people have reviewed, but I just migrated the change from the place it went yesterday to the place it goes today.

@ServeurpersoCom ServeurpersoCom left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Force-pushing is the right call here, it's just a code relocation, not a logic change. Re-approving now:)

@ServeurpersoCom ServeurpersoCom merged commit c807c6e into ggml-org:master Apr 23, 2026
46 of 49 checks passed
@ServeurpersoCom

Copy link
Copy Markdown
Contributor

(CI errors are independent of this PR)

IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026
When testing claude code against llama.cpp, I noticed that only
n_past 18577 was used even when context was 60k or more. The log
in llama-server says:
```
slot update_slots: id  3 | task 10342 | old: ... ; cch= | defa0;You are
slot update_slots: id  3 | task 10342 | new: ... ; cch= | 1c8b4;
```
I observed that the cch value changed every time. Reading about that,
the x-anthropic-billing-header system message seems to be specially
handled inside of the anthropic api. I could remove it, but there
is a meaningful string sometimes included at the end. So instead,
I just replace the changing cch checksum with fffff.

I'm treating this as an anthropic message body API detail - I think this
is the right way to do this, but by all means please correct me!

It's always 5 hexadecimal characters, but I've written the replacement
defensively in case they change the protocol.
IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026
When testing claude code against llama.cpp, I noticed that only
n_past 18577 was used even when context was 60k or more. The log
in llama-server says:
```
slot update_slots: id  3 | task 10342 | old: ... ; cch= | defa0;You are
slot update_slots: id  3 | task 10342 | new: ... ; cch= | 1c8b4;
```
I observed that the cch value changed every time. Reading about that,
the x-anthropic-billing-header system message seems to be specially
handled inside of the anthropic api. I could remove it, but there
is a meaningful string sometimes included at the end. So instead,
I just replace the changing cch checksum with fffff.

I'm treating this as an anthropic message body API detail - I think this
is the right way to do this, but by all means please correct me!

It's always 5 hexadecimal characters, but I've written the replacement
defensively in case they change the protocol.
samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026
When testing claude code against llama.cpp, I noticed that only
n_past 18577 was used even when context was 60k or more. The log
in llama-server says:
```
slot update_slots: id  3 | task 10342 | old: ... ; cch= | defa0;You are
slot update_slots: id  3 | task 10342 | new: ... ; cch= | 1c8b4;
```
I observed that the cch value changed every time. Reading about that,
the x-anthropic-billing-header system message seems to be specially
handled inside of the anthropic api. I could remove it, but there
is a meaningful string sometimes included at the end. So instead,
I just replace the changing cch checksum with fffff.

I'm treating this as an anthropic message body API detail - I think this
is the right way to do this, but by all means please correct me!

It's always 5 hexadecimal characters, but I've written the replacement
defensively in case they change the protocol.
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
When testing claude code against llama.cpp, I noticed that only
n_past 18577 was used even when context was 60k or more. The log
in llama-server says:
```
slot update_slots: id  3 | task 10342 | old: ... ; cch= | defa0;You are
slot update_slots: id  3 | task 10342 | new: ... ; cch= | 1c8b4;
```
I observed that the cch value changed every time. Reading about that,
the x-anthropic-billing-header system message seems to be specially
handled inside of the anthropic api. I could remove it, but there
is a meaningful string sometimes included at the end. So instead,
I just replace the changing cch checksum with fffff.

I'm treating this as an anthropic message body API detail - I think this
is the right way to do this, but by all means please correct me!

It's always 5 hexadecimal characters, but I've written the replacement
defensively in case they change the protocol.
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
When testing claude code against llama.cpp, I noticed that only
n_past 18577 was used even when context was 60k or more. The log
in llama-server says:
```
slot update_slots: id  3 | task 10342 | old: ... ; cch= | defa0;You are
slot update_slots: id  3 | task 10342 | new: ... ; cch= | 1c8b4;
```
I observed that the cch value changed every time. Reading about that,
the x-anthropic-billing-header system message seems to be specially
handled inside of the anthropic api. I could remove it, but there
is a meaningful string sometimes included at the end. So instead,
I just replace the changing cch checksum with fffff.

I'm treating this as an anthropic message body API detail - I think this
is the right way to do this, but by all means please correct me!

It's always 5 hexadecimal characters, but I've written the replacement
defensively in case they change the protocol.
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
When testing claude code against llama.cpp, I noticed that only
n_past 18577 was used even when context was 60k or more. The log
in llama-server says:
```
slot update_slots: id  3 | task 10342 | old: ... ; cch= | defa0;You are
slot update_slots: id  3 | task 10342 | new: ... ; cch= | 1c8b4;
```
I observed that the cch value changed every time. Reading about that,
the x-anthropic-billing-header system message seems to be specially
handled inside of the anthropic api. I could remove it, but there
is a meaningful string sometimes included at the end. So instead,
I just replace the changing cch checksum with fffff.

I'm treating this as an anthropic message body API detail - I think this
is the right way to do this, but by all means please correct me!

It's always 5 hexadecimal characters, but I've written the replacement
defensively in case they change the protocol.
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
When testing claude code against llama.cpp, I noticed that only
n_past 18577 was used even when context was 60k or more. The log
in llama-server says:
```
slot update_slots: id  3 | task 10342 | old: ... ; cch= | defa0;You are
slot update_slots: id  3 | task 10342 | new: ... ; cch= | 1c8b4;
```
I observed that the cch value changed every time. Reading about that,
the x-anthropic-billing-header system message seems to be specially
handled inside of the anthropic api. I could remove it, but there
is a meaningful string sometimes included at the end. So instead,
I just replace the changing cch checksum with fffff.

I'm treating this as an anthropic message body API detail - I think this
is the right way to do this, but by all means please correct me!

It's always 5 hexadecimal characters, but I've written the replacement
defensively in case they change the protocol.
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
When testing claude code against llama.cpp, I noticed that only
n_past 18577 was used even when context was 60k or more. The log
in llama-server says:
```
slot update_slots: id  3 | task 10342 | old: ... ; cch= | defa0;You are
slot update_slots: id  3 | task 10342 | new: ... ; cch= | 1c8b4;
```
I observed that the cch value changed every time. Reading about that,
the x-anthropic-billing-header system message seems to be specially
handled inside of the anthropic api. I could remove it, but there
is a meaningful string sometimes included at the end. So instead,
I just replace the changing cch checksum with fffff.

I'm treating this as an anthropic message body API detail - I think this
is the right way to do this, but by all means please correct me!

It's always 5 hexadecimal characters, but I've written the replacement
defensively in case they change the protocol.
fewtarius pushed a commit to fewtarius/CachyLLama that referenced this pull request May 30, 2026
When testing claude code against llama.cpp, I noticed that only
n_past 18577 was used even when context was 60k or more. The log
in llama-server says:
```
slot update_slots: id  3 | task 10342 | old: ... ; cch= | defa0;You are
slot update_slots: id  3 | task 10342 | new: ... ; cch= | 1c8b4;
```
I observed that the cch value changed every time. Reading about that,
the x-anthropic-billing-header system message seems to be specially
handled inside of the anthropic api. I could remove it, but there
is a meaningful string sometimes included at the end. So instead,
I just replace the changing cch checksum with fffff.

I'm treating this as an anthropic message body API detail - I think this
is the right way to do this, but by all means please correct me!

It's always 5 hexadecimal characters, but I've written the replacement
defensively in case they change the protocol.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants