Replies: 1 comment
-
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
When running llama-server with
-np 1and executing more than one agent at a time, it’s common for one agent’s prompt to overwrite and invalidate the cache of another agent’s prompt, leading to significant and unnecessary recomputation.When running llama-server with
-np 4, it creates four slots and resolves this issue. However,-npalso controls the number of concurrent inferences, which reduces the overall token generation speed from 20 to 5 (glm-4.7-flash).Why isn’t there a separate argument (e.g.,
-ns) to control the number of slots independently?What we actually want is something like:
-ns 4 -np 1Also,
llama-server --helpis not very clear about the-npargument.Beta Was this translation helpful? Give feedback.
All reactions