Feature Request: Allow independent control of slots (-ns) and parallelism (-np) #22401

kripper · 2026-04-26T18:36:25Z

kripper
Apr 26, 2026

When running llama-server with -np 1 and executing more than one agent at a time, it’s common for one agent’s prompt to overwrite and invalidate the cache of another agent’s prompt, leading to significant and unnecessary recomputation.

When running llama-server with -np 4, it creates four slots and resolves this issue. However, -np also controls the number of concurrent inferences, which reduces the overall token generation speed from 20 to 5 (glm-4.7-flash).

Why isn’t there a separate argument (e.g., -ns) to control the number of slots independently?
What we actually want is something like: -ns 4 -np 1

Also, llama-server --help is not very clear about the -np argument.

kripper · 2026-05-14T10:48:24Z

kripper
May 14, 2026
Author

#22921

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Allow independent control of slots (-ns) and parallelism (-np) #22401

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Feature Request: Allow independent control of slots (-ns) and parallelism (-np) #22401

Uh oh!

kripper Apr 26, 2026

Replies: 1 comment

Uh oh!

kripper May 14, 2026 Author

kripper
Apr 26, 2026

kripper
May 14, 2026
Author