You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi there, i'm playing a lot around with llama cpp lately i have two gpu cards (rtx4060ti and rtx5060ti) and fine tuned my models.ini to be able to load models smoothly on both cards using llama-server. But when i want to change a model on one card i must first unload manually one model before loading another. And as my two cards have not the same vram available i have each models tied to one or another gpu.
Here's the idea i think will really be a nice addition:
in models.ini if we can add a "group" parameter and then when a request shows up for a model in this group the llama-server could check first which model is loaded in this group, if this is not the requested model it would automatically unload other models from that group. we can even push this a little further and have a group-max-models if we want a group to handle 2 or 3 models at the same time (I personally don't have use for this bur maybe it can be useful for others).
What do you think of that idea ?
I know of llama-swap but this will allow to use llama-server directly without the need to depend on another third party app.
Also it would rely on user to fine tune their config, and avoid llama-server complex code change to check on the actual state of available resources on the host, simply check a setting before executing the request.
I already know the --models-max parameter, but this is different, --models-max doesn't take into account which gpu is used or not, for example on my 5060ti i load qwen3.6 35b A3B or GLM4.7 but on my rtx4060ti i run smaller models like gemma4 E4B, I can't rely on --models-max = 2 as if only qwen is loaded i just can't run GLM4.7 without unloading qwen before. If they are marked in the same group it would unload qwen when i try to "talk" to glm as they are part of the same group and only one of the group may be loaded at a time.
If this is perceived as a good idea, i can even try to make a PR for this if pointed in the right direction.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi there, i'm playing a lot around with llama cpp lately i have two gpu cards (rtx4060ti and rtx5060ti) and fine tuned my models.ini to be able to load models smoothly on both cards using llama-server. But when i want to change a model on one card i must first unload manually one model before loading another. And as my two cards have not the same vram available i have each models tied to one or another gpu.
Here's the idea i think will really be a nice addition:
in models.ini if we can add a "group" parameter and then when a request shows up for a model in this group the llama-server could check first which model is loaded in this group, if this is not the requested model it would automatically unload other models from that group. we can even push this a little further and have a group-max-models if we want a group to handle 2 or 3 models at the same time (I personally don't have use for this bur maybe it can be useful for others).
What do you think of that idea ?
I know of llama-swap but this will allow to use llama-server directly without the need to depend on another third party app.
Also it would rely on user to fine tune their config, and avoid llama-server complex code change to check on the actual state of available resources on the host, simply check a setting before executing the request.
I already know the --models-max parameter, but this is different, --models-max doesn't take into account which gpu is used or not, for example on my 5060ti i load qwen3.6 35b A3B or GLM4.7 but on my rtx4060ti i run smaller models like gemma4 E4B, I can't rely on --models-max = 2 as if only qwen is loaded i just can't run GLM4.7 without unloading qwen before. If they are marked in the same group it would unload qwen when i try to "talk" to glm as they are part of the same group and only one of the group may be loaded at a time.
If this is perceived as a good idea, i can even try to make a PR for this if pointed in the right direction.
Beta Was this translation helpful? Give feedback.
All reactions