You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This would be huge to me. llama-server has a great amount of features and ease of use that I'd lose by writing a wrapper (like I often use LlamaSharp, which doesn't even wrap common), but it doesn't allow a lot of control over sampling--just some built-in parameters and a grammar.
The thing I wanted the most is to be able to break loops without affecting intelligence (like the ability to write code that uses the same variable on every line)... especially because those loops happen fairly often with quantized models, and that can cost a lot of time when the client doesn't stream all the tokens, so you can't even tell it's happening. So, I had Qwen3.6-27B-UD-Q6_K_XL try implementing it via OpenCode:
Not too complex, but it'd need a lot more work for it to function in various cases like speculative decoding, multiple chained samplers, and optimization, I'm sure. But making this customization available via an extension library means not having to maintain and repeatedly rebase our own llama.cpp branches if we do want a custom sampling method in llama-server.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This would be huge to me. llama-server has a great amount of features and ease of use that I'd lose by writing a wrapper (like I often use LlamaSharp, which doesn't even wrap common), but it doesn't allow a lot of control over sampling--just some built-in parameters and a grammar.
The thing I wanted the most is to be able to break loops without affecting intelligence (like the ability to write code that uses the same variable on every line)... especially because those loops happen fairly often with quantized models, and that can cost a lot of time when the client doesn't stream all the tokens, so you can't even tell it's happening. So, I had Qwen3.6-27B-UD-Q6_K_XL try implementing it via OpenCode:
Not too complex, but it'd need a lot more work for it to function in various cases like speculative decoding, multiple chained samplers, and optimization, I'm sure. But making this customization available via an extension library means not having to maintain and repeatedly rebase our own llama.cpp branches if we do want a custom sampling method in llama-server.
Speaking of rebasing, I had it implement this on #22673 at 5d5f1b4, before it was force-pushed recently, merged with master at 9dcf835. server-sampler-extension.patch
Edit: and I reapplied it to that pull request and merged llama.cpp master again, just in case someone wants to play with it: https://github.com/ggml-org/llama.cpp/compare/master...dpmm99:llama.cpp:mtp-clean-with-extensions-merged-23020?expand=1
Edit again: reapplied to master now that MTP is merged, and made it support speculative decoding: https://github.com/ggml-org/llama.cpp/compare/master...dpmm99:llama.cpp:master-with-sampling-extensions?expand=1
Beta Was this translation helpful? Give feedback.
All reactions