Replies: 1 comment 1 reply
-
|
ROCm uses more memory than Vulkan, I have your same setup, and you're also trying to run with F16, I can't imagine the amount that it's trying to fit, you must define a specific context size manually, under that space with that model, I'd say you can only fit about 30k context at much. Also, llama.cpp has a problem with Windows where its performance throttles down over time until you make a restart ( not saying this is the case with you ). |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Using
llama-cli.exe -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q5_K_XL -ngl all --fit on --no-mmap --spec-type draft-mtp --spec-draft-n-max n --cache-type-k f16 --cache-type-v f16 --spec-draft-type-k f16 --spec-draft-type-v f16 -fa on(and omitting--spec-*when ntp) on a 7900XTX in Windows 11. All test cases fit model and context into VRAM.HIP, fa on
vulkan, fa on
fa off
Beta Was this translation helpful? Give feedback.
All reactions