Replies: 13 comments 20 replies
-
|
I'm the sole user of these models (at most an embedding and llm model at the same time). Usually just chatting back and forth with basic tool calls via MCP and the llama-server webui or Pi for cli. Request: MoE models that are larger than VRAM (let's focus on
I know file sizes and active parameters all impact the numbers and you can't expect apples to apples between gpt-oss model architecture and others; just want to make sure I'm not missing any dials to tweak. Thanks everyone for this amazing project! Additional Questions
System SetupRTX 3090 (24 GB VRAM) | 128 DDR5 6400 MT/s RAM | Intel Core Ultra 265k I use
Portions of my For embedding models I add this at each model level Llama-Bench Command |
Beta Was this translation helpful? Give feedback.
-
|
Great thread! I'm running Qwen 35B A3B on my RTX 2060 laptop (32GB RAM, 6 GB VRAM, i7 9750H, Windows 11) and this is the first model I am able to run at ridiculous amounts of context at great speeds.
This is the best configuration I have come up with. With that, the experts are running on the CPU while the other layers run on the GPU. ubatch 2048 gives a huge speedup to prompt processing which is sorely needed. This way I'm getting around 350-400 token/s prefill on 102K context and a text generation of around 15 token/s. So I'm very pleased how well it runs. So pure text generation is great. However, as you have noticed, I have offloaded the mmproj entirely on the CPU. Why? Because it needs around 600 MB VRAM and that would greatly reduce the effective context I am able to run. On the CPU it can be very slow, it takes up to 300 seconds on decently sized images like browser snapshots. I wonder if there is any way to either make the mmproj more efficient on the CPU or switch layers from GPU to RAM to free up VRAM for the vision encoder just before the vision processing automatically or with a command and then load them back on the GPU after the vision processing has been completed. |
Beta Was this translation helpful? Give feedback.
-
|
I might be using one of the more exotic setups :) But speeds aren't really that good despite the connectivity.
I typically execute like this: |
Beta Was this translation helpful? Give feedback.
-
|
i have an old crypto miner, its basically e-waste but it does run smaller models okeyish. i'am building a garbage multi agent chat interface, it sort of works as long as you only trigger 1-2 model at the same time hw: i currently run it like this : it can idle multiple models at the same time, and run active inference on 2 models without too much slowdown but as soon as you hit the third model everything slows down significantly. it's running different 4b models in Q4, or whatever fits in a single gpu's memory i was wondering if there is a way to reduce cpu load to be-able to run multiple models concurrently |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
Hey guys. I have the pleasure of being able to use an on prem GH200 server my company purchased a while back and wanted to optimize inference further on this particular platform. One things I have seen across a lot of llm inference engines is that they don't fully utilize these NVIDIA grace platforms well. Specifically the NVIDIA NVLink-C2C - Chip Interconnect. So one big win I think for these type of systems is for MOE models by placing experts in a CUDA-owned mapped host buffer that lives inside the grace memory. I'm a little out of my area on expertise on this though so with GPT 5.5 i was able to get something working (i think) that tries to take advantage of this. the diff is here. 56900d0 To test i picked Qwen3.5 A122B FP8 which is larger than the 94gb of vram on my GH200 system and with the normal --moe-cpu offloading we get these results: build-c2c/bin/llama-server \
-m /bartowski--Qwen_Qwen3.5-122B-A10B-GGUF
-ngl all \
--cpu-moe \
--no-mmap \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 12288 \
--reasoning off
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97280 MiB):
Device 0: NVIDIA GH200 480GB, compute capability 9.0, VMM: yes, VRAM: 97280 MiB
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b9108-928b486b0
system_info: n_threads = 72 (n_threads_batch = 72) / 72 | CUDA : ARCHS = 900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | DOTPROD = 1 | SVE_CNT = 16 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 71 threads for HTTP server
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 47 repeating layers to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: CUDA0 model buffer size = 5607.73 MiB
load_tensors: CUDA_Host model buffer size = 118276.97 MiB
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot print_timing: id 3 | task 0 |
prompt eval time = 2075.52 ms / 18 tokens ( 115.31 ms per token, 8.67 tokens per second)
eval time = 18013.80 ms / 128 tokens ( 140.73 ms per token, 7.11 tokens per second)
total time = 20089.32 ms / 146 tokensand with our c2c-moe flag we get: build-c2c/bin/llama-server \
-m /bartowski--Qwen_Qwen3.5-122B-A10B-GGUF
-ngl all \
--c2c-moe \
--no-mmap \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 12288 \
--reasoning off
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97280 MiB):
Device 0: NVIDIA GH200 480GB, compute capability 9.0, VMM: yes, VRAM: 97280 MiB
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b9108-928b486b0
system_info: n_threads = 72 (n_threads_batch = 72) / 72 | CUDA : ARCHS = 900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | DOTPROD = 1 | SVE_CNT = 16 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 71 threads for HTTP serve
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 47 repeating layers to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: CUDA0 model buffer size = 5607.73 MiB
load_tensors: CUDA0_C2C_Host model buffer size = 117504.00 MiB
load_tensors: CUDA_Host model buffer size = 772.97 MiB
slot print_timing: id 3 | task 0 |
prompt eval time = 215.34 ms / 18 tokens ( 11.96 ms per token, 83.59 tokens per second)
eval time = 2067.43 ms / 128 tokens ( 16.15 ms per token, 61.91 tokens per second)
total time = 2282.77 ms / 146 tokensI am posting this here because I am wondering if this could be improved further by somebody that actually knows the llamacpp code base better than me and GPT 5.5 as i would love to see optimizations for this particular hardware in the llama.cpp codebase. |
Beta Was this translation helpful? Give feedback.
-
|
HW specs (HP Elitebook):
I would be happy to hear potential speedup tuning params, also something more robust benchmarking ideas. Thanks in advance! |
Beta Was this translation helpful? Give feedback.
-
|
Hi everyone, I'm running a Framework laptop with an AMD 7840U with iGPU 780m + dGPU 7700S with 8GB VRAM + 64Go DDR5. Here is my lmaunch parameters: Here are some other parameters I tried without having a difference in output: I benchmarked the If someone can point to me some optimizations I missed or the methodology to find a better sweet spot (in case I'm doing wrong), I would be really thankful 🙏 |
Beta Was this translation helpful? Give feedback.
-
|
Hello all, this is mine using the "AI marketed ASUS NUC" that I bought on an uninformed impulse.. Hardware spec:
Host OS: Ubuntu 26 Llama-server docker compose: services:
llama-server:
image: ghcr.io/ggml-org/llama.cpp:server-intel
container_name: llama-server
restart: unless-stopped
ports:
- "8090:8080"
volumes:
- ~/docker/llm/models:/models:ro
- models-cache:/root/.cache/llama.cpp
- ~/docker/llm/models/models-preset-default.ini:/models-preset.ini:ro
devices:
- /dev/dri:/dev/dri
command:
- --models-dir
- /models
- --models-preset
- /models-preset.ini
- --host
- 0.0.0.0
- --port
- "8080"
- --jinja
- --chat-template-kwargs
- '{"enable_thinking":false}'
- --cont-batching
volumes:
models-cache:Model: Qwen3-4B-Instruct-2507-Q4_K_M Use-case: 3 users, I'm using Home Assistant and am using llm to power its voice assistant. Objective way to evaluate current performance: It's working now, but the more speed I could get out of it, the better of course. |
Beta Was this translation helpful? Give feedback.
-
|
Hello everyone! I’m really excited to see this thread. I’ve run into quite a few issues while setting up llama.cpp, so I’d like to share my setup and runtime details below. Hope the experienced users here can give me some advice. Hardware Specs First llama-server launch command Second llama-server launch command (running concurrently on the same machine) My English is not perfect, so I translated this post myself. Any feedback or suggestions are highly appreciated! |
Beta Was this translation helpful? Give feedback.
-
|
Late to the party but this is an awesome thread! @ggerganov your software has changed my life for the better. Major thank you to all the Llama.cpp contributors. I'm running Unsloth Qwen 3.6 27B MTP in Q8 on an RTX Pro 4500 and a modded 4090D (80GB VRAM combined). My use case for the inference is Hermes Agent writing reports and maintaining my k8s cluster. Here are my params. I'm seeing between 50 and 10 t/s for generation and between 1700 and 500 t/s for prompt eval. Generation averaged 33 t/s over the last 24 hours. |
Beta Was this translation helpful? Give feedback.
-
|
That's a great idea. I was running with b = 4096 ub = 1024 probably seen somewhere on the internet, and was getting around 70k ctx and 15tks with Gemma 4 31B, until I came across ServeurpersoCom config here #23502 and tried his config with b = 128 ub = 512 and I got 80k ctx and 18tks, prefill is faster too. |
Beta Was this translation helpful? Give feedback.
-
|
Hello, I'm running Qwen3.5-9B on M4 24GB and consistently the decode speed is at ~10 tok/s with 82s TTFT for large prompts, despite efficient prefill (~174 tok/s). I got the TTFT and decode metrics from the llama-server logs and I'm not sure what the difference is between TTFT and what the actual time from prompt to first character is but it takes anywhere from 3-6 minutes for fist character from first prompt in fresh session. After the first prompt it responds in <30s or faster depending on prompt and context. But that first prompt eats up to as much as 50k tokens and up to 6 minutes before first character. I was trying to Optimize llama.cpp performance for Qwen3.5-9B-Q5_K_M.gguf with 65,536 context length Hermes v0.15.1 on M4 24GB. Running Hermes Agent directly (no Docker). Decode speed bottlenecked at ~10 tok/s regardless of optimization attempts Current llama-server configuration: report - https://paste.rs/KH5Hh Is this expected behavior on M4 24GB, or should decode be faster? Note: Thanks! |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
Overview
Are you using
llama.cppand wondering if you are getting the most out of your hardware?Post your parameters below and get some help from the community to improve the performance. Sometimes, adjusting a few parameters can make a big difference in terms of speed and/or quality.
Information needed:
llama-servercommand that you are currently usingllama-bench, but could be something else depending on the use case)Beta Was this translation helpful? Give feedback.
All reactions