Skip to content

Commit a83c73a

Browse files
authored
[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full (ggml-org#19042)
* [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline. Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size. * Set the env variable in the CUDA backend registry allocation * Add link to PR in code comment * Remove warning logs and update documentation
1 parent fc3cdf3 commit a83c73a

2 files changed

Lines changed: 18 additions & 0 deletions

File tree

docs/build.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -248,6 +248,14 @@ You may set the [cuda environmental variables](https://docs.nvidia.com/cuda/cuda
248248
CUDA_VISIBLE_DEVICES="-0" ./build/bin/llama-server --model /srv/models/llama.gguf
249249
```
250250

251+
#### CUDA_SCALE_LAUNCH_QUEUES
252+
253+
The environment variable [`CUDA_SCALE_LAUNCH_QUEUES`](https://docs.nvidia.com/cuda/cuda-programming-guide/05-appendices/environment-variables.html#cuda-scale-launch-queues) controls the size of CUDA's command buffer, which determines how many GPU operations can be queued before the CPU must wait for the GPU to catch up. A larger buffer reduces CPU-side stalls and allows more work to be queued on a GPU.
254+
255+
**Default behavior:** llama.cpp automatically sets `CUDA_SCALE_LAUNCH_QUEUES=4x`, which increases the CUDA command buffer to 4 times its default size. This optimization is particularly beneficial for **Multi-GPU setups with pipeline parallelism**, where it significantly improves prompt processing throughput by allowing more operations to be enqueued across GPUs.
256+
257+
See PR [#19042](https://github.com/ggml-org/llama.cpp/pull/19042) for performance benchmarks and technical details.
258+
251259
### Unified Memory
252260

253261
The environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` can be used to enable unified memory in Linux. This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted. In Windows this setting is available in the NVIDIA control panel as `System Memory Fallback`.

ggml/src/ggml-cuda/ggml-cuda.cu

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4876,6 +4876,16 @@ ggml_backend_reg_t ggml_backend_cuda_reg() {
48764876
static std::mutex mutex;
48774877
std::lock_guard<std::mutex> lock(mutex);
48784878
if (!initialized) {
4879+
// Set CUDA_SCALE_LAUNCH_QUEUES before any CUDA API call to improve multi-GPU pipeline parallelism performance
4880+
// PR: https://github.com/ggml-org/llama.cpp/pull/19042
4881+
if (getenv("CUDA_SCALE_LAUNCH_QUEUES") == nullptr) {
4882+
#ifdef _WIN32
4883+
_putenv_s("CUDA_SCALE_LAUNCH_QUEUES", "4x");
4884+
#else
4885+
setenv("CUDA_SCALE_LAUNCH_QUEUES", "4x", 0); // don't overwrite if already set
4886+
#endif // _WIN32
4887+
}
4888+
48794889
ggml_backend_cuda_reg_context * ctx = new ggml_backend_cuda_reg_context;
48804890
const int min_batch_size = getenv("GGML_OP_OFFLOAD_MIN_BATCH") ? atoi(getenv("GGML_OP_OFFLOAD_MIN_BATCH")) : 32;
48814891

0 commit comments

Comments
 (0)