Skip to content

Slow decode on RTX 4090 and windows #55

@gtrak

Description

@gtrak

I see inconsistent GPU usage and 100% CPU usage across 16 cores of 7950x3d, and gallocr_realloc log spam. I got 90 tok/s at 16k context and 55 tok/s at 32k context, running like this: python scripts/server.py --port 1234 --max-ctx 32768

What MinMax 2.5 thinks is the issue:

Root cause: Every speculative decoding step builds a new ggml compute graph with a different number of nodes, triggering ggml_gallocr_reserve which reallocates the GPU buffer (~1.2GB) each time. This is the performance bottleneck, not the attention computation itself.
Why it happens: The DDTree structure varies per iteration based on accepted tokens, so the graph topology changes constantly, preventing buffer reuse.
The fix would be: Pre-reserve the maximum buffer size once (call ggml_gallocr_reserve with the max possible graph) rather than re-evaluating each step. This requires estimating the max graph size upfront.

-- Selecting Windows SDK version 10.0.22621.0 to target Windows 10.0.26200.
-- The C compiler identification is MSVC 19.44.35222.0
-- The CXX compiler identification is MSVC 19.44.35222.0
-- The CUDA compiler identification is NVIDIA 13.0.48 with host compiler MSVC 19.44.35222.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.44.35207/bin/Hostx64/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.44.35207/bin/Hostx64/x64/cl.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v13.0/bin/nvcc.exe - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- The ASM compiler identification is GNU
-- Found assembler: C:/ProgramData/mingw64/mingw64/bin/gcc.exe
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - not found
-- Found Threads: TRUE
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: AMD64
-- CMAKE_GENERATOR_PLATFORM: x64
-- GGML_SYSTEM_ARCH: x86
-- Including CPU backend
-- Found OpenMP_C: -openmp (found version "2.0")
-- Found OpenMP_CXX: -openmp (found version "2.0")
CMake Warning at deps/llama.cpp/ggml/src/ggml-cpu/CMakeLists.txt:84 (message):
  OpenMP not found
Call Stack (most recent call first):
  deps/llama.cpp/ggml/src/CMakeLists.txt:446 (ggml_add_cpu_backend_variant_impl)
-- Could NOT find OpenMP_CUDA (missing: OpenMP_CUDA_FLAGS OpenMP_CUDA_LIB_NAMES) 
-- Could NOT find OpenMP (missing: OpenMP_CUDA_FOUND) (found version "2.0")
-- x86 detected
-- Performing Test HAS_AVX_1
-- Performing Test HAS_AVX_1 - Success
-- Performing Test HAS_AVX2_1
-- Performing Test HAS_AVX2_1 - Success
-- Performing Test HAS_FMA_1
-- Performing Test HAS_FMA_1 - Success
-- Performing Test HAS_AVX512_1
-- Performing Test HAS_AVX512_1 - Success
-- Adding CPU backend variant ggml-cpu: /arch:AVX512 GGML_AVX512
-- Found CUDAToolkit: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v13.0/include;C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v13.0/include/cccl (found version "13.0.48")
-- CUDA Toolkit found
-- Using CMAKE_CUDA_ARCHITECTURES=89 CMAKE_CUDA_ARCHITECTURES_NATIVE=89-real
-- Could NOT find NCCL (missing: NCCL_LIBRARY NCCL_INCLUDE_DIR) 
-- Warning: NCCL not found, performance for multiple CUDA GPUs will be suboptimal
-- Including CUDA backend
-- ggml version: 0.9.11
-- ggml commit:  b6ffab4a9
-- Could NOT find OpenMP_CUDA (missing: OpenMP_CUDA_FLAGS OpenMP_CUDA_LIB_NAMES) 
-- Could NOT find OpenMP (missing: OpenMP_CUDA_FOUND) (found version "2.0")

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingquestionFurther information is requested

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions