I see inconsistent GPU usage and 100% CPU usage across 16 cores of 7950x3d, and gallocr_realloc log spam. I got 90 tok/s at 16k context and 55 tok/s at 32k context, running like this: python scripts/server.py --port 1234 --max-ctx 32768
What MinMax 2.5 thinks is the issue:
Root cause: Every speculative decoding step builds a new ggml compute graph with a different number of nodes, triggering ggml_gallocr_reserve which reallocates the GPU buffer (~1.2GB) each time. This is the performance bottleneck, not the attention computation itself.
Why it happens: The DDTree structure varies per iteration based on accepted tokens, so the graph topology changes constantly, preventing buffer reuse.
The fix would be: Pre-reserve the maximum buffer size once (call ggml_gallocr_reserve with the max possible graph) rather than re-evaluating each step. This requires estimating the max graph size upfront.
-- Selecting Windows SDK version 10.0.22621.0 to target Windows 10.0.26200.
-- The C compiler identification is MSVC 19.44.35222.0
-- The CXX compiler identification is MSVC 19.44.35222.0
-- The CUDA compiler identification is NVIDIA 13.0.48 with host compiler MSVC 19.44.35222.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.44.35207/bin/Hostx64/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.44.35207/bin/Hostx64/x64/cl.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v13.0/bin/nvcc.exe - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- The ASM compiler identification is GNU
-- Found assembler: C:/ProgramData/mingw64/mingw64/bin/gcc.exe
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - not found
-- Found Threads: TRUE
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: AMD64
-- CMAKE_GENERATOR_PLATFORM: x64
-- GGML_SYSTEM_ARCH: x86
-- Including CPU backend
-- Found OpenMP_C: -openmp (found version "2.0")
-- Found OpenMP_CXX: -openmp (found version "2.0")
CMake Warning at deps/llama.cpp/ggml/src/ggml-cpu/CMakeLists.txt:84 (message):
OpenMP not found
Call Stack (most recent call first):
deps/llama.cpp/ggml/src/CMakeLists.txt:446 (ggml_add_cpu_backend_variant_impl)
-- Could NOT find OpenMP_CUDA (missing: OpenMP_CUDA_FLAGS OpenMP_CUDA_LIB_NAMES)
-- Could NOT find OpenMP (missing: OpenMP_CUDA_FOUND) (found version "2.0")
-- x86 detected
-- Performing Test HAS_AVX_1
-- Performing Test HAS_AVX_1 - Success
-- Performing Test HAS_AVX2_1
-- Performing Test HAS_AVX2_1 - Success
-- Performing Test HAS_FMA_1
-- Performing Test HAS_FMA_1 - Success
-- Performing Test HAS_AVX512_1
-- Performing Test HAS_AVX512_1 - Success
-- Adding CPU backend variant ggml-cpu: /arch:AVX512 GGML_AVX512
-- Found CUDAToolkit: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v13.0/include;C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v13.0/include/cccl (found version "13.0.48")
-- CUDA Toolkit found
-- Using CMAKE_CUDA_ARCHITECTURES=89 CMAKE_CUDA_ARCHITECTURES_NATIVE=89-real
-- Could NOT find NCCL (missing: NCCL_LIBRARY NCCL_INCLUDE_DIR)
-- Warning: NCCL not found, performance for multiple CUDA GPUs will be suboptimal
-- Including CUDA backend
-- ggml version: 0.9.11
-- ggml commit: b6ffab4a9
-- Could NOT find OpenMP_CUDA (missing: OpenMP_CUDA_FLAGS OpenMP_CUDA_LIB_NAMES)
-- Could NOT find OpenMP (missing: OpenMP_CUDA_FOUND) (found version "2.0")
I see inconsistent GPU usage and 100% CPU usage across 16 cores of 7950x3d, and gallocr_realloc log spam. I got 90 tok/s at 16k context and 55 tok/s at 32k context, running like this: python scripts/server.py --port 1234 --max-ctx 32768
What MinMax 2.5 thinks is the issue: