fix: server /inference fails to decode in-memory audio (regression) by ServeurpersoCom · Pull Request #3818 · ggml-org/whisper.cpp

ServeurpersoCom · 2026-05-21T18:53:21Z

fix: server /inference fails to decode in-memory audio (regression)

whisper-server /inference returns HTTP 400 on any valid WAV (including samples/jfk.wav) for builds without WHISPER_FFMPEG and without --convert: "error: failed to read audio data from (RIFF...)".

Cause: the handler passes the uploaded bytes audio_file.content to read_audio_data, which since the miniaudio migration only handles a filename via ma_decoder_init_file. It tries to open a path starting with RIFF and fails.

Regression introduced by 7d3da68 (#2759, "examples : use miniaudio for direct decoding"), which dropped the is_wav_buffer + drwav_init_memory in-memory path that read_wav previously used. #3707 later added the 400 status that surfaced it as Invalid request.

Fix: factor the PCM extraction into a shared helper and add a read_audio_data(const char* buffer, size_t size, ...) overload that decodes from memory via ma_decoder_init_memory (already used by the function for the stdin path). server.cpp calls it with the upload content. The filename overload is unchanged.

Verified: samples/jfk.wav now returns 200 on the /inference endpoint instead of 400, and transcription quality checked against my production setup.

AI Usage: Opus 4.7 + MCP computer use (rootless container with GPU)

whisper-server /inference without --convert passed the uploaded file bytes to read_audio_data as a filename, so ma_decoder_init_file tried to open a path starting with "RIFF" and failed. every request returned HTTP 400 "Invalid request" on builds without WHISPER_FFMPEG, which is the default. factor the PCM extraction into a shared helper and add an overload that decodes straight from a memory buffer via ma_decoder_init_memory, which the function already used for the stdin path. server now calls it with the upload content. the filename overload behavior is unchanged.

danbev · 2026-05-22T06:26:32Z

@ServeurpersoCom Thanks!
Merging without running CI again (verified the changes locally).

ServeurpersoCom · 2026-05-22T06:34:20Z

I use it daily on a Telegram bot to transcribe all audio messages, and as a source for GGML voice cloning pipeline and llama.cpp input.

With the following WAV format:

Audio
Format                                   : PCM
Paramètres du format                     : Little / Signed
Identifiant du codec                     : 1
Durée                                    : 44s 485 ms
Type de débit                            : Constant
Débit                                    : 256 kb/s
Canaux                                   : 1 canal
Echantillonnage                          : 16,0 kHz
Profondeur des couleurs                  : 16 bits
Taille du flux                           : 1,36 Mio (100%)

…ml to ggml-org upstream to (#33) * ruby : transcribe without GVL, accept more MemoryViews, Windows support, fix memory size report, improve document (#3775) * Change MemoryView example using NDAV * Add note on audio attributes for #full and #full_parallel * Support more variants of MemoryView * Use IO.popen instead of Kernel.` for Windows compatibility * Use cmake's -C option instead of multiple -D options * Fix memsize calculation * Remove unused argument * Add is_interrupted field to abort callback container * Fix RBS syntax * Address document comment for RDoc * Add .document for RDoc * Add .rdoc_options * Run #full without GVL * Initialize callbacks with nil * Specify implicity Whisper::Params to distinguish from Whisper::Context::Params * Run callbacks without GVL * Call log callback with GVL * Run full_parallel without GVL * Run transcribe without GVL * Fix ruby_whisper_lock_gvl and ruby_whisper_unlock_gvl * Fix return value of encoder_begin_callback * Report GVL unlocking from transcribe * Remove unused interface * Restore overload of full_parallel * Close process * Fix struct name * Make is_without_gvl thread local * Use rb_thread_call_with_gvl instead of global variable * Retrieve instance variable in GVL * Narrow acceptable MemoryView format * Fix option cache path * Reduce files in package * Use append_cflags * Add ext/*.rb to task dependencies * Use copy instead of cp * Make TestPackage more portable * Patch for lower version Ruby * Make build scripts more portable * Add Windows support * Don't raise exceptions * whisper : fix incorrect timestamps, usually near silences (#2279) * Incorrect timetstamps Fixes #2271 - Adds consecutive timestamps after end of last segment as the new starting ts - Add these timestamp to output when "print-special" enabled - Fixes fflush usage in live reporting I was not able to test this with the special "token_timestamps" option. * Skip initial timestamp * server: Add support for controlling token_timestamps directly (#3785) * whisper : fix max_tokens skipping remaining audio (#3798) * whisper: fix max_tokens skipping remaining audio * add PR reference comment as suggested Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix(ci): enable artifact overwrite * server: fix params leak between requests (#3784) * server : fix no_speech_thold not being read (#3783) * opencl: Adreno optimization for MoE - MxFP4 (llama/22301) * MoE Mxfp4 CLC kernel added, router reorder on GPU * Pass test-backend-ops for MoE mxfp4 Adreno CLC * remove putenv in llama-model.cpp * fix indent style and whitespace * opencl: remove unnecessary headers * opencl: do not save cl_program objects * opencl: remove unnecessary assert * fix precision issue --------- Co-authored-by: Li He <lih@qti.qualcomm.com> * ggml-virtgpu: fix circular dependency in headers (llama/22557) * fix: CUDA device PCI bus ID de-dupe OOMing (ignoring other 3 gpus entirely) (llama/22533) * fix: CUDA device PCI bus ID detection for multi-GPU de-dupe * HIP, MUSA macros --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * ggml-webgpu: add layer norm ops (llama/22406) * shader(norm): add layer norm ops * shader(norm): stablize floating point computation with Kahan summation and handle mixed types * shader(norm): remove the non-contiguous strides * shader(norm): use the original implementation rather than the kahan summation * vulkan: delete dead GGML_VK_MAX_NODES def (llama/22621) * CUDA: use fastdiv for batch index split in get_rows (llama/22650) * kleidiai : update to v1.24.0 and use release archive (llama/22549) * ggml : implement fast walsh-hadamard transform for kv rotation (#21352) (llama/22631) * llama : add option to save memory in device buffers (llama/22679) * llama : add option to save memory in device buffers * tests : extend llama-save-load-state * ggml : bump version to 0.11.0 (ggml/1478) * rpc : use graph uid instead of graph cache (llama/22701) Store the last graph uid and compare against it to determine if the same graph is being computed. * opencl: refactor Adreno q4_0 (llama/22335) * Hexagon: Process M-tail rows on HMX instead of HVX (llama/22724) * hex-mm: process m-tail rows on HMX instead of HVX * hmx-mm: unroll and optimize padded activation loop --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> * ggml : use `CL_DEVICE_GLOBAL_MEM_SIZE` as memory estimate for OpenCL --fit (llama/22688) * ggml : report estimated OpenCL memory for --fit Signed-off-by: Florian Reinle <f.reinle@otec.de> * ggml : estimated OpenCL memory backend integrated Signed-off-by: Florian Reinle <f.reinle@otec.de> --------- Signed-off-by: Florian Reinle <f.reinle@otec.de> * ggml-cpu: fuse RMS_NORM + MUL on CPU backend (llama/22423) * ggml-cpu: Optimized risc-v cpu q1_0 dot * sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET (llama/22149) * sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET Signed-off-by: Chun Tao <chun.tao@intel.com> * Fix abort during test-backend-ops Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Regenerate ops.md Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Add scope_dbg_print to newly added SYCL ops. Also add scope_dbg_print to existing ssm_conv op. Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> --------- Signed-off-by: Chun Tao <chun.tao@intel.com> Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Todd Malsbary <todd.malsbary@intel.com> * opencl: add opfilter regex for debugging (llama/22782) * llama : fix device state save/load (llama/22805) * CUDA: batch out_prod inner loop with cublasSgemmStridedBatched (llama/22651) * CUDA: batch out_prod inner loop with cublasSgemmStridedBatched * CUDA: batch out_prod inner loop with cublasSgemmStridedBatched * CUDA: add cublasSgemmStridedBatched mapping for HIP and MUSA backends * opencl: add q4_0 MoE GEMM for Adreno (llama/22731) * Q4_0 MoE CLC pass sanity check * release program * opencl: fix whitespace * opencl: remove unused cl_program * opencl: break #if block to make it more clear * opencl: adjust format --------- Co-authored-by: Li He <lih@qti.qualcomm.com> * ggml: update SCHED_DEBUG output to use ggml_op_desc() (llama/22825) * vulkan: fix spv shadowing (llama/22760) * CUDA: lower-case PCI bus id, standardize for ggml (llama/22820) * cuda: fuse snake activation (mul, sin, sqr, mul, add) (llama/22667) * cuda: fuse snake activation (mul, sin, sqr, mul, add) Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The matcher recognizes the naive 5 op decomposition emitted by audio decoders (BigVGAN, Vocos) for snake activation y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise kernel. Add test_snake_fuse comparing CPU naive vs CUDA fused across F32 / F16 / BF16. * cuda: address review feedback from @am17an Use ggml_cuda_cast for F32/F16/BF16 conversions and rename kernel_snake to snake_kernel to match upstream conventions. * cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an * Update tests/test-backend-ops.cpp Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cuda: snake fusion check add->type matches x->type Address review feedback from @am17an * cuda: snake fusion check add->type matches x->type Moved for readability (equivalent) Address review feedback from @am17an --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Feature hexagon l2 norm (llama/22816) * L2_NORM Updates * Addressed PR Comments * ggml-hexagon: add L2_NORM HVX kernel for Hexagon backend * hex-unary: remove supported_unary_nc since the outer loop is the same for all unary ops --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> * sycl: support non-contiguous input in PAD op (llama/22148) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Todd Malsbary <todd.malsbary@intel.com> * hexagon: add HTP kernel for GGML_OP_GATED_DELTA_NET (llama/22837) Implement the Gated Delta Net recurrence on HVX with: - 4-row fused kernels for PP (prompt processing) path - 8-row fused kernels for TG (token generation) path, reducing K/Q/gate vector reload overhead by 2x - Separate PP/TG thread functions for I-cache isolation - VTCM state scratchpad with DMA in/out for TG single-cycle access - Vectorized gate exp via hvx_exp_f32 * Add flash attention MMA / Tiles to support MiMo-V2.5 (llama/22812) * mimo-v2.5: add flash attention mma/tiles for for d_kq=192 d_v=128 * mimo-v2.5: follow (256, 256) fattn templates * mimo-v2.5: cleanup comments * mimo-v2.5: further comment cleanup * mimo-v2.5: address PR feedback fix GQA handling check for other dangling 320/576 carveouts and mirror them for 192 Add to backend ops test so new paths are covered * sycl: Battlemage AOT build via spir64_gen + MMQ subgroup annotations (llama/22147) * sycl: Battlemage AOT build via spir64_gen + MMQ subgroup annotations Signed-off-by: Chun Tao <chun.tao@intel.com> * Remove unneeded/unnecessary comments and annotations The MMQ subgroup annotations added are on functions gated behind ggml_sycl_supports_mmq(). Revisit the need for these annotations when that function changes. --------- Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Todd Malsbary <todd.malsbary@intel.com> * sycl: Q5_K reorder MMVQ/dequant + Q8_0 reorder MMVQ path (llama/22152) * sycl: Q5_K reorder MMVQ/dequant + Q8_0 reorder MMVQ path Signed-off-by: Chun Tao <chun.tao@intel.com> * Remove duplicate definitions --------- Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Todd Malsbary <todd.malsbary@intel.com> * Add BF16 support to GET_ROWS operation (llama/21391) Add GGML_TYPE_BF16 to the SYCL backend's GET_ROWS operation, both in supports_op and in the kernel dispatch. This fixes a performance regression where models using BF16 embedding tensors (e.g., Gemma4's per_layer_token_embd.weight) fall back to CPU for the GET_ROWS op, causing a full GPU-to-CPU tensor transfer every token. The fix reuses the existing get_rows_sycl_float template with sycl::ext::oneapi::bfloat16, matching the pattern already used for sycl::half (F16) and float (F32). * SYCL: reduce allocation overhead during flash attention (llama/22732) * SYCL: reduce allocation overhead during flash attention * tidy up whitespace * add a note about the flag * move ggml_sycl_fattn_* into fattn-buffers.hpp * refactor implementation into fattn-buffers.cpp * move new_fattn_kv_buffers back into ggml-sycl.cpp * internal AllReduce kernel for CUDA provider (llama/22299) * ggml-cuda: add internal AllReduce provider for tensor parallelism Introduces a NCCL-free AllReduce implementation for LLAMA_SPLIT_MODE_TENSOR using a single-phase CUDA kernel that pipelines D2H copy, cross-GPU handshake via pinned-memory volatile flags, and the reduction in one kernel launch per GPU. New files: - ggml/src/ggml-cuda/comm.cuh — ggml_cuda_allreduce_provider enum - ggml/src/ggml-cuda/allreduce.cuh — pipeline API declarations - ggml/src/ggml-cuda/allreduce.cu — kernel + pipeline init/dispatch ggml-cuda.cu changes: - ggml_backend_cuda_comm_context gains ar_pipeline field - Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal") - INTERNAL provider initialises the pipeline at comm_init time - Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend CPU reduce for unsupported sizes or GPU counts (> 2) Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * llama-bench: add --allreduce flag to select AllReduce provider Adds --allreduce <auto|nccl|internal> to llama-bench (and via the shared field pattern, consistent with other multi-value flags). Useful for isolating hangs or regressions in tensor-parallel mode: pass --allreduce nccl to force NCCL and bypass the internal provider. Also fixes ggml_cuda_select_allreduce_provider() to treat an empty GGML_CUDA_ALLREDUCE env var the same as unset (avoids spurious warning when llama-bench sets it to "" for the "auto" case). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> xt gains ar_pipeline field - Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal") - INTERNAL provider initialises the pipeline at comm_init time - Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend CPU reduce for unsupported sizes or GPU counts (> 2) Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * llama-bench: rename --allreduce to --reduction-provider / -rp Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> via the shared field pattern, consistent with other multi-value flags). Useful for isolating hangs or regressions in tensor-parallel mode: pass --allreduce nccl to force NCCL and bypass the internal provider. Also fixes ggml_cuda_select_allreduce_provider() to treat an empty GGML_CUDA_ALLREDUCE env var the same as unset (avoids spurious warning when llama-bench sets it to "" for the "auto" case). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> xt gains ar_pipeline field - Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal") - INTERNAL provider initialises the pipeline at comm_init time - Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend CPU reduce for unsupported sizes or GPU counts (> 2) Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * llama-bench: pass WARN/ERROR log messages through in non-verbose mode The null log callback was silently dropping all messages. WARN and ERROR should always be visible since they indicate legitimate issues (e.g. a requested reduction provider not being available). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> vider. Also fixes ggml_cuda_select_allreduce_provider() to treat an empty GGML_CUDA_ALLREDUCE env var the same as unset (avoids spurious warning when llama-bench sets it to "" for the "auto" case). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> xt gains ar_pipeline field - Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal") - INTERNAL provider initialises the pipeline at comm_init time - Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend CPU reduce for unsupported sizes or GPU counts (> 2) Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * cmake: improve NCCL detection for source-tree builds, add static/dynamic switch FindNCCL.cmake now searches the cmake source-build layout used by the Windows NCCL port (cmake/lib/Release for static, cmake/src/Release for dynamic import lib) and also checks src/include for the generated nccl.h header. New option GGML_CUDA_NCCL_STATIC (default OFF) selects static vs dynamic linking and controls which paths and library names are searched. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> for the "auto" case). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> xt gains ar_pipeline field - Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal") - INTERNAL provider initialises the pipeline at comm_init time - Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend CPU reduce for unsupported sizes or GPU counts (> 2) Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ggml-cuda: add AllReduce hang watchdog (GGML_CUDA_AR_WATCHDOG) When compiled with -DGGML_CUDA_AR_WATCHDOG=ON, uses a debug kernel variant that writes per-GPU spin diagnostics to pinned host memory. A host-side blocking poll (cudaEventQuery + volatile reads) detects hangs and logs WARN with the last observed arrival counters and spin counts, controlled by GGML_CUDA_AR_WATCHDOG (ms timeout) and GGML_CUDA_AR_MAX_SPIN (kernel bailout) env vars at runtime. Zero overhead on the production path — all debug code is behind #ifdef. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> ar_pipeline field - Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal") - INTERNAL provider initialises the pipeline at comm_init time - Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend CPU reduce for unsupported sizes or GPU counts (> 2) Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ggml-cuda: fix intermittent AllReduce hang on Blackwell PCIe Add __threadfence_system() before the arrival signal write in signal_set to ensure D2H data is globally visible before the peer observes the arrival flag. Without this fence, the peer could enter Phase 3 host reads before the data had fully landed, causing an intermittent deadlock on RTX 5090 (Blackwell, PCIe-only). Also redesign the watchdog from a blocking dispatch-thread poll to a non-blocking background thread, eliminating the ~20ms per-slot latency the old design added. Verified: 30/30 soak test runs clean at ~50 t/s (previously ~1-in-15 hang rate). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> - INTERNAL provider initialises the pipeline at comm_init time - Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend CPU reduce for unsupported sizes or GPU counts (> 2) Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ggml-cuda: fix watchdog shutdown ordering and pipeline_free drain - Stop watchdog thread BEFORE destroying GPU resources (events, streams) to prevent polling destroyed handles → spurious "busy" readings - Add cudaStreamSynchronize in pipeline_free to drain in-flight kernels before freeing pinned host buffers they may still be reading - Sleep-first watchdog polling: no +0ms noise, only logs when a kernel is genuinely stuck past the poll interval - Check wdog_stop in both outer and inner loops so join() returns promptly instead of draining the entire queue - Add Phase 3 breadcrumbs to debug[3] for hang localization Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> RNAL provider initialises the pipeline at comm_init time - Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend CPU reduce for unsupported sizes or GPU counts (> 2) Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ggml-cuda: replace event-based watchdog with per-GPU ring buffer Completely rework the GGML_CUDA_AR_WATCHDOG system: - Replace the shared debug_buf + event-polling + queue design with per-GPU ring buffers in pinned host memory - Kernel writes a debug record only on spin-limit bailout: claims a ring slot via atomicAdd (single-GPU host atomics work on RTX 5090), writes fields, fences, sets completion flag, then all threads exit - Watchdog thread simply polls ring head counters every 1ms and prints any new complete records — no CUDA event queries, no mutex, no queue - Zero overhead on the dispatch path (no queue posting, no memset) - Watchdog shutdown returns within ~1ms (atomic bool, no drain) - On bailout the kernel skips Phase 3 entirely and exits cleanly Verified: 20/20 prefill soak test clean at ~1112 t/s, no hangs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> P32, tensors <= 256 KB. Notes in NOTES-allreduce.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: normalize line endings to LF (undo Windows CRLF conversion) Five files were inadvertently converted to CRLF by the Windows development environment, causing every line to show as changed in diffs against master. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> imit bailout: claims a ring slot via atomicAdd (single-GPU host atomics work on RTX 5090), writes fields, fences, sets completion flag, then all threads exit - Watchdog thread simply polls ring head counters every 1ms and prints any new complete records — no CUDA event queries, no mutex, no queue - Zero overhead on the dispatch path (no queue posting, no memset) - Watchdog shutdown returns within ~1ms (atomic bool, no drain) - On bailout the kernel skips Phase 3 entirely and exits cleanly Verified: 20/20 prefill soak test clean at ~1112 t/s, no hangs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> P32, tensors <= 256 KB. Notes in NOTES-allreduce.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * .gitattributes: force LF line endings to prevent Windows CRLF conversion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> elopment environment, causing every line to show as changed in diffs against master. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> imit bailout: claims a ring slot via atomicAdd (single-GPU host atomics work on RTX 5090), writes fields, fences, sets completion flag, then all threads exit - Watchdog thread simply polls ring head counters every 1ms and prints any new complete records — no CUDA event queries, no mutex, no queue - Zero overhead on the dispatch path (no queue posting, no memset) - Watchdog shutdown returns within ~1ms (atomic bool, no drain) - On bailout the kernel skips Phase 3 entirely and exits cleanly Verified: 20/20 prefill soak test clean at ~1112 t/s, no hangs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> P32, tensors <= 256 KB. Notes in NOTES-allreduce.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ggml-cuda: move GGML_CUDA_AR_WATCHDOG from CMake option to local define The watchdog is development-only; a global CMake option is overkill. Move the toggle to a #define at the top of allreduce.cu (set to 0 by default) and remove the option from ggml/CMakeLists.txt and the CUDA CMakeLists.txt add_compile_definitions block. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> fences, sets completion flag, then all threads exit - Watchdog thread simply polls ring head counters every 1ms and prints any new complete records — no CUDA event queries, no mutex, no queue - Zero overhead on the dispatch path (no queue posting, no memset) - Watchdog shutdown returns within ~1ms (atomic bool, no drain) - On bailout the kernel skips Phase 3 entirely and exits cleanly Verified: 20/20 prefill soak test clean at ~1112 t/s, no hangs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> P32, tensors <= 256 KB. Notes in NOTES-allreduce.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * unify kernel debug paths * use __threadfence_system explicitly (not in ggml_cuda_ar_signal_set) * preferentially use internal reduction for <=2 GPUs * templatize the main kernel to support fp16/bf16 * restore llama-bench.cpp changes * revert CMakeLists changes * remove notes from repo * remove dead warmup code * fix comments * improve reduction provider fallback code * add messages for allreduce fallback * rework reduction provider init to not call ncclCommInitAll if using the internal provider * fix case where a given tensor has not been computed * add chunked mode to the kernel for unlimited vector size * rework a few checks/fallbacks * various small cleanups * allow disabling CUDA reductions completely (falling back to the non-CUDA butterfly mode) * simplify reduction provider selection * minor simplifications * more cleanups/fixes * prototype alternate path for large reductions * chunked version of large reduction path * use bf16 for large reductions * experimental reduction using cudaMemcpyPeerAsync (slightly slower) * revert experimental change * add combined conversion/reduction kernel * add bf16 wire format for single kernel mode * experimental on-stream small reduction kernel * double buffer arrival slots, use token (incrementing) method * double buffer host_buf for small reductions * put in waits for use of host_mem in large reduction case (prevents stomping on in-use memory * remove watchdog code * various cleanups / dead code removal * fix fp16 mode * fix some comments/logging statements * use increasing token scheme for arrival signals * add top-level comment to allreduce.cu * improve top-level comment in allreduce.cu * fix comments in ggml_cuda_ar_kernel * improve event handling for hostmem buffer usage tracking * change ev_pool to fixed 2D array * add chunked memcpy fallback for extra-large reductions (>32 MB) * change thresholds for copy-engine path and bf16 demotion * multi-block kernel test * more fine-tuning for chukn-size, etc. * various fixes for PR review * more PR fixes * fix semantics of all host mappings * require ampere+ * small cleanups * properly use host pointer for src/dst in cudaMemcpy calls * allreduce: lazy-init the internal pipeline on first use A config that lives entirely on NCCL never needs the chunked-kernel pipeline (host_buf, host_large, dev_tmp, streams, events, arrival ring). Defer pipeline creation to the first try_allreduce_internal call using the same std::call_once pattern as ensure_nccl, so those resources stay unallocated when only NCCL is in use. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * allreduce: assert n_backends == 2 instead of soft-fallback ar_pipeline_init already requires n_devices == 2 and bails before any AR can get here, so by the time we reach try_allreduce_internal we know we have exactly two backends. Replace the runtime-debug-log fallback with a hard assert. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> NCCL is in use. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * rework reduction provider selection. internal/nccl is OS dependent; most fallbacks are removed * remove unneeded Turing arch check (llama.cpp doesn't even compile pre-Turing anyway) * allreduce: ASCII-only comments and ggml_cuda_cast for value conversions Replace non-ASCII characters in comments (em dashes, right arrows) with ASCII equivalents (--, ->) so the source stays in the ggml/upstream norm. In the kernel-side code, replace static_cast<Twire>/static_cast<Tdst> with ggml_cuda_cast<...> so the BF16 conversions go through the fast __float2bfloat16 / __bfloat162float intrinsics from convert.cuh. Pure pointer and integer casts stay as static_cast. Also drops two stray garbage tokens that snuck in from earlier merges (a duplicated 'return ok; }' tail in allreduce.cu and a leftover '_reg)' fragment in ggml-cuda.cu). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * allreduce: use ggml_cuda_memcpy_1 for the chunked-kernel vector copies The chunked kernel's two 16-byte register<->host transfers (Phase 1 store and Phase 3 load) used reinterpret_cast<float4 *> on both sides. Replace with ggml_cuda_memcpy_1<sizeof(wire)>, which is the canonical helper for this pattern and emits the same int4 LD/ST under the hood. Conformance passes; 5x reruns of 70b internal pp512 show 1832-1836 t/s, matching the prior matrix value of 1831 t/s -- no perf change as expected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> ok; }' tail in allreduce.cu and a leftover '_reg)' fragment in ggml-cuda.cu). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * allreduce: assert cuda_ctx->device matches the pipeline's device Both ggml_cuda_ar_pipeline and ggml_backend_cuda_context carry the device they were created for; if they ever disagree, every cuda call that follows runs on the wrong device. Add GGML_ASSERT at each cuda_ctx retrieval site in the AR path so the misuse fails fast rather than silently corrupting. Also: rename __nv_bfloat16 -> nv_bfloat16 (typedef alias) for consistency with the rest of the file, and tighten one cudaGetLastError check to fire only after the to_bf16 call that can actually fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> gml-cuda.cu). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * allreduce: expand one-liner for loops to braced bodies Code-style preference -- match the rest of the file by writing every for loop with the body on its own braced line. Three sites in the copy-engine typed dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> in the AR path so the misuse fails fast rather than silently corrupting. Also: rename __nv_bfloat16 -> nv_bfloat16 (typedef alias) for consistency with the rest of the file, and tighten one cudaGetLastError check to fire only after the to_bf16 call that can actually fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> gml-cuda.cu). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * allreduce: rename template parameters Tdst/Twire/Tsrc -> T_dst/T_wire/T_src Code-style preference per PR review -- T_dst/T_wire/T_src is more consistent with surrounding code. Whole-word rename across all 58 sites in allreduce.cu (kernel definitions, internal uses, and comment text). Realigned the parameter columns in three function signatures whose T_src/T_dst lines shifted by 1 char relative to their non-templated neighbors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> to fire only after the to_bf16 call that can actually fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> gml-cuda.cu). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * allreduce: drop hyphen in 'chunked-kernel' across comments Per PR review feedback -- 'chunked kernel' (no hyphen) reads more naturally in running prose, especially for ESL readers. Pure comment-only change; all 10 occurrences in allreduce.cu updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> three function signatures whose T_src/T_dst lines shifted by 1 char relative to their non-templated neighbors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> to fire only after the to_bf16 call that can actually fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> gml-cuda.cu). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * allreduce: use ggml_cuda_get_max_cpy_bytes() instead of hardcoded 16 The chunked kernel hardcoded a 16-byte vector unit; replace with the ggml_cuda_get_max_cpy_bytes() helper that fattn-common.cuh uses for the same purpose, so ELEMS_PER_VEC self-adjusts to the arch's widest single-instruction copy. Perf-neutral on supported targets (Volta+ returns 16). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> hbors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> to fire only after the to_bf16 call that can actually fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> gml-cuda.cu). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ggml-cuda: PR review fixes -- annotate #endif, fix stale comment, assert nbytes alignment Three separate but minor changes from PR #22299 review feedback: 1. Annotate the five GGML_USE_NCCL #endif lines with the matching condition so the pairing is visible without scrolling back. 2. The comment block on ggml_backend_cuda_comm_context claimed NCCL is lazy-initialised; that was true at one point but the dispatch refactor (727b141c0) made both NCCL and the internal pipeline eager. Rewrite the comment to match current behaviour. 3. Assert in ggml_backend_cuda_comm_allreduce_internal that the tensor's byte size is a 16-byte multiple. The chunked-kernel issues full-width vector loads/stores, so this is a precondition; tensor-parallel splits of hidden-dim-multiples satisfy it trivially, but a hard assert turns any caller-side bug into a clear failure rather than UB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> device's new AR records its ev.ker -- otherwise the second device's wait sees the first device's just-recorded event (the in-flight new AR) and creates a circular dependency with the in-kernel peer signal. Two-pass dispatch (all waits, then all launches) avoids this. Bump POOL_SIZE 2 -> 8 (small memory cost, more breathing room for the GPU's view of the event chain) and add a runtime env override for the hybrid kernel chunk size (GGML_CUDA_AR_HYBRID_CHUNK_BYTES) for tuning. One-shot stderr diagnostic at first AR prints the chosen path + sizing. Result on 2x RTX 5090 Linux, 70b ub_sweep: ub=64 (1 MB AR): 913 -> 1036 t/s (+13.5% vs old, +1.8% vs NCCL) ub=128 (2 MB AR): 1056 -> 1181 (+11.9%, +3.7% vs NCCL) ub=256 (4 MB AR): 1212 -> 1424 (+17.5%, +3.5% vs NCCL) Internal now beats NCCL at every size (+1.8% to +15.6%), recovering all ground in the 1-4 MB regime that was previously a 10-12% loss. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * simplify the init logic * address some other PR requests * ggml-cuda: stub internal AllReduce on HIP/MUSA, drop pre-Ampere mention, gate NCCL fallback warning on !HIP The internal AllReduce relies on cudaHostAllocPortable/Mapped, cudaHostGetDevicePointer, and __nanosleep -- none of which the HIP or MUSA shims expose -- so wrap the implementation in !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) and provide nullptr/no-op/false stubs in the #else branch. The dispatcher already treats a null pipeline as init failure and silently falls back to the meta backend's generic AllReduce, so HIP/MUSA builds compile clean and behave correctly without further call-site changes. PR review follow-ups: - drop "or pre-Ampere?" from the internal-init failure warning -- the kernel doesn't require Ampere or newer. - guard the "NCCL not compiled in" fallback warning behind !defined(GGML_USE_HIP); the suggestion to install NCCL only makes sense on NVIDIA builds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> hind, now +6-8% ahead at ub=1024-4096. Perplexity (32 chunks) matches NCCL bit-for-bit (3.4044 vs 3.4043). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * allreduce: guard __nanosleep on Volta+ and reject pre-Volta devices at init __nanosleep is the only Volta-specific intrinsic in the kernel; wrap it in #if __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA / NO_DEVICE_CODE so the file still compiles cleanly when targeting older arches (the dispatcher's init check below ensures the kernel is never actually launched on pre-Volta). Add a per-device compute-capability check in pipeline_init that returns nullptr if any device is below sm70. The dispatcher already treats nullptr as init failure and silently falls back to the meta backend's generic AllReduce. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> rom the internal-init failure warning -- the kernel doesn't require Ampere or newer. - guard the "NCCL not compiled in" fallback warning behind !defined(GGML_USE_HIP); the suggestion to install NCCL only makes sense on NVIDIA builds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> hind, now +6-8% ahead at ub=1024-4096. Perplexity (32 chunks) matches NCCL bit-for-bit (3.4044 vs 3.4043). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * allreduce: fix CI -Werror warnings (sign-compare, format, restrict alias, maybe-uninitialized) The CUDA CI builds with -Werror -Wsign-compare -Wformat -Wrestrict -Wmaybe-uninitialized. Address each: - n_devices is size_t; change `int i; i < n_devices` to size_t in the three init loops, and the matching GGML_LOG_INFO format from %d to %zu. - ggml_cuda_ar_kernel was launched with sendbuf == recvbuf (in-place reduction), so the __restrict__ qualifiers on those parameters were technically UB. Drop __restrict__ from sendbuf and recvbuf; an A/B sweep showed <0.6% perf delta (within noise) on Linux. - The buf/src/dst pointer arrays in ggml_cuda_ar_allreduce and the per-iteration arrays in ggml_cuda_ar_allreduce_copy_outer were declared with size GGML_CUDA_MAX_DEVICES but the loop only writes indices [0, n_devices); zero-initialise so the compiler sees the tail elements as defined. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> now +6-8% ahead at ub=1024-4096. Perplexity (32 chunks) matches NCCL bit-for-bit (3.4044 vs 3.4043). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ggml-cuda: drop unused-function warning by guarding try_allreduce_nccl behind GGML_USE_NCCL The only call site (in init_nccl) is already inside #ifdef GGML_USE_NCCL, so the function is unreferenced in non-NCCL builds and trips nvcc's -Werror=unused-function check. Move the guard from inside the function body to around the entire definition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> ce reduction), so the __restrict__ qualifiers on those parameters were technically UB. Drop __restrict__ from sendbuf and recvbuf; an A/B sweep showed <0.6% perf delta (within noise) on Linux. - The buf/src/dst pointer arrays in ggml_cuda_ar_allreduce and the per-iteration arrays in ggml_cuda_ar_allreduce_copy_outer were declared with size GGML_CUDA_MAX_DEVICES but the loop only writes indices [0, n_devices); zero-initialise so the compiler sees the tail elements as defined. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> now +6-8% ahead at ub=1024-4096. Perplexity (32 chunks) matches NCCL bit-for-bit (3.4044 vs 3.4043). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * ggml : bump version to 0.11.1 (ggml/1484) * sync : ggml * talk-llama : sync llama.cpp * try to fix window cublas CI failure Refs: https://github.com/ggml-org/whisper.cpp/actions/runs/25631391231/job/75237266964?pr=3803 * Revert "try to fix window cublas CI failure" This reverts commit a4d91768aa2ae8cf7083650b3e4dc214413f92b7. * try using CCCL 12.4.127 with cuda 11.8.0 to fix CI failure * Revert "try using CCCL 12.4.127 with cuda 11.8.0 to fix CI failure" This reverts commit be867eadf553801eb7d1c383ed47a90fdd3d4b18. Sorry about this noise, I thought it was worth a try. * devops : add spirv-headers to vulkan dockerfile * ggml-cuda : add explicit casts to -INFINITY for float and half2 types This commit adds explicit casts to float for -INFINITY. The motivation for this is that in CUDA 11.8.0, the -INFINITY macro is defined as a double (a header provided NVCC). This triggers a warning and hence causes a CI failure in whisper.cpp. I belive that this header might have been updated in CUDA 12 which is why we don't see this warning. Refs: https://github.com/ggml-org/whisper.cpp/actions/runs/25713948217/job/75500081939?pr=3803 Refs: https://github.com/ggml-org/llama.cpp/issues/22824 * ggml-cuda : add ar_add() to avoid ambiguous operator+ for half/bfloat16 in CUDA 11.8 * ci : update ONEAPI version to 2025.3.3-0-devel-ubuntu24.04 * squash! ci : update ONEAPI version to 2025.3.3-0-devel-ubuntu24.04 * Revert "ggml-cuda : add ar_add() to avoid ambiguous operator+ for half/bfloat16 in CUDA 11.8" This reverts commit 5cd228494af3973294e90aad95b58c2ede400f43. Reverting in favor of: https://github.com/ggml-org/llama.cpp/pull/22994 * Revert "ggml-cuda : add explicit casts to -INFINITY for float and half2 types" This reverts commit a2839b4404de473bc7af127b7b308d530afda024. Reverting this as after closer inspection these only warnings and not errors. * ggml: install ggml.pc in <libdir>/pkgconfig (ggml/1480) That's always how it's done: https://github.com/search?q=path%3ACMakeLists.txt%20%22%24%7BCMAKE_INSTALL_LIBDIR%7D%2Fpkgconfig%22&type=code * metal : tighten input-position loop in kernel_conv_transpose_1d (ggml/1477) For a given output position j on the time axis, only input positions i such that i*s0 <= j < i*s0 + K contribute -- i.e. i in [ceil((j - K + 1)/s0), floor(j/s0)] intersected with [0, IL-1]. That's at most ceil(K/s0) values (typically 2 for stride==K/2 transposed convs). The current kernel iterates the full IL range and filters with an `if`, amplifying per-thread work by IL/ceil(K/s0) (~160x for IL=320, K=10, s0=5 -- a representative codec-decoder shape). On Apple M1 the wasted work trips the macOS GPU watchdog (kIOGPUCommandBufferCallbackErrorImpactingInteractivity) on long graphs. Compute i_min, i_max analytically before the inner loop and iterate only [i_min, i_max]. Output is bit-identical (same multiplies and adds in the same order); loop bound shrinks by IL/ceil(K/s0). Tested on M1 with a downstream consumer running a TTS codec at full T_codec; end-to-end codec decode ~3-4x faster, zero watchdog hits across long synthesis runs vs ~30% pre-patch. * ggml-virtgpu : include missing mutex header (llama/22810) Add missing `#include <mutex>` in ggml-backend-device.cpp. Fixes: #22809 Signed-off-by: Oliver Walsh <owalsh@redhat.com> * Add OP im2col_3d (llama/22903) * add im2col_3d * format code * update the ops.md * CUDA: directly include cuda/iterator (llama/22936) Before, we relied on a transient import from `cub/cub.cuh`, which is bad practice to do as cub may not always expose cuda/iterator * vulkan: Support asymmetric FA in scalar/mmq/coopmat1 paths (llama/22589) * Ggml/cuda snake fusion hardening (llama/22912) * cuda: tighten snake fusion type checks for all operands (defensive, sync vulkan) * cuda: reject snake fusion when ne[2] or ne[3] > 1 (mirror vulkan PR review) * cuda: merge type_ok and types_ok into a single types_ok (address am17an review) * cuda: filter ADD/SUB/MUL/DIV in supports_op to F32/F16 bin_bcast only dispatches F32/F16 type triplets, mirror the vulkan filter so unsupported types fall back through cpy instead of aborting. * test-backend-ops: extend snake_fuse to rank-4 with ne[2]/ne[3] > 1 cases * CUDA: handle OW > 65535 in im2col (2D and 3D) (llama/22944) `im2col_cuda` and `im2col_3d_cuda` both dispatch with `block_nums.y = OW`. CUDA caps grid Y at 65535. Conv1d encoders on raw 16 kHz audio with T > 65535 (~ 4 s) trip the limit -- e.g. SEANet at 11 s lands at OW = 176000 -- and the launch returns `invalid configuration argument`. Clamp `block_nums.y` to `MIN(OW, MAX_GRIDDIM_Y)` and loop inside the kernel with stride `MAX_GRIDDIM_Y`. Same in-kernel stride pattern already used for the z axis (`MAX_GRIDDIM_Z`). Both 2D `im2col_kernel` and 3D `im2col_3d_kernel` need the same fix. Bit-identical for OW <= 65535 (single iteration of the new outer loop). Tested on T4 / Jetson Orin with a SEANet encoder running on 11 s / 16 kHz audio (im2col reaching OW ~ 176000); pre-fix launch returns `invalid configuration argument`, post-fix runs to completion. Existing test-backend-ops im2col cases unchanged. * opencl: add q4_1 MoE for Adreno (llama/22856) * Q4_1 MoE CLC pass sanity check * remove unnecessary code * opencl: remove unnecessary asserts and reformat * opencl: fix supports_op for q4_1 moe * q4_1 moe is supported by Adreno with certain shapes --------- Co-authored-by: Li He <lih@qti.qualcomm.com> * metal : promote mul_mv/mul_mm batch divisors to function constants (llama/22711) * metal : promote mul_mv/mul_mm batch divisors to function constants * metal : take op directly in get_pipeline_mul_mv_ext * vulkan: Check shared memory size for mmq shaders (llama/22693) * vulkan: Fix Windows performance regression on Intel GPU BF16 workloads for Xe2 and newer (llama/22461) * refactor * Use l_warptile only when coopamt is available for BF16 * ggml-webgpu: address precision issues for multimodal (llama/22808) * fix(mixed-types): use f32 for precision and update the shared memory calculation logic for f32 * fix(unary): correct the gelu, gelu quick and gelu erf functions * fix(flash-attn-tile): fix the hardcode v type * fix(flash_attn): fix tile path * fix: pass editorconfig and address the type conflicts * fix: remove reduant pipeline keys * fix: remove inline min/max group size functions and revert the flash attn path order * fix: use clamp to avoid NaN for GELU * fix: use the right range for exp, 80 is safer for f32 exp * ggml-webgpu: Enables running gpt-oss-20b (llama/22906) * Enable to run gpt-oss-20b and refactor mulmat-q * disable test-backend-ops in ubuntu-24-webgpu * opencl: add opt-in Adreno xmem F16xF32 GEMM for prefill (llama/22755) * ggml-opencl: add Adreno xmem F16xF32 GEMM for prefill * ggml-opencl: address Adreno xmem review comments * ggml-opencl: align xmem gemm kernel naming --------- Co-authored-by: Your Name <your@email.com> * hexagon: eliminate scalar VTCM loads via HVX splat helpers (llama/22993) * hexagon: add hvx_vec_repl helpers and use those for splat-from-vtcm usecase * hmx-mm: optimize per-group scale handling * hmx-fa: optimize slope load from vtcm * hmx-fa: use aligned access where possible in hmx-utils * hexagon: add hvx_vec_repl_2x_f16 helper and consolidate repl helpers --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> * ggml-zendnn : adaptive fallback to CPU backend for small batch sizes (llama/22681) * ggml-zendnn : add runtime env var GGML_ZENDNN_ADAPTIVE_FALLBACK to control adaptive fallback (default: enabled) * ggml-zendnn : restore original fallback logic when adaptive fallback is disabled * hexagon: add unary tanh op (llama/22999) * flush the gpu profile timestamp before the queryset is overflowed (llama/22995) * opencl: fix crash when warming up MoE on Adreno (llama/22876) * opencl: add q5_0 and q5_1 MoE for Adreno (llama/22985) * opencl: add q5_0 moe support * opencl: add q5_1 moe support * opencl: avoid potential leak * opencl: suppress unused var warning when building for non-Adreno --------- Co-authored-by: Li He <lih@qti.qualcomm.com> * Fix for issue #22974. Cast intermediate results to float before adding and casting the result to the destination type. Avoids half+half operator ambiguity. (llama/22994) * ggml-webgpu: only use subgroup-matrix path when head dims are divisible by sg_mat_k / sg_mat_n (llama/23020) * sync : ggml * talk-llama : sync llama.cpp * server: add support for carry_initial_prompt (#3781) * Add support for carry_initial_prompt on the server * Update README * server : Return speaker information in JSON (#3782) * examples : fix memory leak in read_audio_data (#3810) This commit addresses a memory leak in the `read_audio_data` function where it is currently possible that a call to `ma_decoder_init_file` succeeds and the function returns early without calling `ma_decoder_uninit`. A similar situation can occur with `ma_decoder_init_memory`. Refs: https://bugs.debian.org/1124796 Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> * whisper : set bench data for each iteration (#3812) * whisper : set bench data for each iteration This commit updates whisper_bench_ggml_mul_mat_str to intialize the tensors data for each iteration. The motivation for this is that is currently possible for a previous run's results, F32 values, to leak into the next run. When it is time for the F16 iteration then F32 results can cause NaN values to appear in the tensor values causing the F16 iteration to fail. Refs:https://github.com/ggml-org/whisper.cpp/actions/runs/25901678402/job/76152894644?pr=3735 * ci : set GGML_NATIVE=OFF if x86_64 This commit sets GGML_NATIVE=OFF for x86_64 architectures. The motivation for this is to try to get CI to pass and the theory is that the libggml-cpu.so library in the ccache might have been built by a runner that supports a different instruction set. When another runner that does not support that instruction set tries to use it, it will fail with a segmentation fault. I'm not sure about this yet but going to try this out and if it does not work I'll ssh into the runner to debug further. * ci : use github ubuntu-22.04-arm runner instead of qemu (#3815) * ci : use github ubuntu-22.04-arm runner instead of qemu This commit updates the ubuntu-22-gcc-arm64 job to use a arm github runner instead of QEMU. The motivation for this is that we get intermittent failure specifically related to QEMU. For example: ```console Segmentation fault (core dumped) qemu: uncaught target signal 11 (Segmentation fault) - core dumped Segmentation fault (core dumped) dpkg: error processing package libc-bin (--configure): installed libc-bin package post-installation script subprocess returned error exit status 139 Processing triggers for ca-certificates (20240203~22.04.1) ... Updating certificates in /etc/ssl/certs... 0 added, 0 removed; done. Running hooks in /etc/ca-certificates/update.d... done. Errors were encountered while processing: libc-bin E: Sub-process /usr/bin/dpkg returned an error code (1) ``` This is an attempt to try to avoid QEMU and hence avoid this issue. * ci : remove QEMU where possible * common : fix server /inference fails to decode in-memory audio (regression) (#3818) * common: add memory buffer overload of read_audio_data whisper-server /inference without --convert passed the uploaded file bytes to read_audio_data as a filename, so ma_decoder_init_file tried to open a path starting with "RIFF" and failed. every request returned HTTP 400 "Invalid request" on builds without WHISPER_FFMPEG, which is the default. factor the PCM extraction into a shared helper and add an overload that decodes straight from a memory buffer via ma_decoder_init_memory, which the function already used for the stdin path. server now calls it with the upload content. the filename overload behavior is unchanged. * fix: in bindings/ruby/test/jfk_reader/jfk_reader in jfk_reader.c (#3756) * fix: V-002 security vulnerability Automated security fix generated by Orbis Security AI * fix(ruby): use Ruby allocator macros in jfk_reader and fix memory leak - Replace calloc/free with ALLOC_N/xfree to match Ruby binding conventions (ALLOC_N handles overflow checking and raises NoMemoryError on failure) - Free temporary samples buffer after conversion loop (was leaked) - Add NULL check for fopen return value with rb_raise - Add comment clarifying n_samples is a compile-time constant Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(ruby): return false instead of rb_raise in memory_view callback rb_memory_view_get_func_t callbacks should communicate errors via return value (false), not exceptions. rb_memory_view_get has no exception-handling wrapper around get_func calls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * replacing ALLOC_N with rb_protect as ALLOC_N raises Ruby exceptions --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * cmake : add CMakePresets.json [no ci] (#3808) This commit adds a CMakePresets.json file similar to the one in llama.cpp. The motivation for this is that this provides sharable named configuration which can be used with cmake --preset <name>. It also allows for extendins these preset with a CMakeUserPresets.json for specific hardware (like CPUs), architectures, and toolchains etc. * SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations (llama/21597) * SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM. zeMemAllocDevice uses the SVM/P2P path with no host staging. On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes. With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with no performance regression. All Level Zero calls include automatic fallback to the original SYCL allocation path if Level Zero interop is unavailable. * SYCL: address review feedback - remove try/catch, check device types, deduplicate - Remove try/catch from malloc/free/memcpy helpers, check backend and device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu) - Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp and declare in common.hpp to eliminate code duplication - Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls - Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the host-staged path for iGPU-to-dGPU transfers - Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH) in CMakeLists.txt (co-authored with @arthw) * SYCL: add build/runtime flags for Level Zero, address review feedback Implements the architecture suggested by @arthw: compile-time and runtime flags to cleanly separate Level Zero and SYCL memory API paths. - Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level Zero code is wrapped in #ifdef so the build works on systems without the Level Zero SDK installed (e.g. CPU-only CI servers). Both the loader library and headers are checked before enabling. - Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls whether Level Zero or SYCL memory APIs are used. Only one API style is used per session, no mixing. If Level Zero is enabled but the devices don't support the Level Zero backend, it auto-disables with a warning. - Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory is not called anywhere in the backend) and used try/catch for flow control. - Update SYCL.md with documentation for both new parameters. Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development (Claude). Code reviewed and tested on my hardware. * SYCL: unify Level Zero malloc/free call sites, address review feedback Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device. Both functions are now unconditionally available — Level Zero code is #ifdef'd inside the functions, not at call sites. All call sites use uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks. Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack traces on failure, eliminate duplicated #ifdef/else patterns at 6 call sites (-29 lines net). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs so the Level Zero code path is compiled and tested in CI. Fix two bugs found during extended dual-GPU testing (no ONEAPI_DEVICE_SELECTOR set): - The Level Zero backend check was iterating all SYCL devices including CPU. The OpenCL CPU device caused Level Zero to be disabled for the GPUs, defeating the fix on multi-GPU systems. Added is_gpu() filter so only GPU devices are checked. - sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers) were still calling sycl::malloc/sycl::free directly, bypassing the Level Zero path. Routed through ggml_sycl_malloc_device/free_device for consistency with the other device memory call sites. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: address arthw review feedback on Level Zero memory API structure - Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp; only ggml_sycl_free_device (used by common.cpp) stays in common.cpp - Switch both helpers to use g_ggml_sycl_enable_level_zero global instead of per-call queue backend checks - Remove #ifdef wrapper from global definition; always declare at 0, add #else branch in init block so it stays 0 when L0 not compiled in - Update init loop comment to explain GPU-only device check - CMakeLists: message(STATUS) before the if block; align option wording AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed <5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * SYCL: remove unused cstdio/cstdlib includes from common.cpp Leftover from the deleted ggml_sycl_queue_supports_level_zero helper. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Apply suggestions from code review Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * SYCL: preserve Level Zero allocation path during early malloc * ci: fix Level Zero package conflict in Intel Docker build * ci: find Level Zero loader in oneAPI package step * ci: allow Windows SYCL package without Level Zero DLL --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * vulkan: fix matmul integer pipeline selection (llama/23005) * vulkan: fix matmul integer pipeline selection * gate pipeline creation with the right bools * ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend (llama/22863) * logs : reduce (llama/23021) * logs : reduce * args : fix envs * server : fix build * common : print verbosity level at start * server : clean-up logs * server : print prompt processing timings + sampling params * minor : whitespaces * ggml-webgpu: makes the flash attn vec path subgroup-aware (llama/23040) * ggml-webgpu: makes the flash attn vec path compile and size its split/reduce work from the device’s reported subgroup range instead of assuming 32 subgroup size. * ggml-webgpu: remove the extra max_wg_size >= max_subgroup_size guard. Remove hardcoded 32 when determine the value of reduce_wg_size and vec_nwg_cap * HIP: RDNA3 mma FA, faster AMD transpose, tune AMD (llama/22880) Adds RDNA3 support to the CUDA mma FA kernel. To make the RDNA3 tensor cores work with the FP16 accumulation for VKQ the tiles they need to be 32 logical units long in direction of the attention head; for head sizes 80 and 112 that are not exactly divided by 32 the regular length of 16 with FP32 accumulation is used instead. The longer tiles also enable more efficient transposition for a warp size of 32 which is why it's also used for RDNA4. However, this scrambles the data layout of the accumulators along the attention head dimension. To prevent accidental misuse I added another entry to ggml_cuda_mma::data_layout. I also tuned the kernel parameters for RDNA3, RDNA4, and CDNA1 in general, during which I discovered that the kernel can be made to work for head sizes up to 256 for CDNA. For RDNA3/4 I was not able to get better performance that the tile kernel for head sizes > 128. * ggml-hexagon: cpy: add contiguous fast-path in reshape copy (llama/23076) * llama + spec: MTP Support (llama/22673) * spec: support MTP * fix batch size * rename files * cont : simplify (llama/7) * MTP: clean-up (llama/9) * MTP: clean-up * review: use llama_context_type instead of llama_graph_type * review: remove llama_model_has_mtp * review: fix convert issues * convert: fix pycheck * review: formatting * use `mtp-` for identifying mtp models * convert: fix mtp conversion * mtp -> draft-mtp * remove unused llama_arch * add need_embd in speculative * llama: allow partial seq_rm for GDN models for speculative decoding Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates. * fix pending state * vulkan: add GDN partial rollback * meta: extend check to axis 1 * metal: add GDN partial rollback Extend the gated delta net kernel to store intermediate states for partial rollback support on the Metal backend. - Add K (snapshot slot count) as a function constant - Read input state from slot 0 of the 3D state tensor - Write intermediate states to different slots during token loop - For K=1, maintain backward-compatible single-slot behavior Ref: https://github.com/ggml-org/llama.cpp/commit/8c05923630110223669f069af2000e9cf10c02bc Assisted-by: llama.cpp:local pi * delta_net_base: use ggml_pad instead of new_tensor * review: add need_rs_seq * review: rename part_bounded to n_rs * review: deslop comments * review: rename, add asserts * server : adjust checkpoint logic (llama/11) * server : adjust checkpoint logic * cont : rm asserts * server-context: fix early exit * spec : fix compatibility with n-gram and add TODOs (llama/13) * metal : cleanup * llama : fix faulty bitwise check in recurrent memory * server : disable RS-based MTP in combination with other spec types * spec : add TODOs * cont : fix comment * cont : update comment * common : fix logic for ngram + mtp compat * llama-memory: enable checkpointing with partial rollback * cont: add test-case for loading into a dirty ctx * llama-memory-recurrent: clear rs_idx in clear * download: fix mtp path * llama-arch: fix enorm op * docs: update docs * conversion: fix type annotations --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * ggml : bump version to 0.12.0 (ggml/1494) * ggml-alloc: fix out-of-bounds read in ggml_dyn_tallocr_remove_block (ggml/1492) * ggml.h: correct ggml_silu_back arg docstring (a=dy, b=x) (ggml/1500) * vulkan: removed duplicate #include <memory> in headers (llama/23144) * vulkan: fuse SSM_CONV + BIAS + SILU (llama/22653) * vulkan: Support unaligned tensors for ROPE (llama/22637) * vulkan: add cpy bf16 -> f32 pipelines (llama/22677) * ggml-vulkan/CMakeLists: add a check for SPIRV-Headers (llama/22009) * ci/run: set explicit SPIR-V Headers search path for macOS vulkan CI For whatever reason, the files are under additional sub-path `vulkan/` under the cmake directory, which does not match either current LunarG macOS Vulkan SDK structure (`lib/cmake/SPIRV-Headers`), nor what gets installed when you run the cmake build+install for SPIRV-Headers itself on at least Linux (`share/cmake/SPIRV-Headers`). This allows for SPIRV-Headers to be found, as currently the CI runner's setup does not seem to include the relevant path in list of search locations. * ggml-vulkan/CMakeLists: add a check for SPIRV-Headers This is installed by the project if it is built and installed. Receiving an error during the configuration step is generally preferred to receiving an error in the middle of a build. * CUDA: Continue directly including cuda/iterator (llama/23102) Cont of #22936, forgot to update one site * feat: Support d_conv=15 for ssm-conv.cu (llama/23017) Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * sycl: route small f32 matmuls to oneMKL, bypass oneDNN (llama/22150) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> * sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (llama/22156) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> * ggml-hexagon: add PAD op HVX kernel (llama/23078) * ggml-hexagon: add PAD op HVX kernel Implements GGML_OP_PAD on the Hexagon HTP backend using HVX vectorized kernels. Supports zero-padding and circular padding across all 4 tensor dimensions. * hex-ggml: remove duplicate op cases (merge conflict) * hex-pad: fix editorconfig checks and macro alignment --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> * hexagon: add support for TRI op (llama/22822) * Hexagon: TRI HVX Kernel addition to ggml hexagon HTP ops and context * addressed PR review comments for TRI op * hexagon: clang format * hex-unary: remove merge conflict markers * hex-ggml: remove duplicate op cases (merge conflict) * hex-ggml: fix editor config errors --------- Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> * rpc : keep last_graph_uid in the device context (llama/23273) With the introduction of MTP we can have multiple compute contexts for the same RPC device. In this case last_graph_uid is not updated properly when contexts are being switched. This patch fixes this by moving last_graph_uid to the device context, making sure it is always updated. closes: #23242 * sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle (llama/22153) * sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle Signed-off-by: Chun Tao <chun.tao@intel.com> * Use async mem ops for correctness when SYCL graphs are explicitly on. Signed-off-by: Tao, Chun <chun.tao@intel.com> --------- Signed-off-by: Chun Tao <chun.tao@intel.com> Signed-off-by: Tao, Chun <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> * ggml-webgpu : extend GDN for K>1 (llama/23299) * hexagon: enable support for NORM op (llama/23319) * hexagon: add MROPE and IMROPE support in HTP rope op (llama/23317) * opencl: add MoE support for q4_k, q5_k, q6_k on Adreno (llama/23303) * opencl: add q4_k moe support * opencl: add q5_k moe support * op…

* sync : ggml * talk-llama : sync llama.cpp * try to fix window cublas CI failure Refs: https://github.com/ggml-org/whisper.cpp/actions/runs/25631391231/job/75237266964?pr=3803 * Revert "try to fix window cublas CI failure" This reverts commit a4d91768aa2ae8cf7083650b3e4dc214413f92b7. * try using CCCL 12.4.127 with cuda 11.8.0 to fix CI failure * Revert "try using CCCL 12.4.127 with cuda 11.8.0 to fix CI failure" This reverts commit be867eadf553801eb7d1c383ed47a90fdd3d4b18. Sorry about this noise, I thought it was worth a try. * devops : add spirv-headers to vulkan dockerfile * ggml-cuda : add explicit casts to -INFINITY for float and half2 types This commit adds explicit casts to float for -INFINITY. The motivation for this is that in CUDA 11.8.0, the -INFINITY macro is defined as a double (a header provided NVCC). This triggers a warning and hence causes a CI failure in whisper.cpp. I belive that this header might have been updated in CUDA 12 which is why we don't see this warning. Refs: https://github.com/ggml-org/whisper.cpp/actions/runs/25713948217/job/75500081939?pr=3803 Refs: https://github.com/ggml-org/llama.cpp/issues/22824 * ggml-cuda : add ar_add() to avoid ambiguous operator+ for half/bfloat16 in CUDA 11.8 * ci : update ONEAPI version to 2025.3.3-0-devel-ubuntu24.04 * squash! ci : update ONEAPI version to 2025.3.3-0-devel-ubuntu24.04 * Revert "ggml-cuda : add ar_add() to avoid ambiguous operator+ for half/bfloat16 in CUDA 11.8" This reverts commit 5cd228494af3973294e90aad95b58c2ede400f43. Reverting in favor of: https://github.com/ggml-org/llama.cpp/pull/22994 * Revert "ggml-cuda : add explicit casts to -INFINITY for float and half2 types" This reverts commit a2839b4404de473bc7af127b7b308d530afda024. Reverting this as after closer inspection these only warnings and not errors. * ggml: install ggml.pc in <libdir>/pkgconfig (ggml/1480) That's always how it's done: https://github.com/search?q=path%3ACMakeLists.txt%20%22%24%7BCMAKE_INSTALL_LIBDIR%7D%2Fpkgconfig%22&type=code * metal : tighten input-position loop in kernel_conv_transpose_1d (ggml/1477) For a given output position j on the time axis, only input positions i such that i*s0 <= j < i*s0 + K contribute -- i.e. i in [ceil((j - K + 1)/s0), floor(j/s0)] intersected with [0, IL-1]. That's at most ceil(K/s0) values (typically 2 for stride==K/2 transposed convs). The current kernel iterates the full IL range and filters with an `if`, amplifying per-thread work by IL/ceil(K/s0) (~160x for IL=320, K=10, s0=5 -- a representative codec-decoder shape). On Apple M1 the wasted work trips the macOS GPU watchdog (kIOGPUCommandBufferCallbackErrorImpactingInteractivity) on long graphs. Compute i_min, i_max analytically before the inner loop and iterate only [i_min, i_max]. Output is bit-identical (same multiplies and adds in the same order); loop bound shrinks by IL/ceil(K/s0). Tested on M1 with a downstream consumer running a TTS codec at full T_codec; end-to-end codec decode ~3-4x faster, zero watchdog hits across long synthesis runs vs ~30% pre-patch. * ggml-virtgpu : include missing mutex header (llama/22810) Add missing `#include <mutex>` in ggml-backend-device.cpp. Fixes: #22809 Signed-off-by: Oliver Walsh <owalsh@redhat.com> * Add OP im2col_3d (llama/22903) * add im2col_3d * format code * update the ops.md * CUDA: directly include cuda/iterator (llama/22936) Before, we relied on a transient import from `cub/cub.cuh`, which is bad practice to do as cub may not always expose cuda/iterator * vulkan: Support asymmetric FA in scalar/mmq/coopmat1 paths (llama/22589) * Ggml/cuda snake fusion hardening (llama/22912) * cuda: tighten snake fusion type checks for all operands (defensive, sync vulkan) * cuda: reject snake fusion when ne[2] or ne[3] > 1 (mirror vulkan PR review) * cuda: merge type_ok and types_ok into a single types_ok (address am17an review) * cuda: filter ADD/SUB/MUL/DIV in supports_op to F32/F16 bin_bcast only dispatches F32/F16 type triplets, mirror the vulkan filter so unsupported types fall back through cpy instead of aborting. * test-backend-ops: extend snake_fuse to rank-4 with ne[2]/ne[3] > 1 cases * CUDA: handle OW > 65535 in im2col (2D and 3D) (llama/22944) `im2col_cuda` and `im2col_3d_cuda` both dispatch with `block_nums.y = OW`. CUDA caps grid Y at 65535. Conv1d encoders on raw 16 kHz audio with T > 65535 (~ 4 s) trip the limit -- e.g. SEANet at 11 s lands at OW = 176000 -- and the launch returns `invalid configuration argument`. Clamp `block_nums.y` to `MIN(OW, MAX_GRIDDIM_Y)` and loop inside the kernel with stride `MAX_GRIDDIM_Y`. Same in-kernel stride pattern already used for the z axis (`MAX_GRIDDIM_Z`). Both 2D `im2col_kernel` and 3D `im2col_3d_kernel` need the same fix. Bit-identical for OW <= 65535 (single iteration of the new outer loop). Tested on T4 / Jetson Orin with a SEANet encoder running on 11 s / 16 kHz audio (im2col reaching OW ~ 176000); pre-fix launch returns `invalid configuration argument`, post-fix runs to completion. Existing test-backend-ops im2col cases unchanged. * opencl: add q4_1 MoE for Adreno (llama/22856) * Q4_1 MoE CLC pass sanity check * remove unnecessary code * opencl: remove unnecessary asserts and reformat * opencl: fix supports_op for q4_1 moe * q4_1 moe is supported by Adreno with certain shapes --------- Co-authored-by: Li He <lih@qti.qualcomm.com> * metal : promote mul_mv/mul_mm batch divisors to function constants (llama/22711) * metal : promote mul_mv/mul_mm batch divisors to function constants * metal : take op directly in get_pipeline_mul_mv_ext * vulkan: Check shared memory size for mmq shaders (llama/22693) * vulkan: Fix Windows performance regression on Intel GPU BF16 workloads for Xe2 and newer (llama/22461) * refactor * Use l_warptile only when coopamt is available for BF16 * ggml-webgpu: address precision issues for multimodal (llama/22808) * fix(mixed-types): use f32 for precision and update the shared memory calculation logic for f32 * fix(unary): correct the gelu, gelu quick and gelu erf functions * fix(flash-attn-tile): fix the hardcode v type * fix(flash_attn): fix tile path * fix: pass editorconfig and address the type conflicts * fix: remove reduant pipeline keys * fix: remove inline min/max group size functions and revert the flash attn path order * fix: use clamp to avoid NaN for GELU * fix: use the right range for exp, 80 is safer for f32 exp * ggml-webgpu: Enables running gpt-oss-20b (llama/22906) * Enable to run gpt-oss-20b and refactor mulmat-q * disable test-backend-ops in ubuntu-24-webgpu * opencl: add opt-in Adreno xmem F16xF32 GEMM for prefill (llama/22755) * ggml-opencl: add Adreno xmem F16xF32 GEMM for prefill * ggml-opencl: address Adreno xmem review comments * ggml-opencl: align xmem gemm kernel naming --------- Co-authored-by: Your Name <your@email.com> * hexagon: eliminate scalar VTCM loads via HVX splat helpers (llama/22993) * hexagon: add hvx_vec_repl helpers and use those for splat-from-vtcm usecase * hmx-mm: optimize per-group scale handling * hmx-fa: optimize slope load from vtcm * hmx-fa: use aligned access where possible in hmx-utils * hexagon: add hvx_vec_repl_2x_f16 helper and consolidate repl helpers --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> * ggml-zendnn : adaptive fallback to CPU backend for small batch sizes (llama/22681) * ggml-zendnn : add runtime env var GGML_ZENDNN_ADAPTIVE_FALLBACK to control adaptive fallback (default: enabled) * ggml-zendnn : restore original fallback logic when adaptive fallback is disabled * hexagon: add unary tanh op (llama/22999) * flush the gpu profile timestamp before the queryset is overflowed (llama/22995) * opencl: fix crash when warming up MoE on Adreno (llama/22876) * opencl: add q5_0 and q5_1 MoE for Adreno (llama/22985) * opencl: add q5_0 moe support * opencl: add q5_1 moe support * opencl: avoid potential leak * opencl: suppress unused var warning when building for non-Adreno --------- Co-authored-by: Li He <lih@qti.qualcomm.com> * Fix for issue #22974. Cast intermediate results to float before adding and casting the result to the destination type. Avoids half+half operator ambiguity. (llama/22994) * ggml-webgpu: only use subgroup-matrix path when head dims are divisible by sg_mat_k / sg_mat_n (llama/23020) * sync : ggml * talk-llama : sync llama.cpp * server: add support for carry_initial_prompt (#3781) * Add support for carry_initial_prompt on the server * Update README * server : Return speaker information in JSON (#3782) * examples : fix memory leak in read_audio_data (#3810) This commit addresses a memory leak in the `read_audio_data` function where it is currently possible that a call to `ma_decoder_init_file` succeeds and the function returns early without calling `ma_decoder_uninit`. A similar situation can occur with `ma_decoder_init_memory`. Refs: https://bugs.debian.org/1124796 Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> * whisper : set bench data for each iteration (#3812) * whisper : set bench data for each iteration This commit updates whisper_bench_ggml_mul_mat_str to intialize the tensors data for each iteration. The motivation for this is that is currently possible for a previous run's results, F32 values, to leak into the next run. When it is time for the F16 iteration then F32 results can cause NaN values to appear in the tensor values causing the F16 iteration to fail. Refs:https://github.com/ggml-org/whisper.cpp/actions/runs/25901678402/job/76152894644?pr=3735 * ci : set GGML_NATIVE=OFF if x86_64 This commit sets GGML_NATIVE=OFF for x86_64 architectures. The motivation for this is to try to get CI to pass and the theory is that the libggml-cpu.so library in the ccache might have been built by a runner that supports a different instruction set. When another runner that does not support that instruction set tries to use it, it will fail with a segmentation fault. I'm not sure about this yet but going to try this out and if it does not work I'll ssh into the runner to debug further. * ci : use github ubuntu-22.04-arm runner instead of qemu (#3815) * ci : use github ubuntu-22.04-arm runner instead of qemu This commit updates the ubuntu-22-gcc-arm64 job to use a arm github runner instead of QEMU. The motivation for this is that we get intermittent failure specifically related to QEMU. For example: ```console Segmentation fault (core dumped) qemu: uncaught target signal 11 (Segmentation fault) - core dumped Segmentation fault (core dumped) dpkg: error processing package libc-bin (--configure): installed libc-bin package post-installation script subprocess returned error exit status 139 Processing triggers for ca-certificates (20240203~22.04.1) ... Updating certificates in /etc/ssl/certs... 0 added, 0 removed; done. Running hooks in /etc/ca-certificates/update.d... done. Errors were encountered while processing: libc-bin E: Sub-process /usr/bin/dpkg returned an error code (1) ``` This is an attempt to try to avoid QEMU and hence avoid this issue. * ci : remove QEMU where possible * common : fix server /inference fails to decode in-memory audio (regression) (#3818) * common: add memory buffer overload of read_audio_data whisper-server /inference without --convert passed the uploaded file bytes to read_audio_data as a filename, so ma_decoder_init_file tried to open a path starting with "RIFF" and failed. every request returned HTTP 400 "Invalid request" on builds without WHISPER_FFMPEG, which is the default. factor the PCM extraction into a shared helper and add an overload that decodes straight from a memory buffer via ma_decoder_init_memory, which the function already used for the stdin path. server now calls it with the upload content. the filename overload behavior is unchanged. * fix: in bindings/ruby/test/jfk_reader/jfk_reader in jfk_reader.c (#3756) * fix: V-002 security vulnerability Automated security fix generated by Orbis Security AI * fix(ruby): use Ruby allocator macros in jfk_reader and fix memory leak - Replace calloc/free with ALLOC_N/xfree to match Ruby binding conventions (ALLOC_N handles overflow checking and raises NoMemoryError on failure) - Free temporary samples buffer after conversion loop (was leaked) - Add NULL check for fopen return value with rb_raise - Add comment clarifying n_samples is a compile-time constant Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(ruby): return false instead of rb_raise in memory_view callback rb_memory_view_get_func_t callbacks should communicate errors via return value (false), not exceptions. rb_memory_view_get has no exception-handling wrapper around get_func calls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * replacing ALLOC_N with rb_protect as ALLOC_N raises Ruby exceptions --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * cmake : add CMakePresets.json [no ci] (#3808) This commit adds a CMakePresets.json file similar to the one in llama.cpp. The motivation for this is that this provides sharable named configuration which can be used with cmake --preset <name>. It also allows for extendins these preset with a CMakeUserPresets.json for specific hardware (like CPUs), architectures, and toolchains etc. * SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations (llama/21597) * SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM. zeMemAllocDevice uses the SVM/P2P path with no host staging. On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes. With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with no performance regression. All Level Zero calls include automatic fallback to the original SYCL allocation path if Level Zero interop is unavailable. * SYCL: address review feedback - remove try/catch, check device types, deduplicate - Remove try/catch from malloc/free/memcpy helpers, check backend and device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu) - Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp and declare in common.hpp to eliminate code duplication - Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls - Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the host-staged path for iGPU-to-dGPU transfers - Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH) in CMakeLists.txt (co-authored with @arthw) * SYCL: add build/runtime flags for Level Zero, address review feedback Implements the architecture suggested by @arthw: compile-time and runtime flags to cleanly separate Level Zero and SYCL memory API paths. - Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level Zero code is wrapped in #ifdef so the build works on systems without the Level Zero SDK installed (e.g. CPU-only CI servers). Both the loader library and headers are checked before enabling. - Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls whether Level Zero or SYCL memory APIs are used. Only one API style is used per session, no mixing. If Level Zero is enabled but the devices don't support the Level Zero backend, it auto-disables with a warning. - Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory is not called anywhere in the backend) and used try/catch for flow control. - Update SYCL.md with documentation for both new parameters. Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development (Claude). Code reviewed and tested on my hardware. * SYCL: unify Level Zero malloc/free call sites, address review feedback Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device. Both functions are now unconditionally available — Level Zero code is #ifdef'd inside the functions, not at call sites. All call sites use uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks. Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack traces on failure, eliminate duplicated #ifdef/else patterns at 6 call sites (-29 lines net). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs so the Level Zero code path is compiled and tested in CI. Fix two bugs found during extended dual-GPU testing (no ONEAPI_DEVICE_SELECTOR set): - The Level Zero backend check was iterating all SYCL devices including CPU. The OpenCL CPU device caused Level Zero to be disabled for the GPUs, defeating the fix on multi-GPU systems. Added is_gpu() filter so only GPU devices are checked. - sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers) were still calling sycl::malloc/sycl::free directly, bypassing the Level Zero path. Routed through ggml_sycl_malloc_device/free_device for consistency with the other device memory call sites. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: address arthw review feedback on Level Zero memory API structure - Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp; only ggml_sycl_free_device (used by common.cpp) stays in common.cpp - Switch both helpers to use g_ggml_sycl_enable_level_zero global instead of per-call queue backend checks - Remove #ifdef wrapper from global definition; always declare at 0, add #else branch in init block so it stays 0 when L0 not compiled in - Update init loop comment to explain GPU-only device check - CMakeLists: message(STATUS) before the if block; align option wording AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed <5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * SYCL: remove unused cstdio/cstdlib includes from common.cpp Leftover from the deleted ggml_sycl_queue_supports_level_zero helper. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Apply suggestions from code review Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * SYCL: preserve Level Zero allocation path during early malloc * ci: fix Level Zero package conflict in Intel Docker build * ci: find Level Zero loader in oneAPI package step * ci: allow Windows SYCL package without Level Zero DLL --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * vulkan: fix matmul integer pipeline selection (llama/23005) * vulkan: fix matmul integer pipeline selection * gate pipeline creation with the right bools * ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend (llama/22863) * logs : reduce (llama/23021) * logs : reduce * args : fix envs * server : fix build * common : print verbosity level at start * server : clean-up logs * server : print prompt processing timings + sampling params * minor : whitespaces * ggml-webgpu: makes the flash attn vec path subgroup-aware (llama/23040) * ggml-webgpu: makes the flash attn vec path compile and size its split/reduce work from the device’s reported subgroup range instead of assuming 32 subgroup size. * ggml-webgpu: remove the extra max_wg_size >= max_subgroup_size guard. Remove hardcoded 32 when determine the value of reduce_wg_size and vec_nwg_cap * HIP: RDNA3 mma FA, faster AMD transpose, tune AMD (llama/22880) Adds RDNA3 support to the CUDA mma FA kernel. To make the RDNA3 tensor cores work with the FP16 accumulation for VKQ the tiles they need to be 32 logical units long in direction of the attention head; for head sizes 80 and 112 that are not exactly divided by 32 the regular length of 16 with FP32 accumulation is used instead. The longer tiles also enable more efficient transposition for a warp size of 32 which is why it's also used for RDNA4. However, this scrambles the data layout of the accumulators along the attention head dimension. To prevent accidental misuse I added another entry to ggml_cuda_mma::data_layout. I also tuned the kernel parameters for RDNA3, RDNA4, and CDNA1 in general, during which I discovered that the kernel can be made to work for head sizes up to 256 for CDNA. For RDNA3/4 I was not able to get better performance that the tile kernel for head sizes > 128. * ggml-hexagon: cpy: add contiguous fast-path in reshape copy (llama/23076) * llama + spec: MTP Support (llama/22673) * spec: support MTP * fix batch size * rename files * cont : simplify (llama/7) * MTP: clean-up (llama/9) * MTP: clean-up * review: use llama_context_type instead of llama_graph_type * review: remove llama_model_has_mtp * review: fix convert issues * convert: fix pycheck * review: formatting * use `mtp-` for identifying mtp models * convert: fix mtp conversion * mtp -> draft-mtp * remove unused llama_arch * add need_embd in speculative * llama: allow partial seq_rm for GDN models for speculative decoding Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates. * fix pending state * vulkan: add GDN partial rollback * meta: extend check to axis 1 * metal: add GDN partial rollback Extend the gated delta net kernel to store intermediate states for partial rollback support on the Metal backend. - Add K (snapshot slot count) as a function constant - Read input state from slot 0 of the 3D state tensor - Write intermediate states to different slots during token loop - For K=1, maintain backward-compatible single-slot behavior Ref: https://github.com/ggml-org/llama.cpp/commit/8c05923630110223669f069af2000e9cf10c02bc Assisted-by: llama.cpp:local pi * delta_net_base: use ggml_pad instead of new_tensor * review: add need_rs_seq * review: rename part_bounded to n_rs * review: deslop comments * review: rename, add asserts * server : adjust checkpoint logic (llama/11) * server : adjust checkpoint logic * cont : rm asserts * server-context: fix early exit * spec : fix compatibility with n-gram and add TODOs (llama/13) * metal : cleanup * llama : fix faulty bitwise check in recurrent memory * server : disable RS-based MTP in combination with other spec types * spec : add TODOs * cont : fix comment * cont : update comment * common : fix logic for ngram + mtp compat * llama-memory: enable checkpointing with partial rollback * cont: add test-case for loading into a dirty ctx * llama-memory-recurrent: clear rs_idx in clear * download: fix mtp path * llama-arch: fix enorm op * docs: update docs * conversion: fix type annotations --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * ggml : bump version to 0.12.0 (ggml/1494) * ggml-alloc: fix out-of-bounds read in ggml_dyn_tallocr_remove_block (ggml/1492) * ggml.h: correct ggml_silu_back arg docstring (a=dy, b=x) (ggml/1500) * vulkan: removed duplicate #include <memory> in headers (llama/23144) * vulkan: fuse SSM_CONV + BIAS + SILU (llama/22653) * vulkan: Support unaligned tensors for ROPE (llama/22637) * vulkan: add cpy bf16 -> f32 pipelines (llama/22677) * ggml-vulkan/CMakeLists: add a check for SPIRV-Headers (llama/22009) * ci/run: set explicit SPIR-V Headers search path for macOS vulkan CI For whatever reason, the files are under additional sub-path `vulkan/` under the cmake directory, which does not match either current LunarG macOS Vulkan SDK structure (`lib/cmake/SPIRV-Headers`), nor what gets installed when you run the cmake build+install for SPIRV-Headers itself on at least Linux (`share/cmake/SPIRV-Headers`). This allows for SPIRV-Headers to be found, as currently the CI runner's setup does not seem to include the relevant path in list of search locations. * ggml-vulkan/CMakeLists: add a check for SPIRV-Headers This is installed by the project if it is built and installed. Receiving an error during the configuration step is generally preferred to receiving an error in the middle of a build. * CUDA: Continue directly including cuda/iterator (llama/23102) Cont of #22936, forgot to update one site * feat: Support d_conv=15 for ssm-conv.cu (llama/23017) Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * sycl: route small f32 matmuls to oneMKL, bypass oneDNN (llama/22150) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> * sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (llama/22156) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> * ggml-hexagon: add PAD op HVX kernel (llama/23078) * ggml-hexagon: add PAD op HVX kernel Implements GGML_OP_PAD on the Hexagon HTP backend using HVX vectorized kernels. Supports zero-padding and circular padding across all 4 tensor dimensions. * hex-ggml: remove duplicate op cases (merge conflict) * hex-pad: fix editorconfig checks and macro alignment --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> * hexagon: add support for TRI op (llama/22822) * Hexagon: TRI HVX Kernel addition to ggml hexagon HTP ops and context * addressed PR review comments for TRI op * hexagon: clang format * hex-unary: remove merge conflict markers * hex-ggml: remove duplicate op cases (merge conflict) * hex-ggml: fix editor config errors --------- Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> * rpc : keep last_graph_uid in the device context (llama/23273) With the introduction of MTP we can have multiple compute contexts for the same RPC device. In this case last_graph_uid is not updated properly when contexts are being switched. This patch fixes this by moving last_graph_uid to the device context, making sure it is always updated. closes: #23242 * sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle (llama/22153) * sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle Signed-off-by: Chun Tao <chun.tao@intel.com> * Use async mem ops for correctness when SYCL graphs are explicitly on. Signed-off-by: Tao, Chun <chun.tao@intel.com> --------- Signed-off-by: Chun Tao <chun.tao@intel.com> Signed-off-by: Tao, Chun <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> * ggml-webgpu : extend GDN for K>1 (llama/23299) * hexagon: enable support for NORM op (llama/23319) * hexagon: add MROPE and IMROPE support in HTP rope op (llama/23317) * opencl: add MoE support for q4_k, q5_k, q6_k on Adreno (llama/23303) * opencl: add q4_k moe support * opencl: add q5_k moe support * opencl: add q6_k moe support * opencl: adjust format --------- Co-authored-by: Li He <lih@qti.qualcomm.com> * ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps (llama/23349) * metal : optimize pad + cpy (llama/23354) * metal : optimize pad * metal : optinmize cpy * cont : better row packing in threadgroup * Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522) * Adds initial PDL setup. * Adds PDL barriers based on simple heuristic: place "sync" before first input pointer access, and "launch" after last write, e.g. to tensors like dst. * Further optimization pass of the first half of kernels * Optimized PDL barriers for the second batch of kernels * Further refinements after rebase. * Moves pdl logic to separate function, removes some whitespace * Strips post-hoc PDL logic * Adds stream capture PDL setup. Enrolls quantize_q8_1 to leverage pdl to overlap execution with previous kernels * Enrolls mul_mat_vec_q, rms_norm_f32 and k_bin_bcast (partly) into PDL * Enrolls mmvf, rope, set-rows and topk kernels for gpt-oss into PDL * Introduce ggml_cuda_kernel_launch, to abstract away cudaLaunchKernelEx, to enable hip/musa compatibility * Enrolls cpy_scalar_contiguous, k_get_rows_float and rms_norm_f32 * Enrolls flash_attn_combine_results * Fix: Drops needless and broken check of CUDA arch for PDL. PDL either works or is without effect. * Enrolls flash-attention kernels to pdl * Fix: inlines ggml_cuda_kernel_launch, and uses perfect forwarding for kernels args. This fixes PDL. * Perf: Enrolls k_bin_bcast variadic template invocation into PDL, via and template alias and template expansion * Enrolls all remaining kernels for qwen3-coder-next into PDL * Remove all PDL LC calls to create a baseline * Added LC according to internal guidance and tested kernel performance. * Enrols missing qwen3-5 kernels passively into PDL. * Kernel optimizations (LC signals) for qwen3.5 * Enrolls ssm-scan kernels into PDL * Adds GGML_CUDA_PDL command line option to toggle PDL. * Fix: Ada and lower compilation by guarding PDL calls correctly * Cleanup: Removes commented out GGML_CUDA_PDL_LC * Cleanup: Removes experimental comments * Adds 90-virtual to build script so that Hopper GPUs can leverage PDL. * Adds stricter checks to enable PDL, adds env-check to disable it, and removes now superfluous compile option to enable PDL. * Fix: Correct PDL en/disablement based on device-side arch check. Host side check is UB. Required moving from macros to inlined functions * Fix: default-disable PDL. Enable by setting GGML_CUDA_ENABLE_PDL=1 * Enable PDL by default for Hopper+ devices * Enrolls softcap_f32 and two flash_attn kernels into PDL. * Improves flash attn PDL barrier placement * Fix: Perf regression on ada; excludes ada and below from PDL launches * Improves some sync barrier placements * Drops superfluous constructor * Adds #endif guard comments * Reverts experimental change to top-k-moe.cu, which moved expensive allocations in front of the PDL barrier. It did not have a meaningful impact. * Exchanges GGML_CUDA_DISABLE_PDL with GGML_CUDA_PDL. IFF GGML_CUDA_PDL=0 PDL is disabled * Revert "Drops superfluous constructor". Adds const to remaining arguments This reverts commit 12b1d250da0089ae02a9bb71bbb3fd6d70f6f2f1. * Cleanup: Removes and fixes some comments and whitespace * Clarifies comment of sync-barrier position * Relocates and refactors PDL launch functions and accessories * Adds error checking to the regular kernel launch path * Drops "auto" in favor of "ggml_cuda_kernel_params" * Adds "const" to ggml_cuda_kernel_launch_params * [Whitespace] Adds final newline to common.cuh to make editorconfig CI job happy * hexagon: HMX quantized matmul rework (llama/23368) * hmx-mm: update debug logging in hmx-mm * hmx-mm: update dequant logic to use HVX_vector_x2/4 * hmx-mm: remove non-pipelined version of the quantize matmul It seems that we don't reall need non-pipelined version * hmx-mm: use activation depth mode and update naming Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com> * hex-mm: minor hmx matmul naming updates * hmx-mm: remove unused vars * snapdragon: scripts bump default ubatch-size to 1K * hexagon: combine HMX and power and clock settings into a single set_power call * hmx-mm: remove leftover of the scale repl helper * hexagon: fix editconf error --------- Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com> * vulkan: optimize operations in the IM2COL shader (llama/22685) * vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting * opencl: refactor backend initilization (llama/23318) * opencl: refactor initialization * opencl: refactor GPU identification * opencl: rename for consistency * opencl: cache global mem size in dev_ctx * opencl: adjust log level * opencl: load argsort and flash_attn kernels in supports_op * argsort kernel must be built for supports_op for querying the max workgroups * flash_attn kernel has many variants, only load them when needed * hexagon: ssm-conv fix for large prompts (llama/23307) * hexagon: remove gathers and better handling of vtcm in ssm-conv * hexagon: relax ssm-conv gating requirements * hexagon: add new prefill ssm-conv backend test * hexagon: remove trailing white space * hex-rope: uninline rope_cache_init, otherwise it breaks after rebaseing with SSM_CONV changes --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> * ggml : Check the right iface method before using the fallback 2d get (llama/23306) Probably no backends implement only one of 2d get/set, but this might be annoying for some future backend developer trying to add 2d get/set. * metal : optimize concat kernel and fix set kernel threads (llama/23411) * metal : fix GGML_OP_SET kernel threads * tests : extend test_cpy to support different src/dst shapes Extend test_cpy to support different source and destination tensor shapes for CPY operations (reshaping), where the total number of elements must match. - Renamed ne -> ne_src, added ne_dst parameter (default: use src shape) - Added 50 new reshaping test cases covering 1D<->2D<->3D<->4D conversions - Tests exercise 1024 boundary, small shapes, and large dimensionality changes - Fixed dangling reference bug (storing & to temporary std::array) - Updated all existing test calls with permute/transpose args for compatibility Assisted-by: llama.cpp:local pi * metal : optimize concat kernel with row batching for small widths When ne0 < 256, batch multiple rows into a single threadgroup to improve occupancy. This avoids underutilizing the GPU when processing narrow tensors. - Dispatch nth = min(256, ne0) threads per group - Calculate nrptg (rows per threadgroup) to fill up to 256 threads - Update kernel index calculation to handle the row batching - Add boundary check for i1 >= ne1 Assisted-by: llama.cpp:local pi * tests : clean-up * tests : refactor CPY shape tests to use dimension permutations Replace 75 hardcoded test cases with a loop over permutations of {3, 5, 7, 32} (total elements: 3360). Each src permutation is tested against canonical sorted and reverse dst, skipping identical shapes. Covers F32, F16, and Q4_0 (when both src and dst ne0 == 32). Assisted-by: llama.cpp:local pi * fix(flash-attn): replace f32 with kv_type and q_type (llama/23372) * vulkan: fuse snake activation (mul, sin, sqr, mul, add) (llama/22855) * vulkan: fuse snake activation (mul, sin, sqr, mul, add) Add snake.comp shader with F32 / F16 / BF16 pipelines and ggml_vk_snake_dispatch_fused. The matcher recognizes the naive 5 op decomposition emitted by audio decoders (BigVGAN, Vocos) for snake activation y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise kernel. test_snake_fuse from the CUDA PR now also compares CPU naive vs Vulkan fused across F32 / F16 / BF16. * vulkan: address jeffbolznv review for fused snake activation Rename T / C to ne0 / ne1 in the shader and push constants to match the standard naming convention used across the Vulkan backend. Tighten ggml_vk_can_fuse_snake: require x and dst to be contiguous (the shader uses idx = i0 + i1 * ne0) and require a / inv_b to be tightly packed on the broadcast dim (the shader reads data_a[i1]). * vulkan: tighten snake fusion type checks for all operands (address jeffbolznv review) * vulkan: reject snake fusion when ne[2] or ne[3] > 1 (address jeffbolznv review) * vulkan: address 0cc4m review for fused snake activation snake.comp is renamed to follow the ggml DATA_A_* / A_TYPE convention. A_TYPE now applies to the activation tensor data_a instead of the broadcast multiplier, and the bindings become data_a (A_TYPE), data_b (float), data_c (float) and data_d (D_TYPE). A header at the top of the shader maps each buffer to its role in y = x + sin(b * x)^2 * c. On the C++ side, ggml_vk_can_fuse_snake reuses the existing snake_pattern constant instead of duplicating the op list, sin_node is extracted as a named local alongside the other chain nodes, and the broadcast operands a and inv_b are now required to be GGML_TYPE_F32 to match the hardcoded float bindings on data_b and data_c (the previous a->type == x->type would silently reject any future BF16 or F16 chain once the supports_op gate for SIN / SQR is lifted). ggml_vk_snake_dispatch_fused gets an explicit GGML_TYPE_F32 case and GGML_ABORT on default in place of the silent f32 fallback, and a stale comment about data_a[i1] / data_inv_b[i1] is refreshed to match the new binding names. * CUDA: fix PDL CC check for JIT compilation (llama/23471) * ggml-zendnn : add Q8_0 quantization support (llama/23414) * ggml-zendnn : add Q8_0 quantization support * ggml-zendnn : sync with latest ZenDNN * ggml-zendnn : address review comments for Q8_0 * SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc) (llama/21580) * SYCL: add BF16 to DMMV kernel path for ~4x token generation speedup BF16 models had no dedicated token generation kernel — they fell through to the generic full-GEMM path, resulting in ~14% memory bandwidth utilization on Intel Arc GPUs. This adds BF16 support to the DMMV (dequantize mul-mat-vec) path, matching the existing F16 implementation. Fixes #20478 * SYCL: fix BF16 DMMV out-of-bounds when ncols % 64 != 0 The qk=1 kernel (used for F16 and BF16) iterates with stride 2*GGML_SYCL_DMMV_X (= 64 on Intel targets where WARP_SIZE=16). When ncols is a multiple of DMMV_X (32) but not of 2*DMMV_X (64), the last warp iteration accesses elements at col >= ncols, producing NaN for the final row and wrong values for interior rows. Fix: tighten can_use_dequantize_mul_mat_vec to require ne[0] % (2*DMMV_X) == 0 for F16/BF16 types, and update the ASSERT in the BF16 launcher to match. Quantized types use block-structured kernels with different access patterns and keep the existing DMMV_X check. Verified: test-backend-ops MUL_MAT passes 913/913 on Intel Arc Pro B70. Previously failing: m=128/129 n=1 k=1056 cases (NaN and ERR > 0.0005). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * SYCL : gated_delta_net K>1 (llama/23174) * sycl_gated_delta_net K>1 * editor_config * sycl : Level Zero detection in ggml_sycl_init (llama/23097) * [SYCL] Centralize Level Zero detection in ggml_sycl_init * use the same wording * get back the warning * SYCL: improve MoE prefill throughput (llama/23142) - change `k_copy_src1_to_contiguous` so that uses a precomputed contiguous mapping where all rows "owned" by an expert are in one slice with a know starts and ends - switch the `O(n_as * n_routed_rows)` contraption to a counting sort-based procedure with `O(n_as + n_routed_rows)` complexity * opencl: generalize Adreno MoE kernels on M (llama/23449) * vulkan: fix windows find_package of SPIRV-Headers (llama/23215) * vulkan: fix windows find_package of SPIRV-Headers * not windows-only * ggml : Check the right iface method before using the fallback 2d get (llama/23514) * hexagon: apply repl optimization in flash attn softmax as #22993 (llama/23455) * opencl: batch profiling to improve speed and prevent memory leaks (llama/23495) * TP: fix entirely zero-sized slices per device (llama/23525) * ggml : Parallelize quant LUT init (llama/23595) - Use OpenMP to parallelize iq2xs_init_impl and iq3xs_init_impl. - Move the OpenMP detection from ggml-cpu to ggml-base. - Update OpenMP dependencies in ggml-config.cmake.in. * ggml : bump version to 0.12.1 (ggml/1508) * sync : ggml * talk-llama : sync llama.cpp * readme : add AMD ROCm/HIP GPU build instructions (#3823) Signed-off-by: Kaihui-AMD <Kaihui.Tang@amd.com> * ggml: `gguf_init_from_callback` and `gguf_init_from_buffer` (llama/22341) * ggml: implement `gguf_init_from_buffer` * test: `gguf_init_from_buffer` * fix: memory breakdown for a model loaded with `no_alloc` from a file is consistent with being loaded from a buffer * fix: use `GGML_UNUSED` Co-authored-by: Copilot <copilot@github.com> * fix: remove `total_size` from `gguf_reader` * fix: file offset calculation, rename `offset` to `data_offset` Co-authored-by: Copilot <copilot@github.com> * refactor: extract model loader bug fixes to another PR * feat: add `gguf_init_from_callback` * fix: always require a max expected size * fix: change `gguf_reader_callback_t`'s `output` type to `void *`, change `max_expected_size` and offsets to `uint64_t` * fix: harden against offset overflow in buffer read * fix: remove seek behavior from the callback * feat: `max_chunk_read == 0` means `SIZE_MAX` * fix: seeking in a gguf file with no tensors --------- Co-authored-by: Copilot <copilot@github.com> * TP: fix ggml context size calculation (llama/22616) * TP: fix ggml context size calculation, memory leak * move split state cache back into the context * revert to constant ggml context size for cgraphs * increase headroom for statically allocated tensors * remove obsolete include * ggml : bump version to 0.13.0 (ggml/1510) * sync : ggml * benches : update * release : v1.8.5 * cli : merge tokens split across UTF-8 boundaries in JSON output (#3751) * cli : merge tokens split across UTF-8 boundaries in JSON output When a multi-byte UTF-8 codepoint (most commonly a CJK character, 3 bytes) is split across multiple whisper tokens, the -ojf/--output-json-full writer emitted each token's partial bytes as its own JSON string, producing invalid UTF-8 that chokes downstream parsers. Merge adjacent tokens in output_json whenever the accumulated text still ends on an incomplete UTF-8 sequence. The merged entry keeps the first token's id/p/t_dtw and extends t1 to the last absorbed token, which matches how segment text is assembled elsewhere. Refs #1798 * fix: address review — add braces for consistency, use full issue URL - Add braces to if/else chain for codebase consistency - Use full URL for issue #1798 reference Review: @danbev --------- Co-authored-by: texasich <texasich@users.noreply.github.com> Co-authored-by: texasich <texasich@gmail.com> * docs : add AGENTS.md and CONTRIBUTING.md [no ci] (#3826) * docs : add AGENTS.md and CONTRIBUTING.md [no ci] This commit add AGENTS.md and CONTRIBUTING.md which are based on the same files in llama.cpp. They have been modified slightly to fit with whisper.cpp. The motivation for this is to clarify the contribution policy in whisper.cpp so that contributers can have a better understanding of the expectations and requirements for contributing to the project. * ci : only run docker jobs when pushed to master [no ci] (#3828) * ci : set GGML_NATIVE=OFF for bindings-java (#3830) * ci : set GGML_NATIVE=OFF for bindings-java This commit attempts to address an issue with the bindings-java job which is currently failing. I've not been able to reproduce this locally my windows machine and I suspect that what might be happning is that windows job compiles on a runner where it has different CPU features, for example AVX512 and when this dll is used on a different runner that does not have that feature it will crash. Refs: https://github.com/ggml-org/whisper.cpp/actions/runs/26496174929/job/78059073255?pr=3829 * ci : also disable BMI2 * ci : renable arm64 docker builds (#3832) This commit re-enables the arm64 docker images builds which were removed in Commit 9366544991bfee59c927e7c23b1861c6c762e708 ("ci : fix arm builds"). It also uses ubuntu-24.04-arm as the runner which enables us to avoid QEMU. Resolves: https://github.com/ggml-org/whisper.cpp/issues/2859 * ci : add on push/pull_request paths ruby job (#3833) * ci : add on push/pull_request paths ruby job This commit adds paths to bindings-ruby to only build if changes where made to bindings/ruby or to include/whisper.h. * ci : add additional paths [no ci] * ci : fix include paths for bindings-go job [no ci] (#3835) * ci : add ignore for bindings/{ruby, go} in build.yml [no ci] (#3837) This commit adds an ignore for bindings-ruby and bindings-go in build.yml as these are handled by separate .yml file (separate jobs) and don't need to trigger a full CI build. * CUDA: add fast walsh-hadamard transform (llama/23615) * CUDA: add fast walsh-hadamard transform * review: add unrolls + change size_t -> int * warp size 64 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * metal : add apple device id (llama/23566) Co-authored-by: lvyichen <lvyichen@stepfun.com> * CUDA: missing PDL sync for FWHT, better fallback (llama/23690) * Check batch_compute_passes before sending passes when not doing GPU profiling (llama/23457) * Only run webgpu CI on my fork * Add webgpu only workflow * refactor batch_compute_passes to a per-thread variable, and submit individual passes when it is set to false and no GPU profiling is enabled * restore build.yml * ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K and clean up legacy MUL_MAT pipeline (llama/23594) * ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K * Fix to editorconfig checking pass * Remove mul-mat-legacy pipeline * Fix to use vendor name as is and add dot_product/vendor to shader_lib_ctx * SYCL: implement ggml_sycl_pool_vmm (llama/22862) * SYCL: implement ggml_sycl_pool_vmm * Add an option to bypass VMM with GGML_SYCL_DISABLE_VMM * Clean up debugging logging * document GGML_SYCL_DISABLE_VMM * Multi-stream MoE optimization * Revert "Multi-stream MoE optimization" This reverts commit 938929c3f13a562ec67c59e87cc5d38595444cce. * Update common.hpp Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * Flip GGML_SYCL_DISABLE_VMM to GGML_SYCL_ENABLE_VMM * add logging for GGML_SYCL_ENABLE_VMM when extension is not available (SYCL_EXT_ONEAPI_VIRTUAL_MEM macro) * Apply suggestions from code review Co-authored-by: Alexey Kopytko <alexey@kopytko.com> * Apply suggestion from @sanmai * Apply suggestion from @sanmai --------- Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * hexagon: add support for CONCAT op (llama/23648) * hexagon: add support for CONCAT with optimized concat_2d_transposed qwen3.5 models are quite heavy on the CONCAT with large and transposed src1. * hex-concat: use fastdiv in generic version * hex-concat: make checks for transposed a bit more readable * hex-concat: reoder dma ops for better pipelining * hex-cont/cpy: optimize CPY and CONT ops The primary change is to avoid scalar divs in the inner loops. We were calling hvx_copy_uu(... type_size) where type_size is non a constexpr. This causes runtime divs by that value which is normally just 4 or 2 (f32/f16). * hex-get-rows: optimize GET_ROWS for large rows We now use DMA for larger rows and also split them into chunks to improve perf for Qwen3.5 and other models that do lots of GET_ROWS with huge (2MB+ rows). Also bump the DMA queue depth now that we can take advantage of it. * hex-concat: unroll the inner loops of concat_2d * hex-concat: more updates to concat_2d to improve perf a bit further * hex-cpy: fixed n_rows per thread checks in the copy ops * hmx-fa: fix alignment issues while computing dma sizes * hex-set-rows: add early returns for idle threads * hvx-rope: minor optimization to replace loops with fastdiv logic * hex-rope: replace scalar tail processing with HVX * hex-rope: optimize rope cache init with HVX Add hvx-utils sin/cos helpers that use an aprox method (similar to rsqrt, inverse, etc) Use the helpers to optimize ROPE. * vulkan: optimize conv2d and implement coopmat1 support (llama/22620) * vulkan: add CONV_SHAPE_64x128 for medium-K conv2d * vulkan: skip conv2d bounds checks when shapes align with tile sizes * vulkan: use WG_SIZE=128 for CONV_SHAPE_64x32 conv2d * vulkan: stage cm2 conv2d accumulator through shmem before global store * vulkan: add coopmat1 conv2d path * fallback when using too much shared memory. clean up comments * Require 16x16x16 and subgroup size 32 or 64 * check whether shared memory is sufficient before overwriting conv2d params with coopmat1 values * ggml-zendnn : fixed naming of matmul function (llama/20964) * ggml-zendnn: fixed naming of matmul function * ggml-zendnn: fixed naming of mul_mat_id function * ggml-zendnn: fixed print in mul_mat_id --------- Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru> * vulkan: avoid preferring transfer queue on AMD UMA devices (llama/22455) * CUDA: restrict PDL to CTK >= 12.3 due to MSVC issues (llama/23742) * vulkan: add REPEAT op support for f16 to f16. (llama/23298) * feat: extend repeat op for vulkan * feat: add repeat_f16 vulkan pipeline * fix: ensure same dst and src types * fix: use type_size instead of data types * fix: use int16 and int32 for repeat shader op * chore: rename repeat_f* to repeat_i* * chore: rename repeat vulkan pipelines * vulkan: use GL_NV_cooperative_matrix_decode_vector for faster matmul (llama/23541) * vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 (llama/22887) * vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 Against mesa git, this shows a 4.8% performance improvement for tg128 on Qwen3.5-9B:BF16 on Intel BMG. Note that this breaks some tests until the last commit which fixes OOB A reads. * vulkan: Use aligned loads in mul_mat_vec when available Against mesa git, this shows a 3.3% performance improvement for tg128 on Qwen3.5-9B:BF16 on Intel BMG. * Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec Mesa's UUB logic can't see through conditionals, limiting its ability to understand the bounds on the `num_rows` field in the cleanup run. Making it explicit that `num_rows` is, indeed, always <= `NUM_ROWS` helps mesa make slightly better codegen. Against mesa git, this currently shows a 1% performance improvement in tg128 on Qwen3.5-9B:BF16 on Intel BMG. * vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes There was a TODO to fix the OOB reads from the A matrix which we do here. It is within performance noise (+<0.1%) in tg128 for Qwen3.5-9B:BF16 on Intel BMG. * ggml-webgpu: Fix how to dispatch WG to some ops (llama/23750) * hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (llama/23647) * hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now * hmx-mm: add support for Q4_1 * hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot * hexagon: fix repack scratch buffer overflow * hex-mm: fix Q4_1 repack buffer sizing * hexagon: flip the build order for mm and fa (seems to help LTO) * hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1 * hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output * hexagon: resurrect early-wake and add support for polling for op-batch completions With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax. This is a good thing! But it does add extra latency for the pure benchmark runs. Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking. --------- Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com> * ggml-webgpu: remove legacy constants (llama/23672) * opencl: OP_GATED_DELTA_NET (llama/23312) * OP_GATED_DELTA_NET impl * add back lanes_per_column declaration * removed has_subgroup_arithmetic and has_subgroup_clustered_reduce * removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot * support for K>1 state snapshot * removed picky indent multiple of 4 fixes * removed return that won\'t be executed * Hexagon: OP_GATED_DELTA_NET K>1 support (llama/23531) * K>1 state snapshot support * removed picky indent multiple of 4 fixes * ggml: fixed Arm SVE usage bug in vec.h, vec.cpp (llama/22841) * Updated vec.h/vec.cpp code to accumulate to F32 rather than F16 Change-Id: I0cb789347f2bf60ffaf9047319f727e788c825f8 Signed-off-by: Martin Klacer <martin.klacer@arm.com> Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com> * cuda : fix KQ mask offset integer overflow in fattn MMA kernel (llama/23610) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> * vulkan: Fix memory logger unsafe iterator access (llama/23667) * vulkan: fix wrong index variable in inner loop (llama/23665) * vulkan: fast path for walsh-hadamard transform (llama/23687) * vulkan: fast path for walsh-hadamard transform * disable for intel due to segfault * hexagon: minor refresh for HMX FA and MM (llama/23796) * hex-fa: clean up qf32/fp32 handling and stride handling * hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79 * hex-fa: vectorize leftover handling * hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity * hmx-mm: remove dead code * hmx-mm: use fastdiv in x4x2 dequant * hmx-mm: sandwich dequant and scatter to improve perf * hmx-mm: fixed rebase conflicts * hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv * hmx-mm: an even earlier dispatch for per-type dequant * hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs This is a bit faster than LUT. * hex-cmake: one more tweak for lto --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com> * CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (llama/23227) * CUDA: per-quant MMVQ/MMQ batch threshold on AMD MFMA hardware The dispatcher uses a single global threshold (MMVQ_MAX_BATCH_SIZE = 8) to choose between mul_mat_vec_q (per-row GEMV) and mul_mat_q (MFMA-tiled GEMM) for quantized matmul. On AMD CDNA, the optimal crossover differs substantially by quant family because the per-row GEMV cost is dominated by dequantisation, not the dot-product itself: K-quants pay a heavier super-block decode and so MMQ wins sooner; legacy and IQ quants have lean decode and stay ahead until the batch fully populates an MFMA tile. This patch introduces ggml_cuda_should_use_mmvq(type, cc, ne11) -> bool, mirroring the existing ggml_cuda_should_use_mmq, and gates per-quant thresholds on amd_mfma_available(cc): Q3_K, Q4_K, Q5_K : MMVQ <= 3 (MMQ wins from batch=4: +5% .. +76%) Q2_K, Q6_K : MMVQ <= 5 (MMQ wins from batch=6: +8% .. +35%) others : MMVQ <= 8 (legacy & IQ regress under MMQ; unchanged) Non-AMD-MFMA paths (NVIDIA, RDNA, CDNA1 without MFMA) are byte-identical to master. GGML_CUDA_FORCE_MMVQ=1 restores the original global threshold for A/B testing. Measured on MI250X (gfx90a, ROCm 7.2.1) with Llama-3.2-3B-Instruct, llama-bench pp512 across all 20 supported quants, ubatch 1..8, 10 reps. Full table in PR description. Selected pp512 throughput (tok/s, ub=8): Q4_K_S: 559 -> 940 (+68%) Q5_K_S: 503 -> 884 (+76%) Q3_K_S: 629 -> 879 (+40%) Q2_K : 615 -> 809 (+32%) Q6_K : 582 -> 776 (+33%) Selected pp512 throughput (tok/s, ub=4): Q4_K_S: 444 -> 480 (+ 8%) Q4_0 : 682 -> 685 (+ 0%) (no regression - retains MMVQ) IQ4_XS: 706 -> 698 (- 1%) (no regression - retains MMVQ) * CUDA: address review — inline MMVQ batch table, drop env hatch & doc block * tune kernel selection logic for CDNA1 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for … (#23729) * mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING * avoid a mismatch for JIT compilation of Turing device code for Ampere or newer Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * ggml: auto apply iGPU flag CUDA/HIP if integrated device (llama/23007) * opencl: move backend info printing into its own function (llama/23702) * opencl: move backend info print into its own function * opencl: move new log line * opencl: fix for non adreno path * hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion (llama/23835) Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case. * meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (llama/23480) Without this at least the vulkan backend will skip the `* 0` for !COMPUTE tensors, causing corrupt output. * cuda : disables launch_fattn PDL enrollment due to compiler bug (llama/23825) * sync : ggml * talk-llama : sync llama.cpp * ggml : bump version to 0.13.1 (ggml/1523) * sync : ggml * common : re-implement `ffmpeg-transcode.cpp` + clarify ffmpeg usage (#3846) * examples : remove ffmpeg-transcode.cpp * examples : implement ffmpeg-transcode.cpp Assisted-by: llama.cpp:local pi * common : switch from WHISPER_FFMPEG -> WHISPER_COMMON_FFMPEG * common : pass sample rate to `ffmpeg_decode_audio()` * ci : remove obsolete self-hosted label * pi : add config [no ci] * ci : fix self-hosted paths to mnt * ci : fix path to whisper.h in examples.yml [no ci] (#3842) This commit updates the include path to whisper.h and also ensures that this is only built on pushes to master. * release : v1.8.6 * cmake : do not assume /usr/lib library installation. (#3693) Current `pkgconfig` configuration file installation path and its contents assume libraries are installed under `/usr/lib` and this is not always the case, for instance `/usr/lib64` is quite possible under Gentoo Linux. Thus use the `CMAKE_INSTALL_LIBDIR` variable instead of a hardcoded `lib`. * server : merge split utf-8 token text in verbose json (#3850) * whisper : catch C++ exceptions in whisper_init_with_params_no_state (#3831) whisper_model_load() can throw instead of returning false: std::runtime_error from this file (failed ggml context / no compatible buffer type), or vk::SystemError / vk::OutOfDeviceMemoryError from the ggml-vulkan backend during device/buffer allocation. whisper_init_* are extern "C", so a C++ exception unwinding across that boundary aborts non-C++ callers (Rust via whisper-rs, Go via cgo) -- on Windows STATUS_STACK_BUFFER_OVERRUN (0xC0000409) -- even though the function already returns NULL on failure. Wrap whisper_model_load() in try/catch and route any throw into the existing NULL-return path. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * ci : refactor + optimize (#3847) * ci : add ccache clear action * ci : split self-hosted GPU jobs into build-self-hosted.yml Extract self-hosted runner jobs from build.yml into a dedicated build-self-hosted.yml following the llama.cpp pattern: - gpu-cuda (NVIDIA Linux) - gpu-vulkan-nvidia-cm (NVIDIA Linux) - gpu-vulkan-nvidia-cm2 (NVIDIA Linux + COOPMAT2) - gpu-metal (macOS ARM64) - gpu-vulkan (macOS ARM64) GitHub-hosted CPU jobs remain in build.yml. Assisted-by: llama.cpp:local pi * ci : split release jobs into release.yml Extract release-related jobs from build.yml into a dedicated release.yml following the llama.cpp pattern: - determine-tag - windows (Win32/x64, SDL2) - windows-blas (Win32/x64, OpenBLAS) - windows-cublas (x64, CUDA 11.8/12.4) - ios-xcode-build - bindings-java (depends on windows) - release (artifact aggregation + GitHub release) CoreML job stays in build.yml with its own local tag calculation. Assisted-by: llama.cpp:local pi * ci : remove bindings-java job from release.yml Assisted-by: llama.cpp:local pi * cont : add manual trigger for build.yml * cont : remove obsolete ifs * ci : extract sanitizer job to bild-sanitize.yml * ci : extract linux jobs into build-linux.yml * ci : extract macos jobs to build-macos.yml * ci : extract gcc jobs to build-gcc.yml * ci : extract clang jobs to build-clang.yml * ci : extract sycl jobs to build-sycl.yml * ci : extract windows jobs to build-windows.yml * ci : extract emscripten job to build-wasm.yml * ci : extract android jobs into build-android.yml * ci : extract quantize job to quantize.yml * ci : extract coreml job into coreml.yml * ci : extract vad job to vad.yml * ci : extract cpu jobs to build-cpu.yml * ci : make naming of yml files consistent * ci : add --fail to curl download and propagate This commit adds the --fail option to the model download scripts so that if the model download returns a server error this is picked up. This is then detected in run.sh and a error message is displayed and the script stops and returns an error. The motivation for this is that currently it is possible for the model download to fail but this script proceeds and instead of a model file the contents will be an html page probably with the error. This will then cause the model to not be able to load due to a missing magic number. I'm not sure we can do much about the downloading failing, perhaps a retry but at least this will give a clearer error message. Refs: https://github.com/danbev/whisper.cpp/actions/runs/26866349389/job/79230794512 * ci : enable command traces to see download command in use * ci : add retry functionality to download model script This commit adds curl retry options to the model download script. The motivation is that currently when CI jobs run huggingface rate limit the requests and return: ```console curl: (22) The requested URL returned error: 429 ``` This is an attempt to work around this and if it does not work then we can an authorization token. * ci : extract freebsd job to build-freebsd.yml This job has been commented out as it has been flaky in the past. I'll monitor this and if it continues to be unreliable we can disable it in the github actions GUI instead of commenting it out like we did before. * ci : add ccache to jobs (non-docker builds) The ccache will only be saved on pushed to master. * ci : bump ccache-action version to v1.2.21 The motivation for this is that the save parameter does not seem to work with the current version. * ci : add ccache to docker jobs in build-linux.yml * ci : add debug statements to linux docker build * ci : set CCACHE_DIR for build-linux.yml * ci : add ccache to the remaining docker jobs * ci : remove build-linux.yml This commit remove build-linux.yml as the same jobs are also run by build-gcc.yml, with the exception that build-gcc.yml also run ctest). So keeping build-gcc.yml and removing the redundant build-linux.yml. * ci : add linux build artifacts to release * ci : revert to hendrikmuhs/ccache-action for win job This is currently causing the following failure: ```console sccache C:\PROGRA~1\NVIDIA~1\CUDA\v\bin\nvcc.exe -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_CRT_SECURE_NO_WARNINGS -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -DCMAKE_INTDIR=\"Release\" -ID:\a\whisper.cpp\whisper.cpp\ggml\src\ggml-cuda\.. -ID:\a\whisper.cpp\whisper.cpp\ggml\src\..\include -isystem "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v\include" -Xcompiler="-MD -O2 -Ob2" -DNDEBUG -std=c++17 -arch=native -use_fast_math -extended-lambda -Xcompiler /Zc:preprocessor -MD -MT ggml\src\ggml-cuda\CMakeFiles\ggml-cuda.dir\Release\allreduce.cu.obj -MF ggml\src\ggml-cuda\CMakeFiles\ggml-cuda.dir\Release\allreduce.cu.obj.d -x cu -c D:\a\whisper.cpp\whisper.cpp\ggml\src\ggml-cuda\allreduce.cu -o ggml\src\ggml-cuda\CMakeFiles\ggml-cuda.dir\Release\allreduce.cu.obj -Xcompiler=-Fdggml\src\ggml-cuda\CMakeFiles\ggml-cuda.dir\Release\,-FS sccache: encountered fatal error sccache: error: Could not parse shell line sccache: caused by: Could not parse shell line ``` Refs: https://github.com/danbev/whisper.cpp/actions/runs/26883673904/job/79290017353 * ci : make static linux artifacts * ci : make linux release artifact names consistent This commit removes the tag form the linux release artifacts to be consistent with the existing artifacts. If we want to include the tag then we can do that in a follow-up PR. * ci : fix linux zip files to have a directory * ci : add HF_TOKEN secret for HF download authorization This is to avoid the HR rate limiting when downloading model. --------- Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> * ci : only publish/push docker images daily (#3854) This commit updates the docker workflow to be triggered on a schedule or manually. * ci : use ccache instead of sccache for windows-cublas [no ci] (#3855) This commit updates the Install cache step to use ggml-org/ccache-action and switched to use ccache instead of sccache. The motivation for switching to ccache is that this is what llama.cpp does and also there is an issue with later version of sscache: ```console sccache C:\PROGRA~1\NVIDIA~1\CUDA\v\bin\nvcc.exe -forward-unknown-to-host-compiler -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_SCHED_MAX_COPIES=4 -DGGML_SHARED -D_CRT_SECURE_NO_WARNINGS -D_XOPEN_SOURCE=600 -Dggml_cuda_EXPORTS -DCMAKE_INTDIR=\"Release\" -ID:\a\whisper.cpp\whisper.cpp\ggml\src\ggml-cuda\.. -ID:\a\whisper.cpp\whisper.cpp\ggml\src\..\include -isystem "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v\include" -Xcompiler="-MD -O2 -Ob2" -DNDEBUG -std=c++17 -arch=native -use_fast_math …

danbev approved these changes May 22, 2026

View reviewed changes

Comment thread examples/common-whisper.cpp Outdated

Comment thread examples/common-whisper.cpp Outdated

Comment thread examples/server/server.cpp Outdated

nits

ee0e40a

danbev merged commit 0ccd896 into ggml-org:master May 22, 2026
64 checks passed

marcoStocchi mentioned this pull request May 22, 2026

Without --convert, whisper-server fails to process any input file #3774

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: server /inference fails to decode in-memory audio (regression)#3818

fix: server /inference fails to decode in-memory audio (regression)#3818
danbev merged 2 commits into
ggml-org:masterfrom
ServeurpersoCom:fix/server-inmemory-audio-decode

ServeurpersoCom commented May 21, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danbev commented May 22, 2026

Uh oh!

Uh oh!

ServeurpersoCom commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ServeurpersoCom commented May 21, 2026