You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Four fixes from code audit of commit e211e4d:
1. CUDA WHT out-of-bounds (critical): kvarn_materialize_swa_kernel re-implemented
the 128-dim inverse WHT inline, running the butterfly on all 128 threads without
the threadIdx.x < 64 guard. Threads 64-127 read/wrote sh[128..255] on a
float[128] shared array — UB that produced correct results in sh[0..127] (only
threads 0-63 touch those) but could corrupt neighboring blocks under high
occupancy. Replaced with kvarn_wht_128(sh), the guarded store-path WHT. Since
H_128 is symmetric, the forward WHT is the correct inverse.
2. force-materialize null mat_idxs (major): self_kvarn_mat_idxs_swa was only built
under !kvarn_force_materialize_enabled(), but the non-rotated (force-materialize)
path still calls get_k/get_v -> materialize(swa=true, mat_idxs=nullptr), which
derefs indices->type and crashes. Now built whenever the SWA cache is KVarN,
independent of force-materialize.
3. SWA ring under-size (major): n_groups_per_stream = ceil(kv_size/128) was too
small — the metadata window of kv_size cells spans ceil(kv_size/128)+1 tiles
(sliding window is rarely tile-aligned), so the oldest in-window tile's record
slot collided with a newer tile, silently zeroing it. Now
ceil(kv_size/128)+2 for SWA, with a backstop assert documenting the invariant.
4. Vulkan SWA path (gap): kvarn_store.comp and kvarn_materialize.comp had no SWA
support (linear group decode, group==0 sink, no swa push-constant). Vulkan
advertises kvarn_native_ops, so SWA KVarN layers could offload to Vulkan and run
the non-SWA shaders on absolute-position indices -> silent garbage. Added swa
push-constant, ring slot math, per-cell position decode, and empty-cell zeroing
to both shaders, mirroring CPU/CUDA. Host dispatch reads op_params[4] (store)
and [6] (materialize) and asserts single-stream for SWA.
Verified: test-kvarn green (CPU+CUDA SWA parity, GPU SWA path now uses guarded
WHT); llama-perplexity KLD on Gemma 4 31B Q5/16k/kvarn4 = 0.7296 (statistically
identical to pre-fix 0.7305 — fixes resolve latent bugs without changing validated
quality); GGML_KVARN_FORCE_MATERIALIZE=1 smoke on Gemma 4 31B generates coherent
text (no crash). Vulkan path is theoretical (not compiled in CUDA-only build).
0 commit comments