Skip to content

Commit 7c2c763

Browse files
committed
docs: sync CLAUDE.md/AGENTS.md with v0.3.2 cache types and KVarN
Add turbo4_tcq and the new low-bit standard quant KV types (q2_0/q2_1/q3_0/q3_1/q6_1) to the cache-types overview, add a KVarN feature bullet, and list the KVarN source files in the fork-specific files section. These were missing after the v0.3.2 KVarN and quant additions.
1 parent d9edb8d commit 7c2c763

2 files changed

Lines changed: 14 additions & 2 deletions

File tree

AGENTS.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,8 @@ BeeLlama.cpp is Anbeeld's fork of llama.cpp. Fork-specific work is concentrated
1010
- **Adaptive draft-max**: server controllers that adjust the active DFlash draft horizon. The default controller is `profit`; `fringe` is also available.
1111
- **DDTree**: tree speculative verification with GPU `parent_ids` and recurrent tree kernels.
1212
- **CopySpec**: model-free speculation through rolling-hash suffix matching.
13-
- **TurboQuant / TCQ KV cache types**: `turbo2`, `turbo3`, `turbo4`, `turbo2_tcq`, and `turbo3_tcq`.
13+
- **TurboQuant / TCQ KV cache types**: `turbo2`, `turbo3`, `turbo4`, `turbo2_tcq`, `turbo3_tcq`, and `turbo4_tcq`, plus newly added low-bit standard quantized KV types `q2_0`/`q2_1`/`q3_0`/`q3_1`/`q6_1`.
14+
- **KVarN KV-cache compression** (experimental): structured KV compression via `--cache-type-k`/`--cache-type-v` pseudo names `kvarn2``kvarn8` (all nine 2/3/4/5/6/8 K/V bit combinations); target-context only, wired for Qwen3.6 and Gemma 4, with non-KVarN layers falling back to bit-width-matched standard cache types.
1415
- **Reasoning loop guard**: server-side detection and intervention for repeated hidden reasoning output.
1516

1617
Treat the local codebase as the source of truth for implementation behavior.
@@ -65,6 +66,11 @@ Key binaries: `build/bin/llama-server`, `build/bin/llama-cli`, `build/bin/llama-
6566
- `tools/server/server-loop-guard.cpp` / `.h` - reasoning loop guard.
6667
- `ggml/src/ggml-turbo-quant.c` - CPU reference quantize/dequantize for turbo2/3/4 and TCQ types.
6768
- `ggml/src/ggml-cuda/turbo-quant-cuda.cuh` - CUDA set-rows/dequantize kernels for all TurboQuant types.
69+
- `src/llama-kvarn.cpp` / `.h` - KVarN type descriptors, tile layout, bit-width presets, and runtime validation.
70+
- `src/llama-kv-cache-kvarn.cpp` / `.h` - target-context KVarN KV cache and group-range state serialization.
71+
- `ggml/src/ggml-cuda/kvarn.cu` / `.cuh` - CUDA KVarN store/materialize ops.
72+
- `ggml/src/ggml-cuda/fattn-kvarn.cuh` - KVarN FlashAttention kernels.
73+
- `ggml/src/ggml-vulkan/vulkan-shaders/kvarn_store.comp` / `kvarn_materialize.comp` - Vulkan KVarN store/materialize shaders.
6874
- `ggml/src/ggml-cuda/cross-ring-interleave.cu` - GPU cross-ring management and interleave kernel.
6975
- `ggml/src/ggml-cuda/gated_delta_net.cu` - DeltaNet CUDA kernels.
7076
- `ggml/src/ggml-cuda/ssm-conv.cu` - SSM convolution CUDA kernels.

CLAUDE.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,8 @@ This file gives code assistants local context for this repository.
77
BeeLlama.cpp is Anbeeld's fork of llama.cpp. Fork-specific work is concentrated around:
88

99
- **DFlash**: cross-attention speculative decoding with DFlash draft GGUFs, target hidden-state capture, CPU/GPU ring buffers, and server verification paths.
10-
- **TurboQuant / TCQ KV cache types**: `turbo2`, `turbo3`, `turbo4`, `turbo2_tcq`, and `turbo3_tcq`.
10+
- **TurboQuant / TCQ KV cache types**: `turbo2`, `turbo3`, `turbo4`, `turbo2_tcq`, `turbo3_tcq`, and `turbo4_tcq`, plus newly added low-bit standard quantized KV types `q2_0`/`q2_1`/`q3_0`/`q3_1`/`q6_1`.
11+
- **KVarN KV-cache compression** (experimental): structured KV compression via `--cache-type-k`/`--cache-type-v` pseudo names `kvarn2``kvarn8` (all nine 2/3/4/5/6/8 K/V bit combinations); target-context only, wired for Qwen3.6 and Gemma 4, with non-KVarN layers falling back to bit-width-matched standard cache types.
1112
- **Adaptive draft-max**: server controllers that adjust the active DFlash draft horizon. The default controller is `profit`; `fringe` is also available.
1213
- **DDTree**: tree speculative verification with GPU `parent_ids` and recurrent tree kernels.
1314
- **CopySpec**: model-free speculation through rolling-hash suffix matching.
@@ -65,6 +66,11 @@ Key binaries: `build/bin/llama-server`, `build/bin/llama-cli`, `build/bin/llama-
6566
- `tools/server/server-loop-guard.cpp` / `.h` - reasoning loop guard.
6667
- `ggml/src/ggml-turbo-quant.c` - CPU reference quantize/dequantize for turbo2/3/4 and TCQ types.
6768
- `ggml/src/ggml-cuda/turbo-quant-cuda.cuh` - CUDA set-rows/dequantize kernels for all TurboQuant types.
69+
- `src/llama-kvarn.cpp` / `.h` - KVarN type descriptors, tile layout, bit-width presets, and runtime validation.
70+
- `src/llama-kv-cache-kvarn.cpp` / `.h` - target-context KVarN KV cache and group-range state serialization.
71+
- `ggml/src/ggml-cuda/kvarn.cu` / `.cuh` - CUDA KVarN store/materialize ops.
72+
- `ggml/src/ggml-cuda/fattn-kvarn.cuh` - KVarN FlashAttention kernels.
73+
- `ggml/src/ggml-vulkan/vulkan-shaders/kvarn_store.comp` / `kvarn_materialize.comp` - Vulkan KVarN store/materialize shaders.
6874
- `ggml/src/ggml-cuda/cross-ring-interleave.cu` - GPU cross-ring management and interleave kernel.
6975
- `ggml/src/ggml-cuda/gated_delta_net.cu` - DeltaNet CUDA kernels.
7076
- `ggml/src/ggml-cuda/ssm-conv.cu` - SSM convolution CUDA kernels.

0 commit comments

Comments
 (0)