You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: sync CLAUDE.md/AGENTS.md with v0.3.2 cache types and KVarN
Add turbo4_tcq and the new low-bit standard quant KV types
(q2_0/q2_1/q3_0/q3_1/q6_1) to the cache-types overview, add a KVarN
feature bullet, and list the KVarN source files in the fork-specific
files section. These were missing after the v0.3.2 KVarN and quant
additions.
Copy file name to clipboardExpand all lines: AGENTS.md
+7-1Lines changed: 7 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,8 @@ BeeLlama.cpp is Anbeeld's fork of llama.cpp. Fork-specific work is concentrated
10
10
-**Adaptive draft-max**: server controllers that adjust the active DFlash draft horizon. The default controller is `profit`; `fringe` is also available.
11
11
-**DDTree**: tree speculative verification with GPU `parent_ids` and recurrent tree kernels.
12
12
-**CopySpec**: model-free speculation through rolling-hash suffix matching.
-**TurboQuant / TCQ KV cache types**: `turbo2`, `turbo3`, `turbo4`, `turbo2_tcq`, `turbo3_tcq`, and `turbo4_tcq`, plus newly added low-bit standard quantized KV types `q2_0`/`q2_1`/`q3_0`/`q3_1`/`q6_1`.
14
+
-**KVarN KV-cache compression** (experimental): structured KV compression via `--cache-type-k`/`--cache-type-v` pseudo names `kvarn2`…`kvarn8` (all nine 2/3/4/5/6/8 K/V bit combinations); target-context only, wired for Qwen3.6 and Gemma 4, with non-KVarN layers falling back to bit-width-matched standard cache types.
14
15
-**Reasoning loop guard**: server-side detection and intervention for repeated hidden reasoning output.
15
16
16
17
Treat the local codebase as the source of truth for implementation behavior.
Copy file name to clipboardExpand all lines: CLAUDE.md
+7-1Lines changed: 7 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,8 @@ This file gives code assistants local context for this repository.
7
7
BeeLlama.cpp is Anbeeld's fork of llama.cpp. Fork-specific work is concentrated around:
8
8
9
9
-**DFlash**: cross-attention speculative decoding with DFlash draft GGUFs, target hidden-state capture, CPU/GPU ring buffers, and server verification paths.
-**TurboQuant / TCQ KV cache types**: `turbo2`, `turbo3`, `turbo4`, `turbo2_tcq`, `turbo3_tcq`, and `turbo4_tcq`, plus newly added low-bit standard quantized KV types `q2_0`/`q2_1`/`q3_0`/`q3_1`/`q6_1`.
11
+
-**KVarN KV-cache compression** (experimental): structured KV compression via `--cache-type-k`/`--cache-type-v` pseudo names `kvarn2`…`kvarn8` (all nine 2/3/4/5/6/8 K/V bit combinations); target-context only, wired for Qwen3.6 and Gemma 4, with non-KVarN layers falling back to bit-width-matched standard cache types.
11
12
-**Adaptive draft-max**: server controllers that adjust the active DFlash draft horizon. The default controller is `profit`; `fringe` is also available.
12
13
-**DDTree**: tree speculative verification with GPU `parent_ids` and recurrent tree kernels.
13
14
-**CopySpec**: model-free speculation through rolling-hash suffix matching.
0 commit comments