FluffyAIcode
diff --git a/‎CHANGELOG.md‎
Lines changed: 32 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 32 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 19 additions & 6 deletions b/‎README.md‎
Lines changed: 19 additions & 6 deletions
diff --git a/‎benchmarks/bitpack_vs_tq/gemma4_hetero_check.py‎
Lines changed: 98 additions & 0 deletions b/‎benchmarks/bitpack_vs_tq/gemma4_hetero_check.py‎
Lines changed: 98 additions & 0 deletions
diff --git a/‎benchmarks/bitpack_vs_tq/verify_packed_e2e.py‎
Lines changed: 25 additions & 8 deletions b/‎benchmarks/bitpack_vs_tq/verify_packed_e2e.py‎
Lines changed: 25 additions & 8 deletions
diff --git a/‎kakeyalattice/pyproject.toml‎
Lines changed: 1 addition & 1 deletion b/‎kakeyalattice/pyproject.toml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎kakeyalattice/python/kakeyalattice/hf/cache.py‎
Lines changed: 43 additions & 25 deletions b/‎kakeyalattice/python/kakeyalattice/hf/cache.py‎
Lines changed: 43 additions & 25 deletions
@@ -1,5 +1,37 @@
 # Changelog
 
+## v1.6.1 — 2026-06-15
+
+**Drop-in support for heterogeneous per-layer head_dim (Gemma-4) + bit-packing
+adopted as the unified compression-ratio standard.**
+
+### Fixed
+- **Per-layer head_dim.** Models whose layers expose different K/V head dims now
+  work drop-in. Gemma-4-26B mixes `sliding_attention` (head_dim=256) and
+  `full_attention` (head_dim=512) layers, which raised
+  `AssertionError: expected last dim 256, got 512`. Each layer's codec is now
+  built lazily from the head_dim actually observed at that layer
+  (`KakeyaLatticeQuantizedCache`, `KakeyaLatticeCache`, `TurboQuantPackedCache`).
+- **Attention-mask sizes.** The int-storage caches keep their compressed state
+  outside `self.layers`, so transformers-5's `DynamicCache.get_mask_sizes` fell
+  through to `(query_length, 0)` and corrupted Gemma-4's sliding-window /
+  multimodal blockwise mask during multi-step decode (CUDA device-side assert).
+  `get_mask_sizes` is now overridden to report the true cache length.
+- Verified on H200: **Gemma-4-26B generates end-to-end** with
+  `KakeyaLatticePackedCache` (E8 Q=38), real CR **2.44×**, lossless; per-layer
+  codecs 256 (sliding) / 512 (full). Qwen3-4B regression unchanged.
+
+### Changed
+- **Bit-packing + iso-quality is now the unified comparison standard.** All
+  codec-vs-codec comparisons (KakeyaLattice and the TurboQuant baseline) use the
+  bit-packed caches (`KakeyaLatticePackedCache`, `TurboQuantPackedCache`) **and**
+  match quality (each codec taken at the operating point meeting a fixed |Δppl|
+  threshold, then real bytes compared). Raw CR at unmatched bit budgets is never
+  used to rank codecs. Iso-ppl result on Qwen3-4B (|Δppl| ≤ 2 %): **E8 +7.7 %,
+  D4 +5.0 %** real-byte advantage over TurboQuant. The int8
+  `KakeyaLatticeQuantizedCache` (1.94×) remains as the simpler, dependency-free
+  storage option. README and reports updated accordingly.
+
 ## v1.6.0 — 2026-06-15
 
 **fix codec.roundtrip bug — contiguous, directly-SDPA-feedable K/V decode.**
 
@@ -3,20 +3,33 @@
 > **A D4 / E8 nested-lattice codec that realises a discrete *Kakeya
 > cover* over the direction sphere of transformer KV activations.**
 >
-> Two `transformers.DynamicCache` subclasses ship together:
+> Three `transformers.DynamicCache` subclasses ship together:
 >
+> - **`KakeyaLatticePackedCache`** — stores **bit-packed lattice codes**.
+>   **Real ~2.4× HBM compression** (D4 Q=38 ≈ 2.46×, E8 Q=38 ≈ 2.37×;
+>   measured end-to-end on Qwen3-4B / H200). **This is the unified
+>   comparison standard** for all reported compression ratios as of v1.6.
 > - **`KakeyaLatticeQuantizedCache`** — stores **int8 lattice indices**.
->   **Real ~1.94× HBM compression** (measured at the tensor-byte
->   level; see [`reports/v1_5_release/hbm_savings/REAL_HBM_PROOF.md`](reports/v1_5_release/hbm_savings/REAL_HBM_PROOF.md)).
+>   Simpler, dependency-free storage; **real ~1.94× HBM compression**
+>   (the int8-vs-6.3-bit overhead). Bit-identical reconstruction to the
+>   packed cache — use it when you prefer the simplest storage type.
 > - **`KakeyaLatticeCache`** — stores reconstructed bf16. **Zero HBM
 >   savings**; use as a reconstruction-quality probe.
 >
 > At the **codec bit-rate level** (a Q=38 lattice vector needs ~6.3
 > bits per coordinate, vs 16 bits for bf16), the achievable ceiling
 > is **2.4×–2.8× compression at <1 % perplexity loss** on Qwen3,
-> Llama-3, DeepSeek, GLM-4, and Gemma. The current int8
-> implementation hits **1.94×** of that ceiling; the gap to 2.4× is
-> bit-packed int storage, the v1.6 work item.
+> Llama-3, DeepSeek, GLM-4, and Gemma. **`KakeyaLatticePackedCache`
+> realises that ceiling as real bytes** (v1.6); the int8 cache trades
+> ~25 % of it for a plain storage type. **All compression-ratio
+> comparisons in this repo use (1) the bit-packed caches** (both for
+> KakeyaLattice and the TurboQuant baseline) **and (2) iso-quality
+> matching** — each codec is taken at the operating point meeting a fixed
+> |Δppl| threshold, then real bytes are compared. (Raw CR at unmatched bit
+> budgets is never used to rank codecs — a lower-bit point trivially shows a
+> higher CR at worse quality.) Iso-ppl result on Qwen3-4B (|Δppl| ≤ 2 %):
+> **E8 +7.7 %, D4 +5.0 %** real-byte advantage over TurboQuant; see
+> [`reports/v1_5_release/bitpack_vs_tq_2026-06-15/`](reports/v1_5_release/bitpack_vs_tq_2026-06-15/).
 >
 > `pip install kakeyalattice`.
 
 
@@ -0,0 +1,98 @@
+"""Gemma-4-26B heterogeneous-head_dim check / repro / fix verification.
+
+Gemma-4 uses head_dim=256 (sliding_attention layers) and global_head_dim=512
+(full_attention layers). This script:
+  1. loads the model (text-only generate),
+  2. inspects the per-layer K head_dim from a bf16 DynamicCache,
+  3. tries KakeyaLatticePackedCache (E8 Q=38) and reports success/CR/coherence
+     or the assertion (pre-fix repro).
+"""
+from __future__ import annotations
+import argparse, json, os, traceback
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache
+from kakeyalattice.hf import KakeyaLatticePackedCache
+
+
+def layer_kv_dims(cache):
+    dims = []
+    if hasattr(cache, "layers"):
+        for layer in cache.layers:
+            k = getattr(layer, "keys", None)
+            dims.append(None if k is None else int(k.shape[-1]))
+    return dims
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--model", default="google/gemma-4-26B-A4B-it")
+    ap.add_argument("--max-new", type=int, default=24)
+    ap.add_argument("--out", default="/root/kakeyalattice-test/reports/v1_5_release/gemma4_hetero_headdim_2026-06-15/gemma4_check.json")
+    args = ap.parse_args()
+    dev = "cuda"
+    tok = AutoTokenizer.from_pretrained(args.model)
+    model = AutoModelForCausalLM.from_pretrained(args.model, dtype=torch.bfloat16, device_map=dev).eval()
+    cfg = model.config
+    tcfg = getattr(cfg, "text_config", cfg)
+    L = tcfg.num_hidden_layers
+    hd = getattr(tcfg, "head_dim", None)
+    ghd = getattr(tcfg, "global_head_dim", None)
+    print(f"[cfg] layers={L} head_dim={hd} global_head_dim={ghd}", flush=True)
+    print(f"[cfg] layer_types={getattr(tcfg,'layer_types',None)}", flush=True)
+
+    msgs = [{"role": "user", "content": "In one sentence, what is lattice quantization?"}]
+    enc = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt", return_dict=True)
+    ids = enc["input_ids"].to(dev)
+    in_len = ids.shape[1]
+    gen = dict(max_new_tokens=args.max_new, do_sample=False, use_cache=True)
+
+    report = {"model": args.model, "layers": L, "head_dim": hd, "global_head_dim": ghd}
+
+    # 1) bf16 baseline + per-layer K dims
+    cacheA = DynamicCache()
+    with torch.inference_mode():
+        outA = model.generate(ids, past_key_values=cacheA, **gen)
+    dimsA = layer_kv_dims(cacheA)
+    base_bytes = sum(
+        (layer.keys.element_size()*layer.keys.numel() + layer.values.element_size()*layer.values.numel())
+        for layer in cacheA.layers if getattr(layer, "keys", None) is not None)
+    textA = tok.decode(outA[0][in_len:], skip_special_tokens=True)
+    print(f"[bf16] per-layer K head_dim = {dimsA}", flush=True)
+    print(f"[bf16] distinct dims = {sorted(set(d for d in dimsA if d))}", flush=True)
+    print(f"[bf16] text: {textA[:160]}", flush=True)
+    report["per_layer_kv_dim"] = dimsA
+    report["distinct_dims"] = sorted(set(d for d in dimsA if d))
+    report["bf16_text"] = textA
+    report["bf16_kv_bytes"] = base_bytes
+    seqA = int(outA.shape[1]); del cacheA, outA; torch.cuda.empty_cache()
+
+    # 2) packed cache (E8 Q=38)
+    try:
+        cacheB = KakeyaLatticePackedCache(variant="e8", q_range=38,
+                                          num_hidden_layers=L, head_dim=hd or 256, device=dev)
+        with torch.inference_mode():
+            outB = model.generate(ids, past_key_values=cacheB, **gen)
+        textB = tok.decode(outB[0][in_len:], skip_special_tokens=True)
+        kb = cacheB.kv_storage_bytes()
+        cr = base_bytes / kb if kb else None
+        codec_dims = {li: (c.D_shape if c is not None else None)
+                      for li, c in enumerate(cacheB._codecs)}
+        print(f"[packed] OK seq={int(outB.shape[1])} kv={kb/2**20:.2f}MiB realCR={cr:.3f}x lossless={cacheB.packed_pack_unpack_ok()}", flush=True)
+        print(f"[packed] per-layer codec D_shape = {codec_dims}", flush=True)
+        print(f"[packed] text: {textB[:160]}", flush=True)
+        report.update({"packed_ok": True, "packed_kv_bytes": kb, "packed_real_cr": cr,
+                       "packed_text": textB, "codec_dims": codec_dims,
+                       "lossless": cacheB.packed_pack_unpack_ok()})
+    except Exception as e:
+        print(f"[packed] FAILED: {type(e).__name__}: {e}", flush=True)
+        traceback.print_exc()
+        report.update({"packed_ok": False, "error": f"{type(e).__name__}: {e}"})
+
+    os.makedirs(os.path.dirname(args.out), exist_ok=True)
+    with open(args.out, "w") as f:
+        json.dump(report, f, indent=2, default=str)
+    print(f"[out] {args.out}", flush=True)
+
+
+if __name__ == "__main__":
+    main()
@@ -1,11 +1,21 @@
-"""End-to-end REAL bit-packed storage on a live model.
+"""End-to-end REAL bit-packed storage on a live model (per-operating-point CR).
 
-Generates with the bit-packed caches on Qwen3-4B and reports the real packed
-KV footprint vs bf16 DynamicCache:
-  * KakeyaLatticePackedCache (D4 Q=38)  -> ~2.46x
-  * KakeyaLatticePackedCache (E8 Q=38)  -> ~2.42x
-  * TurboQuantPackedCache    (b=4)      -> ~3.76x  (lower quality; see iso-ppl)
-Also verifies the pack->unpack cycle is lossless (so quality == unpacked cache).
+Generates with the bit-packed caches on Qwen3-4B and reports the real packed KV
+footprint vs bf16 DynamicCache, and verifies pack->unpack is lossless.
+
+!!! NOT A FAIR HEAD-TO-HEAD !!!
+The points below (D4/E8 @ Q=38, TurboQuant @ b=4) are at DIFFERENT bit budgets /
+quality, so their raw CRs are NOT comparable: TurboQuant b=4 shows a higher CR
+ONLY because it is a much more aggressive, much lower-quality point
+(|Δppl| ~4.8% vs ~0.2% for KakeyaLattice Q=38). Comparing CR across unmatched
+quality is meaningless.
+
+>>> The canonical KakeyaLattice-vs-TurboQuant comparison is ISO-QUALITY (matched
+    |Δppl|) and lives in `compare_real_cr.py`. At |Δppl| <= 2% on Qwen3-4B the
+    real-byte winners are E8 +7.7% / D4 +5.0% over TurboQuant. <<<
+
+This script is only a sanity check that each packed cache works end-to-end and
+hits its expected real CR at its own operating point.
 """
 from __future__ import annotations
 import argparse, json, time
@@ -80,6 +90,8 @@ def run(make_cache, name):
 
     print(f"model={args.model} layers={L} head_dim={hd} seq={seqA}")
     print(f"bf16 DynamicCache KV bytes = {base_bytes:,} ({base_bytes/2**20:.2f} MiB)")
+    print("NOTE: per-operating-point raw CR — NOT quality-matched. Do not rank "
+          "codecs by these numbers. Iso-quality comparison: compare_real_cr.py.")
     print(f"{'cache':<26} {'KV MiB':>9} {'real CR':>9} {'lossless':>9} {'time(s)':>8}")
     rows = []
     for r in runs:
@@ -96,7 +108,12 @@ def run(make_cache, name):
     with open(args.out, "w") as f:
         json.dump({"model": args.model, "gpu": torch.cuda.get_device_name(0),
                    "head_dim": hd, "layers": L, "seq": seqA,
-                   "bf16_kv_bytes": base_bytes, "runs": rows}, f, indent=2)
+                   "bf16_kv_bytes": base_bytes, "runs": rows,
+                   "note": ("per-operating-point raw CR, NOT quality-matched; "
+                            "ranking codecs by these is meaningless. Iso-quality "
+                            "(matched |Dppl|) comparison is in compare_real_cr.py."),
+                   "iso_quality_comparison": "benchmarks/bitpack_vs_tq/compare_real_cr.py"},
+                  f, indent=2)
     print(f"[out] {args.out}")
 
 
 
@@ -8,7 +8,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "kakeyalattice"
-version = "1.6.0"
+version = "1.6.1"
 description = "Nested-lattice KV-cache compression for LLM inference: Zamir-Feder D4 and E8 variants with shaping gain over scalar quantisation."
 readme = "README.md"
 # NOTE: we intentionally declare the license only via classifier
 
@@ -173,36 +173,58 @@ def __init__(
                 warnings.warn(msg, UserWarning, stacklevel=2)
                 logger.warning(msg)
 
-        # One codec instance per layer. Codec has no cross-layer state
-        # but per-layer instantiation allows future per-layer Q sweeps
-        # without re-architecting.
-        self._codecs: list[Any | None] = []
-        if self._supports_lattice:
-            self._init_codecs()
+        # Per-layer codecs are built LAZILY on first update(), keyed by the
+        # head_dim actually observed at each layer — so models with
+        # heterogeneous per-layer head_dim (e.g. Gemma-4 sliding=256 / full=512)
+        # work drop-in. ``head_dim`` is the declared default for back-compat.
+        self._codecs: list[Any | None] = [None] * self.num_hidden_layers
+        self._raw_layers: set[int] = set()
 
         # Fire counters for sanity / audit.
         self.codec_fired_per_layer: dict[int, int] = {}
         self.skip_fired_per_layer: dict[int, int] = {}
 
     # ----- codec management -----
 
-    def _init_codecs(self) -> None:
+    def _codec_cls(self):
         if self.variant == "d4":
             from kakeyalattice import V14KakeyaZamirLatticeGPU as CodecCls
         else:
             from kakeyalattice import V15KakeyaZamirE8GPU as CodecCls
+        return CodecCls
 
-        self._codecs = []
-        for layer_idx in range(self.num_hidden_layers):
-            if self._is_boundary_layer(layer_idx):
-                self._codecs.append(None)
-            else:
-                codec = CodecCls(
-                    D=self.head_dim,
-                    q_range=self.q_range,
-                    device=str(self.device),
+    def _get_codec(self, layer_idx: int, observed_dim: int):
+        """Lazily build/return the per-layer codec from the observed head_dim.
+        Returns None for raw bf16 (boundary / incompatible-with-strict=False)."""
+        if self._is_boundary_layer(layer_idx) or layer_idx in self._raw_layers:
+            return None
+        codec = self._codecs[layer_idx]
+        if codec is not None:
+            if codec.D_shape != observed_dim:
+                raise ValueError(
+                    f"layer {layer_idx} head_dim changed "
+                    f"{codec.D_shape} -> {observed_dim} between updates"
                 )
-                self._codecs.append(codec)
+            return codec
+        bd = self._block_dim
+        is_pow2 = observed_dim > 0 and (observed_dim & (observed_dim - 1)) == 0
+        if (observed_dim % bd != 0) or not is_pow2:
+            msg = (
+                f"KakeyaLatticeCache(variant={self.variant!r}): layer "
+                f"{layer_idx} head_dim={observed_dim} is incompatible "
+                f"(need a power of 2 divisible by {bd})."
+            )
+            if self.strict:
+                raise ValueError(msg + " Pass strict=False to keep raw bf16.")
+            warnings.warn(msg + " strict=False: layer kept as raw bf16.",
+                          UserWarning, stacklevel=2)
+            self._raw_layers.add(layer_idx)
+            return None
+        codec = self._codec_cls()(
+            D=observed_dim, q_range=self.q_range, device=str(self.device),
+        )
+        self._codecs[layer_idx] = codec
+        return codec
 
     def _is_boundary_layer(self, layer_idx: int) -> bool:
         if self.boundary <= 0:
@@ -239,21 +261,17 @@ def update(
         """Roundtrip K and V through the per-layer codec, then delegate
         to ``DynamicCache.update`` to concat with existing cache state.
         """
-        # Fast path: codec disabled (strict=False on incompatible model,
-        # or boundary layer).
-        if (
-            not self._supports_lattice
-            or layer_idx >= len(self._codecs)
-            or self._codecs[layer_idx] is None
-        ):
+        # Lazily resolve the per-layer codec from the observed head_dim so
+        # heterogeneous-head_dim models work drop-in. None => raw bf16.
+        codec = self._get_codec(layer_idx, key_states.shape[-1])
+        if codec is None:
             self.skip_fired_per_layer[layer_idx] = (
                 self.skip_fired_per_layer.get(layer_idx, 0) + 1
             )
             return super().update(
                 key_states, value_states, layer_idx, *args, **kwargs
             )
 
-        codec = self._codecs[layer_idx]
         k_rt = self._roundtrip(key_states, codec)
         v_rt = self._roundtrip(value_states, codec)
         self.codec_fired_per_layer[layer_idx] = (