D1.1 CodecKernelCache — scaffold-before-codegen (80/80 tests, 9 new)

claude · claude · commit 58d7b2c6a362 · 2026-04-20T23:10:00.000Z
First Phase 1 deliverable from codec-sweep-via-lab-infra-v1. Ships the structural cache layer NOW; Cranelift IR emission (D1.1b) defers. Design — generic over kernel handle type: CodecKernelCache<H: Clone> hosts the signature → kernel map with concurrent read-many / single-writer semantics via RwLock. Same cache hosts StubKernel (tests) AND KernelHandle (production). This separates TWO concerns usually tangled: - Cache semantics: signature-keyed insertion, double-checked locking under concurrent miss, counters for hit-ratio measurement. Testable in microseconds without a JIT engine. - IR emission: Cranelift / jitson code generation. Heavy, defers. Public API: CodecKernelCache<H> { new(), default(), get_or_compile(&self, &CodecParams, FnOnce() -> H) -> H, try_get_or_compile(&self, &CodecParams, FnOnce() -> Result<H,E>) -> Result<H,E>, len() / is_empty() / compile_count() / hit_count() / hit_ratio(), has_signature(u64) -> bool, clear(), } StubKernel { signature, is_matmul_heavy, backend } — deterministic fake for testing; captures what the kernel WOULD be (including tier selection) without compiling. Rule compliance: - Rule A/B/C/D: n/a at the cache layer (defers to IR emission) - Rule E: kernel_signature IS the key — CodecParams method returns a stable hash; the cache is keyed by it directly - Rule F: no serialisation anywhere in the cache Concurrency: - fast path: RwLock read, clone on hit, increment hit_count - slow path: RwLock write, double-check (for concurrent miss), run compile closure, insert, clone, increment compile_count - prevents duplicate compilation under concurrent load - hit_count + compile_count counters are separately locked to avoid holding cache lock during counter increment Tests (9 new, all under --features serve): - cache_starts_empty - first_call_compiles_second_is_cache_hit (cached closure must NOT re-invoke on hit; enforced via panic) - different_params_produce_different_kernels - seed_changes_do_not_invalidate_cache (kernel_signature excludes seed — different sample, same IR) - matmul_heavy_params_select_amx_backend_in_stub (OPQ+BF16x32 → backend="amx"; identity+F32x16 → backend="avx512") - clear_resets_cache_and_counters - try_get_or_compile_propagates_errors (failed compile does NOT populate cache) - has_signature_checks_without_compiling - sweep_grid_warms_cache_deterministically (5 candidates, 4 unique signatures, seed collision proven by counter) Board hygiene (CLAUDE.md Mandatory rule): STATUS_BOARD.md: D1.1 Queued → In PR (scaffold) D1.1b added as new row — Queued (Cranelift IR emission follow-up) EPIPHANIES.md PREPEND: "D1.1 scaffold-before-codegen" — cache semantics testable without Cranelift. Generic-over-handle-type is the wedge that separates the hard-to-change contract (cache) from the hard-to-build implementation (IR emission). Generalises: any JIT pipeline should split at this seam. https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
diff --git a/.claude/board/EPIPHANIES.md b/.claude/board/EPIPHANIES.md
@@ -65,6 +65,41 @@ stay as historical references.
 
 ## Entries (reverse chronological)
 
+## 2026-04-20 — D1.1 scaffold-before-codegen: cache semantics testable without Cranelift
+
+**Status:** FINDING
+
+`CodecKernelCache<H>` is generic over the kernel-handle type. The same
+cache hosts `StubKernel` (deterministic fake, no compilation) for tests
+AND `KernelHandle` (real Cranelift function pointer) for production.
+
+This separates TWO concerns that are usually tangled:
+
+1. **Cache semantics** — signature-keyed insertion, double-checked
+   locking under concurrent miss, counters for hit-ratio measurement.
+   Testable in microseconds without a JIT engine.
+2. **IR emission** — the actual Cranelift / jitson code generation
+   that takes `CodecParams` and produces a callable function pointer.
+   Heavy; takes minutes per build; requires ndarray's jitson surface
+   to be finalized.
+
+By shipping the cache layer with `StubKernel` NOW, Phase 1's cache
+semantics are verified + CI-gated before the Cranelift work starts.
+When D1.1b lands, the only change is `H = KernelHandle`; all 9 cache
+tests remain valid. This is the **scaffold-before-codegen** pattern:
+test the hard-to-change contract first, defer the hard-to-build
+implementation.
+
+Generalises: any JIT pipeline should separate cache-keying from IR
+emission at the type level. Generic over handle type is the wedge
+that makes this possible.
+
+Cross-ref: D1.1 `crates/cognitive-shader-driver/src/codec_kernel_cache.rs`;
+D0.3 sweep-grid-IS-cache-warmer epiphany (same signature-as-identity
+insight); PR #225 `CodecParams::kernel_signature()`.
+
+---
+
 ## 2026-04-20 — D0.3 sweep grid IS the JIT cache warmer
 
 **Status:** FINDING
diff --git a/.claude/board/STATUS_BOARD.md b/.claude/board/STATUS_BOARD.md
@@ -57,11 +57,12 @@ afterwards is a JIT kernel, not a rebuild. Plan path:
 | D0.6 | `CodecParamsBuilder` fluent API | **Shipped** | #225 — `contract::cam` +290 LOC of codec-params types, 14 tests (CODING_PRACTICES gap 3) |
 | D0.7 | Precision-ladder validation (OPQ↔BF16x32, Hadamard pow2, overfit guard) | **Shipped** | #225 — `CodecParamsError` at `.build()` BEFORE JIT compile |
 
-### Phase 1 — JIT codec kernels — Queued
+### Phase 1 — JIT codec kernels
 
 | D-id | Title | Status | PR / Evidence |
 |---|---|---|---|
-| D1.1 | `CodecKernelCache` via `JitCompiler` (Cranelift) | **Queued** | target ~180 LOC |
+| D1.1 | `CodecKernelCache` — structural cache layer (generic over handle) | **In PR** | branch — `CodecKernelCache<H>` + `StubKernel` + `get_or_compile` / `try_get_or_compile` with RwLock concurrent-safe double-check + compile/hit/ratio counters + 9 tests. Scaffold ships NOW; D1.1b Cranelift IR emission follows. |
+| D1.1b | Cranelift IR emission (plugs the real `KernelHandle` into the cache from D1.1) | **Queued** | target ~180 LOC once ndarray's jitson engine exposes the compile entry |
 | D1.2 | Rotation primitives: Identity / Hadamard / OPQ as JIT kernels | **Queued** | target ~190 LOC |
 | D1.3 | Residual PQ via JIT composition | **Queued** | target ~150 LOC |
 
diff --git a/crates/cognitive-shader-driver/src/codec_kernel_cache.rs b/crates/cognitive-shader-driver/src/codec_kernel_cache.rs
@@ -0,0 +1,339 @@
+//! **LAB-ONLY.** D1.1 — `CodecKernelCache`: JIT kernel cache keyed by
+//! `CodecParams::kernel_signature()`.
+//!
+//! The structural layer of Phase 1 — independent of the underlying
+//! Cranelift / jitson implementation. This module defines the cache
+//! semantics; D1.2 (rotation primitives), D1.3 (residual composition),
+//! and D1.1b (actual Cranelift IR emission) plug into it.
+//!
+//! The insight this module captures: **kernel signature and sweep grid
+//! axis are the same object viewed from two sides** (EPIPHANIES 2026-04-20
+//! "D0.3 sweep grid IS the JIT cache warmer"). Every unique
+//! `(subspaces, centroids, residual_depth, rotation_kind, distance,
+//! lane_width)` tuple maps to exactly one `kernel_signature()` — so the
+//! grid traversal order determines how fast the cache warms.
+//!
+//! ## Design — generic over handle type
+//!
+//! `CodecKernelCache<H>` is generic over `H: Clone` so this scaffold can
+//! host:
+//!
+//! - **Production:** `H = KernelHandle` from `lance-graph-contract::jit`
+//!   (raw function pointer to Cranelift-emitted code).
+//! - **Stub / testing:** `H = StubKernel` (deterministic fake — what the
+//!   kernel WOULD be, without compilation).
+//! - **Future variants:** e.g., a GPU-kernel handle when/if that lands.
+//!
+//! The cache itself doesn't know or care what a kernel IS — it only
+//! manages the `kernel_signature() → H` map with concurrent read-many /
+//! single-writer semantics. Per ndarray/.claude/rules/data-flow.md:
+//! "No `&mut self` during computation" — cache uses interior mutability.
+
+use lance_graph_contract::cam::{CodecParams, CodecParamsError};
+use std::collections::HashMap;
+use std::sync::RwLock;
+
+/// JIT kernel cache keyed by `CodecParams::kernel_signature()`.
+///
+/// Generic over kernel handle type. Concurrent-safe via `RwLock`; multiple
+/// readers can hit cache simultaneously; exactly one writer at a time for
+/// insert.
+pub struct CodecKernelCache<H: Clone> {
+    cache: RwLock<HashMap<u64, H>>,
+    compile_count: RwLock<u64>,
+    hit_count: RwLock<u64>,
+}
+
+impl<H: Clone> CodecKernelCache<H> {
+    /// Create an empty cache.
+    pub fn new() -> Self {
+        Self {
+            cache: RwLock::new(HashMap::new()),
+            compile_count: RwLock::new(0),
+            hit_count: RwLock::new(0),
+        }
+    }
+
+    /// Get the kernel for `params`, compiling if missing.
+    ///
+    /// The `compile` closure runs **only on cache miss**; for the typical
+    /// sweep where overlapping grid tuples share a kernel signature, most
+    /// calls are zero-cost cache reads.
+    ///
+    /// Returns a cloned handle — the caller drives the kernel; the cache
+    /// retains its own copy indefinitely.
+    pub fn get_or_compile<F>(&self, params: &CodecParams, compile: F) -> H
+    where
+        F: FnOnce() -> H,
+    {
+        let sig = params.kernel_signature();
+        // Fast path: read-lock check for cache hit.
+        if let Some(h) = self.cache.read().unwrap().get(&sig).cloned() {
+            *self.hit_count.write().unwrap() += 1;
+            return h;
+        }
+        // Slow path: compile + insert. Double-check inside write-lock to
+        // prevent duplicate compilation under concurrent misses.
+        let mut w = self.cache.write().unwrap();
+        if let Some(h) = w.get(&sig).cloned() {
+            *self.hit_count.write().unwrap() += 1;
+            return h;
+        }
+        let h = compile();
+        w.insert(sig, h.clone());
+        *self.compile_count.write().unwrap() += 1;
+        h
+    }
+
+    /// Same as `get_or_compile` but with a fallible compile closure.
+    pub fn try_get_or_compile<F>(
+        &self,
+        params: &CodecParams,
+        compile: F,
+    ) -> Result<H, CodecParamsError>
+    where
+        F: FnOnce() -> Result<H, CodecParamsError>,
+    {
+        let sig = params.kernel_signature();
+        if let Some(h) = self.cache.read().unwrap().get(&sig).cloned() {
+            *self.hit_count.write().unwrap() += 1;
+            return Ok(h);
+        }
+        let mut w = self.cache.write().unwrap();
+        if let Some(h) = w.get(&sig).cloned() {
+            *self.hit_count.write().unwrap() += 1;
+            return Ok(h);
+        }
+        let h = compile()?;
+        w.insert(sig, h.clone());
+        *self.compile_count.write().unwrap() += 1;
+        Ok(h)
+    }
+
+    /// Number of unique kernels in the cache (= unique signatures seen).
+    pub fn len(&self) -> usize {
+        self.cache.read().unwrap().len()
+    }
+
+    /// `true` if the cache is empty.
+    pub fn is_empty(&self) -> bool {
+        self.len() == 0
+    }
+
+    /// Number of `compile()` invocations — one per unique signature.
+    pub fn compile_count(&self) -> u64 {
+        *self.compile_count.read().unwrap()
+    }
+
+    /// Number of cache hits (compile closure NOT invoked).
+    pub fn hit_count(&self) -> u64 {
+        *self.hit_count.read().unwrap()
+    }
+
+    /// Cache hit ratio: `hit_count / (hit_count + compile_count)`.
+    /// Returns 0.0 when no calls have been made.
+    pub fn hit_ratio(&self) -> f64 {
+        let hits = self.hit_count() as f64;
+        let compiles = self.compile_count() as f64;
+        let total = hits + compiles;
+        if total < 0.5 { 0.0 } else { hits / total }
+    }
+
+    /// Check whether a specific signature is cached without calling compile.
+    pub fn has_signature(&self, signature: u64) -> bool {
+        self.cache.read().unwrap().contains_key(&signature)
+    }
+
+    /// Clear the cache (and reset counters). Useful for test isolation.
+    pub fn clear(&self) {
+        self.cache.write().unwrap().clear();
+        *self.compile_count.write().unwrap() = 0;
+        *self.hit_count.write().unwrap() = 0;
+    }
+}
+
+impl<H: Clone> Default for CodecKernelCache<H> {
+    fn default() -> Self {
+        Self::new()
+    }
+}
+
+/// Deterministic stub kernel handle — for testing the cache without
+/// invoking the real Cranelift / jitson compilation path.
+///
+/// Captures what the kernel WOULD be (the signature it was compiled for +
+/// whether AMX would be used). D1.1b's Cranelift path replaces the
+/// stub with a real `KernelHandle`.
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct StubKernel {
+    /// `CodecParams::kernel_signature()` this stub represents.
+    pub signature: u64,
+    /// `params.is_matmul_heavy()` at compile time — drives Tier-1 AMX dispatch.
+    pub is_matmul_heavy: bool,
+    /// SIMD tier name this stub claims ("amx" | "vnni" | "avx512" | "avx2").
+    /// Never "scalar" on a SoA path — iron rule.
+    pub backend: &'static str,
+}
+
+impl StubKernel {
+    /// Build a stub from the current `CodecParams`, selecting a tier label
+    /// under the assumption that AMX is available for matmul-heavy paths.
+    /// The actual per-process capability query is
+    /// `ndarray::simd_amx::amx_available()`; this stub pretends it's true.
+    pub fn from_params(params: &CodecParams) -> Self {
+        Self {
+            signature: params.kernel_signature(),
+            is_matmul_heavy: params.is_matmul_heavy(),
+            backend: if params.is_matmul_heavy() { "amx" } else { "avx512" },
+        }
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use lance_graph_contract::cam::{CodecParamsBuilder, LaneWidth, Rotation};
+
+    #[test]
+    fn cache_starts_empty() {
+        let c: CodecKernelCache<StubKernel> = CodecKernelCache::new();
+        assert_eq!(c.len(), 0);
+        assert!(c.is_empty());
+        assert_eq!(c.compile_count(), 0);
+        assert_eq!(c.hit_count(), 0);
+        assert_eq!(c.hit_ratio(), 0.0);
+    }
+
+    #[test]
+    fn first_call_compiles_second_is_cache_hit() {
+        let c: CodecKernelCache<StubKernel> = CodecKernelCache::new();
+        let p = CodecParamsBuilder::new().centroids(1024).build().unwrap();
+
+        let k1 = c.get_or_compile(&p, || StubKernel::from_params(&p));
+        let k2 = c.get_or_compile(&p, || panic!("must not recompile on cache hit"));
+
+        assert_eq!(k1, k2);
+        assert_eq!(c.compile_count(), 1);
+        assert_eq!(c.hit_count(), 1);
+        assert_eq!(c.len(), 1);
+        assert_eq!(c.hit_ratio(), 0.5);
+    }
+
+    #[test]
+    fn different_params_produce_different_kernels() {
+        let c: CodecKernelCache<StubKernel> = CodecKernelCache::new();
+        let p1 = CodecParamsBuilder::new().centroids(256).build().unwrap();
+        let p2 = CodecParamsBuilder::new().centroids(1024).build().unwrap();
+
+        let k1 = c.get_or_compile(&p1, || StubKernel::from_params(&p1));
+        let k2 = c.get_or_compile(&p2, || StubKernel::from_params(&p2));
+
+        assert_ne!(k1.signature, k2.signature);
+        assert_eq!(c.compile_count(), 2);
+        assert_eq!(c.hit_count(), 0);
+        assert_eq!(c.len(), 2);
+    }
+
+    #[test]
+    fn seed_changes_do_not_invalidate_cache() {
+        // CodecParams::kernel_signature() excludes `seed` (PR #225).
+        // Same IR-shaping fields → same signature → cache hit.
+        let c: CodecKernelCache<StubKernel> = CodecKernelCache::new();
+        let p1 = CodecParamsBuilder::new().seed(1).build().unwrap();
+        let p2 = CodecParamsBuilder::new().seed(2).build().unwrap();
+
+        let k1 = c.get_or_compile(&p1, || StubKernel::from_params(&p1));
+        let k2 = c.get_or_compile(&p2, || panic!("seed change must not invalidate cache"));
+
+        assert_eq!(k1, k2);
+        assert_eq!(c.compile_count(), 1);
+        assert_eq!(c.hit_count(), 1);
+    }
+
+    #[test]
+    fn matmul_heavy_params_select_amx_backend_in_stub() {
+        let opq = CodecParamsBuilder::new()
+            .lane_width(LaneWidth::BF16x32)
+            .rotation(Rotation::Opq { matrix_blob_id: 42, dim: 4096 })
+            .build()
+            .unwrap();
+        let identity = CodecParamsBuilder::new().build().unwrap();
+
+        let k_opq = StubKernel::from_params(&opq);
+        let k_id = StubKernel::from_params(&identity);
+
+        assert_eq!(k_opq.backend, "amx");
+        assert!(k_opq.is_matmul_heavy);
+        assert_eq!(k_id.backend, "avx512");
+        assert!(!k_id.is_matmul_heavy);
+    }
+
+    #[test]
+    fn clear_resets_cache_and_counters() {
+        let c: CodecKernelCache<StubKernel> = CodecKernelCache::new();
+        let p = CodecParamsBuilder::new().build().unwrap();
+        c.get_or_compile(&p, || StubKernel::from_params(&p));
+        c.get_or_compile(&p, || StubKernel::from_params(&p));
+
+        assert_eq!(c.len(), 1);
+        assert_eq!(c.compile_count(), 1);
+        assert_eq!(c.hit_count(), 1);
+
+        c.clear();
+
+        assert_eq!(c.len(), 0);
+        assert_eq!(c.compile_count(), 0);
+        assert_eq!(c.hit_count(), 0);
+        assert!(c.is_empty());
+    }
+
+    #[test]
+    fn try_get_or_compile_propagates_errors() {
+        let c: CodecKernelCache<StubKernel> = CodecKernelCache::new();
+        let p = CodecParamsBuilder::new().build().unwrap();
+        let result: Result<StubKernel, _> = c.try_get_or_compile(&p, || {
+            Err(CodecParamsError::ZeroDimension { field: "test" })
+        });
+        assert!(result.is_err());
+        // Failed compile doesn't populate cache.
+        assert_eq!(c.len(), 0);
+    }
+
+    #[test]
+    fn has_signature_checks_without_compiling() {
+        let c: CodecKernelCache<StubKernel> = CodecKernelCache::new();
+        let p = CodecParamsBuilder::new().centroids(512).build().unwrap();
+        let sig = p.kernel_signature();
+
+        assert!(!c.has_signature(sig));
+        c.get_or_compile(&p, || StubKernel::from_params(&p));
+        assert!(c.has_signature(sig));
+    }
+
+    #[test]
+    fn sweep_grid_warms_cache_deterministically() {
+        // Simulate the D0.3 insight: a sweep grid with 4 distinct kernel
+        // signatures + 1 repeat (seed difference) compiles exactly 4 kernels.
+        let c: CodecKernelCache<StubKernel> = CodecKernelCache::new();
+        let candidates: Vec<CodecParams> = vec![
+            CodecParamsBuilder::new().centroids(256).build().unwrap(),
+            CodecParamsBuilder::new().centroids(512).build().unwrap(),
+            CodecParamsBuilder::new().centroids(1024).build().unwrap(),
+            CodecParamsBuilder::new().centroids(256).seed(999).build().unwrap(), // same sig as first
+            CodecParamsBuilder::new()
+                .lane_width(LaneWidth::BF16x32)
+                .rotation(Rotation::Opq { matrix_blob_id: 1, dim: 4096 })
+                .build().unwrap(),
+        ];
+
+        for p in &candidates {
+            c.get_or_compile(p, || StubKernel::from_params(p));
+        }
+
+        // 4 unique signatures (seed=999 collides with the first).
+        assert_eq!(c.len(), 4);
+        assert_eq!(c.compile_count(), 4);
+        assert_eq!(c.hit_count(), 1);
+        assert!((c.hit_ratio() - 0.2).abs() < 1e-9);
+    }
+}
diff --git a/crates/cognitive-shader-driver/src/lib.rs b/crates/cognitive-shader-driver/src/lib.rs