Merge pull request #234 from AdaWorldAPI/claude/teleport-session-setup-wMZfb

AdaWorldAPI · web-flow · commit a3529ff58e39 · 2026-04-21T01:54:54.000+02:00
D1.2 rotation primitives + taxonomy/ladder/shader-vs-engine epiphanies (95/95 tests)
diff --git a/.claude/board/EPIPHANIES.md b/.claude/board/EPIPHANIES.md
@@ -65,6 +65,180 @@ stay as historical references.
 
 ## Entries (reverse chronological)
 
+## 2026-04-20 — Shader vs engine: statelessness is the boundary
+
+**Status:** FINDING (sharpens the three-level taxonomy)
+
+**Cognitive shader** = stateless atomic compute. Given `ShaderDispatch`
++ `BindSpace` columns, returns `ShaderHit`s + `MetaWord`. Knows nothing
+of why it fires. Output is one-cycle-wide, no history.
+
+**Thinking engine** = stateful orchestrator. Calls `shader.dispatch()`
+many times per cognitive cycle; composes per-lens hits into
+persona/qualia/world_model/ghost state; revises beliefs for the next
+cycle. The cognitive stack IS the state.
+
+**The engine_bridge is where they meet** —
+`cognitive-shader-driver/src/engine_bridge.rs` is the seam. Shader
+side: `ShaderDriver::dispatch` stateless. Engine side:
+`cognitive_stack::cycle` accumulates dispatches through
+`bf16_engine` / `signed_engine` / `composite_engine` / `dual_engine` /
+`layered` / `domino`, folds into persona/qualia, emits state for next
+cycle.
+
+**Analogy:** shader = eye (no memory, reports the current frame);
+engine = mind (memory, assembles frames into narrative, counterfactually
+imagines alternatives).
+
+**Where codec-flexibility-as-thinking lands:** the **engine** level,
+not the shader level. A "new thinking style" = a new engine
+configuration (lens composition, persona, qualia-update rule) that
+picks DIFFERENT shader configs per cycle. Shader stays the same; the
+engine's orchestration changes. That's why Phase 5+ "production-grade
+thinking tissue" drops into mid (engine), not L2 (shader).
+
+**Concrete Phase 1-5 shipping:** codec-sweep D1.x work = shader layer
+(tensor decode primitives). Engine-level codec-flexibility (swap
+lenses via YAML) = D5 / Phase 5+, plugging INTO the codec infrastructure.
+
+Cross-ref: three-level taxonomy above; resolution-ladder entry
+`64×64 > 256×257 >> 4096×4096 > 16k`; `engine_bridge.rs` seam.
+
+---
+
+## 2026-04-20 — Resolution hierarchy: `64×64 > 256×257 >> 4096×4096 > 16k` (user-named)
+
+**Status:** FINDING (capstone of the three-level taxonomy from earlier this session)
+
+The 5-layer stack is a **resolution ladder**, not a layer cake. Each
+level operates at its own granularity and has its own "shader" /
+"kernel cache" / "distance table" at that scale:
+
+| Size | Role | Where | HHTL stage (I10) |
+|---|---|---|---|
+| **64×64** | p64 topology mask — 8 predicate planes × 64 rows × u64 — "which archetype blocks relate via predicate z" | `p64_bridge::cognitive_shader::CognitiveShader` | HEEL (coarse basin) |
+| **256×257** | bgz17 palette distance table — 256 archetypes × 256 + 1 sentinel — O(1) lookup `semiring.distance(a, b)` | `bgz17::PaletteSemiring` | HIP (family sharpen) |
+| **4096×4096** | Cross-vocabulary / cross-context correlation — COCA × COCA, or 4096 τ-prefix × 4096 slot space | ndarray `ScanParams` JIT (`jitson_cranelift`) | BRANCH / TWIG |
+| **16 K** | Individual fingerprint bit identity — 16384-bit `Fingerprint<256>` | `ndarray::simd::Fingerprint<256>` + codec decoder (D1.x) | LEAF (exact member) |
+
+**The `>>` between 256×257 and 4096×4096 is the big jump** (~64×)
+matching HIP → BRANCH refinement. That's where palette-level (one
+row of the codebook) meets vocabulary-level (COCA 4096). Below that
+jump, everything is O(1) table lookup; above it, JIT kernels become
+worth the compile cost.
+
+**Each JIT targets its own resolution — no overlap:**
+
+- p64 cascade: 64×64 bitmask ops. Not JIT'd (bit tricks in hot loop
+  already optimal under AVX-512).
+- bgz17 palette: 256×256 precomputed. Not JIT'd (memory-bound).
+- ndarray ScanParams: 4096×4096 scan kernels. **JIT'd via
+  `jitson_cranelift::JitEngine`** — shipped.
+- Codec kernels (D1.x): 16k bit-level tensor decode. **Will be JIT'd
+  via D1.1b `CodecKernelEngine` adapter**. Scaffold (D1.1) + rotation
+  primitives (D1.2) landed; Cranelift IR emission deferred to D1.1b.
+
+**Three-level taxonomy (from earlier this session) maps onto the
+resolution ladder:**
+
+- **L2 small-precision cognitive shaders** (ns budget) →
+  64×64 + 256×257 (p64 + bgz17 palette). Pure table lookups.
+- **mid thinking-engine layers** (µs-ms) →
+  4096×4096 (cross-vocab, persona-aware lens composition). JIT'd
+  scan kernels.
+- **L4 thinking styles / NARS / JIT** (ms) →
+  orchestrates traversal ACROSS resolutions (starts at 64×64 cascade
+  to find candidates, narrows to 256×257 for family, drops to
+  4096×4096 for context, verifies at 16k fingerprint identity).
+
+**p64::CognitiveShader double-check conclusion:** architecturally
+clean. Operates at the coarsest (64×64) level; codec-sweep work at
+finest (16k); they compose in `cognitive_shader_driver::ShaderDriver`
+without overlap. Different layers of the ladder, different
+operations, different JIT targets (if any).
+
+Cross-ref: I10 (HEEL/HIP/BRANCH/TWIG/LEAF); three-level taxonomy entry
+above; `p64_bridge::cognitive_shader::CognitiveShader::cascade`;
+D1.1 `CodecKernelCache`; D1.2 `RotationKernel`; bgz17 `PaletteSemiring`.
+
+---
+
+## 2026-04-20 — Thinking styles ARE codecs over the semantic field (north star)
+
+**Status:** FINDING (forward-looking deposit — not a current work item; reference when Phase 5+ generalises)
+
+A codec compresses tensor content into fingerprints; a thinking style
+compresses reasoning trajectories into NARS-revised beliefs. Same
+underlying operation — structure-preserving compression on a binary
+Hamming substrate. Different input/output domains, same substrate
+guarantees (E-SUBSTRATE-1, I-SUBSTRATE-MARKOV), same compile-and-swap
+machinery.
+
+**The codec infrastructure IS the template for production-grade
+thinking tissue.** When Phase 5+ activates:
+
+| Codec (shipped D0.1–D1.2, D1.1b queued) | Thinking-style analog |
+|---|---|
+| `CodecParams` | `ThinkingStyleParams { style, modulation_7d, nars_priors, fallback_chain, sigma_priority, semiring_choice }` |
+| `kernel_signature()` — excludes runtime drift | `style_signature()` — excludes per-cycle modulation drift |
+| `CodecKernelCache<H>` | `ThinkingStyleKernelCache<H>` — same generic scaffold |
+| JIT kernel = Cranelift-compiled decode | JIT kernel = compiled scan-walk on 36-node topology (already shipped ndarray-side via `scan_jit.rs` + `ScanParams`) |
+| **Token agreement** (I11 cert gate) | **Conclusion agreement** — same NARS-revised conclusions as reference style? |
+| Sweep grid = N codec candidates | Sweep grid = N (style × modulation × NARS fallback) candidates |
+| `/v1/shader/calibrate` | `/v1/shader/think-calibrate` |
+| `[FORMAL-SCAFFOLD]` 5 pillars | **Same scaffold** — E-SUBSTRATE-1 covers any transition under bundle |
+
+**Generalisation isn't "port codec pattern to thinking"** — it's
+recognising thinking styles as a SPECIAL CASE of the codec pattern we
+just built. When Phase 5+ lands, `WireThinkCalibrate` +
+`ThinkingStyleKernelCache` + `conclusion_agreement` metric drop in
+alongside the codec versions. Same JIT engine, same tests, same
+board-hygiene discipline.
+
+**The phrase "production-grade thinking tissue"** names the telos
+cleanly: once codec infra is at Phase 3 token-agreement pass rates,
+cloning to thinking styles yields production-grade swappable
+reasoning — YAML-configured, JIT-compiled, sweep-certified. No
+rebuild per new style, no black box, signature-keyed reproducibility.
+
+**Cross-ref:** D0.6 `CodecParams` (the parameter-shape template);
+D1.1 `CodecKernelCache<H>` (the cache pattern — generic-over-H is the
+wedge for reuse); I5 (thinking IS an AdjacencyStore — already
+topologically unified with data graph); codec-sweep-via-lab-infra-v1.
+
+---
+
+## 2026-04-20 — D1.2 Hadamard is pure-Rust, not a JIT-necessary primitive
+
+**Status:** FINDING
+
+D1.2's HadamardRotation is implemented as a plain Rust in-place
+Sylvester butterfly (O(N log N) add/sub, no allocations). It does NOT
+need JIT compilation or Cranelift code emission because:
+
+1. **Fixed shape** — the butterfly structure is identical across all
+   power-of-two dims. Rust's compiler (under `target-cpu=x86-64-v4`)
+   already emits AVX-512 add/sub from the straight-line loop.
+2. **Not matmul** — Hadamard is a pattern of adds and subtracts,
+   never a dot product. Per Rule C polyfill hierarchy, matmul-heavy
+   paths benefit from AMX (Tier 1); add/sub stays at Tier 3 F32x16.
+   AMX gives no speedup here — confirmed in plan Appendix §12 C.
+
+**Consequence for D1.1b (Cranelift wiring):** only OPQ rotation needs
+the JIT path — it's the one that's actually a learned matmul. The
+Cranelift integration scope narrows: we don't need to JIT-compile
+Identity (no-op) or Hadamard (butterfly); just OPQ (matmul) and the
+main codec decode loop (ADC distance with palette lookup).
+
+This reduces D1.1b scope by maybe 30-40% — fewer kernel shapes to
+emit, only the ones that actually benefit.
+
+Cross-ref: D1.2 `rotation_kernel.rs::HadamardRotation`; Rule C
+(polyfill hierarchy); plan Appendix B (CartanCascade harmonic
+compression ratios rely on real Hadamard, so this matters).
+
+---
+
 ## 2026-04-20 — CORRECTION to D1.1 scaffold: ndarray::hpc::jitson_cranelift already ships JitEngine
 
 **Status:** FINDING / CORRECTION
diff --git a/.claude/board/STATUS_BOARD.md b/.claude/board/STATUS_BOARD.md
@@ -63,7 +63,7 @@ afterwards is a JIT kernel, not a rebuild. Plan path:
 |---|---|---|---|
 | D1.1 | `CodecKernelCache` — structural cache layer (generic over handle) | **In PR** | branch — `CodecKernelCache<H>` + `StubKernel` + `get_or_compile` / `try_get_or_compile` with RwLock concurrent-safe double-check + compile/hit/ratio counters + 9 tests. Scaffold ships NOW; D1.1b Cranelift IR emission follows. |
 | D1.1b | Adapter: `CodecKernelEngine` wrapping `ndarray::hpc::jitson_cranelift::JitEngine` with two-phase BUILD/RUN lifecycle (Arc-freeze). CodecParams → CodecScanParams adapter + codec-specific IR emission in jitson_cranelift/scan_jit analog | **Queued** | target ~250 LOC; `JitEngine` already ships (`/home/user/ndarray/src/hpc/jitson_cranelift/engine.rs`); the work is the CodecParams adapter + codec-specific JITSON template |
-| D1.2 | Rotation primitives: Identity / Hadamard / OPQ as JIT kernels | **Queued** | target ~190 LOC |
+| D1.2 | Rotation primitives: Identity / Hadamard / OPQ as `RotationKernel` impls | **In PR** | branch — `RotationKernel` trait (Send+Sync+Debug, object-safe) + `IdentityRotation` (no-op) + `HadamardRotation` (real Sylvester butterfly, O(N log N) in-place, norm²-scaling verified) + `OpqRotationStub` (matrix-blob-id placeholder for D1.1b) + `build(&Rotation, dim)` factory + `RotationError` typed errors + 15 tests. Hadamard stays at Tier-3 F32x16 (add/sub, not matmul → no AMX benefit per Rule C). |
 | D1.3 | Residual PQ via JIT composition | **Queued** | target ~150 LOC |
 
 ### Phase 2 — Token-agreement harness (I11 cert gate) — Queued
diff --git a/crates/cognitive-shader-driver/src/lib.rs b/crates/cognitive-shader-driver/src/lib.rs
@@ -125,6 +125,12 @@ pub mod auto_detect;
 #[cfg(feature = "serve")]
 pub mod codec_kernel_cache;
 
+// D1.2 — rotation primitives (Identity / Hadamard / OPQ-stub). LAB-ONLY.
+// Hadamard is real (in-place butterfly); OPQ is stub pending D1.1b's
+// ndarray::hpc::jitson_cranelift::JitEngine adapter + matrix-blob loader.
+#[cfg(feature = "serve")]
+pub mod rotation_kernel;
+
 // Axum REST server. LAB-ONLY.
 #[cfg(feature = "serve")]
 pub mod serve;
diff --git a/crates/cognitive-shader-driver/src/rotation_kernel.rs b/crates/cognitive-shader-driver/src/rotation_kernel.rs