AdaWorldAPI · AdaWorldAPI · Jun 14, 2026 · Jun 13, 2026 · Jun 13, 2026 · Jun 14, 2026
diff --git a/.claude/board/AGENT_LOG.md b/.claude/board/AGENT_LOG.md
@@ -1,3 +1,15 @@
+## 2026-06-13 — turbovec ⇄ ndarray integration: fork-wired + ndarray::simd polyfill GEMM + measured AMX-vs-LUT
+
+**Main thread (Opus 4.8 1M) + 1 Opus general-purpose agent (bgz-tensor synergy map).** User: "create a crate in lance-graph for turbovec and check synergies; route SIMD through ndarray::simd (simd.rs→simd_amx/avx512/ops/soa); the polyfill does the work, ndarray ships AMX via byte-asm dispatch; pin rust 1.95." Cross-repo, branch `claude/wonderful-hawking-lodtql` in all three repos.
+
+**Shipped:**
+- **turbovec** (the AdaWorldAPI fork of Google TurboQuant, arXiv 2504.19874): re-pointed `ndarray = "0.17"` (crates.io) → the AdaWorldAPI fork (`path = ../../ndarray`, `default-features=false, features=["std"]`) — P0 forks-only; the fork IS rust-ndarray 0.17.2 + HPC/SIMD so the array API is unchanged AND `ndarray::simd` is reachable. `blas` made opt-in (build.rs gates the OpenBLAS link on `CARGO_FEATURE_BLAS`; default uses pure-Rust matrixmultiply for the one encode `.dot()`). Added `rust-toolchain.toml` = 1.95.0. New `src/search_polyfill.rs` (feature `ndarray-simd`): TurboQuant scoring as a batched int8 GEMM `Q·X̂ᵀ` via `ndarray::simd::matmul_i8_to_i32` — zero raw intrinsics; ndarray picks AMX tile / VPDPBUSD / AVX-VNNI / scalar. `FORCE_SCALAR_FALLBACK` exposed under new `bench-internals` feature. `examples/kernel_speed.rs` (native vs polyfill vs scalar + recall). 2 polyfill tests green.
+- **ndarray**: re-exported `hpc::amx_matmul::{matmul_i8_to_i32, amx_available}` through `simd.rs` (std-gated) so the AMX int8-GEMM ladder is reachable via the canonical `ndarray::simd::*` consumer surface (W1a). Additive; no behaviour change.
+- **lance-graph**: new excluded standalone crate `crates/lance-graph-turbovec` (path-deps both forks) — `TurboVec` bridge with a `Kernel::{NativeLut, PolyfillGemm}` A/B switch + lazy reconstruction cache + `polyfill_backend()` report; 2 tests green. `KNOWLEDGE.md` = full synergy map. Root Cargo.toml `exclude` updated. EPIPHANIES E-TURBOVEC-AMX-WRONG-TOOL-1 + this entry + LATEST_STATE.
+
+**Measured (AVX-512+VNNI host, no AMX tiles; n=20k dim=512 k=10 4-bit):** native LUT-ADC 76 µs/q (recall 0.785) ; polyfill GEMM 867 µs/q (recall 0.764) ; scalar 6 267 µs/q. **polyfill 11.4× slower than native** → TurboQuant deliberately trades the matmul away (LUT gather, not dot), so AMX accelerates the op it removed. Native LUT stays the production kernel; polyfill retained as AMX-ready baseline. Placement verdict: index → spine (lance-graph), kernel-math → ndarray (already owns clam/cam_pq/cascade/amx_matmul). The promising synergy is a Belichtungsmesser σ-gate on the LUT scan, NOT AMX.
+
+**Verification:** `cargo build --lib -p turbovec` (fork-wired) green; `cargo test -p turbovec --features ndarray-simd search_polyfill` 2/2 green; `cargo test --manifest-path crates/lance-graph-turbovec/Cargo.toml` green; benchmark ran. Pre-existing upstream turbovec dead-code warning (`avx2_block_epilogue`) silenced minimally. Commits: one per repo on the branch.
 ## 2026-06-13 — SoaEnvelope binding for canonical NodeRow (the canon-as-substrate keystone)
 
 **bardioc cross-session.** Closes punchlist item §7.2 of the 2026-06-13 SoA migration diff resolution doc — the canonical row layout is now bound to the envelope ABI. New `NodeRowPacket<'a>` wrapper in `canonical_node.rs` zero-copy-views a `&[NodeRow]` (each row `#[repr(C, align(64))]` at 512 bytes) as a row-strided LE byte packet through `SoaEnvelope`. Three-column descriptor table (`NODE_ROW_COLUMNS`): key (16 × u8 at offset 0), edges (16 × u8 at offset 16), value (480 × u8 at offset 32) — sums to `NODE_ROW_STRIDE = 512`. Internal structure within each slot stays canon-described (`NodeGuid` for the key, `EdgeBlock` for the edges, registry `ClassView` for the value carve-out) — the envelope contract is at the row-stride level, not the field-decomposition level. `NodeRowColumn` enum exports the column ordinals as `pub enum { Key=0, Edges=1, Value=2 }` for type-safe `column_le` access. `as_le_bytes()` is unsafe-free at the API but uses `core::slice::from_raw_parts` internally with a documented SAFETY note (NodeRow `#[repr(C)]` + locked size + canon-LE field accessors). +9 tests covering column-table layout, empty-packet verification, single-row zero-copy (pointer equality), multi-row byte length, `row_le`/`column_le` LE byte ranges, canon-LE key end-to-end, and `LAYOUT_VERSION` parity. `cargo test -p lance-graph-contract --lib`: **603/603 green** (+9); `cargo clippy -p lance-graph-contract --all-targets -- -D warnings`: clean. **No public-API drift in existing code** — `NodeRowPacket`, `NodeRowColumn`, `NODE_ROW_COLUMNS`, `NODE_ROW_STRIDE` are pure additions. This is the keystone the BindSpace dissolution sequence S1-S4 has been blocked behind: Lance's columnar I/O can now read the canonical row packet directly. Next step: MailboxSoA migrating from its column-major `[T; N]` layout to a row-strided `[NodeRow; N]` backing store that impls `SoaEnvelope` through this wrapper.

diff --git a/.claude/board/EPIPHANIES.md b/.claude/board/EPIPHANIES.md
@@ -1,3 +1,47 @@
+## 2026-06-13 — E-TURBOVEC-AMX-WRONG-TOOL-1 — AMX accelerates the operation TurboQuant deliberately removed
+
+**Status:** FINDING (benchmarked; AVX-512+VNNI host, `amx_available=false`).
+**Confidence:** High — measured, with a mechanistic explanation that holds across the tier ladder.
+
+**The finding.** turbovec (Google TurboQuant, arXiv 2504.19874) was brought
+onto the spine as `crates/lance-graph-turbovec` (excluded standalone, path-deps
+the AdaWorldAPI turbovec + ndarray forks). Its scan was *also* expressed as a
+batched int8 GEMM through `ndarray::simd::matmul_i8_to_i32` (the polyfill that
+ships AMX `TDPBUSD` → AVX-512 VPDPBUSD → AVX-VNNI → scalar). Measured
+(`n=20 000, dim=512, k=10, 4-bit`):
+
+| kernel | ns/query | recall@10 |
+|---|---|---|
+| native nibble-LUT ADC (AVX-512BW) | 76 073 | 0.785 |
+| polyfill int8 GEMM (VPDPBUSD-zmm) | 866 899 | 0.764 |
+| scalar reference | 6 267 279 | — |
+
+The polyfill GEMM is **11.4× slower** than the native LUT, and native is 82×
+faster than scalar. **Mechanism:** TurboQuant's design *trades the matmul away*
+— LUT-ADC is an O(1) table gather per coordinate; the GEMM does the full
+`dim`-length dot per (query,vector) pair. AMX is a tile *matrix-multiply* unit,
+so it accelerates exactly the operation TurboQuant removed. The AMX tile (256
+MAC/instr, ~4× VNNI) would bring the polyfill from 11.4× → ~3× slower — still a
+loss. **A gather is not a matmul; no tile engine makes it one.**
+
+**Consequences.**
+- Keep the native LUT kernel as turbovec's production path. The polyfill is
+  retained only as (a) proof the index is `ndarray::simd`-clean / AMX-ready and
+  (b) a measured baseline. AMX is the right tool only where the workload is
+  genuinely matmul-shaped (e.g. an exact-rerank LEAF over a tiny survivor set).
+- Generalises the I-VSA-IDENTITIES register lesson to *kernels*: match the SIMD
+  primitive to the algorithm's operation, not to peak MAC/instr. "Ship AMX via
+  dispatch" is correct *plumbing* (the polyfill does ship it), but plumbing
+  doesn't make the wrong-shaped op fast.
+- The genuinely promising turbovec⇄bgz-tensor wiring is NOT AMX: it is a
+  Belichtungsmesser σ-gated block reject on the LUT scan (turbovec has only a
+  heap-min prune, no statistical threshold). See
+  `crates/lance-graph-turbovec/KNOWLEDGE.md` §3B.
+
+Cross-ref: `crates/lance-graph-turbovec/KNOWLEDGE.md` (full synergy map +
+reproduce); `ndarray::hpc::amx_matmul::matmul_i8_to_i32` (the 4-tier ladder);
+I-NOISE-FLOOR-JIRAK (the σ-threshold path inherits the Jirak obligation).
+
 ## 2026-06-12 — E-OUTER-BOUNDARY-IS-ORM-1 — there is only one boundary, and it is ontology-mediated
 
 **Status:** FINDING (PR #487 tombstone commit makes this source-true; OGAR class + `SoaEnvelope` + Lance columnar I/O is the realized triangle).

diff --git a/.claude/board/LATEST_STATE.md b/.claude/board/LATEST_STATE.md
@@ -10,6 +10,8 @@
 
 ---
 
+> **2026-06-13 — shipped (autoattended, cross-repo)** (turbovec ⇄ ndarray): new excluded standalone crate **`crates/lance-graph-turbovec`** — Google TurboQuant (arXiv 2504.19874, the AdaWorldAPI `turbovec` fork) bridged onto the spine. `TurboVec` wraps `turbovec::TurboQuantIndex` with a `Kernel::{NativeLut, PolyfillGemm}` A/B switch. **Cross-repo (branch `claude/wonderful-hawking-lodtql` in turbovec + ndarray + lance-graph):** turbovec re-pointed from crates.io `ndarray 0.17` → the AdaWorldAPI fork (path, P0 forks-only; `blas` opt-in so default builds BLAS-free; `rust-toolchain.toml` = 1.95.0); new `turbovec::search_polyfill` (feature `ndarray-simd`) expresses scoring as a batched int8 GEMM via **`ndarray::simd::matmul_i8_to_i32`** (re-exported through `simd.rs` — AMX `TDPBUSD` tile → AVX-512 VPDPBUSD → AVX-VNNI → scalar, dispatched inside ndarray, zero intrinsics in turbovec). **Measured finding (E-TURBOVEC-AMX-WRONG-TOOL-1):** the polyfill GEMM is 11.4× SLOWER than the native nibble-LUT (TurboQuant trades the matmul away → AMX accelerates the op it removed); native LUT stays production, polyfill is the AMX-ready baseline. Placement: index → spine, kernel-math → ndarray (already owns clam/cam_pq/cascade/amx_matmul). Synergy map (HDR popcount stacking early-exit, Belichtungsmesser σ thresholds, preheating vs palette256) in `crates/lance-graph-turbovec/KNOWLEDGE.md`. Tests green in all three repos; benchmark via `examples/kernel_speed.rs`. NOT a merged PR yet (branch work).
+>
 > **2026-06-03 — hardened (follow-up after #460)** (D-HELIX-1 wiring): `crates/helix` now takes **ndarray as a MANDATORY, non-optional git dependency** (`git = AdaWorldAPI/ndarray @ master`), replacing the optional `path` dep + `ndarray-hpc` feature. Why: (1) codex P2 — an optional *path* dep still forces Cargo to read the local sibling manifest at resolution, so a clean checkout failed before feature selection; (2) directive "ndarray is mandatory for lance-graph". `simd.rs` always uses `ndarray::simd` (no scalar fallback); the self-contained fork → no import cycle. 63 unit + 6 doctests green; clippy/fmt clean. See E-HELIX-NDARRAY-MANDATORY.
 >
 > **2026-06-03 — shipped (autoattended)** (D-HELIX-1): new standalone crate `crates/helix` — the golden-spiral **Place/Residue** codec from the user's `KNOWLEDGE.md`. HHTL = deterministic PLACE; helix = orthogonal RESIDUE. Pipeline: equal-area `√u` hemisphere placement (`HemispherePoint`) → stride-4-over-17 `CurveRuler` coupling → Fisher-Z/arctanh `Similarity` alignment → EULER_GAMMA hand-off → 256-palette `RollingFloor` quantise (occupancy-drift + version stamp) → 3-byte `ResidueEdge` endpoint pair; metric-safe L1 via 256×256 `DistanceLut` (`distance_adaptive`) + non-metric byte-Hamming `distance_heuristic`. `prove()` closes the 2-D discrepancy Open Item (companion to `jc::weyl`). Zero-dep default (`edition 2021`, empty `[workspace]`, root `exclude`); optional `ndarray-hpc` feature routes batch Fisher-Z through `ndarray::simd::simd_ln_f32`. **61 unit + 6 doctests green** on BOTH feature configs; clippy -D warnings + fmt clean. ~80% overlaps existing CERTIFIED primitives by design (clean-room, user-directed) — see `crates/helix/KNOWLEDGE.md` § Overlap & Consolidation + E-HELIX-OVERLAP + TD-HELIX-OVERLAP-1. Branch claude/gallant-rubin-Y9pQd.

diff --git a/Cargo.toml b/Cargo.toml
@@ -60,6 +60,13 @@ exclude = [
     # machinery to a single-tier unigram pipeline (see crate README).
     # Verified via `cargo test --manifest-path crates/quasicryth-research/Cargo.toml`.
     "crates/quasicryth-research",
+    # TurboQuant ANN index (Google arXiv 2504.19874) bridged onto the spine —
+    # standalone, path-deps the AdaWorldAPI turbovec + ndarray forks. Kept out
+    # of the main graph so turbovec's faer/statrs tree never enters the
+    # deterministic lance-graph compile path. Both scoring kernels (native
+    # nibble-LUT ADC + ndarray::simd::matmul_i8_to_i32 polyfill GEMM) compiled.
+    # Verify via `cargo test --manifest-path crates/lance-graph-turbovec/Cargo.toml`.
+    "crates/lance-graph-turbovec",
 ]
 resolver = "2"
 

diff --git a/crates/bgz17/src/simd.rs b/crates/bgz17/src/simd.rs
@@ -55,6 +55,11 @@ pub fn batch_palette_distance(
     let level = detect_simd();
 
     match level {
+        #[cfg(target_arch = "x86_64")]
+        SimdLevel::Avx512 => {
+            // Safety: detect_simd() confirmed AVX-512F is available.
+            unsafe { avx512_batch(dm_data, k, query, candidates, out) };
+        }
         #[cfg(target_arch = "x86_64")]
         SimdLevel::Avx2 => {
             // Safety: detect_simd() confirmed AVX2 is available.
@@ -138,6 +143,79 @@ unsafe fn avx2_batch(dm_data: &[u16], k: usize, query: u8, candidates: &[u8], ou
     }
 }
 
+/// AVX-512 gather batch lookup: process 16 lookups at a time using _mm512_i32gather_epi32.
+///
+/// Widened analogue of `avx2_batch`. The distance matrix stores u16 values; we
+/// gather i32 words from the u16 base pointer with byte-scale 2 (each u16 is 2
+/// bytes), then mask off the high u16 of each lane. The low u16 of each gathered
+/// i32 is exactly `dm[query][candidate]`, so the result is identical to
+/// `scalar_batch` (and to `avx2_batch`). The 16-wide remainder falls back to
+/// scalar, matching the AVX2 path's tail handling.
+///
+/// # Safety
+/// Caller must ensure AVX-512F is available (checked via `is_x86_feature_detected!("avx512f")`).
+#[cfg(target_arch = "x86_64")]
+#[target_feature(enable = "avx512f")]
+unsafe fn avx512_batch(dm_data: &[u16], k: usize, query: u8, candidates: &[u8], out: &mut [u16]) {
+    use core::arch::x86_64::*;
+
+    let row_offset = query as usize * k;
+    let row_ptr = dm_data.as_ptr().add(row_offset);
+    let n = candidates.len();
+
+    // Process 16 candidates at a time
+    let chunks = n / 16;
+    let remainder = n % 16;
+
+    for chunk in 0..chunks {
+        let base = chunk * 16;
+
+        // Build index vector: candidate indices as i32 (lane 0 = candidates[base]).
+        let indices = _mm512_set_epi32(
+            candidates[base + 15] as i32,
+            candidates[base + 14] as i32,
+            candidates[base + 13] as i32,
+            candidates[base + 12] as i32,
+            candidates[base + 11] as i32,
+            candidates[base + 10] as i32,
+            candidates[base + 9] as i32,
+            candidates[base + 8] as i32,
+            candidates[base + 7] as i32,
+            candidates[base + 6] as i32,
+            candidates[base + 5] as i32,
+            candidates[base + 4] as i32,
+            candidates[base + 3] as i32,
+            candidates[base + 2] as i32,
+            candidates[base + 1] as i32,
+            candidates[base] as i32,
+        );
+
+        // Gather u16 values via i32 gather on the u16 array. With scale=2 on the
+        // u16 base pointer, lane j reads the i32 at byte offset candidates[..]*2,
+        // i.e. the target u16 (low half) plus the next u16 (high half). Identical
+        // trick to avx2_batch, widened to 16 lanes.
+        let gathered = _mm512_i32gather_epi32::<2>(indices, row_ptr as *const i32);
+
+        // Mask to extract only the low u16 from each i32 lane.
+        let mask = _mm512_set1_epi32(0x0000FFFF);
+        let masked = _mm512_and_si512(gathered, mask);
+
+        // Extract and store individually (no direct i32→u16 pack across 16 lanes).
+        let mut tmp = [0i32; 16];
+        _mm512_storeu_si512(tmp.as_mut_ptr() as *mut __m512i, masked);
+
+        for i in 0..16 {
+            out[base + i] = tmp[i] as u16;
+        }
+    }
+
+    // Scalar fallback for remaining elements
+    let tail_start = chunks * 16;
+    for i in 0..remainder {
+        out[tail_start + i] = dm_data[row_offset + candidates[tail_start + i] as usize];
+    }
+}
+
 /// Batch SPO distance: combined S+P+O distance for multiple candidates.
 ///
 /// For each candidate i:
-Original file line number
+Diff line change
@@ Expand Up / @@ -10,6 +10,8 @@ @@
     ---
+    > **2026-06-13 — shipped (autoattended, cross-repo)** (turbovec ⇄ ndarray): new excluded standalone crate **`crates/lance-graph-turbovec`** — Google TurboQuant (arXiv 2504.19874, the AdaWorldAPI `turbovec` fork) bridged onto the spine. `TurboVec` wraps `turbovec::TurboQuantIndex` with a `Kernel::{NativeLut, PolyfillGemm}` A/B switch. **Cross-repo (branch `claude/wonderful-hawking-lodtql` in turbovec + ndarray + lance-graph):** turbovec re-pointed from crates.io `ndarray 0.17` → the AdaWorldAPI fork (path, P0 forks-only; `blas` opt-in so default builds BLAS-free; `rust-toolchain.toml` = 1.95.0); new `turbovec::search_polyfill` (feature `ndarray-simd`) expresses scoring as a batched int8 GEMM via **`ndarray::simd::matmul_i8_to_i32`** (re-exported through `simd.rs` — AMX `TDPBUSD` tile → AVX-512 VPDPBUSD → AVX-VNNI → scalar, dispatched inside ndarray, zero intrinsics in turbovec). **Measured finding (E-TURBOVEC-AMX-WRONG-TOOL-1):** the polyfill GEMM is 11.4× SLOWER than the native nibble-LUT (TurboQuant trades the matmul away → AMX accelerates the op it removed); native LUT stays production, polyfill is the AMX-ready baseline. Placement: index → spine, kernel-math → ndarray (already owns clam/cam_pq/cascade/amx_matmul). Synergy map (HDR popcount stacking early-exit, Belichtungsmesser σ thresholds, preheating vs palette256) in `crates/lance-graph-turbovec/KNOWLEDGE.md`. Tests green in all three repos; benchmark via `examples/kernel_speed.rs`. NOT a merged PR yet (branch work).
+    >
     > **2026-06-03 — hardened (follow-up after #460)** (D-HELIX-1 wiring): `crates/helix` now takes **ndarray as a MANDATORY, non-optional git dependency** (`git = AdaWorldAPI/ndarray @ master`), replacing the optional `path` dep + `ndarray-hpc` feature. Why: (1) codex P2 — an optional *path* dep still forces Cargo to read the local sibling manifest at resolution, so a clean checkout failed before feature selection; (2) directive "ndarray is mandatory for lance-graph". `simd.rs` always uses `ndarray::simd` (no scalar fallback); the self-contained fork → no import cycle. 63 unit + 6 doctests green; clippy/fmt clean. See E-HELIX-NDARRAY-MANDATORY.
     >
     > **2026-06-03 — shipped (autoattended)** (D-HELIX-1): new standalone crate `crates/helix` — the golden-spiral **Place/Residue** codec from the user's `KNOWLEDGE.md`. HHTL = deterministic PLACE; helix = orthogonal RESIDUE. Pipeline: equal-area `√u` hemisphere placement (`HemispherePoint`) → stride-4-over-17 `CurveRuler` coupling → Fisher-Z/arctanh `Similarity` alignment → EULER_GAMMA hand-off → 256-palette `RollingFloor` quantise (occupancy-drift + version stamp) → 3-byte `ResidueEdge` endpoint pair; metric-safe L1 via 256×256 `DistanceLut` (`distance_adaptive`) + non-metric byte-Hamming `distance_heuristic`. `prove()` closes the 2-D discrepancy Open Item (companion to `jc::weyl`). Zero-dep default (`edition 2021`, empty `[workspace]`, root `exclude`); optional `ndarray-hpc` feature routes batch Fisher-Z through `ndarray::simd::simd_ln_f32`. **61 unit + 6 doctests green** on BOTH feature configs; clippy -D warnings + fmt clean. ~80% overlaps existing CERTIFIED primitives by design (clean-room, user-directed) — see `crates/helix/KNOWLEDGE.md` § Overlap & Consolidation + E-HELIX-OVERLAP + TD-HELIX-OVERLAP-1. Branch claude/gallant-rubin-Y9pQd.
@@ Expand Down @@