Skip to content

Commit 8624cf3

Browse files
authored
Merge pull request #493 from AdaWorldAPI/claude/wonderful-hawking-lodtql
ndarray-SIMD consumer integration: turbovec ANN bridge + bgz17 AVX-512 + blasgraph Hamming dedup
2 parents 2f9c3ca + 31d7757 commit 8624cf3

11 files changed

Lines changed: 1850 additions & 111 deletions

File tree

.claude/board/AGENT_LOG.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,15 @@
1+
## 2026-06-13 — turbovec ⇄ ndarray integration: fork-wired + ndarray::simd polyfill GEMM + measured AMX-vs-LUT
2+
3+
**Main thread (Opus 4.8 1M) + 1 Opus general-purpose agent (bgz-tensor synergy map).** User: "create a crate in lance-graph for turbovec and check synergies; route SIMD through ndarray::simd (simd.rs→simd_amx/avx512/ops/soa); the polyfill does the work, ndarray ships AMX via byte-asm dispatch; pin rust 1.95." Cross-repo, branch `claude/wonderful-hawking-lodtql` in all three repos.
4+
5+
**Shipped:**
6+
- **turbovec** (the AdaWorldAPI fork of Google TurboQuant, arXiv 2504.19874): re-pointed `ndarray = "0.17"` (crates.io) → the AdaWorldAPI fork (`path = ../../ndarray`, `default-features=false, features=["std"]`) — P0 forks-only; the fork IS rust-ndarray 0.17.2 + HPC/SIMD so the array API is unchanged AND `ndarray::simd` is reachable. `blas` made opt-in (build.rs gates the OpenBLAS link on `CARGO_FEATURE_BLAS`; default uses pure-Rust matrixmultiply for the one encode `.dot()`). Added `rust-toolchain.toml` = 1.95.0. New `src/search_polyfill.rs` (feature `ndarray-simd`): TurboQuant scoring as a batched int8 GEMM `Q·X̂ᵀ` via `ndarray::simd::matmul_i8_to_i32` — zero raw intrinsics; ndarray picks AMX tile / VPDPBUSD / AVX-VNNI / scalar. `FORCE_SCALAR_FALLBACK` exposed under new `bench-internals` feature. `examples/kernel_speed.rs` (native vs polyfill vs scalar + recall). 2 polyfill tests green.
7+
- **ndarray**: re-exported `hpc::amx_matmul::{matmul_i8_to_i32, amx_available}` through `simd.rs` (std-gated) so the AMX int8-GEMM ladder is reachable via the canonical `ndarray::simd::*` consumer surface (W1a). Additive; no behaviour change.
8+
- **lance-graph**: new excluded standalone crate `crates/lance-graph-turbovec` (path-deps both forks) — `TurboVec` bridge with a `Kernel::{NativeLut, PolyfillGemm}` A/B switch + lazy reconstruction cache + `polyfill_backend()` report; 2 tests green. `KNOWLEDGE.md` = full synergy map. Root Cargo.toml `exclude` updated. EPIPHANIES E-TURBOVEC-AMX-WRONG-TOOL-1 + this entry + LATEST_STATE.
9+
10+
**Measured (AVX-512+VNNI host, no AMX tiles; n=20k dim=512 k=10 4-bit):** native LUT-ADC 76 µs/q (recall 0.785) ; polyfill GEMM 867 µs/q (recall 0.764) ; scalar 6 267 µs/q. **polyfill 11.4× slower than native** → TurboQuant deliberately trades the matmul away (LUT gather, not dot), so AMX accelerates the op it removed. Native LUT stays the production kernel; polyfill retained as AMX-ready baseline. Placement verdict: index → spine (lance-graph), kernel-math → ndarray (already owns clam/cam_pq/cascade/amx_matmul). The promising synergy is a Belichtungsmesser σ-gate on the LUT scan, NOT AMX.
11+
12+
**Verification:** `cargo build --lib -p turbovec` (fork-wired) green; `cargo test -p turbovec --features ndarray-simd search_polyfill` 2/2 green; `cargo test --manifest-path crates/lance-graph-turbovec/Cargo.toml` green; benchmark ran. Pre-existing upstream turbovec dead-code warning (`avx2_block_epilogue`) silenced minimally. Commits: one per repo on the branch.
113
## 2026-06-13 — SoaEnvelope binding for canonical NodeRow (the canon-as-substrate keystone)
214

315
**bardioc cross-session.** Closes punchlist item §7.2 of the 2026-06-13 SoA migration diff resolution doc — the canonical row layout is now bound to the envelope ABI. New `NodeRowPacket<'a>` wrapper in `canonical_node.rs` zero-copy-views a `&[NodeRow]` (each row `#[repr(C, align(64))]` at 512 bytes) as a row-strided LE byte packet through `SoaEnvelope`. Three-column descriptor table (`NODE_ROW_COLUMNS`): key (16 × u8 at offset 0), edges (16 × u8 at offset 16), value (480 × u8 at offset 32) — sums to `NODE_ROW_STRIDE = 512`. Internal structure within each slot stays canon-described (`NodeGuid` for the key, `EdgeBlock` for the edges, registry `ClassView` for the value carve-out) — the envelope contract is at the row-stride level, not the field-decomposition level. `NodeRowColumn` enum exports the column ordinals as `pub enum { Key=0, Edges=1, Value=2 }` for type-safe `column_le` access. `as_le_bytes()` is unsafe-free at the API but uses `core::slice::from_raw_parts` internally with a documented SAFETY note (NodeRow `#[repr(C)]` + locked size + canon-LE field accessors). +9 tests covering column-table layout, empty-packet verification, single-row zero-copy (pointer equality), multi-row byte length, `row_le`/`column_le` LE byte ranges, canon-LE key end-to-end, and `LAYOUT_VERSION` parity. `cargo test -p lance-graph-contract --lib`: **603/603 green** (+9); `cargo clippy -p lance-graph-contract --all-targets -- -D warnings`: clean. **No public-API drift in existing code** — `NodeRowPacket`, `NodeRowColumn`, `NODE_ROW_COLUMNS`, `NODE_ROW_STRIDE` are pure additions. This is the keystone the BindSpace dissolution sequence S1-S4 has been blocked behind: Lance's columnar I/O can now read the canonical row packet directly. Next step: MailboxSoA migrating from its column-major `[T; N]` layout to a row-strided `[NodeRow; N]` backing store that impls `SoaEnvelope` through this wrapper.

.claude/board/EPIPHANIES.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,47 @@
1+
## 2026-06-13 — E-TURBOVEC-AMX-WRONG-TOOL-1 — AMX accelerates the operation TurboQuant deliberately removed
2+
3+
**Status:** FINDING (benchmarked; AVX-512+VNNI host, `amx_available=false`).
4+
**Confidence:** High — measured, with a mechanistic explanation that holds across the tier ladder.
5+
6+
**The finding.** turbovec (Google TurboQuant, arXiv 2504.19874) was brought
7+
onto the spine as `crates/lance-graph-turbovec` (excluded standalone, path-deps
8+
the AdaWorldAPI turbovec + ndarray forks). Its scan was *also* expressed as a
9+
batched int8 GEMM through `ndarray::simd::matmul_i8_to_i32` (the polyfill that
10+
ships AMX `TDPBUSD` → AVX-512 VPDPBUSD → AVX-VNNI → scalar). Measured
11+
(`n=20 000, dim=512, k=10, 4-bit`):
12+
13+
| kernel | ns/query | recall@10 |
14+
|---|---|---|
15+
| native nibble-LUT ADC (AVX-512BW) | 76 073 | 0.785 |
16+
| polyfill int8 GEMM (VPDPBUSD-zmm) | 866 899 | 0.764 |
17+
| scalar reference | 6 267 279 | — |
18+
19+
The polyfill GEMM is **11.4× slower** than the native LUT, and native is 82×
20+
faster than scalar. **Mechanism:** TurboQuant's design *trades the matmul away*
21+
— LUT-ADC is an O(1) table gather per coordinate; the GEMM does the full
22+
`dim`-length dot per (query,vector) pair. AMX is a tile *matrix-multiply* unit,
23+
so it accelerates exactly the operation TurboQuant removed. The AMX tile (256
24+
MAC/instr, ~4× VNNI) would bring the polyfill from 11.4× → ~3× slower — still a
25+
loss. **A gather is not a matmul; no tile engine makes it one.**
26+
27+
**Consequences.**
28+
- Keep the native LUT kernel as turbovec's production path. The polyfill is
29+
retained only as (a) proof the index is `ndarray::simd`-clean / AMX-ready and
30+
(b) a measured baseline. AMX is the right tool only where the workload is
31+
genuinely matmul-shaped (e.g. an exact-rerank LEAF over a tiny survivor set).
32+
- Generalises the I-VSA-IDENTITIES register lesson to *kernels*: match the SIMD
33+
primitive to the algorithm's operation, not to peak MAC/instr. "Ship AMX via
34+
dispatch" is correct *plumbing* (the polyfill does ship it), but plumbing
35+
doesn't make the wrong-shaped op fast.
36+
- The genuinely promising turbovec⇄bgz-tensor wiring is NOT AMX: it is a
37+
Belichtungsmesser σ-gated block reject on the LUT scan (turbovec has only a
38+
heap-min prune, no statistical threshold). See
39+
`crates/lance-graph-turbovec/KNOWLEDGE.md` §3B.
40+
41+
Cross-ref: `crates/lance-graph-turbovec/KNOWLEDGE.md` (full synergy map +
42+
reproduce); `ndarray::hpc::amx_matmul::matmul_i8_to_i32` (the 4-tier ladder);
43+
I-NOISE-FLOOR-JIRAK (the σ-threshold path inherits the Jirak obligation).
44+
145
## 2026-06-12 — E-OUTER-BOUNDARY-IS-ORM-1 — there is only one boundary, and it is ontology-mediated
246

347
**Status:** FINDING (PR #487 tombstone commit makes this source-true; OGAR class + `SoaEnvelope` + Lance columnar I/O is the realized triangle).

.claude/board/LATEST_STATE.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@
1010
1111
---
1212

13+
> **2026-06-13 — shipped (autoattended, cross-repo)** (turbovec ⇄ ndarray): new excluded standalone crate **`crates/lance-graph-turbovec`** — Google TurboQuant (arXiv 2504.19874, the AdaWorldAPI `turbovec` fork) bridged onto the spine. `TurboVec` wraps `turbovec::TurboQuantIndex` with a `Kernel::{NativeLut, PolyfillGemm}` A/B switch. **Cross-repo (branch `claude/wonderful-hawking-lodtql` in turbovec + ndarray + lance-graph):** turbovec re-pointed from crates.io `ndarray 0.17` → the AdaWorldAPI fork (path, P0 forks-only; `blas` opt-in so default builds BLAS-free; `rust-toolchain.toml` = 1.95.0); new `turbovec::search_polyfill` (feature `ndarray-simd`) expresses scoring as a batched int8 GEMM via **`ndarray::simd::matmul_i8_to_i32`** (re-exported through `simd.rs` — AMX `TDPBUSD` tile → AVX-512 VPDPBUSD → AVX-VNNI → scalar, dispatched inside ndarray, zero intrinsics in turbovec). **Measured finding (E-TURBOVEC-AMX-WRONG-TOOL-1):** the polyfill GEMM is 11.4× SLOWER than the native nibble-LUT (TurboQuant trades the matmul away → AMX accelerates the op it removed); native LUT stays production, polyfill is the AMX-ready baseline. Placement: index → spine, kernel-math → ndarray (already owns clam/cam_pq/cascade/amx_matmul). Synergy map (HDR popcount stacking early-exit, Belichtungsmesser σ thresholds, preheating vs palette256) in `crates/lance-graph-turbovec/KNOWLEDGE.md`. Tests green in all three repos; benchmark via `examples/kernel_speed.rs`. NOT a merged PR yet (branch work).
14+
>
1315
> **2026-06-03 — hardened (follow-up after #460)** (D-HELIX-1 wiring): `crates/helix` now takes **ndarray as a MANDATORY, non-optional git dependency** (`git = AdaWorldAPI/ndarray @ master`), replacing the optional `path` dep + `ndarray-hpc` feature. Why: (1) codex P2 — an optional *path* dep still forces Cargo to read the local sibling manifest at resolution, so a clean checkout failed before feature selection; (2) directive "ndarray is mandatory for lance-graph". `simd.rs` always uses `ndarray::simd` (no scalar fallback); the self-contained fork → no import cycle. 63 unit + 6 doctests green; clippy/fmt clean. See E-HELIX-NDARRAY-MANDATORY.
1416
>
1517
> **2026-06-03 — shipped (autoattended)** (D-HELIX-1): new standalone crate `crates/helix` — the golden-spiral **Place/Residue** codec from the user's `KNOWLEDGE.md`. HHTL = deterministic PLACE; helix = orthogonal RESIDUE. Pipeline: equal-area `√u` hemisphere placement (`HemispherePoint`) → stride-4-over-17 `CurveRuler` coupling → Fisher-Z/arctanh `Similarity` alignment → EULER_GAMMA hand-off → 256-palette `RollingFloor` quantise (occupancy-drift + version stamp) → 3-byte `ResidueEdge` endpoint pair; metric-safe L1 via 256×256 `DistanceLut` (`distance_adaptive`) + non-metric byte-Hamming `distance_heuristic`. `prove()` closes the 2-D discrepancy Open Item (companion to `jc::weyl`). Zero-dep default (`edition 2021`, empty `[workspace]`, root `exclude`); optional `ndarray-hpc` feature routes batch Fisher-Z through `ndarray::simd::simd_ln_f32`. **61 unit + 6 doctests green** on BOTH feature configs; clippy -D warnings + fmt clean. ~80% overlaps existing CERTIFIED primitives by design (clean-room, user-directed) — see `crates/helix/KNOWLEDGE.md` § Overlap & Consolidation + E-HELIX-OVERLAP + TD-HELIX-OVERLAP-1. Branch claude/gallant-rubin-Y9pQd.

Cargo.toml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,13 @@ exclude = [
6060
# machinery to a single-tier unigram pipeline (see crate README).
6161
# Verified via `cargo test --manifest-path crates/quasicryth-research/Cargo.toml`.
6262
"crates/quasicryth-research",
63+
# TurboQuant ANN index (Google arXiv 2504.19874) bridged onto the spine —
64+
# standalone, path-deps the AdaWorldAPI turbovec + ndarray forks. Kept out
65+
# of the main graph so turbovec's faer/statrs tree never enters the
66+
# deterministic lance-graph compile path. Both scoring kernels (native
67+
# nibble-LUT ADC + ndarray::simd::matmul_i8_to_i32 polyfill GEMM) compiled.
68+
# Verify via `cargo test --manifest-path crates/lance-graph-turbovec/Cargo.toml`.
69+
"crates/lance-graph-turbovec",
6370
]
6471
resolver = "2"
6572

crates/bgz17/src/simd.rs

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,11 @@ pub fn batch_palette_distance(
5555
let level = detect_simd();
5656

5757
match level {
58+
#[cfg(target_arch = "x86_64")]
59+
SimdLevel::Avx512 => {
60+
// Safety: detect_simd() confirmed AVX-512F is available.
61+
unsafe { avx512_batch(dm_data, k, query, candidates, out) };
62+
}
5863
#[cfg(target_arch = "x86_64")]
5964
SimdLevel::Avx2 => {
6065
// Safety: detect_simd() confirmed AVX2 is available.
@@ -138,6 +143,79 @@ unsafe fn avx2_batch(dm_data: &[u16], k: usize, query: u8, candidates: &[u8], ou
138143
}
139144
}
140145

146+
/// AVX-512 gather batch lookup: process 16 lookups at a time using _mm512_i32gather_epi32.
147+
///
148+
/// Widened analogue of `avx2_batch`. The distance matrix stores u16 values; we
149+
/// gather i32 words from the u16 base pointer with byte-scale 2 (each u16 is 2
150+
/// bytes), then mask off the high u16 of each lane. The low u16 of each gathered
151+
/// i32 is exactly `dm[query][candidate]`, so the result is identical to
152+
/// `scalar_batch` (and to `avx2_batch`). The 16-wide remainder falls back to
153+
/// scalar, matching the AVX2 path's tail handling.
154+
///
155+
/// # Safety
156+
/// Caller must ensure AVX-512F is available (checked via `is_x86_feature_detected!("avx512f")`).
157+
#[cfg(target_arch = "x86_64")]
158+
#[target_feature(enable = "avx512f")]
159+
unsafe fn avx512_batch(dm_data: &[u16], k: usize, query: u8, candidates: &[u8], out: &mut [u16]) {
160+
use core::arch::x86_64::*;
161+
162+
let row_offset = query as usize * k;
163+
let row_ptr = dm_data.as_ptr().add(row_offset);
164+
let n = candidates.len();
165+
166+
// Process 16 candidates at a time
167+
let chunks = n / 16;
168+
let remainder = n % 16;
169+
170+
for chunk in 0..chunks {
171+
let base = chunk * 16;
172+
173+
// Build index vector: candidate indices as i32 (lane 0 = candidates[base]).
174+
let indices = _mm512_set_epi32(
175+
candidates[base + 15] as i32,
176+
candidates[base + 14] as i32,
177+
candidates[base + 13] as i32,
178+
candidates[base + 12] as i32,
179+
candidates[base + 11] as i32,
180+
candidates[base + 10] as i32,
181+
candidates[base + 9] as i32,
182+
candidates[base + 8] as i32,
183+
candidates[base + 7] as i32,
184+
candidates[base + 6] as i32,
185+
candidates[base + 5] as i32,
186+
candidates[base + 4] as i32,
187+
candidates[base + 3] as i32,
188+
candidates[base + 2] as i32,
189+
candidates[base + 1] as i32,
190+
candidates[base] as i32,
191+
);
192+
193+
// Gather u16 values via i32 gather on the u16 array. With scale=2 on the
194+
// u16 base pointer, lane j reads the i32 at byte offset candidates[..]*2,
195+
// i.e. the target u16 (low half) plus the next u16 (high half). Identical
196+
// trick to avx2_batch, widened to 16 lanes.
197+
let gathered = _mm512_i32gather_epi32::<2>(indices, row_ptr as *const i32);
198+
199+
// Mask to extract only the low u16 from each i32 lane.
200+
let mask = _mm512_set1_epi32(0x0000FFFF);
201+
let masked = _mm512_and_si512(gathered, mask);
202+
203+
// Extract and store individually (no direct i32→u16 pack across 16 lanes).
204+
let mut tmp = [0i32; 16];
205+
_mm512_storeu_si512(tmp.as_mut_ptr() as *mut __m512i, masked);
206+
207+
for i in 0..16 {
208+
out[base + i] = tmp[i] as u16;
209+
}
210+
}
211+
212+
// Scalar fallback for remaining elements
213+
let tail_start = chunks * 16;
214+
for i in 0..remainder {
215+
out[tail_start + i] = dm_data[row_offset + candidates[tail_start + i] as usize];
216+
}
217+
}
218+
141219
/// Batch SPO distance: combined S+P+O distance for multiple candidates.
142220
///
143221
/// For each candidate i:

0 commit comments

Comments
 (0)