Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .claude/board/AGENT_LOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,15 @@
## 2026-06-13 — turbovec ⇄ ndarray integration: fork-wired + ndarray::simd polyfill GEMM + measured AMX-vs-LUT

**Main thread (Opus 4.8 1M) + 1 Opus general-purpose agent (bgz-tensor synergy map).** User: "create a crate in lance-graph for turbovec and check synergies; route SIMD through ndarray::simd (simd.rs→simd_amx/avx512/ops/soa); the polyfill does the work, ndarray ships AMX via byte-asm dispatch; pin rust 1.95." Cross-repo, branch `claude/wonderful-hawking-lodtql` in all three repos.

**Shipped:**
- **turbovec** (the AdaWorldAPI fork of Google TurboQuant, arXiv 2504.19874): re-pointed `ndarray = "0.17"` (crates.io) → the AdaWorldAPI fork (`path = ../../ndarray`, `default-features=false, features=["std"]`) — P0 forks-only; the fork IS rust-ndarray 0.17.2 + HPC/SIMD so the array API is unchanged AND `ndarray::simd` is reachable. `blas` made opt-in (build.rs gates the OpenBLAS link on `CARGO_FEATURE_BLAS`; default uses pure-Rust matrixmultiply for the one encode `.dot()`). Added `rust-toolchain.toml` = 1.95.0. New `src/search_polyfill.rs` (feature `ndarray-simd`): TurboQuant scoring as a batched int8 GEMM `Q·X̂ᵀ` via `ndarray::simd::matmul_i8_to_i32` — zero raw intrinsics; ndarray picks AMX tile / VPDPBUSD / AVX-VNNI / scalar. `FORCE_SCALAR_FALLBACK` exposed under new `bench-internals` feature. `examples/kernel_speed.rs` (native vs polyfill vs scalar + recall). 2 polyfill tests green.
- **ndarray**: re-exported `hpc::amx_matmul::{matmul_i8_to_i32, amx_available}` through `simd.rs` (std-gated) so the AMX int8-GEMM ladder is reachable via the canonical `ndarray::simd::*` consumer surface (W1a). Additive; no behaviour change.
- **lance-graph**: new excluded standalone crate `crates/lance-graph-turbovec` (path-deps both forks) — `TurboVec` bridge with a `Kernel::{NativeLut, PolyfillGemm}` A/B switch + lazy reconstruction cache + `polyfill_backend()` report; 2 tests green. `KNOWLEDGE.md` = full synergy map. Root Cargo.toml `exclude` updated. EPIPHANIES E-TURBOVEC-AMX-WRONG-TOOL-1 + this entry + LATEST_STATE.

**Measured (AVX-512+VNNI host, no AMX tiles; n=20k dim=512 k=10 4-bit):** native LUT-ADC 76 µs/q (recall 0.785) ; polyfill GEMM 867 µs/q (recall 0.764) ; scalar 6 267 µs/q. **polyfill 11.4× slower than native** → TurboQuant deliberately trades the matmul away (LUT gather, not dot), so AMX accelerates the op it removed. Native LUT stays the production kernel; polyfill retained as AMX-ready baseline. Placement verdict: index → spine (lance-graph), kernel-math → ndarray (already owns clam/cam_pq/cascade/amx_matmul). The promising synergy is a Belichtungsmesser σ-gate on the LUT scan, NOT AMX.

**Verification:** `cargo build --lib -p turbovec` (fork-wired) green; `cargo test -p turbovec --features ndarray-simd search_polyfill` 2/2 green; `cargo test --manifest-path crates/lance-graph-turbovec/Cargo.toml` green; benchmark ran. Pre-existing upstream turbovec dead-code warning (`avx2_block_epilogue`) silenced minimally. Commits: one per repo on the branch.
## 2026-06-13 — SoaEnvelope binding for canonical NodeRow (the canon-as-substrate keystone)

**bardioc cross-session.** Closes punchlist item §7.2 of the 2026-06-13 SoA migration diff resolution doc — the canonical row layout is now bound to the envelope ABI. New `NodeRowPacket<'a>` wrapper in `canonical_node.rs` zero-copy-views a `&[NodeRow]` (each row `#[repr(C, align(64))]` at 512 bytes) as a row-strided LE byte packet through `SoaEnvelope`. Three-column descriptor table (`NODE_ROW_COLUMNS`): key (16 × u8 at offset 0), edges (16 × u8 at offset 16), value (480 × u8 at offset 32) — sums to `NODE_ROW_STRIDE = 512`. Internal structure within each slot stays canon-described (`NodeGuid` for the key, `EdgeBlock` for the edges, registry `ClassView` for the value carve-out) — the envelope contract is at the row-stride level, not the field-decomposition level. `NodeRowColumn` enum exports the column ordinals as `pub enum { Key=0, Edges=1, Value=2 }` for type-safe `column_le` access. `as_le_bytes()` is unsafe-free at the API but uses `core::slice::from_raw_parts` internally with a documented SAFETY note (NodeRow `#[repr(C)]` + locked size + canon-LE field accessors). +9 tests covering column-table layout, empty-packet verification, single-row zero-copy (pointer equality), multi-row byte length, `row_le`/`column_le` LE byte ranges, canon-LE key end-to-end, and `LAYOUT_VERSION` parity. `cargo test -p lance-graph-contract --lib`: **603/603 green** (+9); `cargo clippy -p lance-graph-contract --all-targets -- -D warnings`: clean. **No public-API drift in existing code** — `NodeRowPacket`, `NodeRowColumn`, `NODE_ROW_COLUMNS`, `NODE_ROW_STRIDE` are pure additions. This is the keystone the BindSpace dissolution sequence S1-S4 has been blocked behind: Lance's columnar I/O can now read the canonical row packet directly. Next step: MailboxSoA migrating from its column-major `[T; N]` layout to a row-strided `[NodeRow; N]` backing store that impls `SoaEnvelope` through this wrapper.
Expand Down
44 changes: 44 additions & 0 deletions .claude/board/EPIPHANIES.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,47 @@
## 2026-06-13 — E-TURBOVEC-AMX-WRONG-TOOL-1 — AMX accelerates the operation TurboQuant deliberately removed

**Status:** FINDING (benchmarked; AVX-512+VNNI host, `amx_available=false`).
**Confidence:** High — measured, with a mechanistic explanation that holds across the tier ladder.

**The finding.** turbovec (Google TurboQuant, arXiv 2504.19874) was brought
onto the spine as `crates/lance-graph-turbovec` (excluded standalone, path-deps
the AdaWorldAPI turbovec + ndarray forks). Its scan was *also* expressed as a
batched int8 GEMM through `ndarray::simd::matmul_i8_to_i32` (the polyfill that
ships AMX `TDPBUSD` → AVX-512 VPDPBUSD → AVX-VNNI → scalar). Measured
(`n=20 000, dim=512, k=10, 4-bit`):

| kernel | ns/query | recall@10 |
|---|---|---|
| native nibble-LUT ADC (AVX-512BW) | 76 073 | 0.785 |
| polyfill int8 GEMM (VPDPBUSD-zmm) | 866 899 | 0.764 |
| scalar reference | 6 267 279 | — |

The polyfill GEMM is **11.4× slower** than the native LUT, and native is 82×
faster than scalar. **Mechanism:** TurboQuant's design *trades the matmul away*
— LUT-ADC is an O(1) table gather per coordinate; the GEMM does the full
`dim`-length dot per (query,vector) pair. AMX is a tile *matrix-multiply* unit,
so it accelerates exactly the operation TurboQuant removed. The AMX tile (256
MAC/instr, ~4× VNNI) would bring the polyfill from 11.4× → ~3× slower — still a
loss. **A gather is not a matmul; no tile engine makes it one.**

**Consequences.**
- Keep the native LUT kernel as turbovec's production path. The polyfill is
retained only as (a) proof the index is `ndarray::simd`-clean / AMX-ready and
(b) a measured baseline. AMX is the right tool only where the workload is
genuinely matmul-shaped (e.g. an exact-rerank LEAF over a tiny survivor set).
- Generalises the I-VSA-IDENTITIES register lesson to *kernels*: match the SIMD
primitive to the algorithm's operation, not to peak MAC/instr. "Ship AMX via
dispatch" is correct *plumbing* (the polyfill does ship it), but plumbing
doesn't make the wrong-shaped op fast.
- The genuinely promising turbovec⇄bgz-tensor wiring is NOT AMX: it is a
Belichtungsmesser σ-gated block reject on the LUT scan (turbovec has only a
heap-min prune, no statistical threshold). See
`crates/lance-graph-turbovec/KNOWLEDGE.md` §3B.

Cross-ref: `crates/lance-graph-turbovec/KNOWLEDGE.md` (full synergy map +
reproduce); `ndarray::hpc::amx_matmul::matmul_i8_to_i32` (the 4-tier ladder);
I-NOISE-FLOOR-JIRAK (the σ-threshold path inherits the Jirak obligation).

## 2026-06-12 — E-OUTER-BOUNDARY-IS-ORM-1 — there is only one boundary, and it is ontology-mediated

**Status:** FINDING (PR #487 tombstone commit makes this source-true; OGAR class + `SoaEnvelope` + Lance columnar I/O is the realized triangle).
Expand Down
2 changes: 2 additions & 0 deletions .claude/board/LATEST_STATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@

---

> **2026-06-13 — shipped (autoattended, cross-repo)** (turbovec ⇄ ndarray): new excluded standalone crate **`crates/lance-graph-turbovec`** — Google TurboQuant (arXiv 2504.19874, the AdaWorldAPI `turbovec` fork) bridged onto the spine. `TurboVec` wraps `turbovec::TurboQuantIndex` with a `Kernel::{NativeLut, PolyfillGemm}` A/B switch. **Cross-repo (branch `claude/wonderful-hawking-lodtql` in turbovec + ndarray + lance-graph):** turbovec re-pointed from crates.io `ndarray 0.17` → the AdaWorldAPI fork (path, P0 forks-only; `blas` opt-in so default builds BLAS-free; `rust-toolchain.toml` = 1.95.0); new `turbovec::search_polyfill` (feature `ndarray-simd`) expresses scoring as a batched int8 GEMM via **`ndarray::simd::matmul_i8_to_i32`** (re-exported through `simd.rs` — AMX `TDPBUSD` tile → AVX-512 VPDPBUSD → AVX-VNNI → scalar, dispatched inside ndarray, zero intrinsics in turbovec). **Measured finding (E-TURBOVEC-AMX-WRONG-TOOL-1):** the polyfill GEMM is 11.4× SLOWER than the native nibble-LUT (TurboQuant trades the matmul away → AMX accelerates the op it removed); native LUT stays production, polyfill is the AMX-ready baseline. Placement: index → spine, kernel-math → ndarray (already owns clam/cam_pq/cascade/amx_matmul). Synergy map (HDR popcount stacking early-exit, Belichtungsmesser σ thresholds, preheating vs palette256) in `crates/lance-graph-turbovec/KNOWLEDGE.md`. Tests green in all three repos; benchmark via `examples/kernel_speed.rs`. NOT a merged PR yet (branch work).
>
> **2026-06-03 — hardened (follow-up after #460)** (D-HELIX-1 wiring): `crates/helix` now takes **ndarray as a MANDATORY, non-optional git dependency** (`git = AdaWorldAPI/ndarray @ master`), replacing the optional `path` dep + `ndarray-hpc` feature. Why: (1) codex P2 — an optional *path* dep still forces Cargo to read the local sibling manifest at resolution, so a clean checkout failed before feature selection; (2) directive "ndarray is mandatory for lance-graph". `simd.rs` always uses `ndarray::simd` (no scalar fallback); the self-contained fork → no import cycle. 63 unit + 6 doctests green; clippy/fmt clean. See E-HELIX-NDARRAY-MANDATORY.
>
> **2026-06-03 — shipped (autoattended)** (D-HELIX-1): new standalone crate `crates/helix` — the golden-spiral **Place/Residue** codec from the user's `KNOWLEDGE.md`. HHTL = deterministic PLACE; helix = orthogonal RESIDUE. Pipeline: equal-area `√u` hemisphere placement (`HemispherePoint`) → stride-4-over-17 `CurveRuler` coupling → Fisher-Z/arctanh `Similarity` alignment → EULER_GAMMA hand-off → 256-palette `RollingFloor` quantise (occupancy-drift + version stamp) → 3-byte `ResidueEdge` endpoint pair; metric-safe L1 via 256×256 `DistanceLut` (`distance_adaptive`) + non-metric byte-Hamming `distance_heuristic`. `prove()` closes the 2-D discrepancy Open Item (companion to `jc::weyl`). Zero-dep default (`edition 2021`, empty `[workspace]`, root `exclude`); optional `ndarray-hpc` feature routes batch Fisher-Z through `ndarray::simd::simd_ln_f32`. **61 unit + 6 doctests green** on BOTH feature configs; clippy -D warnings + fmt clean. ~80% overlaps existing CERTIFIED primitives by design (clean-room, user-directed) — see `crates/helix/KNOWLEDGE.md` § Overlap & Consolidation + E-HELIX-OVERLAP + TD-HELIX-OVERLAP-1. Branch claude/gallant-rubin-Y9pQd.
Expand Down
7 changes: 7 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,13 @@ exclude = [
# machinery to a single-tier unigram pipeline (see crate README).
# Verified via `cargo test --manifest-path crates/quasicryth-research/Cargo.toml`.
"crates/quasicryth-research",
# TurboQuant ANN index (Google arXiv 2504.19874) bridged onto the spine —
# standalone, path-deps the AdaWorldAPI turbovec + ndarray forks. Kept out
# of the main graph so turbovec's faer/statrs tree never enters the
# deterministic lance-graph compile path. Both scoring kernels (native
# nibble-LUT ADC + ndarray::simd::matmul_i8_to_i32 polyfill GEMM) compiled.
# Verify via `cargo test --manifest-path crates/lance-graph-turbovec/Cargo.toml`.
"crates/lance-graph-turbovec",
]
resolver = "2"

Expand Down
78 changes: 78 additions & 0 deletions crates/bgz17/src/simd.rs
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,11 @@ pub fn batch_palette_distance(
let level = detect_simd();

match level {
#[cfg(target_arch = "x86_64")]
SimdLevel::Avx512 => {
// Safety: detect_simd() confirmed AVX-512F is available.
unsafe { avx512_batch(dm_data, k, query, candidates, out) };
}
#[cfg(target_arch = "x86_64")]
SimdLevel::Avx2 => {
// Safety: detect_simd() confirmed AVX2 is available.
Expand Down Expand Up @@ -138,6 +143,79 @@ unsafe fn avx2_batch(dm_data: &[u16], k: usize, query: u8, candidates: &[u8], ou
}
}

/// AVX-512 gather batch lookup: process 16 lookups at a time using _mm512_i32gather_epi32.
///
/// Widened analogue of `avx2_batch`. The distance matrix stores u16 values; we
/// gather i32 words from the u16 base pointer with byte-scale 2 (each u16 is 2
/// bytes), then mask off the high u16 of each lane. The low u16 of each gathered
/// i32 is exactly `dm[query][candidate]`, so the result is identical to
/// `scalar_batch` (and to `avx2_batch`). The 16-wide remainder falls back to
/// scalar, matching the AVX2 path's tail handling.
///
/// # Safety
/// Caller must ensure AVX-512F is available (checked via `is_x86_feature_detected!("avx512f")`).
#[cfg(target_arch = "x86_64")]
#[target_feature(enable = "avx512f")]
unsafe fn avx512_batch(dm_data: &[u16], k: usize, query: u8, candidates: &[u8], out: &mut [u16]) {
use core::arch::x86_64::*;

let row_offset = query as usize * k;
let row_ptr = dm_data.as_ptr().add(row_offset);
let n = candidates.len();

// Process 16 candidates at a time
let chunks = n / 16;
let remainder = n % 16;

for chunk in 0..chunks {
let base = chunk * 16;

// Build index vector: candidate indices as i32 (lane 0 = candidates[base]).
let indices = _mm512_set_epi32(
candidates[base + 15] as i32,
candidates[base + 14] as i32,
candidates[base + 13] as i32,
candidates[base + 12] as i32,
candidates[base + 11] as i32,
candidates[base + 10] as i32,
candidates[base + 9] as i32,
candidates[base + 8] as i32,
candidates[base + 7] as i32,
candidates[base + 6] as i32,
candidates[base + 5] as i32,
candidates[base + 4] as i32,
candidates[base + 3] as i32,
candidates[base + 2] as i32,
candidates[base + 1] as i32,
candidates[base] as i32,
);

// Gather u16 values via i32 gather on the u16 array. With scale=2 on the
// u16 base pointer, lane j reads the i32 at byte offset candidates[..]*2,
// i.e. the target u16 (low half) plus the next u16 (high half). Identical
// trick to avx2_batch, widened to 16 lanes.
let gathered = _mm512_i32gather_epi32::<2>(indices, row_ptr as *const i32);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid reading past the distance matrix row

On AVX-512 hosts this newly enabled path runs for candidate batches of at least 16. Because each lane performs a 32-bit gather from a u16 row and then masks off the high half, a lookup for the final entry of the final row (for example query == k - 1 and candidate == k - 1) reads two bytes past dm_data; the previous scalar fallback did not. This can fault at a page boundary or invoke UB despite only using the low 16 bits, so the boundary entry needs scalar handling/padding or a true 16-bit-safe load path.

Useful? React with 👍 / 👎.


// Mask to extract only the low u16 from each i32 lane.
let mask = _mm512_set1_epi32(0x0000FFFF);
let masked = _mm512_and_si512(gathered, mask);

// Extract and store individually (no direct i32→u16 pack across 16 lanes).
let mut tmp = [0i32; 16];
_mm512_storeu_si512(tmp.as_mut_ptr() as *mut __m512i, masked);

for i in 0..16 {
out[base + i] = tmp[i] as u16;
}
}

// Scalar fallback for remaining elements
let tail_start = chunks * 16;
for i in 0..remainder {
out[tail_start + i] = dm_data[row_offset + candidates[tail_start + i] as usize];
}
}

/// Batch SPO distance: combined S+P+O distance for multiple candidates.
///
/// For each candidate i:
Expand Down
Loading
Loading