Skip to content

Commit 6311b52

Browse files
authored
Merge pull request #269 from AdaWorldAPI/claude/distance-trait-and-simd-hamming
feat: Distance trait + SIMD Hamming/cosine wiring + PaletteDistanceTable + Dockerfile docs
2 parents 4ee2172 + 6899390 commit 6311b52

15 files changed

Lines changed: 531 additions & 53 deletions

File tree

.claude/board/EPIPHANIES.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2973,3 +2973,90 @@ The architecture's five consumer perspectives are not layers — they're project
29732973
**SoA vs Functional is not a choice — it's a WHERE.** BindSpace is SoA (columnar storage for SIMD). The algebra on it is Functional (methods on carriers). The SoA carries the state; the Functional methods transform it. Both exist simultaneously on the same data. The "struct of arrays vs object thinks for itself" tension resolves as: the ARRAY is the SoA, the ELEMENT (row, trajectory, fingerprint) thinks for itself via methods.
29742974

29752975
Cross-ref: CLAUDE.md §The Stance (AGI-as-glove, SoA columns ARE the AGI surface), lab-vs-canonical-surface.md (I1-I11 invariants), ExternalMembrane (contract::external_membrane), BindSpace (cognitive-shader-driver::bindspace).
2976+
2977+
## 2026-04-26 — FINDING: distance dispatch must be type-intrinsic, not crate-boundary-crossing
2978+
2979+
**Status:** FINDING
2980+
**Owner scope:** @family-codec-smith, @truth-architect, @host-glove-designer
2981+
2982+
The struct-of-arrays (BindSpace, RenderFrame, Arrow columns) carries heterogeneous
2983+
fingerprint types that each need a DIFFERENT distance function:
2984+
2985+
| Type | Distance | Where it lives | Notes |
2986+
|---|---|---|---|
2987+
| `Binary16K = [u64; 256]` | Hamming (popcount of XOR) | `ndarray::hpc::bitwise::hamming_distance_raw` | 16384-bit, SIMD VPOPCNTDQ |
2988+
| `Vsa16kF32 = [f32; 16_384]` | Cosine → FisherZ transform | `ndarray::hpc::heel_f64x8::cosine_f64_simd` | f32 dot/norm via F32x16 FMA |
2989+
| `CamPqCode = [u8; 6]` | ADC (asymmetric distance computation) | `ndarray::hpc::cam_pq::adc_distance` | Precomputed distance tables, O(1) |
2990+
| `PaletteEdge = [u8; 3]` | Palette L1 (lookup table) | `ndarray::hpc::palette_distance::SpoDistanceMatrices::distance` | bgz17 256×256 table, 1.8 ns |
2991+
| `Base17 = [u8; 17]` | Palette nearest (codebook search) | `bgz17::Palette::nearest` | 256 centroids, should use precomputed table |
2992+
| `HighHeelBGZ` container | Cascade (HHTL skip → palette → ADC fallback) | `ndarray::hpc::cascade` + `bgz-tensor::hhtl_cache` | Multi-level, route by `RouteAction` |
2993+
2994+
**The problem:** When a SoA column contains mixed types (e.g., one column is Binary16K,
2995+
another is CamPqCode), the distance dispatch currently happens at the call site — the
2996+
caller must know which distance function to use. This works inside a single crate, but
2997+
when the SoA lives in crate A (e.g., `cognitive-shader-driver::BindSpace`) and the
2998+
distance kernel lives in crate B (e.g., `ndarray::hpc::bitwise`), every call crosses
2999+
a crate boundary. That boundary is zero-cost for `#[inline]` functions, but NOT zero-cost
3000+
if the function is generic over a trait object (`dyn DistanceFn`) or involves dynamic
3001+
dispatch.
3002+
3003+
**The solution — type-intrinsic dispatch, not dynamic dispatch:**
3004+
3005+
The distance function should be a method ON the carrier type, not a free function
3006+
called FROM the SoA consumer. This follows the "object speaks for itself" doctrine
3007+
(CLAUDE.md §The Click):
3008+
3009+
```rust
3010+
// WRONG — caller must know the distance type:
3011+
let d = hamming_distance_raw(fp_a.as_bytes(), fp_b.as_bytes()); // crate boundary
3012+
3013+
// RIGHT — the type carries its own distance:
3014+
let d = fp_a.distance(&fp_b); // monomorphized, inlined, zero boundary tax
3015+
```
3016+
3017+
The contract already has `CodecRoute: Passthrough | CamPq` which names the regime.
3018+
What's missing is a `Distance` trait that each carrier implements:
3019+
3020+
```rust
3021+
pub trait Distance: Sized {
3022+
fn distance(&self, other: &Self) -> u32;
3023+
fn similarity(&self, other: &Self) -> f32 {
3024+
1.0 - (self.distance(other) as f32 / Self::MAX_DISTANCE as f32)
3025+
}
3026+
const MAX_DISTANCE: u32;
3027+
}
3028+
```
3029+
3030+
Implementations:
3031+
- `impl Distance for [u64; 256]``hamming_distance_raw` (inline, SIMD)
3032+
- `impl Distance for CamPqCode` → ADC lookup (precomputed table ref)
3033+
- `impl Distance for PaletteEdge` → palette L1 table lookup
3034+
- `impl Distance for Vsa16kF32` → cosine → FisherZ (F32x16 FMA)
3035+
3036+
The trait monomorphizes at compile time — no dynamic dispatch, no crate boundary
3037+
tax. The SoA column iterates with `col.chunks().map(|a, b| a.distance(b))` and
3038+
the correct distance function is selected by TYPE, not by runtime enum match.
3039+
3040+
**Where this trait should live:** `lance-graph-contract` (zero deps). The
3041+
implementations live in ndarray (for SIMD kernels) or in the carrier crate
3042+
(for precomputed tables). The contract defines the interface; ndarray provides
3043+
the hardware acceleration; the SoA consumer never needs to know which distance
3044+
kernel runs.
3045+
3046+
**Hard-coded dispatch within the same crate is fine** — when `BindSpace` calls
3047+
`hamming_distance_raw` on its `content` column, that's a direct function call
3048+
into ndarray, monomorphized and inlined. The problem only arises if we try to
3049+
make the SoA generic over distance type via `dyn` trait objects. Don't do that.
3050+
Keep the dispatch compile-time via generics or type-specific methods. The SoA
3051+
pays zero boundary tax because Rust's monomorphization erases the crate boundary.
3052+
3053+
**FisherZ note:** Cosine similarity ∈ [-1, 1] is nonlinear for averaging. The
3054+
FisherZ transform `z = atanh(r)` maps it to a normal-distributed variable that
3055+
can be averaged, then `r = tanh(z)` maps back. This matters when the SoA
3056+
accumulates similarities across columns (e.g., weighted multi-column distance).
3057+
The `Distance` trait should expose `fn similarity_z(&self, other: &Self) -> f32`
3058+
for the FisherZ-transformed variant, defaulting to `atanh(similarity())`.
3059+
3060+
Cross-ref: CLAUDE.md §The Click ("object speaks for itself"), I1 Codec Regime
3061+
Split (`CodecRoute`), `contract::cam::DistanceTableProvider` (existing trait for
3062+
ADC), `ndarray::hpc::bitwise::hamming_distance_raw`, `ndarray::hpc::palette_distance`.

.claude/board/TECH_DEBT.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1071,3 +1071,56 @@ Cross-ref: `container_bs/dn_redis.rs`; `callcenter-membrane-v1.md` §§595–803
10711071
| Diagnostic | TD-INT-11 |
10721072

10731073
All 14 items are additive (add call site). Zero items require type creation or code deletion.
1074+
1075+
## 2026-04-26 — TD-DIST-1: Distance trait missing from contract (type-intrinsic dispatch)
1076+
1077+
**Status:** Open
1078+
**Severity:** Medium (no runtime cost today — hard-coded dispatch works — but blocks
1079+
generic SoA distance sweeps)
1080+
1081+
The contract has `CodecRoute` (Passthrough | CamPq) naming the regime and
1082+
`DistanceTableProvider` for ADC, but no unified `Distance` trait that each
1083+
carrier type implements. Today each call site hard-codes which distance
1084+
function to use (`hamming_distance_raw` for Binary16K, `adc_distance` for
1085+
CamPq, `cosine_f64_simd` for Vsa16kF32). This works but prevents writing
1086+
generic distance sweeps over mixed SoA columns.
1087+
1088+
**Fix:** Add `pub trait Distance` to `contract::cam` (or a new `contract::distance`
1089+
module). Implement for `[u64; 256]`, `CamPqCode`, `PaletteEdge`, `Vsa16kF32`.
1090+
Include `similarity_z()` for FisherZ-transformed cosine averaging.
1091+
See EPIPHANIES.md 2026-04-26 distance-dispatch entry for full design.
1092+
1093+
**Blocked by:** nothing — pure additive.
1094+
**Unblocks:** generic SoA distance accumulation, multi-column weighted distance,
1095+
render-frame similarity for force-directed layout (CAM-PQ pruning + HHTL cascade).
1096+
1097+
## 2026-04-26 — TD-DIST-2: vector_ops.rs still has scalar dot/norm/cosine (4 loops)
1098+
1099+
**Status:** Open
1100+
**Severity:** High (hot path in DataFusion UDF — L2/cosine queries)
1101+
1102+
`vector_ops.rs` lines 140, 160, 179, 189 have 4 independent scalar
1103+
`.iter().map().sum()` loops for dot product, norm², cosine similarity.
1104+
Should swap for `ndarray::hpc::heel_f64x8::{dot_f64_simd, cosine_f64_simd}`.
1105+
Estimated 8-12× speedup (chunked F64x8 FMA vs scalar).
1106+
1107+
## 2026-04-26 — TD-DIST-3: bgz17 Palette::nearest() uses brute-force 256×17 L1
1108+
1109+
**Status:** Open
1110+
**Severity:** Medium (build-time hot path for palette construction)
1111+
1112+
`bgz17/palette.rs` lines 56-65 iterate all 256 centroids per query.
1113+
Should use precomputed distance table from `ndarray::hpc::palette_distance`.
1114+
Estimated 100× speedup for encoding (O(1) table lookup vs O(256) L1 per query).
1115+
1116+
## 2026-04-26 — Paid Debt: TD-DIST-1/2/3 all shipped in commit 8603148
1117+
1118+
- **TD-DIST-1** (Distance trait): `contract::distance` module with `Distance` trait,
1119+
`fisher_z_inverse`, `mean_similarity_fisher`. Impls for `[u64; 256]`, `[u8; 6]`, `[u8; 3]`.
1120+
11 tests. Status: **PAID**.
1121+
- **TD-DIST-2** (vector_ops scalar→SIMD): `cosine_distance`, `cosine_similarity`,
1122+
`dot_product_distance`, `dot_product_similarity` all now delegate to
1123+
`ndarray::hpc::heel_f64x8::cosine_f32_to_f64_simd` / `dot_f64_simd`. Status: **PAID**.
1124+
- **TD-DIST-3** (Palette distance table): `Palette::build_distance_table()`
1125+
`PaletteDistanceTable` with O(1) `distance(a, b)` and `edge_distance(a, b)`.
1126+
128 KB table, L2-resident. Status: **PAID**.

.github/workflows/build.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ concurrency:
1616

1717
env:
1818
CARGO_TERM_COLOR: always
19-
RUSTFLAGS: "-C debuginfo=1"
19+
RUSTFLAGS: "-C debuginfo=1 -C target-cpu=x86-64-v3"
2020
RUST_BACKTRACE: "1"
2121
CARGO_INCREMENTAL: "0"
2222

.github/workflows/rust-publish.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ on:
2020

2121
env:
2222
CARGO_TERM_COLOR: always
23-
RUSTFLAGS: "-C debuginfo=1"
23+
RUSTFLAGS: "-C debuginfo=1 -C target-cpu=x86-64-v3"
2424
RUST_BACKTRACE: "1"
2525
CARGO_INCREMENTAL: "0"
2626
CARGO_BUILD_JOBS: "1"

.github/workflows/rust-test.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ concurrency:
1616

1717
env:
1818
CARGO_TERM_COLOR: always
19-
RUSTFLAGS: "-C debuginfo=1"
19+
RUSTFLAGS: "-C debuginfo=1 -C target-cpu=x86-64-v3"
2020
RUST_BACKTRACE: "1"
2121
CARGO_INCREMENTAL: "0"
2222

.github/workflows/style.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ concurrency:
1616

1717
env:
1818
CARGO_TERM_COLOR: always
19-
RUSTFLAGS: "-C debuginfo=1"
19+
RUSTFLAGS: "-C debuginfo=1 -C target-cpu=x86-64-v3"
2020

2121
jobs:
2222
format:

Dockerfile

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,10 @@
1-
# lance-graph — Railway compile-test image
1+
# lance-graph — Railway compile-test image (AVX2 default)
22
# Verifies the workspace builds cleanly (core + bgz17 + planner + contract)
33
# Requires Rust 1.94.0 (LazyLock, modern std APIs)
44
#
5+
# CPU detection & SIMD dispatch documentation: see Dockerfile.md
6+
# AVX-512 pinned variant: see Dockerfile.avx512
7+
#
58
# Build: docker build -t lance-graph-test .
69
# Run: docker run --rm lance-graph-test
710

@@ -38,6 +41,11 @@ COPY crates/bgz17/Cargo.toml crates/bgz17/Cargo.toml
3841
# Copy source
3942
COPY crates/ crates/
4043

44+
# Default target: x86-64-v3 (AVX2) — runs on GitHub CI and most servers.
45+
# Use Dockerfile.avx512 for x86-64-v4 (AVX-512) on Skylake-X / Ice Lake / Sapphire Rapids.
46+
# The .cargo/config.toml pins x86-64-v4 for LOCAL builds; override here for portability.
47+
ENV RUSTFLAGS="-C target-cpu=x86-64-v3"
48+
4149
# Build bgz17 standalone (zero deps, fast check)
4250
RUN cargo build --release --manifest-path crates/bgz17/Cargo.toml 2>&1 \
4351
&& echo "=== BGZ17 BUILD OK ==="

Dockerfile.avx512

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@
44
#
55
# ONLY deploy on AVX-512 hardware.
66
#
7+
# CPU detection & SIMD dispatch documentation: see Dockerfile.md
8+
# Portable (AVX2) variant: see Dockerfile
9+
#
710
# Build: docker build -f Dockerfile.avx512 -t lance-graph-avx512 .
811
# Run: docker run --rm lance-graph-avx512
912

Dockerfile.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# lance-graph Docker CPU Detection & SIMD Dispatch
2+
3+
## Three-Tier Build Strategy
4+
5+
| Target | Dockerfile | RUSTFLAGS | Use case |
6+
|---|---|---|---|
7+
| **Portable (AVX2)** | `Dockerfile` | `-C target-cpu=x86-64-v3` | GitHub CI, general servers |
8+
| **AVX-512 pinned** | `Dockerfile.avx512` | `-C target-cpu=x86-64-v4` | Production (Skylake-X+) |
9+
| **HHTL-D TTS** | `Dockerfile.hhtld` | (inherits) | TTS inference container |
10+
| **Local dev** | `.cargo/config.toml` | `-C target-cpu=x86-64-v4` | Developer machines |
11+
12+
## How lance-graph Uses SIMD
13+
14+
lance-graph delegates all SIMD work to **ndarray** (mandatory dependency).
15+
ndarray's `src/simd.rs` polyfill provides the dispatch:
16+
17+
```
18+
Consumer code (lance-graph):
19+
ndarray::hpc::bitwise::hamming_distance_raw(a, b)
20+
ndarray::simd::F32x16::mul_add(b, c)
21+
ndarray::hpc::renderer::integrate_simd(pos, vel, dt, damp)
22+
23+
Polyfill (ndarray simd.rs):
24+
┌─────────────────────────┐
25+
│ compile-time target_cpu │
26+
├─────────┬───────────────┤
27+
│ v4 │ v3 / lower │
28+
├─────────┼───────────────┤
29+
│ __m512 │ 2× __m256 or │
30+
│ native │ scalar loop │
31+
└─────────┴───────────────┘
32+
+
33+
┌──────────────────────────────┐
34+
│ runtime LazyLock<Tier> │
35+
│ is_x86_feature_detected!() │
36+
│ → per-function AVX-512 even │
37+
│ when compiled at v3 │
38+
└──────────────────────────────┘
39+
```
40+
41+
### What lance-graph calls from ndarray SIMD
42+
43+
| lance-graph location | ndarray function | What it does |
44+
|---|---|---|
45+
| `driver.rs` (shader hot loop) | `bitwise::hamming_distance_raw` | Content-plane Hamming pre-pass (16K-bit fingerprints) |
46+
| `vector_ops.rs` (DataFusion UDF) | `bitwise::hamming_distance_raw` | SQL `hamming_distance()` function |
47+
| `fingerprint.rs` (graph) | `bitwise::hamming_distance_raw` | Graph fingerprint similarity |
48+
| `blasgraph/types.rs` | Own AVX-512/AVX2 Hamming | Hand-rolled (predates ndarray integration) |
49+
50+
### `.cargo/config.toml` vs CI RUSTFLAGS
51+
52+
**Important:** `RUSTFLAGS` env var **replaces** (not appends to) the `rustflags`
53+
array in `.cargo/config.toml`. This is a Cargo design decision.
54+
55+
lance-graph's `.cargo/config.toml` sets `target-cpu=x86-64-v4` for local dev.
56+
CI workflows set `RUSTFLAGS="-C debuginfo=1 -C target-cpu=x86-64-v3"` which
57+
**overrides** config.toml entirely. The CI binary targets AVX2.
58+
59+
This is intentional:
60+
- Local dev: maximum SIMD (AVX-512, everything inlined)
61+
- CI: portable (AVX2, runtime detection for anything higher)
62+
- Production Docker: choose `Dockerfile` (AVX2) or `Dockerfile.avx512`
63+
64+
## AMX Detection
65+
66+
Intel AMX (Sapphire Rapids+) is detected at runtime by ndarray:
67+
`ndarray::hpc::amx_matmul::amx_available()` checks CPUID + OS XSAVE support.
68+
AMX kernels are always compiled in and gated at call sites. No Dockerfile
69+
or RUSTFLAGS change needed — it works with any `target-cpu`.
70+
71+
## NEON (ARM / aarch64 / Raspberry Pi)
72+
73+
ndarray detects NEON automatically on aarch64 (it's mandatory). The `dotprod`
74+
extension (Pi 5 / A76+) is runtime-detected for 4× int8 throughput.
75+
lance-graph inherits this via ndarray; no ARM-specific configuration needed.
76+
77+
## Choosing the Right Dockerfile
78+
79+
```
80+
GitHub CI / PR checks → Dockerfile (AVX2, -C target-cpu=x86-64-v3)
81+
Railway / production → Dockerfile.avx512 (-C target-cpu=x86-64-v4)
82+
TTS inference → Dockerfile.hhtld (downloads codebooks + runs decoder)
83+
Raspberry Pi / ARM → Dockerfile (NEON auto-detected at runtime)
84+
Maximum compatibility → docker build --build-arg RUSTFLAGS="-C target-cpu=x86-64"
85+
```
86+
87+
## Verifying CPU Features
88+
89+
```bash
90+
# Inside the container:
91+
cat /proc/cpuinfo | grep -oP 'avx512\w+' | sort -u
92+
93+
# From Rust (ndarray):
94+
use ndarray::hpc::simd_caps::simd_caps;
95+
println!("{:?}", simd_caps()); // CpuCaps { avx512: true, avx2: true, fma: true, ... }
96+
```
97+
98+
## Build Examples
99+
100+
```bash
101+
# Default (AVX2) — safe everywhere
102+
docker build -t lance-graph-test .
103+
104+
# AVX-512 pinned — production servers
105+
docker build -f Dockerfile.avx512 -t lance-graph-avx512 .
106+
107+
# TTS inference
108+
docker build -f Dockerfile.hhtld \
109+
--build-arg RELEASE_TAG=v0.1.0 \
110+
-t lance-graph-tts:v0.1.0 .
111+
```

crates/bgz17/src/palette.rs

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,24 @@ impl Palette {
6565
best_idx
6666
}
6767

68+
/// Build a precomputed distance table for O(1) inter-centroid distance.
69+
///
70+
/// Returns a 256×256 u16 table where `table[i][j]` = L1 distance between
71+
/// `entries[i]` and `entries[j]`. Used by the renderer and cascade skip
72+
/// for fast palette-edge distance without recomputing L1 per query.
73+
pub fn build_distance_table(&self) -> PaletteDistanceTable {
74+
let k = self.entries.len();
75+
let mut table = vec![0u16; 256 * 256];
76+
for i in 0..k {
77+
for j in i..k {
78+
let d = self.entries[i].l1(&self.entries[j]) as u16;
79+
table[i * 256 + j] = d;
80+
table[j * 256 + i] = d;
81+
}
82+
}
83+
PaletteDistanceTable { table, size: k }
84+
}
85+
6886
/// Encode an SpoBase17 edge to palette indices.
6987
pub fn encode_edge(&self, edge: &SpoBase17) -> PaletteEdge {
7088
PaletteEdge {
@@ -226,6 +244,41 @@ impl Palette {
226244
}
227245
}
228246

247+
/// Precomputed 256×256 L1 distance table for O(1) inter-centroid lookup.
248+
///
249+
/// Built once from a `Palette` via `palette.build_distance_table()`.
250+
/// Used by the cascade skip (HHTL), renderer force-directed layout, and
251+
/// any path that needs repeated palette-edge distance without recomputing L1.
252+
///
253+
/// Memory: 256×256×2 = 128 KB (fits L2 cache). Build cost: O(k²×17).
254+
#[derive(Clone)]
255+
pub struct PaletteDistanceTable {
256+
table: Vec<u16>,
257+
size: usize,
258+
}
259+
260+
impl PaletteDistanceTable {
261+
/// O(1) distance between two palette indices.
262+
#[inline]
263+
pub fn distance(&self, a: u8, b: u8) -> u16 {
264+
self.table[a as usize * 256 + b as usize]
265+
}
266+
267+
/// Number of active entries (≤ 256).
268+
pub fn size(&self) -> usize { self.size }
269+
270+
/// Distance between two PaletteEdges (sum of S + P + O distances).
271+
#[inline]
272+
pub fn edge_distance(&self, a: PaletteEdge, b: PaletteEdge) -> u32 {
273+
self.distance(a.s_idx, b.s_idx) as u32
274+
+ self.distance(a.p_idx, b.p_idx) as u32
275+
+ self.distance(a.o_idx, b.o_idx) as u32
276+
}
277+
278+
/// Memory footprint in bytes.
279+
pub fn byte_size(&self) -> usize { self.table.len() * 2 }
280+
}
281+
229282
/// Palette resolution: trade compression vs accuracy.
230283
///
231284
/// Edge count determines optimal palette size:

0 commit comments

Comments
 (0)