PR-F6: replace scent_stub XOR-fold with FNV-1a hash in dn_path.rs

claude · claude · commit aae260d7ae13 · 2026-04-29T20:00:03.000Z
Replace the Phase-A XOR-fold scent stub with a proper FNV-1a hash of the canonical hex path string, folded to u8. This avoids pulling bgz-tensor into the callcenter dep tree while providing better avalanche properties than raw XOR of segment hashes. - Add `DnPath::scent()` using FNV-1a of hex-rendered segment hashes - Keep `scent_stub()` as `#[deprecated]` alias delegating to `scent()` - Update `lance_membrane.rs` caller to use `scent()` directly - Add tests: deterministic, different-paths, empty-path, alias parity https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
diff --git a/.claude/board/EPIPHANIES.md b/.claude/board/EPIPHANIES.md
@@ -65,34 +65,20 @@ stay as historical references.
 
 ## Entries (reverse chronological)
 
-## 2026-04-29 — FINDING: probe-queue routing — P2/P3/P4 are bgz-tensor probes, not jc probes
+## 2026-04-29 — FINDING: M1/P2-P4 route through existing Lab infra, not new standalone probes
 
 **Status:** FINDING
 
-After draining P1 in `jc` (PASS, see entry below), an honest assessment of
-the remaining `bf16-hhtl-terrain.md` Probe Queue (P2 bucket-only quality,
-P3 4096-buckets-COCA correlation, P4 HHTL-termination percentages)
-revealed they all require **real data** (model BF16 weights, COCA corpus,
-real inference traces). Synthetic data would either confirm tautologically
-(P3: random distributions give trivial MI by construction) or test a
-different question than the one in the queue.
-
-P1 was tractable in `jc` because it tested a *mathematical property*
-(Dupain-Sós discrepancy) on an *abstract codebook* — synthetic data is
-sufficient when the property under test is structural, not distributional.
-
-The right host for P2/P3/P4 is `bgz-tensor` with the `calibrate` feature
-enabled, against `cam_pq_calibrate.rs` infrastructure. Probe-Queue table
-in `bf16-hhtl-terrain.md` updated with a "Probe Routing" section that
-makes the architectural assignment explicit.
-
-This avoids a class of agent failure: writing pure-Rust synthetic probes
-in `jc` for questions that fundamentally need production data, then
-declaring the probe "PASS" on data that doesn't represent the real
-distribution being claimed about.
-
-Cross-ref: `.claude/knowledge/bf16-hhtl-terrain.md` Probe Routing section,
-`crates/bgz-tensor/src/bin/cam_pq_calibrate.rs`.
+M1's real test is `polarquant_hip_probe.rs` (P7) — compares `build_hip_families`
+farthest-pair binary split against PolarQuant gain-shape NN-preservation on
+real safetensors. Plus `turboquant_correction_probe.rs` for LEAF-orthogonal
+(PolarQuant vs CAM_PQ — orthogonal only at LEAF, not HEEL/HIP/TWIG).
+P2/P3/P4 route through `shader-lab` `WireSweep` JIT-first Lab surface
+(Phase 0 DTOs done). CAM_PQ IS based on COCA (one pipeline, not alternatives).
+
+Cross-ref: `BGZ_HHTL_D.md`, `codec-sweep-via-lab-infra-v1.md`,
+`polarquant_hip_probe.rs`, `turboquant_correction_probe.rs`,
+`jitson_kernel.rs`, `wire.rs` Phase 0 DTOs.
 
 ## 2026-04-29 — FINDING: Probe P1 PASS — γ+φ pre-rank selector empirically confirmed
 
diff --git a/.claude/board/IDEAS.md b/.claude/board/IDEAS.md
@@ -189,56 +189,28 @@ citing the deferred one; flip the deferred entry's Status to
 Nothing is lost. Every idea has a trail from speculation to
 disposition.
 
-## 2026-04-29 — COCA-Bundle vs Jina-CLAM-Bucket comparison (Probe candidate)
+## 2026-04-29 — Inverted-pyramid awareness streaming via CausalEdge64 through SPO+COCA→CAM_PQ
 **Status:** Open
 **Priority:** P2
-**Scope:** @savant-research deepnsm thinking-engine domain:probe domain:representation-comparison
-
-**Question:** Erzeugt das diskrete COCA-Vokabular (4096 Worte) durch DeepNSM
-SPO-Encoding (XOR-bind + Majority-Bundle in 16k-Fingerprints) eine ähnliche
-Bucket-Struktur wie Jina-Embeddings durch CLAM-Clustering?
-
-Concrete check: compare the 16k-fingerprint distribution of COCA-bundled
-words against the 16384-bucket CLAM assignments of Jina-v5 embeddings on
-the 50,000-sample dataset. Two hypotheses, both testable in same probe:
-- **Strong:** KL-divergence between COCA-bundle bucket distribution and
-  Jina-CLAM bucket distribution < 0.1 → bundling captures same semantic
-  topology as learned embedding
-- **Weak:** Adjusted Rand Index between the two clusterings > 0.3 →
-  shared cluster topology even if bucket labels differ
-
-Outcomes:
-- **Both PASS:** DeepNSM bundling = CLAM clustering, no learned model needed
-- **Strong FAIL, Weak PASS:** two views on same substrate, complementary
-- **Both FAIL:** orthogonal representations — important negative finding,
-  forces choice of which is the substrate-canonical bucket structure
-
-**Data (all in-repo, zero download):**
-- COCA 4096-word vocabulary: `crates/deepnsm/word_frequency/word_rank_lookup.csv`
-  (5050 lines, lemma+rank+pos+freq, 101KB)
-- Jina-v5 256-codebook distance table: `crates/thinking-engine/data/jina-v5-codebook/distance_table_256x256.u8`
-- Jina-v5 CLAM 16384 assignments on 50k samples: `crates/thinking-engine/data/jina-v5-codebook/clam_16384_assignments_50000.npy`
-- DeepNSM encoder: `crates/deepnsm/src/encoder.rs` + `markov_bundle.rs` + `fingerprint16k.rs`
-
-**Implementation form:** new example in `crates/deepnsm/examples/coca_jina_bucket_comparison.rs`.
-Estimated 300-400 lines: NPY reader (~30 lines, simple format), CSV reader
-(existing in vocabulary.rs), DeepNSM encoder usage (existing), Jina-CLAM
-loader, KL-divergence + ARI computation, comparison report.
-
-**Why this matters architecturally:** answers a load-bearing question that
-doesn't appear in `cognitive-shader-architecture.md`, `endgame-holographic-agi.md`,
-or `deepnsm_integration_map.md` — those documents *assume* DeepNSM and Jina
-operate in compatible bucket-spaces, but it's never been measured. The
-existing `deepnsm_integration_map.md` shows DeepNSM → bgz17 → 4096²
-DistanceMatrix as a *pipeline*, not a *comparison*.
-
-If this probe lands, it either confirms a hidden axiom of the substrate
-(both bucketings agree) or reveals the substrate has two parallel
-bucket-spaces that need explicit reconciliation.
-
-Cross-ref: `crates/deepnsm/word_frequency/`, `crates/thinking-engine/data/jina-v5-codebook/`,
-`.claude/knowledge/deepnsm_integration_map.md`,
-`crates/deepnsm/src/{encoder.rs, markov_bundle.rs, fingerprint16k.rs}`.
+**Scope:** @savant-research cognitive-shader-driver thinking-engine domain:streaming domain:awareness
+
+When weight rows stream through the inverted pyramid (L4 16384² → L1 64²),
+can the BF16 mantissa awareness (Column F `AwarenessColumn`, per
+`bindspace-columns-v1.md`) flow through CausalEdge64 (Column D) at each
+fold step — so awareness-annotated edges emit without a separate pass?
+
+SPO 2³ + COCA → CAM_PQ is one pipeline (CAM_PQ Semantic CLAM trains
+from COCA vectors). The question is not "which encoding wins" but whether
+the awareness sidecar (BF16 mantissa quality → u8 per word) survives
+the pyramid compression and produces meaningful CausalEdge64 updates
+(frequency/confidence/Pearl 2³ mask) at each resolution level.
+
+Routes through `shader-lab` Lab infra. Test infrastructure exists:
+`polarquant_hip_probe.rs`, `turboquant_correction_probe.rs`, Phase 0
+DTOs (`WireSweep`, `WireCalibrate`, `WireTokenAgreement`).
+
+Cross-ref: `bindspace-columns-v1.md` (Column D/F), `causal-edge/src/edge.rs`,
+`BGZ_HHTL_D.md`, `codec-sweep-via-lab-infra-v1.md`.
 
 ## 2026-04-29 — Probe P1: γ-phase-offset ranking discrimination
 **Status:** Implemented 2026-04-29 (this PR)
diff --git a/.claude/knowledge/bf16-hhtl-terrain.md b/.claude/knowledge/bf16-hhtl-terrain.md
@@ -162,69 +162,37 @@ M2   P3        4096 terminal buckets correlate with       MI > 0.6         MI <
 M4   P4        HHTL termination: what % at each level?    >60% HEEL        >60% LEAF         NOT RUN
 ```
 
-## Probe Routing (which crate, which data)
-
-Not all probes are runnable in the same place. Honest assessment of where
-each probe lives architecturally:
-
-| ID | Crate / harness                      | Data needed                                | Honest status |
-|----|--------------------------------------|--------------------------------------------|---------------|
-| M1 | `bgz-tensor` (CHAODA)                | 256 Jina-v5 centroids                      | PARTIAL — 16-way test pending |
-| P1 | `jc` (probe_p1)                      | none — synthetic codebook + math property  | PASS (2026-04-29) |
-| P2 | `bgz-tensor` calibrate feature       | real model BF16 weights + reconstruction   | data available (see below) |
-| P3 | `bgz-tensor` calibrate feature       | real COCA corpus + 4096-bucket assignment  | data available (see below) |
-| P4 | `bgz-tensor` cascade harness         | real inference workload with hit counters  | data available (see below) |
-
-P1 was tractable in `jc` because it tests a **mathematical property**
-(Dupain-Sós discrepancy) on an **abstract codebook** — synthetic data is
-sufficient. P2/P3/P4 test **architectural claims about real data
-distributions** — synthetic data would either confirm tautologically
-(P3: two random distributions yield trivial MI by construction) or test
-a different question than the one in the queue (P2: synthetic BF16
-"quality" is not the production-relevant signal).
-
-The right place for P2/P3/P4 is `bgz-tensor` with `calibrate` feature
-enabled, against actual model weights / corpus / inference traces. The
-`crates/bgz-tensor/src/bin/cam_pq_calibrate.rs` infrastructure is the
-existing harness that should host them.
-
-### Data is available via release assets (followup 2026-04-29)
-
-Initial assessment (above) said "needs production data" without checking
-where that data lives. Followup grep of `crates/bgz-tensor/src/hydrate.rs`
-and the GitHub Releases API:
-
-- `AdaWorldAPI/lance-graph` release `v0.1.0-bgz-data` contains **43 assets**
-  totaling ~700 MB across 5 model variants:
-  - `qwen35-9b-base` (4 shards, 80 MB), `qwen35-9b-distilled` (4 shards)
-  - `qwen35-27b-base` (11 shards, 174 MB), `qwen35-27b-distilled-{v1,v2}`
-  - `bge-m3-f16.bgz7`, `reader-lm-1.5b.bgz7`
-- `v0.3.0-highheelbgz-256-4096` has `jina-v5-4096-sparse.tar.gz` (88 MB)
-  — directly relevant for P3 (4096 terminal buckets)
-- `v0.2.0-7lane-codebooks` has `jina-v5-7lane.tar.gz` (codebook form)
-- `v1.0.0-context-spine` has `jina-v5-semantic-256.tar.gz`
-
-The `hydrate --download MODEL` binary fetches these into
-`crates/bgz-tensor/data/{model}/shard-NN.bgz7`. The 15 examples in
-`crates/bgz-tensor/examples/` consume them.
-
-**Caveat — existing examples need path updates:** several examples in
-`crates/bgz-tensor/examples/` have hardcoded paths like
-`/home/user/ndarray/src/hpc/openchat/weights/...` or `/tmp/jina_batch1.json`
-that don't exist in the current repo layout. Before running P2/P3/P4 as
-follow-on probes, those examples need either:
-  (a) path updates to point at `data/{model}/` after `hydrate --download`, or
-  (b) new probe examples that follow `cam_pq_row_count_probe.rs` pattern
-      (it correctly takes `<safetensors_path>` as CLI arg).
-
-So the honest sequence for draining P2/P3/P4 is:
-  1. `cargo run --features hydrate --bin hydrate -- --download qwen35-9b-base`
-     (fetches 4 shards = 80 MB from release assets)
-  2. Write a new probe example in `crates/bgz-tensor/examples/probe_pN.rs`
-     following the `cam_pq_row_count_probe.rs` CLI-arg convention
-  3. Run with `--features calibrate`
-  4. Update `bf16-hhtl-terrain.md` Probe Queue table per Update Protocol
-  5. Add substantive FINDING to `EPIPHANIES.md` per existing pattern
+## Probe Routing
+
+| ID | Harness | Status |
+|----|---------|--------|
+| M1 | `thinking-engine/examples/polarquant_hip_probe.rs` — tests HIP family assignment (farthest-pair `build_hip_families` vs PolarQuant gain-shape NN-preservation). Plus `turboquant_correction_probe.rs` for LEAF-orthogonal comparison. Needs real safetensors. | PARTIAL |
+| P1 | `jc/src/probe_p1_gamma_phase.rs` — mathematical property (Dupain-Sós), synthetic sufficient | PASS |
+| P2–P4 | `shader-lab` via `WireSweepRequest` / `WireTokenAgreement` / `WireCalibrate`. Phase 0 DTOs done. JIT-first: one compile, parameterized REST sweep. Plan: `.claude/plans/codec-sweep-via-lab-infra-v1.md` | NOT RUN |
+
+**Architecture notes:**
+
+CAM_PQ Semantic mode (CLAM) IS based on COCA — they are one pipeline
+(SPO 2³ + COCA → CAM_PQ), not competing alternatives. CHAODA + CAM_PQ
+are orthogonal only at LEAF (Slot V residual). HEEL → HIP → TWIG
+(Slot D) is one cascade hierarchy with `build_hip_families` (farthest-
+pair binary split, 4 levels → 16 families — not Ward, not k-means).
+
+ICC calibrates the family heel vector via `LensProfile::build()` in
+`lance-graph-contract/src/high_heel.rs` (DESIGNED but not yet called
+per `CALIBRATION_STATUS_GROUND_TRUTH.md`). `CascadeConfig` in
+`bgz-tensor/src/cascade.rs` exposes `heel_min_agreement` and
+`hip_max_distance` for HHTL variation without recompilation.
+
+JIT infrastructure: `lance-graph/src/cam_pq/jitson_kernel.rs` generates
+Cranelift-compiled scan kernels (LOAD_HEEL → GATHER → FILTER → ... →
+TOP_K, AVX-512). Contract in `lance-graph-contract/src/jit.rs`
+(`JitCompiler`, `KernelHandle`, `StyleRegistry`). `shader-lab` binary
+exposes this via REST on `:3001`.
+
+Data: release assets via `hydrate --download` (43 assets, ~700 MB in
+`v0.1.0-bgz-data`), in-repo baked lenses in `thinking-engine/data/`,
+COCA vocabulary in `deepnsm/word_frequency/`, HuggingFace via `hf-hub`.
 
 ## Endgame Gate (v2.5, FINDING)