diff --git a/.claude/handovers/2026-06-16-genetic-research-headstone-exploration.md b/.claude/handovers/2026-06-16-genetic-research-headstone-exploration.md new file mode 100644 index 00000000..40700762 --- /dev/null +++ b/.claude/handovers/2026-06-16-genetic-research-headstone-exploration.md @@ -0,0 +1,293 @@ +# Genetic Research Headstone Exploration — lance-graph + +## Purpose + +This document is a headstone exploration for the full line of thought connecting: + +```text +upstream domain corpora (FASTA / VCF / BAM / GFF / 1000-Genomes / ClinVar / GO / Reactome / htslib) + ndarray::hpc::clam + CHAODA (3-level 16-way clustering + LFD anomaly) + ndarray::hpc::cam_pq (6 × 256 = 48-bit Lloyd-Max fingerprint; 94 % of Jina 1024-D) + ndarray::hpc::activations (exp / log / softmax / matrix exp / Lie-algebra Lyndon log-signature) + ndarray::hpc::amx_matmul (197 GMAC/s int8 GEMM on Emerald Rapids) + bgz17 (11/17 X-Trans stride for anti-moiré k-mer sampling) + lance-graph-contract (canonical NodeGuid · EdgeBlock · NodeRow; HHTL nibble-trie; MailboxSoA; CounterfactualMailbox) + lance-graph-ontology (OntologyRegistry + Pattern D OwlHydrator/MetaStructureHydrator + hydrate_*() glue (dolce/owltime/provo/qudt/schemaorg/skos/fibo_fnd/fibo_be/odoo/zugferd/skr03/skr04) + 47 KB Lance dictionary cache + wikidata_hhtl) + lance-graph-arm-discovery (reliability suite: Pearson / Spearman / Cronbach α / ICC(2,1)) + deepnsm (sentence-level AriGraph reader; P64 / Cam4096 / Crystal4096) + rubicon (§14 oracle: compare_normalised with provenance fields) + adapter-genetics-experimental (NEW — thin domain wiring; not yet built) +``` + +The goal is to preserve the architectural synthesis for genetic research *before* implementation details scatter it into separate plans — so a domain expert walking up to the substrate lands on the destination shape, not on the next tactical PR. + +--- + +## Capstone thesis + +```text +Bioinformatics ships pipelines. +Each pipeline is a tool with its own grammar, its own confidence calibration, +its own provenance discipline (usually informal), and its own failure modes. + +The substrate ships a SHAPE. +Every variant, every annotation, every counterfactual, every cohort summary +takes the same shape: a row in MailboxSoA, with an entropy×energy plane +classification derived from its Hebbian plasticity counter and its +last-active-cycle stamp, with adjacency in EdgeBlock, with class-resolved +value tenants in the 480-byte slab, with NARS truth (frequency, confidence) +calibrating its provenance, with Pearl-2³ subset addressing in CausalEdge64 +for do-calculus, with counterfactual minority poles preserved in a separate +lane via InferenceType::Counterfactual. + +One graph, every domain layer composes: + sequence sketches as CAM-PQ fingerprints; + genomic coordinates as HHTL cascade addresses; + novel variants as CHAODA anomaly scores against LFD distribution; + pathway propagation as MailboxSoA active-inference fan-out; + counterfactual driver-mutation histories as CounterfactualMailbox lanes; + causality as Pearl-2³ subset queries; + literature evidence as DeepNSM SPO triples; + cross-pipeline equivalence as §14 oracle verdicts. + +The genetic researcher's job is the domain mint: + which classid for which entity? + which ontology hydrator for which vocabulary? + which counterfactual gates the next study question? + +Everything below the domain mint is shipped. +``` + +--- + +## The four-layer architecture (from a geneticist's vantage) + +### Layer 0 — Domain corpora (upstream, never vendored) + +FASTA / FASTQ / VCF / BAM / GFF / annotation databases. They live at their canonical homes: +- 1000-Genomes / GIAB at NCBI / EBI. +- ClinVar at NCBI ClinVar releases. +- GO / Reactome / Sequence Ontology at OBO Foundry. +- Reference genome assemblies at UCSC / Ensembl / NCBI. + +The adapter `hydrate_*()` glue (Pattern D — Meta-Structure Hydration, `crates/lance-graph-ontology/src/hydrators/mod.rs:1-57`) points at canonical releases and pins a version; the corpora do not move into the substrate. **No new hydrator trait** — `hydrate_go` / `hydrate_reactome` / `hydrate_clinvar` are *data + ~50 LOC of glue each* over the shipped `OwlHydrator` / `MetaStructureHydrator`, mirroring the proven shape of `hydrate_dolce` / `hydrate_provo` / `hydrate_skos`. + +This layer answers: + +```text +which release of which reference is pinned +where the truth set / ontology / annotation lives +who owns its evolution +how the substrate consumer treats version cadence (GO monthly, ClinVar weekly, Reactome quarterly) +``` + +### Layer 1 — Adapter (`adapter-genetics-experimental`, proposed) + +Thin domain wiring. **Zero new substrate primitives.** Mirrors `OcrProvider` engine-agnostic boundary from `lance-graph` #498's `LayoutBlock::to_node_row` transcode. Defines: + +- `GenomicSubstrate` trait (the seam). +- Parsers (host `noodles-fasta`, `noodles-vcf`, `noodles-bam` — pure-Rust bioinformatics ecosystem). +- `Cam6x8` k-mer fingerprint function (calls into shipped CAM-PQ codec). +- `VcfRecordTranscoder` (mirrors OCR `LayoutBlock → NodeRow`). +- Class-mint registry sync with `lance-graph-ontology`. +- Per-class `ClassView::value_schema` selection (rides existing `Full` / `Compressed` presets — no new variant per #500's contract test). + +This layer answers: + +```text +how a VCF record becomes a NodeRow +how a k-mer becomes a 48-bit fingerprint +which classid identifies a Variant / Gene / Pathway / Cell / IntegrationSite +which ValueSchema preset materialises which tenant for which class +``` + +### Layer 2 — Substrate primitives (shipped) + +The CAM-PQ codec, CLAM tree, CHAODA anomaly scoring, bgz17 stride, ndarray AMX int8 GEMM, lance-graph-contract canonical NodeGuid + EdgeBlock + NodeRow, MailboxSoA, CounterfactualMailbox, Pearl-2³ in CausalEdge64, OntologyRegistry, DeepNSM sentence reader, §14 oracle. All file:line-grounded in `docs/GENETIC_RESEARCH_VIA_STACK.md` §1 + §3. + +This layer answers: + +```text +what every substrate primitive provides at the kernel level +which file:line carries the canonical implementation +which test suite proves green on main +which gating probe has been run (where green) vs. is queued (where speculative) +``` + +### Layer 3 — Research consumer + +The geneticist's queries, in the substrate's native vocabulary: + +- *"Find every variant in gene X within population Y in the bootstrap basin."* → one HHTL prefix scan. +- *"Score variant V against population local manifold for novelty."* → one CHAODA call on the CLAM tree. +- *"Sketch a 100 kb genomic region for cohort-wide similarity search."* → one bgz17-strided CAM-PQ fingerprint. +- *"Compare GATK calls vs. DeepVariant calls for sample S with provenance preserved."* → one §14 oracle invocation. +- *"Simulate KRAS-G12D-vs-WT counterfactual propagation in a 1024-cell tumor lattice."* → one `MailboxSoA<1024>` instantiation + `CounterfactualMailbox` for the G12D-vs-WT split. +- *"Extract gene-disease associations from PubMed abstracts."* → DeepNSM sentence reader over the corpus. + +This layer answers: + +```text +what queries the geneticist asks +which substrate primitive each query consumes +which ontology hydrator each query references +where the falsifiable certificate for each query result lives +``` + +--- + +## Why bioinformatics pipelines alone are not enough + +Each existing tool answers part of the question. None compose into a single counterfactual-preserving graph. + +| Tool family | What it gives | What it doesn't | +|---|---|---| +| GATK / DeepVariant / bcftools | Per-sample variant calls with caller-specific confidence | No cross-caller provenance reconciliation; no counterfactual lane; no entropy×energy substrate-state calibration | +| BLAST / Diamond / minimap2 | Sequence similarity rankings | No fingerprint-substrate integration; no SPO emission; no graph-native composition | +| Reactome / WikiPathways / KEGG | Annotated pathway membership | Static; no counterfactual propagation; no Friston-FEP evidence calibration | +| CADD / REVEL / AlphaMissense | Per-variant deleteriousness scores | Trained classifier required; no unsupervised novel-variant flag; no LFD anomaly grounding | +| CellNOpt / SCENIC / etc. | Network inference from expression | No counterfactual lane preservation; no Pearl-2³ do-calculus addressing | +| nf-core / Snakemake pipelines | Reproducibility via workflow management | Workflow-level, not graph-native; no §14 oracle equivalence checking; no provenance-normalised cross-pipeline comparison | + +The substrate's composition gives the missing piece: **one SPO graph, accumulating evidence across all of these consumers, with counterfactual lanes preserved, with entropy×energy quadrant classification per variant, with Pearl-2³ do-calculus addressing, and with §14 oracle equivalence as the falsifiable cross-tool benchmark.** + +--- + +## Why building genomics tooling from scratch is not enough + +You'd reach for the shipped substrate even if you started from scratch, because: + +- The CAM-PQ codec is mature (PR #482 ratified the canon; ndarray PR #218 measured fidelity). +- The CLAM tree + CHAODA scoring is mature (~1600 lines, validated probes). +- The AMX int8 GEMM is real silicon performance (197 GMAC/s, ndarray PR #217 measured). +- The entropy × energy plane is empirically validated (ρ(entropy, prediction accuracy) = −0.78 measured, ndarray PR #218). +- The reliability stats (Pearson / Spearman / Cronbach α / ICC) are shipped (ndarray PR #218). +- The CounterfactualMailbox is shipped with its iron invariant mechanically enforced. +- The §14 oracle is in production use for OCR caller comparison (post-#498). +- The OntologyRegistry has Pattern D hydrators (`hydrate_dolce` / `hydrate_owltime` / `hydrate_provo` / `hydrate_qudt` / `hydrate_schemaorg` / `hydrate_skos` / `hydrate_fibo_fnd` / `hydrate_fibo_be` / `hydrate_odoo` / `hydrate_zugferd` / `hydrate_skr03/04`) over shipped `OwlHydrator` + `MetaStructureHydrator` as the proven pattern. + +Building these from scratch is N person-years of work. The lift to genetic-research-via-substrate is the **domain wiring** — measured in days to weeks per deliverable in `genetic-research-substrate-integration-v1.md`. + +--- + +## Invariants + +These are what the substrate enforces; the genetic-research adapter inherits them. + +1. **§0 anti-invention guardrail** (lance-graph #496): no new `ValueSchema` variant; no new substrate types. Genetic-research-specific work is *wiring*, not new substrate. +2. **No-new-enum-variant contract test** (lance-graph #500): genomic classes ride existing `Full` / `Compressed` presets via `classid → ClassView`. **Do not propose a `ValueSchema::Genetic`.** +3. **Counterfactuals stay in their own lane** (`counterfactual.rs` iron invariant): `InferenceType::Counterfactual` mantissa = -6; never written as observed SPO. +4. **Closed-vocab discipline** (ruff PR #5 `predicate_count_locked_at_N`): new genetic predicates land in `ruff_spo_triplet::Predicate` under the locked-count gate. +5. **No C++ source vendored into Rust-target crates.** htslib stays upstream; if a transcoded version is wanted, route through `ruff_cpp_spo` (cross-repo handover at `AdaWorldAPI/ruff`). +6. **Five-specialist drift-catching pass** (lance-graph #500): `cascade-architect` / `family-codec-smith` / `palette-engineer` / `dto-soa-savant` / `truth-architect` review before any FINDING-grade claim. +7. **Gating probes before FINDING**: `PROBE-CHAODA-1000G`, `PROBE-KRAS-COUNTERFACTUAL-DET`, `PROBE-CAM-PQ-VS-BLAST` gate the substrate's claims to bioinformatics audiences. +8. **Boundary: representation + research tooling only.** No medical/diagnostic claims (per the predecessor `3DGS-genetics-4x4-fanout-plan.md`). + +--- + +## What "complete" looks like + +The headstone is reached when: + +1. **`adapter-genetics-experimental` compiles** and the locked-shape test passes (the *"shape locked"* milestone analogous to `ruff_ruby_spo` PR #4). +2. **FASTA + VCF round-trip into `NodeRow`** via the `VcfRecordTranscoder`. The first measurable artifact: load 1000-Genomes Phase 3 chromosome 22 into the substrate and round-trip back to VCF, byte-identical for the chr 22 subset. +3. **CHAODA on 1000-Genomes feature vectors** produces ROC-AUC ≥ 0.85 on the held-out novel-singleton test (PROBE-CHAODA-1000G green). +4. **CAM-PQ-vs-BLAST agreement** measured: Spearman ρ ≥ 0.7 on top-100 RefSeq similarity rankings (PROBE-CAM-PQ-VS-BLAST green). +5. **KRAS G12D 1024-cell counterfactual fan-out** simulation runs deterministically (PROBE-KRAS-COUNTERFACTUAL-DET bit-exact across runs), and the observed-lane oncogenic-transformation rate matches published outcomes within tolerance. +6. **GO / Reactome / ClinVar Pattern D hydrate_*() glue** (`hydrate_go` / `hydrate_reactome` / `hydrate_clinvar`, each data + ~50 LOC over shipped `OwlHydrator` / `MetaStructureHydrator`) load into `OntologyRegistry` and ontology cache invalidation works on Lance version bump. +7. **§14 oracle benchmarks GATK vs. DeepVariant** against GIAB HG002 truth set with F1 meeting published minima. +8. **DeepNSM genetic-language reader probe** demonstrates `P64` projection consistency on protein-coding sequences (versus structured-noise on non-coding). +9. **Histology splat extension** carries per-splat genomic profile via `Full` ValueSchema with `Fingerprint` + `HelixResidue` tenants populated. + +When these nine hold, the substrate has fulfilled its purpose as the genetic-research foundation: one graph, every domain layer composes, counterfactual lanes preserved, falsifiable certificates everywhere. + +--- + +## Headstone state — what the era closes + +```text +The era that closes: + - Per-tool pipeline silos with no cross-tool provenance reconciliation. + - Variant calling and pathway annotation as separate worlds with no + shared substrate. + - Counterfactual driver-mutation histories thrown away because no tool + preserves them. + - Reproducibility via workflow managers rather than substrate-native + provenance + §14 oracle equivalence. + - Outlier detection requiring trained classifiers (CADD / REVEL etc.) + because no unsupervised LFD-based novel-variant detector existed at + bioinformatics scale. + - "Bioinformatics builds its own tools" as the default assumption. + +The era that opens: + - One SPO graph accumulating cross-tool, cross-paper, cross-cohort + variant evidence with provenance preserved. + - Counterfactual driver-mutation lanes queryable for retrospective + analysis ("what if KRAS had mutated at codon 13 instead of 12?"). + - CHAODA unsupervised novel-variant detection on the same CLAM tree + the substrate uses for language retrieval. + - Pearl-2³ do-calculus native in CausalEdge64 for cancer-pathway + causal queries. + - The Friston entropy×energy substrate-state plane as the calibrated + confidence axis for every variant in the store. + - The same Gaussian-splat math at cm (organ), mm (lesion), and µm + (cell / histology slide) scales, with the splat carrier extending + to per-cell genomic profile. + - Cross-pipeline equivalence checking as a substrate-native operation + (§14 oracle), not a workflow manager's afterthought. + - Domain experts wiring the genetic-research adapter, not rebuilding + the substrate. +``` + +The capstone thesis at the top of this doc is the one-line restatement of the open-era state. + +--- + +## Cross-references + +### This repo (`AdaWorldAPI/lance-graph`) +- `docs/GENETIC_RESEARCH_VIA_STACK.md` — the *why* doc for a domain expert, file:line-grounded. +- `.claude/plans/genetic-research-substrate-integration-v1.md` — the implementation plan (10 deliverables + 3 probes). +- `.claude/plans/3DGS-genetics-4x4-fanout-plan.md` — predecessor static-representation plan. +- `.claude/plans/3DGS-cross-pollination-raw-field-plan.md` — sibling cross-domain plan (ultrasound + neuronal + genetics share the raw-field backbone). +- `crates/lance-graph-contract/src/canonical_node.rs` — `NodeGuid` / `EdgeBlock` / `NodeRow` (the row substrate). +- `crates/lance-graph-contract/src/counterfactual.rs` — `SplitPoles` / `CounterfactualMailbox` / `revise_if_minority_wins`. +- `crates/lance-graph-contract/src/hhtl.rs` — `NiblePath` HHTL nibble-trie. +- `crates/lance-graph-contract/src/high_heel.rs:202` — `CausalEdge64` bit layout (Pearl mask included). +- `crates/lance-graph-ontology/` — `OntologyRegistry` + TTL hydrators. +- `crates/lance-graph/src/cam_pq/storage.rs` — CAM-PQ 48-bit fingerprint storage. +- `crates/lance-graph-turbovec/KNOWLEDGE.md` — TurboQuant + LUT-ADC kernel (76 µs/query measured). +- `crates/cognitive-shader-driver/src/mailbox_soa.rs` — `MailboxSoA` + `consume_firing`. +- `crates/deepnsm/` — sentence-level AriGraph reader. + +### Sibling repo (`AdaWorldAPI/ndarray`) +- `src/hpc/clam.rs:1493-1560` — CHAODA Phase 4 anomaly scoring on LFD distribution. +- `src/hpc/amx_matmul.rs` — int8 GEMM at 197 GMAC/s. +- `src/hpc/activations.rs` — exp / log / softmax / softmax-backward. +- `src/hpc/linalg/mat_exp.rs` — matrix exponential via Padé. +- `crates/bgz17/src/lib.rs:53-60` — 11/17 X-Trans stride constants. + +### Upstream substrate context (`AdaWorldAPI/lance-graph` PRs) +- PR #491 — entropy × energy framing; SoA migration diff resolution. +- PR #494 — `EntropyRung` + `Quadrant` + `nars_entropy(f, c)`. +- PR #495 — 3-byte EdgeRef witness; reliability ⊥ causality empirically. +- PR #496 — `ValueSchema` presets + §0 anti-invention guardrail. +- PR #498 — GUID decode→read-mode keystone; helix `Signed360`; OCR→NodeRow transcode template. +- PR #500 — rebaseline + no-new-variant contract test + gating-probes pattern. + +### External cross-repo +- `AdaWorldAPI/bardioc/SUBSTRATE_STATE_FRAMINGS.md` — entropy × energy plane framing (durable bardioc-side architectural doc). +- `AdaWorldAPI/bardioc/.claude/handovers/2026-06-16-session-handover.md` — bardioc session handover (covers the preceding session's confabulation pattern + discipline lessons). +- `AdaWorldAPI/ruff/.claude/handovers/2026-06-16-ruff-cpp-headstone-exploration.md` — sibling headstone for the C++ harvester (relevant if htslib transcoding wanted). +- `OGAR/docs/CASCADE-SYNERGIES-EPIPHANY.md` — Morton-cascade × palette256 × golden-helix synthesis (foundational to the genetics fanout). +- `OGAR/docs/DISCOVERY-MAP.md` — D-CASCADE / D-MOIRE / D-BGZ17 ledger entries. + +### Workspace headstones (for shape reference) +- `lance-graph/.claude/plans/3DGS-Cesium-BindSpace4-headstone-exploration.md` — the headstone shape this document follows. +- `bardioc/ROADMAP_RUST_PRIMARY_HEADSTONE.md` — Phase A→I migration headstone. +- `AdaWorldAPI/ruff/.claude/handovers/2026-06-16-ruff-cpp-headstone-exploration.md` — sibling headstone (C++ harvester). +- `AdaWorldAPI/tesseract-rs/.claude/handovers/2026-06-16-tesseract-rs-headstone-exploration.md` — sibling headstone (Rust target). + +--- + +_Authored 2026-06-16 by external session `AdaWorldAPI/bardioc` `session_01VysoWJ6vsyg3wEGc5v7T5v`. Headstone shape — preserves the architectural synthesis for what genetic-research-via-substrate IS when complete. Companion pattern-recognition hand-off at `docs/GENETIC_RESEARCH_VIA_STACK.md` carries the *why*; companion implementation plan at `.claude/plans/genetic-research-substrate-integration-v1.md` carries the *how*. No code, no PR for substrate changes — synthesis-preservation only._ diff --git a/.claude/plans/genetic-research-substrate-integration-v1.md b/.claude/plans/genetic-research-substrate-integration-v1.md new file mode 100644 index 00000000..747ec645 --- /dev/null +++ b/.claude/plans/genetic-research-substrate-integration-v1.md @@ -0,0 +1,214 @@ +# Genetic Research Substrate Integration v1 — implementation plan + +> **Type:** PROPOSAL / integration plan. Companion to: +> - `docs/GENETIC_RESEARCH_VIA_STACK.md` — the pattern-recognition hand-off explaining *why*. +> - `.claude/handovers/2026-06-16-genetic-research-headstone-exploration.md` — the destination-state synthesis explaining *what complete looks like*. +> - `.claude/plans/3DGS-genetics-4x4-fanout-plan.md` — the predecessor exploratory plan (4×4 lane interpretation for static representation). This plan extends with **dynamics + counterfactual lane**. +> **Status:** initial proposal. 10 deliverables (D-GEN-1..10) + 3 gating probes. No code shipped yet. +> **Boundary:** representation + research tooling only. Not a clinical genomics product plan. No medical/diagnostic claims. + +--- + +## 0. Architectural decisions locked (do not re-litigate) + +Per the substrate's §0 anti-invention guardrail (lance-graph #496) and the no-new-variant contract test (#500): + +1. **No new `ValueSchema` enum variant.** Genetic-research rows ride `Full` or `Compressed` presets; specialisation is via `classid → ClassView` mint. +2. **No new `EdgeCodecFlavor` enum variant.** Genomic edges ride **`Pq32x4`** (`TurbovecResidue` tenant has shipped storage) or **`CoarseOnly`** (1-byte palette, no separate residue slab). `CoarseResidue` is **BLOCKED** until the operator mints its dedicated `ValueTenant` (`TD-COARSERESIDUE-NO-VALUE-TENANT`, `.claude/board/TECH_DEBT.md:40`) — pairing it with `Full` or `Compressed` today leaves the signed-4-bit residue unaddressable. +3. **No C++ source vendored into any `-rs` adapter crate.** htslib / bcftools / etc. stay upstream; harvested via the `ruff_cpp_spo` pattern (cross-repo handover at `AdaWorldAPI/ruff/.claude/handovers/2026-06-16-ruff-cpp-spo-handover.md`) when the harvester ships. +4. **Counterfactuals stay in their own lane.** `InferenceType::Counterfactual` mantissa = -6 is the mechanical enforcement. **NEVER written as observed SPO.** +5. **No new substrate primitives.** Every deliverable in this plan is **wiring** of existing primitives. If a deliverable feels like it needs a new substrate type, it's the wrong shape — escalate to the 5-specialist drift-catching pass before writing code. +6. **Closed-vocab discipline applies.** New genetic predicates land in `ruff_spo_triplet::Predicate` under the `predicate_count_locked_at_N` gate (`AdaWorldAPI/ruff` PR #5 pattern). + +--- + +## 1. Deliverables + +### D-GEN-1: `crates/adapter-genetics-experimental` scaffold + +**What:** new crate sibling to `lance-graph-callcenter`. Defines `GenomicSubstrate` trait analogous to `OcrProvider` (lance-graph #498) — the engine-agnostic seam that any provider (1000-Genomes / GIAB / clinical / synthetic) implements. Locks the 4×4 lane interpretation from `3DGS-genetics-4x4-fanout-plan.md` (`lane0: sequence coordinate / lane1: motif covariance / lane2: expression / methylation / lane3: time / sample / lineage`). + +**Where:** `crates/adapter-genetics-experimental/`. Cargo.toml deps on `lance-graph-contract` + `ruff_spo_triplet` (via cross-repo) + `serde` (no `clang` / `noodles` / heavy deps in the trait crate). + +**Lift:** ~1 day. Pure interface + locked-shape test (the `ruff_ruby_spo` PR #4 discipline). + +**Test gate:** `genomic_substrate_trait_compiles_and_has_default_impl`. + +### D-GEN-2: FASTA + VCF parsers (host `noodles-*`) + +**What:** opt-in feature flags `fasta`, `vcf` that host `noodles-fasta` and `noodles-vcf` (the pure-Rust bioinformatics ecosystem). Map `noodles::vcf::Record` → `lance_graph_contract::NodeRow` via a `VcfRecordTranscoder` analogous to `LayoutBlock::to_node_row` (lance-graph #498's `2fa7fcb0`). + +**Where:** `crates/adapter-genetics-experimental/src/fasta.rs` + `vcf.rs`. + +**Lift:** ~3 days. The `noodles` crate is mature; the transcode follows the OCR `LayoutBlock → NodeRow` pattern almost verbatim. + +**Test gate:** `vcf_record_transcodes_to_node_row_with_canonical_classid`. + +### D-GEN-3: k-mer → CAM-PQ 48-bit fingerprint + +**What:** `fn kmer_fingerprint(seq: &[u8], k: usize) -> Cam6x8` that computes the CAM-PQ 6-subspace × 256-centroid fingerprint of a k-mer-frequency vector. Use bgz17 11/17 stride for k-mer sampling (anti-moiré, see `BGZ17_ELEVEN_SEVENTEEN_RATIONALE.md`). + +**Where:** `crates/adapter-genetics-experimental/src/kmer.rs`. + +**Lift:** ~half-day. CAM-PQ codec is shipped (`crates/lance-graph/src/cam_pq/`); this is just feeding it. + +**Test gate:** `kmer_fingerprint_is_deterministic` + `bgz17_stride_visits_all_17_residues`. + +### D-GEN-4: Reference genome NodeGuid addressing mint + +**What:** pin the `classid → ClassView` registry entries for genomic classes: +- `classid 0x0001_0001` = `Cell::Generic` +- `classid 0x0001_0002` = `Cell::RAS_pathway_node` +- `classid 0x0001_0003` = `ViralIntegrationSite` +- `classid 0x0002_0001` = `GenomicPosition` (HEEL = organism / HIP = chromosome / TWIG = position-prefix) +- `classid 0x0002_0002` = `Variant` +- `classid 0x0003_0001` = `Gene` +- `classid 0x0003_0002` = `Transcript` +- `classid 0x0003_0003` = `Protein` + +(Specific classid values to be ratified by an OGAR mint pass — these are placeholders pinning the slot.) + +**Where:** `crates/adapter-genetics-experimental/src/class_mint.rs` + sync with `lance-graph-ontology` registry. + +**Lift:** ~1 day. Mostly registry entries + per-class `ClassView` declarations that ride existing `ValueSchema` presets. + +**Test gate:** `genomic_classids_ride_existing_presets_no_new_variant` (mirrors lance-graph #500's `ocr_schema_fit_rides_existing_preset_no_new_variant`). + +### D-GEN-5: GO / Reactome / ClinVar Pattern D hydrate_*() glue + +**What:** add three `hydrate_*()` glue functions over the shipped `OwlHydrator` / `MetaStructureHydrator` (Pattern D — Meta-Structure Hydration, `crates/lance-graph-ontology/src/hydrators/mod.rs:1-57`). **No new trait** — the substrate already ships `hydrate_dolce` / `hydrate_owltime` / `hydrate_provo` / `hydrate_qudt` / `hydrate_schemaorg` / `hydrate_skos` / `hydrate_fibo_fnd` / `hydrate_fibo_be` / `hydrate_odoo` / `hydrate_zugferd` / `hydrate_skr03` / `hydrate_skr04` as the proven pattern. Per `mod.rs` line 19: *"Each per-ontology hydrator is data + ~50 LOC of glue, never a bespoke crate."* + +- `hydrate_go(reg, source)` → Gene Ontology OBO/OWL via `OwlHydrator` with the `G` slot keyed for biological process / molecular function / cellular component. +- `hydrate_reactome(reg, source)` → Reactome pathway hierarchy (~2500 pathways) via `MetaStructureHydrator` declaring `inherits_from` for pathway-subpathway containment. +- `hydrate_clinvar(reg, source)` → ClinVar clinical-significance annotations as variant SPO edges with `Provenance::ClinicalCurated = (0.98, 0.95)` calibration. + +**Where:** `crates/lance-graph-ontology/src/hydrators/go.rs` + `reactome.rs` + `clinvar.rs`. Each file is *data + glue*: pick the parser, declare the `G` slot, name the parent, whitelist the cascade edge IRIs — mirrors the shipped `dolce.rs` / `provo.rs` / `skos.rs` shape. + +**Lift:** ~1 week (GO + Reactome are large; ClinVar is straightforward). + +**Test gate:** `go_term_count_matches_reference_release` + `reactome_pathway_hierarchy_round_trips` + `clinvar_provenance_calibration_matches_curated_pair`. + +### D-GEN-6: §14 oracle adapter for variant caller comparison + +**What:** `VariantCallerOracle` — a `OracleSubstrate` impl that compares two variant callers' output (e.g. GATK HaplotypeCaller vs. DeepVariant) on the same input, with `Provenance` annotations identifying which caller contributed which call. Uses `rubicon::oracle::compare_normalised` directly. + +**Where:** `crates/adapter-genetics-experimental/src/oracle.rs`. + +**Lift:** ~3 days. The §14 oracle is shipped; this is the genetics-domain mapping. + +**Test gate:** `caller_comparison_finds_known_discordant_calls_in_giab_subset`. + +### D-GEN-7: KRAS G12D 1024-cell counterfactual fan-out simulation + +**What:** the dynamics-axis flagship. Build a `MailboxSoA<1024>` representing a 32×32 cellular lattice; instantiate KRAS-G12D-vs-WT as `SplitPoles` at one row; fan-out via `consume_firing` over 100 cycles. Compare the observed lane (G12D propagates → MAPK cascade → tumor-suppressor loss) against the counterfactual lane (WT, no propagation). Run `revise_if_minority_wins` at cycle 100; verify the observed-lane victory aligns with published KRAS-G12D oncogenic-transformation rates. + +**Where:** `crates/adapter-genetics-experimental/src/sim/kras_propagation.rs` + `tests/kras_g12d_counterfactual_fanout.rs`. + +**Lift:** ~2 weeks. The substrate primitives are shipped; this is the integration + calibration against published outcomes. + +**Test gate:** `kras_g12d_propagation_outpredicts_wt_counterfactual_at_cycle_100` (with measured tolerance). + +### D-GEN-8: DeepNSM genetic-language reader probe + +**What:** point the existing `crates/deepnsm` sentence-level AriGraph reader at codon-triplet sequences. Treat each codon as a `NsmPrime`; verify `SentenceTransformer64` produces a `Sentence64 { p64, cam, spo_hint }` for DNA the same way it does for English. Compare reading-state convergence on protein-coding vs. non-coding sequences. + +**Where:** `crates/adapter-genetics-experimental/examples/genetic_language_probe.rs` (analogous to `lance-graph-arm-discovery/examples/coreference_rung_probe.rs`). + +**Lift:** ~1 week. Mostly fixture construction (codon-triplet NsmPrime mapping) and validation. + +**Test gate:** `deepnsm_genetic_language_p64_consistency_protein_coding_vs_noncoding`. + +### D-GEN-9: Histology splat extension — per-splat genomic profile + +**What:** mint `classid` for `HistologySplatWithGenomicProfile`. The splat carries: +- `key (16 B)`: spatial position in the histology slide. +- `edges (16 B)`: 12 in-family neighboring splats + 4 out-of-family (vascular / immune). +- `value (480 B)`: `Full` ValueSchema preset, with `Fingerprint` tenant carrying the Σ₁ SEED of the resident cell's expression vector and `HelixResidue` tenant carrying orientation. + +**Where:** `crates/adapter-genetics-experimental/src/histology_bridge.rs`. Reuses the splat-native ultrasound substrate (3DGS arc) verbatim. + +**Lift:** ~3 days. The splat substrate is shipped; this is the genomics-side extension. + +**Test gate:** `histology_splat_carries_genomic_profile_at_existing_preset_no_new_variant`. + +### D-GEN-10: §14 oracle benchmark against GIAB truth set + +**What:** run the substrate's variant-calling pipeline (D-GEN-2 + D-GEN-3 + D-GEN-6) against the Genome in a Bottle Consortium truth sets (HG001-HG004). Report sensitivity / precision / F1 with `Provenance` calibration for each call. + +**Where:** `crates/adapter-genetics-experimental/benches/giab_benchmark.rs`. + +**Lift:** ~1 week. Most of this is data wrangling. + +**Test gate:** `giab_hg002_f1_meets_published_minimum_for_call_set_v4_2_1`. + +--- + +## 2. Gating probes (before any FINDING-grade claim) + +Per the `lance-graph` PR #500 discipline (probes spec before measured claims): + +### PROBE-CHAODA-1000G + +**What it gates:** the claim *"CHAODA detects novel variants without trained classifier."* + +**Method:** build CLAM tree on 1000-Genomes Phase 3 feature vectors (AF, depth, strand bias, neighbourhood entropy, conservation score). Compute CHAODA anomaly scores on a held-out test set containing known novel singletons. Measure ROC-AUC against ground truth. + +**Pass condition:** ROC-AUC ≥ 0.85 on novel-singleton detection (calibrated against published CADD / REVEL baselines). + +**Implementation:** ~3 days. The CLAM + CHAODA kernels are shipped; this is fixture + scoring. + +### PROBE-KRAS-COUNTERFACTUAL-DET + +**What it gates:** the claim *"KRAS counterfactual fan-out is deterministic."* + +**Method:** run D-GEN-7's KRAS G12D 1024-cell simulation twice with identical seeds. Bit-compare the final `MailboxSoA.energy[]` arrays and the counterfactual-lane edge counts. + +**Pass condition:** bit-exact match across two runs (the substrate's no-randomness invariant applies; any deviation indicates an unmarked f32 nondeterminism). + +**Implementation:** ~2 days; included in D-GEN-7's test scope. + +### PROBE-CAM-PQ-VS-BLAST + +**What it gates:** the claim *"CAM-PQ 48-bit fingerprint approximates sequence similarity."* + +**Method:** on a held-out RefSeq subset, compare CAM-PQ Hamming-distance rankings against BLAST e-value rankings. Compute Spearman ρ and ICC (the substrate's reliability stats from ndarray #218). + +**Pass condition:** Spearman ρ ≥ 0.7 against BLAST e-value top-100 rankings, and ICC ≥ 0.6. (The Σ₁ SEED preserves 94% of Jina semantic similarity on language; biological-sequence similarity may calibrate differently.) + +**Implementation:** ~1 week. Includes the protein-language-model embedding pipeline (ESM/ProtBERT load via existing GGUF loader + ndarray AMX int8 GEMM at 197 GMAC/s). + +--- + +## 3. Sequencing + +| Phase | Deliverables | Cumulative time | +|---|---|---| +| **P1** | D-GEN-1, D-GEN-2, D-GEN-3, D-GEN-4 | ~2 weeks | +| **P2** | D-GEN-5, PROBE-CHAODA-1000G | ~3 weeks | +| **P3** | D-GEN-6, D-GEN-7, PROBE-KRAS-COUNTERFACTUAL-DET | ~6 weeks | +| **P4** | D-GEN-8, D-GEN-9, PROBE-CAM-PQ-VS-BLAST | ~8 weeks | +| **P5** | D-GEN-10 | ~9 weeks | + +**Minimum viable hand-off:** end of P1. The adapter crate compiles, FASTA/VCF round-trip, k-mer fingerprints are computable, the class mint pins the genomic classid range. That's the *"shape locked"* milestone analogous to `ruff_ruby_spo`'s locked-shape test in PR #4 — enough for a genomics-domain session to pick up and continue without architectural drift. + +**The flagship deliverable:** D-GEN-7 (KRAS counterfactual fan-out). When that ships and PROBE-KRAS-COUNTERFACTUAL-DET passes, the dynamics axis is proven and the substrate is differentiated from every standard bioinformatics tool. + +--- + +## 4. Open questions for the operator + +1. **Tesseract-style corpus pin:** which 1000-Genomes / GIAB / ClinVar release version pins the corpus? Pin one before P2 starts. +2. **Class IDs:** the placeholder `0x0001_0001` etc. need an OGAR mint pass before being committed to the registry. +3. **D-GEN-7 calibration target:** which published study's KRAS-G12D oncogenic-transformation rate is the canonical pass target? (Suggested: Hancock et al. 2002 or a more recent meta-analysis; operator to confirm.) +4. **Ontology release pinning:** GO release cadence is monthly; ClinVar is weekly; Reactome is quarterly. Per-hydrator pinning strategy? +5. **Clinical-vs-research boundary:** this plan stays on the research-tooling side (the existing `3DGS-genetics-4x4-fanout-plan.md` line: *"no medical/diagnostic claims are made"*). The §14 oracle's variant-caller comparison is research benchmarking, not clinical decision-support. Confirm that holds. + +--- + +## 5. Cross-references + +- **Pattern-recognition hand-off:** `docs/GENETIC_RESEARCH_VIA_STACK.md` — the *why* doc for a domain expert. +- **Headstone:** `.claude/handovers/2026-06-16-genetic-research-headstone-exploration.md` — the destination-state synthesis. +- **Predecessor plan:** `.claude/plans/3DGS-genetics-4x4-fanout-plan.md` — static representation, 4×4 lane interpretation. This plan extends with dynamics + counterfactual. +- **Cross-repo C++ harvester (relevant if htslib transcoding wanted):** `AdaWorldAPI/ruff/.claude/handovers/2026-06-16-ruff-cpp-spo-handover.md` + `AdaWorldAPI/ruff/.claude/handovers/2026-06-16-ruff-cpp-headstone-exploration.md`. +- **Upstream substrate context:** `lance-graph` PR #491 (entropy × energy framing) + PR #494 (EntropyRung / Quadrant / nars_entropy) + PR #495 (3-byte EdgeRef witness; reliability ⊥ causality empirically) + PR #496 (ValueSchema presets + §0 anti-invention) + PR #498 (GUID decode→read-mode keystone; helix Signed360; OCR→NodeRow transcode template) + PR #500 (rebaseline + no-new-variant contract test). diff --git a/docs/GENETIC_RESEARCH_VIA_STACK.md b/docs/GENETIC_RESEARCH_VIA_STACK.md new file mode 100644 index 00000000..cb27a559 --- /dev/null +++ b/docs/GENETIC_RESEARCH_VIA_STACK.md @@ -0,0 +1,237 @@ +# Genetic Research via the AdaWorld Stack — Pattern-Recognition Hand-off + +> **For:** a geneticist / bioinformatician / computational-biology researcher who knows their domain cold but has not seen this stack. +> **Reading time:** ~30 minutes. +> **Promise:** every shipped claim cites a file:line in this workspace. Nothing here is speculation about *"what we could build"* — it's *"what already runs."* Where we propose new work, it's named as proposal and the lift is estimated honestly. +> **Companions:** `.claude/plans/genetic-research-substrate-integration-v1.md` (the implementation plan), `.claude/handovers/2026-06-16-genetic-research-headstone-exploration.md` (the destination-state synthesis). + +--- + +## 0. The bet + +The stack you're about to look at was built for cognitive workloads (visual splats, ontology cascades, sentence understanding, OCR transcoding). Eight of its already-shipped primitives turn out to be **isomorphic** to bioinformatics machinery you'd otherwise have to assemble yourself: CAM-PQ fingerprints map to sequence sketching; HHTL nibble-tries map to genomic coordinate cascades; CLAM trees map to chromosome→arm→band→gene hierarchies; CHAODA anomaly scoring maps to novel-variant detection; bgz17 11/17 strides map to anti-aliased k-mer sampling; the entropy×energy substrate-state plane maps to Bayesian variant-evidence accumulation; the MailboxSoA + counterfactual-mailbox pair maps to cancer-pathway propagation simulation; and the Pearl-2³ mask in `CausalEdge64` maps directly to do-calculus on driver-mutation graphs. + +The bet: **build the genetic-research adapter as a thin domain wiring on top of these primitives instead of as a new bioinformatics tool**, and the resulting substrate accumulates evidence across callers / studies / literature / cohorts in one graph with provenance, counterfactual lanes, and falsifiable certificates — something none of GATK / DeepVariant / nf-core / single-tool pipelines ship today. + +--- + +## 1. Eight shapes you already know, eight primitives we already shipped + +Each subsection: domain shape → substrate primitive → where it lives (file:line) → why the composition makes you blink. + +### 1.1 Sequence sketches (MinHash / ntHash / MASH) ↔ CAM-PQ 48-bit fingerprint + +**You know:** a sequence-similarity sketch is a fixed-width fingerprint such that Hamming or Jaccard distance on fingerprints approximates a known sequence-distance proxy (usually Mash-style Jaccard estimating ANI). + +**We have:** **CAM-PQ 6×256 = 48-bit fingerprint** (`crates/lance-graph/src/cam_pq/storage.rs:9`, `ndarray/.claude/knowledge/pr-x12-cam-pq-sigker-dn-tree-substrate-bindings.md:22`). Six product-quantization subspaces × 256 Lloyd-Max centroids per subspace = exactly 48 bits. Validated: **48 bits captures ~94% of Jina 1024-D semantic similarity on SimLex-999** (`lance-graph/.claude/knowledge/linguistic-epiphanies-2026-04-19.md:299-312`). + +**The composition:** swap the Jina-1024D embedding for any sequence embedding (ESM, ProtBERT for proteins; or a learned DNA embedding; or even a frequency-vector of k-mer counts). The same 6-subspace PQ + 256-centroid Lloyd-Max codec gives you a Mash-compatible fingerprint at substrate-native width with measured fidelity. Hamming distance lights up the AVX-512BW LUT scan kernel (`crates/lance-graph-turbovec/KNOWLEDGE.md:60-92`): **76 µs/query at n=20,000, dim=512, recall@10 = 0.785** — that's BLAST-against-RefSeq territory on commodity silicon. + +### 1.2 Genomic coordinates ↔ HHTL nibble-trie addressing + +**You know:** `chr1:1234567` is a hierarchical address that ordinarily lives as a flat (chromosome, position) tuple, and you do range queries by binary search or interval trees. + +**We have:** **HHTL = Hierarchical Hash Trie Lattice** (`crates/lance-graph-contract/src/hhtl.rs`). A 16-ary nibble trie with depth 16 (= 64 bits of address), where every nibble is one cascade level, prefix-shift is `is_ancestor_of`, and subtree enumeration is a native range scan under `addr64 = path << 4·(16−depth)`. Live in `lance_graph_contract::NodeGuid` as the bytes `HEEL · HIP · TWIG` (3 × u16 = 12 nibbles, with classid above and family/identity below). + +**The composition:** map (organism, chromosome, region, gene, position) onto the cascade. A genomic-region range query becomes a single subtree scan; nearest-gene-by-locus becomes a parent-prefix walk. **No B-tree, no interval forest, no rebuild on insertion** — the substrate already does the cascade-walks via shift/mask arithmetic. And the **same address space holds the patient ID (in `family`) and the variant ID (in `identity`)** — so "find every variant in gene X within population Y" is one prefix scan. + +### 1.3 Chromosome→arm→band→gene ↔ CLAM 3-level 16-way tree + +**You know:** cytogenetic coordinates form a fixed multi-level hierarchy (22 + XY chromosomes → ~25 chromosomal bands per arm → ~20,000 genes → ~3 × 10⁹ bases). Your tooling tends to treat each level as a separate index. + +**We have:** **CLAM (Clustered Hierarchical Approximate matching)** ships as `ndarray::hpc::clam` (`ndarray/src/hpc/clam.rs`, ~1600 lines), with the canonical 3-level 16-way layout (`lance-graph/.claude/session_2026_04_11_bf16_hhtl_combined_research.md:127-148`): + +``` +bits 15..12 = L0: 16 coarse clusters (HEEL scan target) +bits 11..8 = L1: 256 mid-clusters (HIP, 1:1 Jina-v5 centroids) +bits 7..4 = L2: 4096 terminal (TWIG, COCA alignment) +``` + +**The composition:** the CLAM tree IS your chromosome-arm-gene cascade when populated with genomic centroids instead of language-embedding centroids. Hierarchical clustering on chromosomal-band features gives you 16 coarse → 256 mid → 4096 terminal — *a roughly gene-resolved leaf count by construction*. The substrate's `silhouette` / `Cronbach α` / `ARI` cluster-quality probes (`lance-graph/.claude/probe_m1_result_2026_04_11.md`) tell you whether your cytogenetic hierarchy actually emerges from your feature distance, or whether your distance metric is wrong. **That's a falsifiable certificate, not a hand-drawn diagram.** + +### 1.4 Novel-variant detection ↔ CHAODA anomaly on LFD distribution + +**You know:** identifying a rare or de novo variant against population catalogues is essentially an outlier-detection problem in a high-dimensional feature space (allele frequency × read-depth × strand bias × neighborhood × etc.). Tools like CADD, REVEL, AlphaMissense produce per-variant scores. + +**We have:** **CHAODA** (Clustered Hierarchical Anomaly and Outlier Detection Algorithm, Ishaq et al. 2021) shipped as Phase 4 of `ndarray::hpc::clam` (`ndarray/src/hpc/clam.rs:1493-1560`): + +```rust +pub struct AnomalyScore { + pub score: f32, // normalised in [0, 1]; higher = more anomalous + pub awareness: AwarenessState, // classification derived from anomaly level +} + +impl ClamTree { + pub fn anomaly_scores(&self, data: &[u8], vec_len: usize) -> Vec; + // Local Fractional Dimensionality (LFD) per cluster; high LFD = complex local geometry +} +``` + +**The composition:** build a CLAM tree on your per-variant feature vectors; CHAODA scores every variant against the local manifold's intrinsic dimensionality. A novel variant in a region of high LFD lights up as `AnomalyScore { score → 1.0, awareness → Salient }` because its position differs from the population's local manifold — *without you having to train a classifier or annotate a truth set first*. This is *unsupervised* outlier detection on the same tree your range queries walk. + +### 1.5 minimap2 minimizers ↔ bgz17 11/17 X-Trans stride + +**You know:** minimizers are a way to sample k-mer positions sparsely but representatively: pick the lexicographically smallest k-mer in every length-w window. Goal: dramatic sketching at minimal accuracy cost. + +**We have:** **bgz17 11/17 golden stride** (`crates/bgz17/src/lib.rs:53-60`, `lance-graph/.claude/BGZ17_ELEVEN_SEVENTEEN_RATIONALE.md`). On a prime base of 17, the stride `11 = round(17/φ) = round(10.506) = 11` is *both* coprime with 17 (full permutation guaranteed: `(i·11) mod 17` visits all 17 residues exactly once) AND closest to a true φ-rotation. Result: **maximally-irrational integer stride** that minimises periodic resonance — the same anti-aliasing principle Fujifilm uses in the X-Trans sensor CFA. Validated as anti-moiré with provable bounds (`OGAR/docs/DISCOVERY-MAP.md` D-BGZ17). + +**The composition:** minimizers reduce window-redundancy via "pick the min hash"; the bgz17 stride reduces sequence-position redundancy via "pick positions at golden-irrational offsets." Both achieve aperiodic sampling; bgz17's bound is *number-theoretic* (provable via gcd + φ-approximation), while minimizers' bound is empirical. For sketching long sequences (k-mers, codon triplets, peptide windows), the bgz17 stride is a drop-in sampler with the same accuracy-vs-density trade-off as minimizers but a falsifiable theoretical bound. + +### 1.6 Bayesian variant-evidence accumulation ↔ entropy × energy substrate-state plane + +**You know:** variant calling resembles Bayesian update — each supporting read shifts the posterior on `P(variant | evidence)`. GATK's HaplotypeCaller does this explicitly via PairHMM + active-region rescoring; DeepVariant does it implicitly via a CNN. The clinical curation step (ACMG/AMP classification) is another layer of evidence accumulation: literature, segregation, in-silico predictions, functional studies. + +**We have:** **the entropy × energy substrate-state plane** (`bardioc/SUBSTRATE_STATE_FRAMINGS.md`, lance-graph PR #491 §6, lance-graph PR #494's `EntropyRung` + `Quadrant` enums + `nars_entropy(f, c) = 1 − c · |2f − 1|`). Four quadrants: + +``` + high energy + │ + Confusion / Chaos │ Wisdom (crystalline) + (in-progress climb) │ (the integrated apex) + ──────────────────────┼────────────────────── + Staunen │ Boredom / Inert + (cognitive pressure) │ (ordered but not energised) + │ + low energy + + high entropy ←──────────────────→ low entropy +``` + +Variant-evidence accumulation as a substrate trajectory: a *novel* variant arrives in **Staunen** (high entropy: little supporting evidence; low energy: no prior accumulation). As reads accumulate and the variant survives filter passes, it crosses through **Confusion / Chaos** (energy invested but the entropy hasn't yet collapsed into a clean call). After cohort confirmation + ClinVar consistency + literature support, it settles into **Wisdom** (low entropy, high accumulated evidence — call confidence near 1.0, ACMG class Pathogenic or Likely Pathogenic). And `ρ(entropy, prediction accuracy) = −0.78` is **measured** (ndarray PR #218) — that's not a theoretical claim. Entropy IS a validated reliability proxy. + +**The composition:** every variant in your store carries a quadrant classification derived from `MailboxSoA.energy` (per-row signed spatio-temporal accumulator) + `plasticity_counter` (saturating Hebbian, = lifetime evidence count) + classid-prefix codebook hit-rate (the entropy proxy). The §14 oracle's *"provenance-normalized equivalence"* compares two pipelines' quadrant assignments; disagreements are exactly where calibration matters. + +### 1.7 Cancer-pathway propagation / virus genome integration ↔ MailboxSoA + counterfactual mailbox + +**You know:** RAS/MAPK pathway propagation, MYC-driven transcriptional cascades, viral integration site preferences (HPV, HBV, EBV oncogenic integration), and metastatic seeding all have the same shape: a discrete event triggers a cascade through a graph of cellular sites under partial observability, with *counterfactual* branches you'd love to simulate but can't easily express in standard pipelines. + +**We have:** +- **`MailboxSoA`** (`crates/cognitive-shader-driver/src/mailbox_soa.rs`) — per-row SoA with `energy: [f32; N]` (signed spatio-temporal accumulator), `plasticity_counter: [u8; N]` (saturating Hebbian = lifetime activation count), `last_active_cycle: [u32; N]` (in-place consumption stamp), `edges: [CausalEdge64; N]` (LE baton edge per row), `consume_firing(row) -> bool` (in-place active inference: row threshold-crosses, energy resets, stamp advances). +- **`CounterfactualMailbox`** (`crates/lance-graph-contract/src/counterfactual.rs:232`) — v3 split-resolution: `SplitPoles` (the alternative-pole representation), `deposit_counterfactual` (writes the minority pole as `InferenceType::Counterfactual` with `to_mantissa() = -6` into the episodic edge), `revise_if_minority_wins` (free-energy comparison flips the canonical reading). **Iron invariant in the doc comment:** *"A counterfactual stays in a separate lane — it is NEVER written as observed SPO. The `InferenceType::Counterfactual` tag is the mechanical enforcement of that invariant."* +- **`SPAWN_DISSONANCE_THRESHOLD: f32 = 0.55`** — calibrated threshold for when a substrate state diverges enough to be *"worth a counterfactual test."* + +**The composition:** map each MailboxSoA row to one cell / one transcription-factor binding site / one viral integration site: +- `energy[row]` = RAS-GTP fraction at this site / viral copy number / cytokine concentration. +- `plasticity_counter[row]` = lifetime cumulative activation (DIKW-climb). +- `last_active_cycle[row]` = last firing time (replication burst recency). +- `edges[row]` 12 in-family slots = 12 neighbouring cells in the tissue lattice; 4 out-of-family = vascular / immune / inter-organ. +- `qualia[row]` = 16-i4 affective lane for stress / hypoxia / immune-surveillance signal. + +At each `consume_firing(row)`: if local dissonance > 0.55, spawn `CounterfactualMailbox` with `SplitPoles` for the alternative cascade path (e.g. KRAS G12D vs. wild-type at this site; integration at locus A vs. B). Both poles fan out forward — majority writes observed SPO edges into the AriGraph episodic chain; minority stays as `InferenceType::Counterfactual` in a separate lane. After K cycles, `revise_if_minority_wins` flips the canonical reading if the counterfactual lane out-predicts the observed. **This is Friston active inference at the per-row level, with the counterfactual ledger preserved for retrospective re-analysis** — the substrate-native analog of "what would have happened if this driver mutation had occurred earlier / elsewhere." + +### 1.8 do-calculus / Pearl causality on driver-mutation graphs ↔ Pearl-2³ mask in CausalEdge64 + +**You know:** Judea Pearl's causal hierarchy distinguishes observation (P(Y | X)) from intervention (P(Y | do(X))) from counterfactual (P(Y_X | X', Y')). For driver-mutation networks (TP53, BRAF, KRAS, PIK3CA, ...) you want to ask all three. + +**We have:** **`CausalEdge64` bit layout includes a Pearl mask** (`crates/lance-graph-contract/src/high_heel.rs:202`): *"S/P/O palette + NARS + Pearl mask + inference + plasticity + temporal."* Plus `crate::pearl_junction` (`crates/lance-graph-contract/src/pearl_junction.rs`) classifies *"the three structural relations the Pearl-junction classifier needs"* (`hhtl.rs:231`). And **`PEARL_SUBSETS`** in `ndarray::hpc::entropy_ladder` (#494) ships the 2³ = 8 hypothesis subsets ready for Pearl do-calculus on the SPO graph — the 8 octants of {observed, do-intervened, counterfactual} × {S, P, O}. + +**The composition:** every causal-edge in your variant graph carries the 8-subset Pearl mask in-band. `decompose_spo` reads the 3 × palette-256 SPO already encoded in `CausalEdge64` — *no re-quantization needed* (#494 architecture decision). For a KRAS → MAPK → ERK → MYC cascade, the per-edge Pearl mask lets you query: *"given the observed presence of KRAS G12D and observed MYC overexpression, what's the counterfactual probability of MYC overexpression in the do(KRAS-WT) intervention?"* The substrate addresses the 8 octants natively; you write the query, not the do-calculus engine. + +--- + +## 2. The architecture + +```text +upstream domain corpora (FASTA / VCF / BAM / GFF / 1000-Genomes / ClinVar / GO / Reactome) + ↓ (parsers — proposed, see plan §1) +adapter-genetics-experimental crate (proposed; thin domain wiring, no new substrate) + ↓ +classid → ClassView resolves which ValueSchema preset materialises per-row + ↓ +ndarray::hpc primitives lance-graph-contract primitives + CAM-PQ 48-bit fingerprint canonical NodeGuid (classid·HEEL·HIP·TWIG·family·identity) + CLAM 3-level 16-way tree EdgeBlock (12 in-family + 4 out-of-family) + CHAODA anomaly score NodeRow = 512 B = key(16) | edges(16) | value(480) + bgz17 11/17 X-Trans stride MailboxSoA (energy / plasticity / edges / qualia) + matrix exp (Padé) EdgeCodecFlavor / ValueSchema (per-class via classid) + softmax / log-softmax (axis-aware SIMD) CausalEdge64 with Pearl 2³ mask + Lie-algebra Lyndon log-signature CounterfactualMailbox (SplitPoles + revise_if_minority_wins) + ↓ +lance / lancedb columnar SPO store + ↓ +§14 oracle (provenance-normalised equivalence: rubicon::oracle::compare_normalised) + ↓ +queryable, falsifiable, counterfactual-preserving variant/cohort/pathway graph +``` + +--- + +## 3. What's running today vs. what we propose + +| Layer | Status | +|---|---| +| CAM-PQ 48-bit fingerprint + Hamming LUT scan | **shipped** (`lance-graph-turbovec`, validated 76 µs/query @ n=20K) | +| CLAM 3-level 16-way tree | **shipped** (`ndarray::hpc::clam`, 1600 lines, with silhouette / ARI / Cronbach α probes) | +| CHAODA anomaly score | **shipped** (`ndarray::hpc::clam.rs:1493-1560`) | +| bgz17 11/17 stride | **shipped** (`crates/bgz17/`, X-Trans rationale documented) | +| HHTL nibble-trie | **shipped** (`lance-graph-contract::hhtl::NiblePath`) | +| `NodeGuid` / `EdgeBlock` / `NodeRow` canonical layout | **shipped** (`canonical_node.rs`, lance-graph #489/#490) | +| `MailboxSoA` + `consume_firing` | **shipped** (`mailbox_soa.rs`) | +| `CounterfactualMailbox` + `SplitPoles` + `revise_if_minority_wins` | **shipped** (`counterfactual.rs`) | +| Pearl-2³ mask in `CausalEdge64` + `pearl_junction` classifier + `PEARL_SUBSETS` | **shipped** | +| Entropy × energy substrate-state plane + `EntropyRung` + `Quadrant` + `nars_entropy` | **shipped** (#491, #494, #495) | +| `OntologyRegistry` + Pattern D hydrators (`hydrate_dolce` / `hydrate_owltime` / `hydrate_provo` / `hydrate_qudt` / `hydrate_schemaorg` / `hydrate_skos` / `hydrate_fibo_fnd` / `hydrate_fibo_be` / `hydrate_odoo` / `hydrate_zugferd` / `hydrate_skr03/04`) over shipped `OwlHydrator` + `MetaStructureHydrator` + 47 KB Lance dictionary cache | **shipped** (`lance-graph-ontology/src/hydrators/mod.rs:1-57`) | +| §14 oracle (`rubicon::oracle::compare_normalised`) | **shipped** (`bardioc/rubicon`) | +| DeepNSM sentence-level AriGraph reader | **shipped** (lance-graph #479; 200 tests) | +| ndarray AMX int8 GEMM (197 GMAC/s on Emerald Rapids) | **shipped** (ndarray #217) | +| Histology splat bridge (same Gaussian math at cm/mm/µm) | **shipped framing** (devcon flyers; splat-native ultrasound arc) | +| ───────────────────────────────────────────── | ───────────────── | +| `crates/adapter-genetics-experimental` scaffold | **proposed** (plan D-GEN-1) | +| FASTA / VCF / BAM parsers (host `noodles-*`) | **proposed** (plan D-GEN-2) | +| k-mer → CAM-PQ fingerprint function | **proposed** (plan D-GEN-3) | +| GO / Reactome / ClinVar Pattern D `hydrate_*()` glue (over shipped `OwlHydrator` / `MetaStructureHydrator` — no new trait) | **proposed** (plan D-GEN-5) | +| KRAS G12D 1024-cell counterfactual fan-out simulation | **proposed** (plan D-GEN-7) | +| §14 oracle benchmark against GIAB truth set | **proposed** (plan D-GEN-10) | + +--- + +## 4. The dynamics axis (what nobody ships today) + +The eight pattern matches above are individually compelling. The synthesis that nobody ships: + +> **Cancer-pathway propagation simulation with counterfactual lanes preserved.** + +Standard tools (GATK / DeepVariant) call variants from a single sample. Pathway tools (Reactome / WikiPathways) annotate observed pathway membership. Network-inference tools (CellNOpt / SCENIC) reconstruct pathway connections from expression data. *None of them* preserve the counterfactual branch — *"what if KRAS had mutated at codon 12 instead of codon 13 in this tumor's earliest clonal expansion?"* — as a persistent, query-visible substrate lane. + +The substrate does. `CounterfactualMailbox` is the storage; `revise_if_minority_wins` is the update rule; `InferenceType::Counterfactual` is the mechanical guarantee that counterfactual edges never leak into observed SPO. For a cancer cohort, this is the equivalent of *"every patient's tumor carries both its actual driver mutation history AND the counterfactual alternatives the substrate's Friston-FEP active inference deemed worth tracking,"* and the §14 oracle can compare counterfactual lanes across patients to spot *"this counterfactual branch was rejected in 9 out of 10 tumors but accepted in tumor X — what's structurally different about X?"* That's clinical-research signal nobody else can address natively. + +--- + +## 5. The first cargo invocation that proves it + +Today, with the shipped substrate (zero new code), you can: + +```bash +# Build the CHAODA-on-vectors substrate (already green on main) +cd /home/user/ndarray +cargo test -p ndarray --lib clam::tests::chaoda +``` + +This runs the existing CHAODA test suite. Replace the test vectors with a 1000-Genomes feature matrix loaded from VCF (manually parsed for now; the FASTA/VCF adapter is plan D-GEN-2) and you have novel-variant detection running today, against the same CLAM tree that does language-embedding retrieval. The kernel is the same; the input is genomic. + +For the dynamics axis, the runnable proof of concept is: + +```bash +cd /home/user/lance-graph +cargo test -p lance-graph-contract --lib counterfactual::tests +``` + +This exercises `SplitPoles`, `deposit_counterfactual`, `CounterfactualMailbox`, and `revise_if_minority_wins`. Replace the test poles with a KRAS-G12D-vs-WT split at codon 12 of one mailbox row, fan-out via `consume_firing` across a 1024-row cellular lattice, and you have the cancer-cascade simulation harness running today. The wiring (plan D-GEN-7) is ~1-2 days of work; the substrate is shipped. + +--- + +## 6. Why this is worth your time, in one paragraph + +You'd build the genetics tool either way. What this substrate gives you that you cannot easily reach for anywhere else is **a counterfactual-lane-preserving SPO graph with Friston-FEP-calibrated variant evidence accumulation, native Pearl-2³ do-calculus addressing, CHAODA-grade unsupervised novel-variant detection, and provenance-normalised cross-pipeline equivalence checking — all running on shipped kernels** (CAM-PQ, CLAM/CHAODA, bgz17, AMX int8 GEMM at 197 GMAC/s, the entropy × energy plane validated at ρ = −0.78 against prediction accuracy). The work to unlock genetics-specific use is mostly file-format parsers + ontology hydrators + a domain class-mint, not new substrate. Most of the eight pattern matches above are one-day to one-week tasks each; the four-week first-deliverable target in the implementation plan is honest. + +--- + +## 7. Where to go next + +- **Implementation plan:** `.claude/plans/genetic-research-substrate-integration-v1.md`. Names 10 deliverables D-GEN-1..10 + 3 gating probes (CHAODA-on-1000G, KRAS-counterfactual-determinism, CAM-PQ-vs-BLAST agreement). Per-deliverable lift estimate ranges from a-few-hours to two-weeks. +- **Headstone:** `.claude/handovers/2026-06-16-genetic-research-headstone-exploration.md`. Capstone synthesis for the destination state — what the substrate looks like *when complete* for genetic research, what era closes (per-tool pipeline silos), what era opens (one counterfactual-preserving graph). +- **The existing exploratory plan:** `.claude/plans/3DGS-genetics-4x4-fanout-plan.md`. Predates this synthesis; covers the static representation (4×4 lane interpretation: sequence coordinate / motif / expression / time). This document IS the *dynamics + counterfactual* extension that the static plan was missing. + +--- + +_Authored 2026-06-16 by external session `AdaWorldAPI/bardioc` `session_01VysoWJ6vsyg3wEGc5v7T5v`. Every file:line citation in this document points to an actual file in this workspace as of the authoring date. If a citation has rotted, treat it as a discipline failure on the author's side and report it; the substrate's no-confabulation rule applies to this document too._