Skip to content

Commit e192266

Browse files
authored
Merge pull request #501 from AdaWorldAPI/claude/genetic-research-substrate-v1
docs(genetics): pattern-recognition hand-off + integration plan + headstone (for a domain expert new to the stack)
2 parents adbcbdc + cd53ab5 commit e192266

3 files changed

Lines changed: 744 additions & 0 deletions

File tree

Lines changed: 293 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,293 @@
1+
# Genetic Research Headstone Exploration — lance-graph
2+
3+
## Purpose
4+
5+
This document is a headstone exploration for the full line of thought connecting:
6+
7+
```text
8+
upstream domain corpora (FASTA / VCF / BAM / GFF / 1000-Genomes / ClinVar / GO / Reactome / htslib)
9+
ndarray::hpc::clam + CHAODA (3-level 16-way clustering + LFD anomaly)
10+
ndarray::hpc::cam_pq (6 × 256 = 48-bit Lloyd-Max fingerprint; 94 % of Jina 1024-D)
11+
ndarray::hpc::activations (exp / log / softmax / matrix exp / Lie-algebra Lyndon log-signature)
12+
ndarray::hpc::amx_matmul (197 GMAC/s int8 GEMM on Emerald Rapids)
13+
bgz17 (11/17 X-Trans stride for anti-moiré k-mer sampling)
14+
lance-graph-contract (canonical NodeGuid · EdgeBlock · NodeRow; HHTL nibble-trie; MailboxSoA<N>; CounterfactualMailbox)
15+
lance-graph-ontology (OntologyRegistry + Pattern D OwlHydrator/MetaStructureHydrator + hydrate_*() glue (dolce/owltime/provo/qudt/schemaorg/skos/fibo_fnd/fibo_be/odoo/zugferd/skr03/skr04) + 47 KB Lance dictionary cache + wikidata_hhtl)
16+
lance-graph-arm-discovery (reliability suite: Pearson / Spearman / Cronbach α / ICC(2,1))
17+
deepnsm (sentence-level AriGraph reader; P64 / Cam4096 / Crystal4096)
18+
rubicon (§14 oracle: compare_normalised with provenance fields)
19+
adapter-genetics-experimental (NEW — thin domain wiring; not yet built)
20+
```
21+
22+
The goal is to preserve the architectural synthesis for genetic research *before* implementation details scatter it into separate plans — so a domain expert walking up to the substrate lands on the destination shape, not on the next tactical PR.
23+
24+
---
25+
26+
## Capstone thesis
27+
28+
```text
29+
Bioinformatics ships pipelines.
30+
Each pipeline is a tool with its own grammar, its own confidence calibration,
31+
its own provenance discipline (usually informal), and its own failure modes.
32+
33+
The substrate ships a SHAPE.
34+
Every variant, every annotation, every counterfactual, every cohort summary
35+
takes the same shape: a row in MailboxSoA<N>, with an entropy×energy plane
36+
classification derived from its Hebbian plasticity counter and its
37+
last-active-cycle stamp, with adjacency in EdgeBlock, with class-resolved
38+
value tenants in the 480-byte slab, with NARS truth (frequency, confidence)
39+
calibrating its provenance, with Pearl-2³ subset addressing in CausalEdge64
40+
for do-calculus, with counterfactual minority poles preserved in a separate
41+
lane via InferenceType::Counterfactual.
42+
43+
One graph, every domain layer composes:
44+
sequence sketches as CAM-PQ fingerprints;
45+
genomic coordinates as HHTL cascade addresses;
46+
novel variants as CHAODA anomaly scores against LFD distribution;
47+
pathway propagation as MailboxSoA active-inference fan-out;
48+
counterfactual driver-mutation histories as CounterfactualMailbox lanes;
49+
causality as Pearl-2³ subset queries;
50+
literature evidence as DeepNSM SPO triples;
51+
cross-pipeline equivalence as §14 oracle verdicts.
52+
53+
The genetic researcher's job is the domain mint:
54+
which classid for which entity?
55+
which ontology hydrator for which vocabulary?
56+
which counterfactual gates the next study question?
57+
58+
Everything below the domain mint is shipped.
59+
```
60+
61+
---
62+
63+
## The four-layer architecture (from a geneticist's vantage)
64+
65+
### Layer 0 — Domain corpora (upstream, never vendored)
66+
67+
FASTA / FASTQ / VCF / BAM / GFF / annotation databases. They live at their canonical homes:
68+
- 1000-Genomes / GIAB at NCBI / EBI.
69+
- ClinVar at NCBI ClinVar releases.
70+
- GO / Reactome / Sequence Ontology at OBO Foundry.
71+
- Reference genome assemblies at UCSC / Ensembl / NCBI.
72+
73+
The adapter `hydrate_*()` glue (Pattern D — Meta-Structure Hydration, `crates/lance-graph-ontology/src/hydrators/mod.rs:1-57`) points at canonical releases and pins a version; the corpora do not move into the substrate. **No new hydrator trait**`hydrate_go` / `hydrate_reactome` / `hydrate_clinvar` are *data + ~50 LOC of glue each* over the shipped `OwlHydrator` / `MetaStructureHydrator`, mirroring the proven shape of `hydrate_dolce` / `hydrate_provo` / `hydrate_skos`.
74+
75+
This layer answers:
76+
77+
```text
78+
which release of which reference is pinned
79+
where the truth set / ontology / annotation lives
80+
who owns its evolution
81+
how the substrate consumer treats version cadence (GO monthly, ClinVar weekly, Reactome quarterly)
82+
```
83+
84+
### Layer 1 — Adapter (`adapter-genetics-experimental`, proposed)
85+
86+
Thin domain wiring. **Zero new substrate primitives.** Mirrors `OcrProvider` engine-agnostic boundary from `lance-graph` #498's `LayoutBlock::to_node_row` transcode. Defines:
87+
88+
- `GenomicSubstrate` trait (the seam).
89+
- Parsers (host `noodles-fasta`, `noodles-vcf`, `noodles-bam` — pure-Rust bioinformatics ecosystem).
90+
- `Cam6x8` k-mer fingerprint function (calls into shipped CAM-PQ codec).
91+
- `VcfRecordTranscoder` (mirrors OCR `LayoutBlock → NodeRow`).
92+
- Class-mint registry sync with `lance-graph-ontology`.
93+
- Per-class `ClassView::value_schema` selection (rides existing `Full` / `Compressed` presets — no new variant per #500's contract test).
94+
95+
This layer answers:
96+
97+
```text
98+
how a VCF record becomes a NodeRow
99+
how a k-mer becomes a 48-bit fingerprint
100+
which classid identifies a Variant / Gene / Pathway / Cell / IntegrationSite
101+
which ValueSchema preset materialises which tenant for which class
102+
```
103+
104+
### Layer 2 — Substrate primitives (shipped)
105+
106+
The CAM-PQ codec, CLAM tree, CHAODA anomaly scoring, bgz17 stride, ndarray AMX int8 GEMM, lance-graph-contract canonical NodeGuid + EdgeBlock + NodeRow, MailboxSoA<N>, CounterfactualMailbox, Pearl-2³ in CausalEdge64, OntologyRegistry, DeepNSM sentence reader, §14 oracle. All file:line-grounded in `docs/GENETIC_RESEARCH_VIA_STACK.md` §1 + §3.
107+
108+
This layer answers:
109+
110+
```text
111+
what every substrate primitive provides at the kernel level
112+
which file:line carries the canonical implementation
113+
which test suite proves green on main
114+
which gating probe has been run (where green) vs. is queued (where speculative)
115+
```
116+
117+
### Layer 3 — Research consumer
118+
119+
The geneticist's queries, in the substrate's native vocabulary:
120+
121+
- *"Find every variant in gene X within population Y in the bootstrap basin."* → one HHTL prefix scan.
122+
- *"Score variant V against population local manifold for novelty."* → one CHAODA call on the CLAM tree.
123+
- *"Sketch a 100 kb genomic region for cohort-wide similarity search."* → one bgz17-strided CAM-PQ fingerprint.
124+
- *"Compare GATK calls vs. DeepVariant calls for sample S with provenance preserved."* → one §14 oracle invocation.
125+
- *"Simulate KRAS-G12D-vs-WT counterfactual propagation in a 1024-cell tumor lattice."* → one `MailboxSoA<1024>` instantiation + `CounterfactualMailbox` for the G12D-vs-WT split.
126+
- *"Extract gene-disease associations from PubMed abstracts."* → DeepNSM sentence reader over the corpus.
127+
128+
This layer answers:
129+
130+
```text
131+
what queries the geneticist asks
132+
which substrate primitive each query consumes
133+
which ontology hydrator each query references
134+
where the falsifiable certificate for each query result lives
135+
```
136+
137+
---
138+
139+
## Why bioinformatics pipelines alone are not enough
140+
141+
Each existing tool answers part of the question. None compose into a single counterfactual-preserving graph.
142+
143+
| Tool family | What it gives | What it doesn't |
144+
|---|---|---|
145+
| GATK / DeepVariant / bcftools | Per-sample variant calls with caller-specific confidence | No cross-caller provenance reconciliation; no counterfactual lane; no entropy×energy substrate-state calibration |
146+
| BLAST / Diamond / minimap2 | Sequence similarity rankings | No fingerprint-substrate integration; no SPO emission; no graph-native composition |
147+
| Reactome / WikiPathways / KEGG | Annotated pathway membership | Static; no counterfactual propagation; no Friston-FEP evidence calibration |
148+
| CADD / REVEL / AlphaMissense | Per-variant deleteriousness scores | Trained classifier required; no unsupervised novel-variant flag; no LFD anomaly grounding |
149+
| CellNOpt / SCENIC / etc. | Network inference from expression | No counterfactual lane preservation; no Pearl-2³ do-calculus addressing |
150+
| nf-core / Snakemake pipelines | Reproducibility via workflow management | Workflow-level, not graph-native; no §14 oracle equivalence checking; no provenance-normalised cross-pipeline comparison |
151+
152+
The substrate's composition gives the missing piece: **one SPO graph, accumulating evidence across all of these consumers, with counterfactual lanes preserved, with entropy×energy quadrant classification per variant, with Pearl-2³ do-calculus addressing, and with §14 oracle equivalence as the falsifiable cross-tool benchmark.**
153+
154+
---
155+
156+
## Why building genomics tooling from scratch is not enough
157+
158+
You'd reach for the shipped substrate even if you started from scratch, because:
159+
160+
- The CAM-PQ codec is mature (PR #482 ratified the canon; ndarray PR #218 measured fidelity).
161+
- The CLAM tree + CHAODA scoring is mature (~1600 lines, validated probes).
162+
- The AMX int8 GEMM is real silicon performance (197 GMAC/s, ndarray PR #217 measured).
163+
- The entropy × energy plane is empirically validated (ρ(entropy, prediction accuracy) = −0.78 measured, ndarray PR #218).
164+
- The reliability stats (Pearson / Spearman / Cronbach α / ICC) are shipped (ndarray PR #218).
165+
- The CounterfactualMailbox is shipped with its iron invariant mechanically enforced.
166+
- The §14 oracle is in production use for OCR caller comparison (post-#498).
167+
- The OntologyRegistry has Pattern D hydrators (`hydrate_dolce` / `hydrate_owltime` / `hydrate_provo` / `hydrate_qudt` / `hydrate_schemaorg` / `hydrate_skos` / `hydrate_fibo_fnd` / `hydrate_fibo_be` / `hydrate_odoo` / `hydrate_zugferd` / `hydrate_skr03/04`) over shipped `OwlHydrator` + `MetaStructureHydrator` as the proven pattern.
168+
169+
Building these from scratch is N person-years of work. The lift to genetic-research-via-substrate is the **domain wiring** — measured in days to weeks per deliverable in `genetic-research-substrate-integration-v1.md`.
170+
171+
---
172+
173+
## Invariants
174+
175+
These are what the substrate enforces; the genetic-research adapter inherits them.
176+
177+
1. **§0 anti-invention guardrail** (lance-graph #496): no new `ValueSchema` variant; no new substrate types. Genetic-research-specific work is *wiring*, not new substrate.
178+
2. **No-new-enum-variant contract test** (lance-graph #500): genomic classes ride existing `Full` / `Compressed` presets via `classid → ClassView`. **Do not propose a `ValueSchema::Genetic`.**
179+
3. **Counterfactuals stay in their own lane** (`counterfactual.rs` iron invariant): `InferenceType::Counterfactual` mantissa = -6; never written as observed SPO.
180+
4. **Closed-vocab discipline** (ruff PR #5 `predicate_count_locked_at_N`): new genetic predicates land in `ruff_spo_triplet::Predicate` under the locked-count gate.
181+
5. **No C++ source vendored into Rust-target crates.** htslib stays upstream; if a transcoded version is wanted, route through `ruff_cpp_spo` (cross-repo handover at `AdaWorldAPI/ruff`).
182+
6. **Five-specialist drift-catching pass** (lance-graph #500): `cascade-architect` / `family-codec-smith` / `palette-engineer` / `dto-soa-savant` / `truth-architect` review before any FINDING-grade claim.
183+
7. **Gating probes before FINDING**: `PROBE-CHAODA-1000G`, `PROBE-KRAS-COUNTERFACTUAL-DET`, `PROBE-CAM-PQ-VS-BLAST` gate the substrate's claims to bioinformatics audiences.
184+
8. **Boundary: representation + research tooling only.** No medical/diagnostic claims (per the predecessor `3DGS-genetics-4x4-fanout-plan.md`).
185+
186+
---
187+
188+
## What "complete" looks like
189+
190+
The headstone is reached when:
191+
192+
1. **`adapter-genetics-experimental` compiles** and the locked-shape test passes (the *"shape locked"* milestone analogous to `ruff_ruby_spo` PR #4).
193+
2. **FASTA + VCF round-trip into `NodeRow`** via the `VcfRecordTranscoder`. The first measurable artifact: load 1000-Genomes Phase 3 chromosome 22 into the substrate and round-trip back to VCF, byte-identical for the chr 22 subset.
194+
3. **CHAODA on 1000-Genomes feature vectors** produces ROC-AUC ≥ 0.85 on the held-out novel-singleton test (PROBE-CHAODA-1000G green).
195+
4. **CAM-PQ-vs-BLAST agreement** measured: Spearman ρ ≥ 0.7 on top-100 RefSeq similarity rankings (PROBE-CAM-PQ-VS-BLAST green).
196+
5. **KRAS G12D 1024-cell counterfactual fan-out** simulation runs deterministically (PROBE-KRAS-COUNTERFACTUAL-DET bit-exact across runs), and the observed-lane oncogenic-transformation rate matches published outcomes within tolerance.
197+
6. **GO / Reactome / ClinVar Pattern D hydrate_*() glue** (`hydrate_go` / `hydrate_reactome` / `hydrate_clinvar`, each data + ~50 LOC over shipped `OwlHydrator` / `MetaStructureHydrator`) load into `OntologyRegistry` and ontology cache invalidation works on Lance version bump.
198+
7. **§14 oracle benchmarks GATK vs. DeepVariant** against GIAB HG002 truth set with F1 meeting published minima.
199+
8. **DeepNSM genetic-language reader probe** demonstrates `P64` projection consistency on protein-coding sequences (versus structured-noise on non-coding).
200+
9. **Histology splat extension** carries per-splat genomic profile via `Full` ValueSchema with `Fingerprint` + `HelixResidue` tenants populated.
201+
202+
When these nine hold, the substrate has fulfilled its purpose as the genetic-research foundation: one graph, every domain layer composes, counterfactual lanes preserved, falsifiable certificates everywhere.
203+
204+
---
205+
206+
## Headstone state — what the era closes
207+
208+
```text
209+
The era that closes:
210+
- Per-tool pipeline silos with no cross-tool provenance reconciliation.
211+
- Variant calling and pathway annotation as separate worlds with no
212+
shared substrate.
213+
- Counterfactual driver-mutation histories thrown away because no tool
214+
preserves them.
215+
- Reproducibility via workflow managers rather than substrate-native
216+
provenance + §14 oracle equivalence.
217+
- Outlier detection requiring trained classifiers (CADD / REVEL etc.)
218+
because no unsupervised LFD-based novel-variant detector existed at
219+
bioinformatics scale.
220+
- "Bioinformatics builds its own tools" as the default assumption.
221+
222+
The era that opens:
223+
- One SPO graph accumulating cross-tool, cross-paper, cross-cohort
224+
variant evidence with provenance preserved.
225+
- Counterfactual driver-mutation lanes queryable for retrospective
226+
analysis ("what if KRAS had mutated at codon 13 instead of 12?").
227+
- CHAODA unsupervised novel-variant detection on the same CLAM tree
228+
the substrate uses for language retrieval.
229+
- Pearl-2³ do-calculus native in CausalEdge64 for cancer-pathway
230+
causal queries.
231+
- The Friston entropy×energy substrate-state plane as the calibrated
232+
confidence axis for every variant in the store.
233+
- The same Gaussian-splat math at cm (organ), mm (lesion), and µm
234+
(cell / histology slide) scales, with the splat carrier extending
235+
to per-cell genomic profile.
236+
- Cross-pipeline equivalence checking as a substrate-native operation
237+
(§14 oracle), not a workflow manager's afterthought.
238+
- Domain experts wiring the genetic-research adapter, not rebuilding
239+
the substrate.
240+
```
241+
242+
The capstone thesis at the top of this doc is the one-line restatement of the open-era state.
243+
244+
---
245+
246+
## Cross-references
247+
248+
### This repo (`AdaWorldAPI/lance-graph`)
249+
- `docs/GENETIC_RESEARCH_VIA_STACK.md` — the *why* doc for a domain expert, file:line-grounded.
250+
- `.claude/plans/genetic-research-substrate-integration-v1.md` — the implementation plan (10 deliverables + 3 probes).
251+
- `.claude/plans/3DGS-genetics-4x4-fanout-plan.md` — predecessor static-representation plan.
252+
- `.claude/plans/3DGS-cross-pollination-raw-field-plan.md` — sibling cross-domain plan (ultrasound + neuronal + genetics share the raw-field backbone).
253+
- `crates/lance-graph-contract/src/canonical_node.rs``NodeGuid` / `EdgeBlock` / `NodeRow` (the row substrate).
254+
- `crates/lance-graph-contract/src/counterfactual.rs``SplitPoles` / `CounterfactualMailbox` / `revise_if_minority_wins`.
255+
- `crates/lance-graph-contract/src/hhtl.rs``NiblePath` HHTL nibble-trie.
256+
- `crates/lance-graph-contract/src/high_heel.rs:202``CausalEdge64` bit layout (Pearl mask included).
257+
- `crates/lance-graph-ontology/``OntologyRegistry` + TTL hydrators.
258+
- `crates/lance-graph/src/cam_pq/storage.rs` — CAM-PQ 48-bit fingerprint storage.
259+
- `crates/lance-graph-turbovec/KNOWLEDGE.md` — TurboQuant + LUT-ADC kernel (76 µs/query measured).
260+
- `crates/cognitive-shader-driver/src/mailbox_soa.rs``MailboxSoA<N>` + `consume_firing`.
261+
- `crates/deepnsm/` — sentence-level AriGraph reader.
262+
263+
### Sibling repo (`AdaWorldAPI/ndarray`)
264+
- `src/hpc/clam.rs:1493-1560` — CHAODA Phase 4 anomaly scoring on LFD distribution.
265+
- `src/hpc/amx_matmul.rs` — int8 GEMM at 197 GMAC/s.
266+
- `src/hpc/activations.rs` — exp / log / softmax / softmax-backward.
267+
- `src/hpc/linalg/mat_exp.rs` — matrix exponential via Padé.
268+
- `crates/bgz17/src/lib.rs:53-60` — 11/17 X-Trans stride constants.
269+
270+
### Upstream substrate context (`AdaWorldAPI/lance-graph` PRs)
271+
- PR #491 — entropy × energy framing; SoA migration diff resolution.
272+
- PR #494`EntropyRung` + `Quadrant` + `nars_entropy(f, c)`.
273+
- PR #495 — 3-byte EdgeRef witness; reliability ⊥ causality empirically.
274+
- PR #496`ValueSchema` presets + §0 anti-invention guardrail.
275+
- PR #498 — GUID decode→read-mode keystone; helix `Signed360`; OCR→NodeRow transcode template.
276+
- PR #500 — rebaseline + no-new-variant contract test + gating-probes pattern.
277+
278+
### External cross-repo
279+
- `AdaWorldAPI/bardioc/SUBSTRATE_STATE_FRAMINGS.md` — entropy × energy plane framing (durable bardioc-side architectural doc).
280+
- `AdaWorldAPI/bardioc/.claude/handovers/2026-06-16-session-handover.md` — bardioc session handover (covers the preceding session's confabulation pattern + discipline lessons).
281+
- `AdaWorldAPI/ruff/.claude/handovers/2026-06-16-ruff-cpp-headstone-exploration.md` — sibling headstone for the C++ harvester (relevant if htslib transcoding wanted).
282+
- `OGAR/docs/CASCADE-SYNERGIES-EPIPHANY.md` — Morton-cascade × palette256 × golden-helix synthesis (foundational to the genetics fanout).
283+
- `OGAR/docs/DISCOVERY-MAP.md` — D-CASCADE / D-MOIRE / D-BGZ17 ledger entries.
284+
285+
### Workspace headstones (for shape reference)
286+
- `lance-graph/.claude/plans/3DGS-Cesium-BindSpace4-headstone-exploration.md` — the headstone shape this document follows.
287+
- `bardioc/ROADMAP_RUST_PRIMARY_HEADSTONE.md` — Phase A→I migration headstone.
288+
- `AdaWorldAPI/ruff/.claude/handovers/2026-06-16-ruff-cpp-headstone-exploration.md` — sibling headstone (C++ harvester).
289+
- `AdaWorldAPI/tesseract-rs/.claude/handovers/2026-06-16-tesseract-rs-headstone-exploration.md` — sibling headstone (Rust target).
290+
291+
---
292+
293+
_Authored 2026-06-16 by external session `AdaWorldAPI/bardioc` `session_01VysoWJ6vsyg3wEGc5v7T5v`. Headstone shape — preserves the architectural synthesis for what genetic-research-via-substrate IS when complete. Companion pattern-recognition hand-off at `docs/GENETIC_RESEARCH_VIA_STACK.md` carries the *why*; companion implementation plan at `.claude/plans/genetic-research-substrate-integration-v1.md` carries the *how*. No code, no PR for substrate changes — synthesis-preservation only._

0 commit comments

Comments
 (0)