Skip to content

Commit 9138d28

Browse files
committed
docs(genetics): D-GEN-CHAODA-ENSEMBLE increment 1 RUN — ensemble clears the synthetic bar (AUC 0.62 -> 0.99)
Records the ndarray #220 result: the multi-method CHAODA ensemble resolves the kernel-level blocker the #219 spike surfaced. MEASURED (ndarray #220, same synthetic fixture as #219): single-LFD AUC 0.6240 -> ensemble AUC 0.9906 (+0.3667, clears 0.85) The dominant signal is the parent-child path-minority ratio (immune to the leaf-fragmentation that defeated a naive leaf-cardinality/degree attempt at AUC 0.621), averaged with connected-component cardinality. Updates: - Sequencing table: split P0 into P0a (ensemble, DONE, AUC 0.991) and P0b (genomic probe, unblocked at kernel level but gated on real corpora). Blocker note flipped from surfaced to resolved-at-kernel. - Added a FOLLOW-UP block under PROBE-CHAODA-1000G with the ensemble measurement and the honest scope (synthetic only). - D-GEN-CHAODA-ENSEMBLE: marked INCREMENT 1 DONE (ndarray #220); listed what remains (random-walk method; Step 3 wiring lands with D-GEN-1+2). Noted the ~half-day actual vs ~1-week estimate. - GENETIC_RESEARCH_VIA_STACK.md S 1.4: caveat flipped from "NOT a working detector" to "kernel capability now demonstrated via ensemble_anomaly_scores; genomic claim still gated on D-GEN-1+2." Honest scope preserved throughout: synthetic smoke test proves the ensemble approach; genomic novelty detection remains unproven until the VCF->feature-vector pipeline exists. https://claude.ai/code/session_01VysoWJ6vsyg3wEGc5v7T5v
1 parent 74e04cc commit 9138d28

2 files changed

Lines changed: 54 additions & 20 deletions

File tree

.claude/plans/genetics-probes-v1.md

Lines changed: 53 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -21,15 +21,20 @@
2121

2222
| Phase | Probe | Cost | Status | Gates |
2323
|---|---|---|---|---|
24-
| **P0** | PROBE-CHAODA-1000G | ~3 days (after D-GEN-1+2) |**spike RUN — AUC 0.624, BELOW bar** (ndarray #219) | The "CHAODA-as-novelty-detector" line of the entire plan |
24+
| **P0a** | D-GEN-CHAODA-ENSEMBLE (synthetic) | ~done |**ensemble RUN — AUC 0.991, CLEARS bar** (ndarray #220) | The CHAODA kernel's isolation capability |
25+
| **P0b** | PROBE-CHAODA-1000G (genomic) | ~3 days (after D-GEN-1+2) | ⏳ unblocked at kernel level; gated on real corpora | The "CHAODA-as-novelty-detector" line of the entire plan |
2526
| **P1** | PROBE-KRAS-COUNTERFACTUAL-DET | ~2 days (included in D-GEN-7) | queued | D-GEN-7 flagship dynamics-axis claim |
2627
| **P2** | PROBE-CAM-PQ-VS-BLAST | ~1 week | queued | D-GEN-3 sequence-fingerprint claim |
2728

28-
**⚠ Blocker surfaced 2026-06-16:** the P0 spike (ndarray #219) shows the shipped
29-
single-method leaf-LFD `anomaly_scores` reaches only AUC 0.624 on ideal synthetic
30-
data. Porting the multi-method CHAODA ensemble (Ishaq et al. 2021) is now a
31-
**prerequisite for PROBE-CHAODA-1000G**, ahead of any genomic-fixture work. See
32-
the ⚠ FINDING under PROBE-CHAODA-1000G below.
29+
**⚠→✅ Blocker surfaced AND resolved at kernel level 2026-06-16:** the P0 spike
30+
(ndarray #219) showed the shipped single-method leaf-LFD `anomaly_scores` reaching
31+
only AUC 0.624 on ideal synthetic data. The multi-method CHAODA ensemble has now
32+
been built (ndarray #220, `ClamTree::ensemble_anomaly_scores`) and measured at
33+
**AUC 0.991** on the same synthetic fixture — clearing the 0.85 bar with a +0.367
34+
lift. **This resolves the kernel-level blocker** (the ensemble *approach* captures
35+
isolation where single-LFD did not). It does NOT yet prove genomic novelty
36+
detection: `PROBE-CHAODA-1000G` on real corpora remains gated on D-GEN-1 + D-GEN-2.
37+
See the FINDING under PROBE-CHAODA-1000G below.
3338

3439
**Critical-path note:** PROBE-CHAODA-1000G is the single highest-leverage probe.
3540
If it fails (AUC < 0.85 on novel-variant detection against ClinVar Pathogenic
@@ -54,6 +59,30 @@ probe is a regression gate, not a discovery gate).
5459
> singletons from common population variants at ROC-AUC ≥ 0.85 on a
5560
> held-out test fold drawn from 1000-Genomes Phase 3 + ClinVar.
5661
62+
### ✅ FOLLOW-UP (ensemble RUN 2026-06-16) — multi-method ensemble CLEARS the bar (synthetic)
63+
64+
The blocker below has been resolved at the kernel level. The multi-method CHAODA
65+
ensemble is built (ndarray PR #220, `ClamTree::ensemble_anomaly_scores`) and
66+
measured on the *same* synthetic fixture:
67+
68+
| signal | ROC-AUC |
69+
|---|---|
70+
| single-method leaf-LFD (baseline) | 0.6240 |
71+
| **multi-method ensemble** | **0.9906** |
72+
| lift | **+0.3667** |
73+
74+
The dominant signal is the **parent-child path-minority ratio** — walking a leaf
75+
up to the root, the minimum `child/parent` cardinality ratio is tiny for a point
76+
that split off as a minority (an isolated outlier) and moderate for a dense-cluster
77+
member that always stayed in the majority. This is *immune to the leaf-fragmentation*
78+
that defeated the naïve first attempt (raw leaf cardinality + degree + component
79+
size → AUC 0.621, no lift). Averaged with connected-component cardinality.
80+
81+
**This proves the ensemble approach, on synthetic data only.** It does NOT prove
82+
genomic novelty detection — `PROBE-CHAODA-1000G` on real corpora is still gated on
83+
D-GEN-1 + D-GEN-2. Random-walk stationary distribution remains deferred to a later
84+
ensemble increment.
85+
5786
### ⚠ FINDING (spike substitute RUN 2026-06-16) — single-method LFD is BELOW the bar
5887

5988
The 1-day spike substitute (see §Cost below) has been **run** against the
@@ -353,21 +382,26 @@ the pattern match is sound, the single shipped signal is not yet sufficient,
353382
and the honest path is "port the ensemble, then re-run the spike, then build
354383
the fixture." A new candidate deliverable falls out of this:
355384

356-
> **D-GEN-CHAODA-ENSEMBLE (new, prerequisite to PROBE-CHAODA-1000G):** add the
385+
> **D-GEN-CHAODA-ENSEMBLE (prerequisite to PROBE-CHAODA-1000G):** add the
357386
> multi-method CHAODA anomaly ensemble to `ndarray::hpc::clam` as a **new
358-
> scoring entry point** (e.g. `ensemble_anomaly_scores(...) -> Vec<AnomalyScore>`,
359-
> name TBD at implementation), combining the graph-based signals of Ishaq et
360-
> al. 2021. The existing single-method `anomaly_scores` is **kept unchanged as
361-
> the documented baseline / regression** (the ndarray #219 spike's `auc < 0.85`
362-
> tripwire stays green on it). **`PROBE-CHAODA-1000G` Step 3 must call the new
363-
> ensemble entry point, not `anomaly_scores`** — that wiring is part of this
387+
> scoring entry point**, combining the graph-based signals of Ishaq et al.
388+
> 2021. The existing single-method `anomaly_scores` is **kept unchanged as the
389+
> documented baseline / regression**. **`PROBE-CHAODA-1000G` Step 3 must call the
390+
> new ensemble entry point, not `anomaly_scores`** — that wiring is part of this
364391
> deliverable, otherwise the genomic probe would re-measure the known-bad
365-
> AUC-0.624 path. Re-run the ndarray #219 spike against the ensemble; gate at
366-
> AUC ≥ 0.85 on the synthetic mixture *before* genomic fixtures are built.
367-
> Lift: ~1 week (the graph-construction primitives — cluster cardinality,
368-
> neighbourhood, random-walk — are mostly present in the CLAM tree already;
369-
> the ensemble combination + per-method scoring + the probe-API wiring is the
370-
> new code).
392+
> AUC-0.624 path.
393+
>
394+
> **✅ INCREMENT 1 DONE 2026-06-16 — ndarray PR #220.** Shipped
395+
> `ClamTree::ensemble_anomaly_scores` = parent-child path-minority ratio ⊕
396+
> connected-component cardinality. Re-ran the ndarray #219 synthetic fixture:
397+
> **ensemble AUC 0.9906 vs single-LFD 0.6240 (+0.367), clears the 0.85 gate.**
398+
> Deterministic; built from shipped tree fields + public `dist()`; no new tree
399+
> state. **Remaining for a later increment:** (a) random-walk stationary
400+
> distribution method (deferred — needs power-iteration on the cluster graph);
401+
> (b) the actual `PROBE-CHAODA-1000G` Step 3 wiring lands with D-GEN-1+2 when the
402+
> VCF→feature-vector pipeline exists. Increment-1 lift came in at ~half-day, well
403+
> under the ~1-week estimate, because the path-minority signal needed only a
404+
> parent-map walk — no random-walk solver.
371405
372406
**PROBE-CHAODA-1000G fires first, even though chronologically D-GEN-1..2 must
373407
ship first.** That ordering is a substrate-economic decision (cheaper to

docs/GENETIC_RESEARCH_VIA_STACK.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ impl ClamTree {
7171

7272
**The composition:** build a CLAM tree on your per-variant feature vectors; CHAODA scores every variant against the local manifold's intrinsic dimensionality. A novel variant in a region of high LFD lights up as `AnomalyScore { score → 1.0, awareness → AwarenessState::Noise }` (the `score ≥ 0.75` quartile per `clam.rs:1556`) because its position differs from the population's local manifold — *without you having to train a classifier or annotate a truth set first*. This is *unsupervised* outlier detection on the same tree your range queries walk.
7373

74-
> **⚠ MEASURED CAVEAT (2026-06-16, ndarray PR #219):** the *shipped* `anomaly_scores` implements **only the single-method leaf-LFD signal**, not the full multi-method CHAODA ensemble of Ishaq et al. 2021. A spike on ideal synthetic data (clean Gaussian clusters + far outliers) measured **ROC-AUC = 0.624** — well below the ≥ 0.85 bar a novelty detector needs. Leaf LFD captures *intra-leaf* geometry complexity, not *inter-leaf* isolation, so isolated outliers and dense-cluster points end up in the same score band. **As shipped today, this composition is NOT a working novel-variant detector.** Realising the claim requires porting the multi-method CHAODA ensemble (relative/component cardinality, graph neighbourhood, random-walk stationary distribution, vertex degree) — see `PROBE-CHAODA-1000G` in `.claude/plans/genetics-probes-v1.md`. The pattern match is real; the *single shipped signal* is not yet sufficient.
74+
> **⚠→✅ MEASURED CAVEAT (2026-06-16):** the *original* `anomaly_scores` implements **only the single-method leaf-LFD signal**. A spike (ndarray PR #219) on ideal synthetic data measured **ROC-AUC = 0.624** — below the ≥ 0.85 bar — because leaf LFD captures *intra-leaf* geometry complexity, not *inter-leaf* isolation. **The multi-method ensemble has since been built** (ndarray PR #220, `ClamTree::ensemble_anomaly_scores`: parent-child path-minority ⊕ connected-component cardinality) and measured at **ROC-AUC = 0.991** on the same fixture — clearing the bar. So the *kernel* now does isolation-aware novelty detection; use `ensemble_anomaly_scores`, not the single-LFD `anomaly_scores`, for this composition. **Still gated:** this is synthetic-only proof. Genomic novelty detection (`PROBE-CHAODA-1000G` on 1000-Genomes + ClinVar) remains unproven until the VCF→feature-vector pipeline (plan D-GEN-1+2) exists. The pattern match is real and the kernel capability is now demonstrated; the genomic claim is not yet measured.
7575

7676
### 1.5 minimap2 minimizers ↔ bgz17 11/17 X-Trans stride
7777

0 commit comments

Comments
 (0)