docs(genetics): probe spec v1 + Salient cite-rot fix#503
Conversation
Promotes the three probes named in §2 of .claude/plans/genetic-research-substrate-integration-v1.md (merged in #501) from "named" to "fully-specified with file:line citations + pass/fail criteria locked." Mirrors the probe-spec discipline of ocr-probes-v1.md (PR #500). Probe specs: - PROBE-CHAODA-1000G (P0, ~3 days after D-GEN-1+2): novel-variant detection on 1000-Genomes Phase 3 + ClinVar held-out. Feature vector locked at 5 lanes (AF / DP / FS / 100bp Shannon entropy via bgz17 / phyloP100way). ROC-AUC >= 0.85 pass condition with a per-quartile separation sanity check. Critical path: if it fails, the unsupervised novelty story in GENETIC_RESEARCH_VIA_STACK.md S 1.4 collapses. - PROBE-KRAS-COUNTERFACTUAL-DET (P1, ~2 days inside D-GEN-7): bit-exact MailboxSoA<1024> across two seeded runs. Regression gate for the substrate's no-randomness invariant under fan-out load. - PROBE-CAM-PQ-VS-BLAST (P2, ~1 week): Spearman rho >= 0.7 + ICC >= 0.6 against BLAST e-value top-100 rankings on a 10K RefSeq protein subset via ESM-2 small (320-D) embeddings. Sequencing locked: PROBE-CHAODA-1000G is the single highest-leverage probe; the per-quartile separation check guards against an inverted signal regime. S A1 cite-rot fix folded in: GENETIC_RESEARCH_VIA_STACK.md S 1.4 cited a non-existent AwarenessState::Salient and an f32 score field. Shipped variants per clam.rs:1549-1557 are Crystallized / Tensioned / Uncertain / Noise; score field is f64 (clam.rs:1504). Corrected to map the score >= 0.75 quartile to AwarenessState::Noise per clam.rs:1556, with the AnomalyScore struct re-stated against the shipped layout. Also added the "gated by PROBE-CHAODA-1000G" callout so future readers see the conjecture status of the novelty claim. https://claude.ai/code/session_01VysoWJ6vsyg3wEGc5v7T5v
|
Warning Review limit reached
More reviews will be available in 49 minutes and 31 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 444e371e0d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| layout (`ndarray/src/hpc/clam.rs`, HEEL=16 / HIP=256 / TWIG=4096 per | ||
| `lance-graph/.claude/session_2026_04_11_bf16_hhtl_combined_research.md`). | ||
| 2. Project held-out vectors through the tree (assign to leaf cluster). | ||
| 3. Compute `anomaly_scores(held_out_bytes, vec_len=5)` → `Vec<AnomalyScore>`. |
There was a problem hiding this comment.
Score held-out variants with a real projection path
For this held-out scenario, the documented call cannot produce held-out scores: the current ClamTree::anomaly_scores implementation used by this workspace returns one score per index in the dataset used to build the tree by mapping self.reordered leaf indices to LFDs, and it does not use the supplied bytes to route new points. Passing held_out_bytes after training on chromosomes 1/7/17 will score the first N training indices or leave zero LFDs, so the ROC-AUC gate can pass/fail on meaningless labels unless the probe first adds a real projection/scoring path or builds the tree on the evaluation set.
Useful? React with 👍 / 👎.
| |---|---|---| | ||
| | 0 | Allele frequency | VCF `INFO/AF` | | ||
| | 1 | Total read depth | VCF `INFO/DP` | | ||
| | 2 | Strand bias (Fisher) | VCF `INFO/FS` (GATK convention) | |
There was a problem hiding this comment.
Use fields present in the pinned VCF corpora
With the pinned corpora, this lane is not extractable as written: INFO/FS is a GATK callset annotation, but the 1000 Genomes Phase 3 release and ClinVar VCFs pinned below are not a paired GATK callset, and ClinVar in particular does not carry per-sample depth/strand-bias evidence. An implementation would have to impute/drop this lane or join BAM/sample-call data, so the locked 5-D feature vector is not reproducible from the stated sources.
Useful? React with 👍 / 👎.
| **Step 2 — Embed.** ESM-2 small (`esm2_t6_8M_UR50D`) → 320-D protein | ||
| embeddings via the existing GGUF loader + ndarray AMX int8 GEMM path. | ||
|
|
||
| **Step 3 — Fingerprint.** CAM-PQ encode each 320-D embedding → 48-bit | ||
| `Cam6x8`. (Note: SimLex-999 fidelity was measured on 1024-D Jina; ESM-2 |
There was a problem hiding this comment.
Avoid silently dropping ESM dimensions in CAM-PQ
The shipped CAM-PQ codec splits vectors by total_dim / 6, so a 320-D ESM vector becomes six 53-D subspaces and dimensions 318-319 are ignored unless the probe explicitly pads, projects, or truncates. Since this probe claims to compare fingerprints of the full 320-D embeddings, the planned BLAST agreement metric will not measure the stated representation and may change under a future fixed codec; choose a 6-divisible representation or document the padding/projection step.
Useful? React with 👍 / 👎.
Summary
Promotes the three probes named in
.claude/plans/genetic-research-substrate-integration-v1.md§2 (merged in #501) from "named" to "fully-specified with file:line citations + pass/fail criteria locked" — matching the probe-spec discipline ofocr-probes-v1.md(PR #500).Probes:
PROBE-CHAODA-1000GPROBE-KRAS-COUNTERFACTUAL-DETPROBE-CAM-PQ-VS-BLASTCritical-path call: PROBE-CHAODA-1000G fires first because failure collapses the unsupervised-novelty story regardless of every other adapter deliverable. The probe locks a 5-lane feature vector (AF / DP / FS / 100bp Shannon entropy via
bgz17/ phyloP100way) and the corpus pin (1000G Phase 320130502+ ClinVar2024-12), with a per-quartile separation sanity check on top of the AUC ≥ 0.85 threshold.§A1 — Salient cite-rot fix folded in
docs/GENETIC_RESEARCH_VIA_STACK.md§1.4 (just merged in #501) cited a non-existentAwarenessState::Salientand anf32score field. Shipped variants perndarray/src/hpc/clam.rs:1549-1557areCrystallized/Tensioned/Uncertain/Noise;scoreisf64(clam.rs:1504). Correction maps thescore ≥ 0.75quartile toAwarenessState::Noise(perclam.rs:1556), re-states theAnomalyScorestruct against the shipped layout, and adds a "gated byPROBE-CHAODA-1000G" callout so future readers see the conjecture status of the novelty claim.Files
+.claude/plans/genetics-probes-v1.md(~300 lines, new)~docs/GENETIC_RESEARCH_VIA_STACK.md(§1.4 cite-rot fix, 4 lines changed)Test plan
ocr-probes-v1.mdreference shape).clam.rs:1498-1567AnomalyScore citation still accurate when reviewer spot-checks.clam.rs:1549-1557AwarenessState variants still match (Crystallized/Tensioned/Uncertain/Noise).Cross-refs
.claude/plans/genetic-research-substrate-integration-v1.md(merged docs(genetics): pattern-recognition hand-off + integration plan + headstone (for a domain expert new to the stack) #501) — names the probes..claude/plans/ocr-probes-v1.md(merged docs(plans)+test: rebaseline #497 OCR plans to #498 + gating probes (5-specialist framing) #500) — probe-spec shape this doc mirrors.docs/GENETIC_RESEARCH_VIA_STACK.md§1.4 — site of the cite-rot fix.ndarray/src/hpc/clam.rs:1493-1567— CHAODA Phase 4 implementation.https://claude.ai/code/session_01VysoWJ6vsyg3wEGc5v7T5v