Skip to content

docs(genetics): probe spec v1 + Salient cite-rot fix#503

Merged
AdaWorldAPI merged 1 commit into
mainfrom
claude/probe-genetics-spec-v1
Jun 16, 2026
Merged

docs(genetics): probe spec v1 + Salient cite-rot fix#503
AdaWorldAPI merged 1 commit into
mainfrom
claude/probe-genetics-spec-v1

Conversation

@AdaWorldAPI

Copy link
Copy Markdown
Owner

Summary

Promotes the three probes named in .claude/plans/genetic-research-substrate-integration-v1.md §2 (merged in #501) from "named" to "fully-specified with file:line citations + pass/fail criteria locked" — matching the probe-spec discipline of ocr-probes-v1.md (PR #500).

Probes:

Probe Phase Cost Gates
PROBE-CHAODA-1000G P0 ~3 days (after D-GEN-1+2) The whole CHAODA-as-novelty-detector line of the plan
PROBE-KRAS-COUNTERFACTUAL-DET P1 ~2 days (in D-GEN-7) D-GEN-7 dynamics-axis flagship
PROBE-CAM-PQ-VS-BLAST P2 ~1 week D-GEN-3 sequence-fingerprint claim → D-GEN-10 benchmark

Critical-path call: PROBE-CHAODA-1000G fires first because failure collapses the unsupervised-novelty story regardless of every other adapter deliverable. The probe locks a 5-lane feature vector (AF / DP / FS / 100bp Shannon entropy via bgz17 / phyloP100way) and the corpus pin (1000G Phase 3 20130502 + ClinVar 2024-12), with a per-quartile separation sanity check on top of the AUC ≥ 0.85 threshold.

§A1 — Salient cite-rot fix folded in

docs/GENETIC_RESEARCH_VIA_STACK.md §1.4 (just merged in #501) cited a non-existent AwarenessState::Salient and an f32 score field. Shipped variants per ndarray/src/hpc/clam.rs:1549-1557 are Crystallized / Tensioned / Uncertain / Noise; score is f64 (clam.rs:1504). Correction maps the score ≥ 0.75 quartile to AwarenessState::Noise (per clam.rs:1556), re-states the AnomalyScore struct against the shipped layout, and adds a "gated by PROBE-CHAODA-1000G" callout so future readers see the conjecture status of the novelty claim.

Files

  • + .claude/plans/genetics-probes-v1.md (~300 lines, new)
  • ~ docs/GENETIC_RESEARCH_VIA_STACK.md (§1.4 cite-rot fix, 4 lines changed)

Test plan

  • Spec doc renders cleanly on GitHub (matched against ocr-probes-v1.md reference shape).
  • clam.rs:1498-1567 AnomalyScore citation still accurate when reviewer spot-checks.
  • clam.rs:1549-1557 AwarenessState variants still match (Crystallized / Tensioned / Uncertain / Noise).
  • No probe is RUN in this PR — pure docs.

Cross-refs

https://claude.ai/code/session_01VysoWJ6vsyg3wEGc5v7T5v

Promotes the three probes named in §2 of
.claude/plans/genetic-research-substrate-integration-v1.md (merged in #501)
from "named" to "fully-specified with file:line citations + pass/fail
criteria locked." Mirrors the probe-spec discipline of
ocr-probes-v1.md (PR #500).

Probe specs:
  - PROBE-CHAODA-1000G (P0, ~3 days after D-GEN-1+2): novel-variant
    detection on 1000-Genomes Phase 3 + ClinVar held-out. Feature
    vector locked at 5 lanes (AF / DP / FS / 100bp Shannon entropy
    via bgz17 / phyloP100way). ROC-AUC >= 0.85 pass condition with
    a per-quartile separation sanity check. Critical path: if it
    fails, the unsupervised novelty story in
    GENETIC_RESEARCH_VIA_STACK.md S 1.4 collapses.
  - PROBE-KRAS-COUNTERFACTUAL-DET (P1, ~2 days inside D-GEN-7):
    bit-exact MailboxSoA<1024> across two seeded runs. Regression
    gate for the substrate's no-randomness invariant under fan-out
    load.
  - PROBE-CAM-PQ-VS-BLAST (P2, ~1 week): Spearman rho >= 0.7 + ICC
    >= 0.6 against BLAST e-value top-100 rankings on a 10K RefSeq
    protein subset via ESM-2 small (320-D) embeddings.

Sequencing locked: PROBE-CHAODA-1000G is the single highest-leverage
probe; the per-quartile separation check guards against an inverted
signal regime.

S A1 cite-rot fix folded in:
  GENETIC_RESEARCH_VIA_STACK.md S 1.4 cited a non-existent
  AwarenessState::Salient and an f32 score field. Shipped variants
  per clam.rs:1549-1557 are Crystallized / Tensioned / Uncertain /
  Noise; score field is f64 (clam.rs:1504). Corrected to map the
  score >= 0.75 quartile to AwarenessState::Noise per clam.rs:1556,
  with the AnomalyScore struct re-stated against the shipped layout.
  Also added the "gated by PROBE-CHAODA-1000G" callout so future
  readers see the conjecture status of the novelty claim.

https://claude.ai/code/session_01VysoWJ6vsyg3wEGc5v7T5v
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@AdaWorldAPI, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 49 minutes and 31 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: fb362e58-e78f-45fb-909b-1fdd0fda7e61

📥 Commits

Reviewing files that changed from the base of the PR and between e192266 and 444e371.

📒 Files selected for processing (2)
  • .claude/plans/genetics-probes-v1.md
  • docs/GENETIC_RESEARCH_VIA_STACK.md

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@AdaWorldAPI AdaWorldAPI merged commit cb14704 into main Jun 16, 2026
1 check passed

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 444e371e0d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

layout (`ndarray/src/hpc/clam.rs`, HEEL=16 / HIP=256 / TWIG=4096 per
`lance-graph/.claude/session_2026_04_11_bf16_hhtl_combined_research.md`).
2. Project held-out vectors through the tree (assign to leaf cluster).
3. Compute `anomaly_scores(held_out_bytes, vec_len=5)` → `Vec<AnomalyScore>`.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Score held-out variants with a real projection path

For this held-out scenario, the documented call cannot produce held-out scores: the current ClamTree::anomaly_scores implementation used by this workspace returns one score per index in the dataset used to build the tree by mapping self.reordered leaf indices to LFDs, and it does not use the supplied bytes to route new points. Passing held_out_bytes after training on chromosomes 1/7/17 will score the first N training indices or leave zero LFDs, so the ROC-AUC gate can pass/fail on meaningless labels unless the probe first adds a real projection/scoring path or builds the tree on the evaluation set.

Useful? React with 👍 / 👎.

|---|---|---|
| 0 | Allele frequency | VCF `INFO/AF` |
| 1 | Total read depth | VCF `INFO/DP` |
| 2 | Strand bias (Fisher) | VCF `INFO/FS` (GATK convention) |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use fields present in the pinned VCF corpora

With the pinned corpora, this lane is not extractable as written: INFO/FS is a GATK callset annotation, but the 1000 Genomes Phase 3 release and ClinVar VCFs pinned below are not a paired GATK callset, and ClinVar in particular does not carry per-sample depth/strand-bias evidence. An implementation would have to impute/drop this lane or join BAM/sample-call data, so the locked 5-D feature vector is not reproducible from the stated sources.

Useful? React with 👍 / 👎.

Comment on lines +214 to +218
**Step 2 — Embed.** ESM-2 small (`esm2_t6_8M_UR50D`) → 320-D protein
embeddings via the existing GGUF loader + ndarray AMX int8 GEMM path.

**Step 3 — Fingerprint.** CAM-PQ encode each 320-D embedding → 48-bit
`Cam6x8`. (Note: SimLex-999 fidelity was measured on 1024-D Jina; ESM-2

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid silently dropping ESM dimensions in CAM-PQ

The shipped CAM-PQ codec splits vectors by total_dim / 6, so a 320-D ESM vector becomes six 53-D subspaces and dimensions 318-319 are ignored unless the probe explicitly pads, projects, or truncates. Since this probe claims to compare fingerprints of the full 320-D embeddings, the planned BLAST agreement metric will not measure the stated representation and may change under a future fixed codec; choose a 6-divisible representation or document the padding/projection step.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants