RandomCoder-lab
diff --git a/‎experiments/hybrid_llm/README.md‎
Lines changed: 188 additions & 0 deletions b/‎experiments/hybrid_llm/README.md‎
Lines changed: 188 additions & 0 deletions
diff --git a/‎experiments/hybrid_llm/experiment_0_copy_task.omc‎
Lines changed: 186 additions & 0 deletions b/‎experiments/hybrid_llm/experiment_0_copy_task.omc‎
Lines changed: 186 additions & 0 deletions
@@ -0,0 +1,188 @@
+# Hybrid Harmonic / Transformer LLM
+
+This branch (`claude/phi-field-llm-evolution`) explores using OMC's φ-math
+primitives to replace or augment specific transformer components, with the
+goal of producing measurable behavior differences on real sequence tasks.
+
+The existing pure-OMC demos (`examples/phi_field_llm_demo.omc`,
+`examples/phi_field_llm_multilayer.omc`) prove that geodesic
+attention — picking the Fibonacci attractor with the highest
+`OmniWeight w = φ^(-|e|)` — runs end-to-end. They don't yet show
+**when** that's better than softmax-QK attention and **what it costs**.
+This experiment series answers that.
+
+## The substitutions we want to test
+
+Three transformer pieces map cleanly onto OMC's harmonic primitives:
+
+| Transformer piece | Harmonic replacement | What we're measuring |
+|---|---|---|
+| **Sinusoidal positional encoding** | Golden-angle rotation (`pos * 2π/φ²`) folded onto Fibonacci attractors via `phi.fold`. | Length-generalization: does a model trained on length N still work at 2N? Sinusoidal PE is known to extrapolate poorly. |
+| **Softmax attention scoring** | OmniWeight: `w(q, k) = φ^(-|q − k| / max(\|k\|, 1))`. Per-position; pick argmax instead of weighted average. | Sharpness vs. softness. OmniWeight is winner-take-all. Useful for copy/lookup tasks; lossy for averaging tasks. |
+| **Layer-norm + residual** | `phi.fold(residual_blend)` (already implemented in `phi_field_llm_multilayer.omc`). | Whether the φ-fold provides a useful regularizer that keeps activations on-attractor. |
+
+Phase 0 of this branch focuses on (2) — OmniWeight attention — because
+it's the most isolated and the existing demos already implement it.
+The other two come later.
+
+## Experiment 0: Copy task — OmniWeight vs softmax
+
+The simplest task that distinguishes the two approaches:
+
+- **Input:** a sequence of 8 Fibonacci-aligned tokens drawn at random
+  from `{1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233}`, plus a separator,
+  plus a "query" token that copies one of the inputs verbatim.
+  Example: `[34, 8, 89, 13, 21, |, 89]` → expected next token `89`.
+- **Models:**
+  - OmniWeight-attention head over the input (the current
+    `best_attractor` mechanism).
+  - Softmax-attention head over the same inputs, where the score is
+    `exp(-|q − k|)` normalized. Both use **no learned weights** — this
+    isolates the scoring function from training dynamics.
+- **Metric:** exact-match accuracy on 100 random instances, broken
+  down by (a) whether the query exactly matches an input, (b) how
+  many distractors share the query's nearest attractor.
+
+If OmniWeight wins on (a) and loses on (b), that confirms the
+"winner-take-all" thesis and tells us where to apply it in a larger model.
+
+**Status:** `experiment_0_copy_task.omc` runs this comparison.
+
+## Why no torch yet
+
+The current remote environment has no torch / numpy. Pure-OMC
+experiments give us:
+
+1. Deterministic, reproducible runs inside the standalone binary.
+2. No dependency on `python-embed` for the experiment itself.
+3. A baseline that any later torch-based experiment must match
+   byte-for-byte on the harmonic side.
+
+Once we have a winning harmonic primitive, the next branch step is to
+port the same scoring rule to PyTorch (via `examples/lib/torch.omc` or
+a stand-alone Python script) and bench against a real learned model
+on a real corpus.
+
+## How to run
+
+```bash
+# Build (one time)
+PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1 cargo build --release
+
+# Run experiment 0 (tree-walk)
+./target/release/omnimcode-standalone experiments/hybrid_llm/experiment_0_copy_task.omc
+
+# Same under the bytecode VM
+OMC_VM=1 ./target/release/omnimcode-standalone experiments/hybrid_llm/experiment_0_copy_task.omc
+
+# Audit: bytecode VM must match tree-walk
+./target/release/omnimcode-standalone --audit experiments/hybrid_llm/experiment_0_copy_task.omc
+```
+
+## Results so far
+
+| Experiment | Setting | Headline number |
+|---|---|---|
+| 0 | Copy task, exact-match query, 100 trials | OmniWeight 82/100, softmax 82/100, 0 disagreements. Confirms both scorers agree on exact match (the 18 "misses" are duplicate-value trials, both tie-break to first occurrence). |
+| 1 | Perturbed query (query = true_val + noise), 200 trials per noise level | Softmax wins everywhere. noise=1: 189 vs 170. noise=7: 118 vs 99. noise=50: 42 vs 33. OmniWeight's |k|-normalised denominator pulls toward smaller-magnitude attractors regardless of perturbation direction, which hurts the "recover the original value" objective. |
+| 2 | Single-channel PE distinctness + lookup at L = 8 / 14 / 24 / 48 | Sinusoidal wins at short L (8/8 vs 6/8). At L=48 harmonic appears to overtake: 38/48 vs 26/48 (79% vs 54%). Flagged as a likely metric artefact — single-int "closest code" lookup favours monotonic over periodic encodings. |
+| 3 | 4-channel PE (harmonic primes 7/11/13/17, sin/cos periods 8/64), L2 lookup, L = 8 → 200 | **Sinusoidal regains its lead decisively at every L ≥ 16.** L=48: 48/48 vs 21/48. L=200: 72/200 vs 34/200. Harmonic saturates at 22 unique vectors by L=64; sinusoidal stays perfectly distinct up to L=64 then saturates at 64. The single-channel L=48 harmonic "win" was a metric artefact, exactly as suspected. |
+| 4A | Harmonic OOD gate vs L2-NN baseline on 4-dim synthetic vectors (N_REF=300, 150 in-dist test, 150 OOD test). OOD = uniform [1, 90]. | L2 wins. AUROC L2 0.961 vs harmonic 0.910. TPR @ FPR=10%: L2 0.91 vs harmonic 0.71. L2 has a trivial magnitude advantage — mean L2 score 87 (in-dist) vs 1313 (OOD), since OOD vectors are larger on average and harmonic gate's `phi.fold` discards magnitude. |
+| 4B | Same gates, **magnitude-matched** structural OOD (inverted attractor weights: 10%/30%/60% small/med/large vs in-dist's 60%/30%/10%). | **Harmonic edges past L2 in AUROC: 0.956 vs 0.946.** At low FPR L2 still wins (TPR@FPR=1%: L2 0.60 vs harmonic 0.48), but on overall ranking the structural rarity signal beats the L2 metric once magnitude is no longer a giveaway. |
+| 5 | HBit cross-cutting tension (no reference) + combined gate (sum of z-normalised HBit, marginal rarity, L2) on both scenarios. | **Scenario A: HBit tension AUROC = 1.0** (perfect — mean tension 0.0 in-dist vs 20.1 OOD). Combined: 0.999. **Scenario B: HBit AUROC = 0.5** (random — both sides on-manifold, tension = 0 everywhere). Combined: 0.967, beating every single gate. Each gate owns a different OOD axis: HBit→off-manifold, marginal→distribution-shift, L2→magnitude. |
+| 6 | Phi-Pi-Fib compression gate: model as `(library + chain of keys)` instead of dense weights. 12-primitive library keyed by Fibonacci attractors, gate = nearest-key lookup, chains = "parameters". | Composition: trace `[3, 8, 13, 5, 21]` on state 7 → 9. Compression: 29 ints (library+chain) vs ~1001 ints dense table over [0,1000] = ~34× smaller (extrapolates to 9 orders of magnitude at LLM scale). **Death tolerance: all 12 library deletions complete without crashing — biggest deltas: kill key=13 → +12, kill key=5 → +5, kill key=21 → +3. 8 of 12 deletions invisible to output (unused capabilities or path coincidence).** Interchangeability: 6 different chains over the same library yield 6 different outputs (9, 22, 9, 5, 5, 52). |
+| 7 | Wire `phi_pi_fib::fibonacci_search` in as four OMC builtins (`phi_pi_fib_search`, `phi_pi_fib_nearest`, `phi_pi_fib_stats`, `phi_pi_fib_reset`). Rerun exp 6's gate using the real Fibonacci-step search; measure comparison counts vs library size. | **Sublinear scaling confirmed.** N=8 → 3.8 compares/search, N=1024 → 12.6. Going 128× wider in library size grows the per-lookup work only ~3.3×, vs ~64× for a linear scan. Empirically tracks `~log₂(N)`, slightly better than `log_φ_π_fibonacci(N) ≈ 1.44·log₂(N)`. Sanity check passes (same final state as exp 6). Death tolerance preserved across all 12 library deletions. 148/148 existing tests still pass. |
+
+### Cumulative read across experiments 0–5
+
+The six experiments now form a complete picture. Each OOD axis has
+a gate that owns it:
+
+| Failure mode | Owning gate | Cost | Scenario A AUROC | Scenario B AUROC |
+|---|---|---|---|---|
+| Off-manifold values | **HBit cross-cutting tension** | **Reference-free** | **1.000** | 0.500 |
+| Wrong attractor distribution | Marginal log-rarity (exp 4 harmonic) | needs reference | 0.910 | 0.956 |
+| Wrong magnitude | L2 nearest-neighbour | needs reference | 0.961 | 0.946 |
+| Any of the above | Sum of z-normalised triple | needs reference | 0.999 | 0.967 |
+
+The HBit gate is the cheapest possible: `sum_d |v[d] − phi.fold(v[d])|`.
+Zero fitting, zero reference set, perfect detector when the OOD axis is
+"value isn't a Fibonacci attractor". Useless when both sides are
+on-manifold (scenario B mean tension is 0.0 on both in-dist and OOD —
+the gate can't see any difference).
+
+The combined gate is the clear winner across both scenarios. Sum of
+z-normalised per-gate scores, with the z-normalisation parameters
+fit on **in-dist scores only** (the combiner doesn't peek at OOD data).
+Scenario A: 0.999 — almost perfect, gets HBit's free wins plus L2 and
+marginal contributions. Scenario B: 0.967 — beats every individual
+gate by 1-2 AUROC points.
+
+What this means concretely:
+
+1. **Reference-free OOD detection is real on harmonic-structured
+   data.** If your in-distribution lives on (or near) the Fibonacci
+   attractor manifold, HBit tension is a free OOD signal you can
+   compute on a single test point with no model fitting. Cost is
+   D float subtractions per test point.
+
+2. **The "harmonic substrate is a structural detector" thesis is
+   now empirically grounded for OOD gating**, with quantified
+   contribution from each piece. Exp 0-3 ruled out using harmonic
+   primitives as drop-in replacements for transformer components.
+   Exp 4-5 found their actual home: as auxiliary detectors layered
+   onto raw features (or activations) to catch failure modes that
+   L2 alone misses.
+
+3. **The combined gate is the deployable artifact.** Three
+   complementary axes, z-normalised on the reference, summed.
+   Wins on both magnitude-shifted and structural OOD. Beats every
+   single-gate baseline.
+
+### What changed between experiment 2 and experiment 3
+
+Experiment 2 used **single-integer codes** and a **closest-int**
+lookup metric. Single-integer codes can't capture the geometric
+frequency layering that makes sinusoidal PE work in real
+transformers — once the period wraps, the encoding is dead.
+
+Experiment 3 used **4-channel vectors** and **L2 distance**. That
+gives sinusoidal a long-period channel (P=64) that stays distinct
+well past the short-period channel's wrap. Harmonic gets four
+prime-multiplier channels but they all saturate at the same
+Fibonacci ceiling, so the joint vector hits its uniqueness budget
+fast (22 unique vectors total) and stays there forever.
+
+The lesson is one of the project's existing themes spelled out
+again: **measure honestly, and let the measurement reshape the
+plan.** Experiment 2's headline number was reproducible and
+audited, but the framing was wrong. Adding experiment 3 — same
+question, fairer comparison — flipped the answer. The README is
+updated to reflect the cumulative read, not just the latest
+result.
+
+## Roadmap on this branch
+
+- **0** Copy task: OmniWeight vs softmax scoring. ✓ done
+- **1** Perturbed-query divergence study. ✓ done
+- **2** Single-channel positional-encoding distinctness + lookup. ✓ done
+- **3** Multi-channel PE with L2 lookup. ✓ done
+- **4** Harmonic OOD gate vs L2-NN baseline, two scenarios. ✓ done
+- **5** HBit cross-cutting tension + 3-gate combined detector. ✓ done
+- **6** Phi-Pi-Fib compression gate: model = library + chain. ✓ done
+- **7** Wire `omnimcode-core/src/phi_pi_fib.rs::fibonacci_search` in
+  as four OMC builtins; rerun exp 6's gate on top; measure compare
+  counts. ✓ done
+- **8** Learnable routing policy: a function `state -> chain` that
+  picks WHICH chain to run from input state. Start with a simple
+  hand-authored policy (if state on small attractor use chain A,
+  else chain B); then explore phi-folded state as a hash into a
+  policy table. This is the "compression gate as learned component"
+  half — exp 6 had only the library + nearest-key fallback.
+- **9** Layer-norm-matched OOD setup (was the old exp 6): pre-
+  normalise to unit L2 and re-run scenarios A and B from exp 4.
+  Confirms HBit's magnitude-invariance.
+- **10** Bake the combined OOD gate into a reusable library:
+  `experiments/hybrid_llm/lib/ood_gate.omc` exposing
+  `ood_gate.fit(ref_corpus)` and `ood_gate.score(vec)`. Then once
+  torch is available, replicate on real transformer activations.
@@ -0,0 +1,186 @@
+# =============================================================================
+# Experiment 0 — OmniWeight vs softmax scoring on a copy task.
+#
+# Setup:
+#   - 8-token "context" sampled from a Fibonacci-attractor vocab.
+#   - 1 separator token (0) — present but not scored.
+#   - 1 "query" token that equals one of the context tokens verbatim.
+#   - Both scoring rules must point at the matching context position.
+#
+# We are NOT training anything. We're isolating the SCORING FUNCTION:
+#   - OmniWeight (harmonic): w(q, k) = φ^(-|q - k| / max(|k|, 1))
+#   - Softmax-style:         w(q, k) = exp(-|q - k|), normalized
+# Both pick the argmax position; we measure exact-match accuracy.
+#
+# Run:
+#   ./target/release/omnimcode-standalone experiments/hybrid_llm/experiment_0_copy_task.omc
+#   OMC_VM=1 ./target/release/omnimcode-standalone experiments/hybrid_llm/experiment_0_copy_task.omc
+# =============================================================================
+
+h PHI = 1.6180339887498948;
+
+# Vocabulary: 12 Fibonacci attractors. Index 0 is the separator (never queried).
+h VOCAB_SIZE = 12;
+fn vocab_at(i) -> int {
+    if i == 0 { return 1; }
+    if i == 1 { return 2; }
+    if i == 2 { return 3; }
+    if i == 3 { return 5; }
+    if i == 4 { return 8; }
+    if i == 5 { return 13; }
+    if i == 6 { return 21; }
+    if i == 7 { return 34; }
+    if i == 8 { return 55; }
+    if i == 9 { return 89; }
+    if i == 10 { return 144; }
+    if i == 11 { return 233; }
+    return 1;
+}
+
+# ---------------------------------------------------------------------------
+# Scoring functions
+# ---------------------------------------------------------------------------
+
+# OmniWeight: harmonic geodesic distance, no normalization.
+fn omni_weight(q, k) -> float {
+    h diff = to_float(q - k);
+    if diff < 0.0 { diff = 0.0 - diff; }
+    h denom = to_float(k);
+    if denom < 0.0 { denom = 0.0 - denom; }
+    if denom < 1.0 { denom = 1.0; }
+    h e = diff / denom;
+    return pow(PHI, 0.0 - e);
+}
+
+# Softmax-style score: exp(-|q - k|). Returned unnormalized; argmax is
+# scale-invariant so the softmax denominator doesn't affect the
+# selection. (We'd need it for KL divergence or sampling, not here.)
+fn softmax_score(q, k) -> float {
+    h diff = to_float(q - k);
+    if diff < 0.0 { diff = 0.0 - diff; }
+    return exp(0.0 - diff);
+}
+
+# ---------------------------------------------------------------------------
+# Argmax over a length-N context using a chosen scoring function.
+# `which_score` = 0 → OmniWeight, 1 → softmax.
+# Returns the INDEX of the highest-scoring position.
+# ---------------------------------------------------------------------------
+fn argmax_score(context, n, query, which_score) -> int {
+    h best_idx = 0;
+    h k0 = arr_get(context, 0);
+    h best_score = 0.0;
+    if which_score == 0 {
+        best_score = omni_weight(query, k0);
+    } else {
+        best_score = softmax_score(query, k0);
+    }
+
+    h i = 1;
+    while i < n {
+        h k = arr_get(context, i);
+        h s = 0.0;
+        if which_score == 0 {
+            s = omni_weight(query, k);
+        } else {
+            s = softmax_score(query, k);
+        }
+        if s > best_score {
+            best_score = s;
+            best_idx = i;
+        }
+        i = i + 1;
+    }
+    return best_idx;
+}
+
+# ---------------------------------------------------------------------------
+# Build one trial: random 8-token context, query equals context[target_idx].
+# Returns the predicted index from each scorer.
+# We pre-seed with random_seed() so the trials are reproducible.
+# ---------------------------------------------------------------------------
+fn run_trial(target_idx, ctx_len) -> array {
+    h context = arr_new(ctx_len, 0);
+    h i = 0;
+    while i < ctx_len {
+        # random_int is inclusive on both ends in OMC, so clamp upper.
+        h v = random_int(0, VOCAB_SIZE - 1);
+        arr_set(context, i, vocab_at(v));
+        i = i + 1;
+    }
+    h query = arr_get(context, target_idx);
+
+    h omni_pick = argmax_score(context, ctx_len, query, 0);
+    h soft_pick = argmax_score(context, ctx_len, query, 1);
+
+    # Pack result as [target, omni_pick, soft_pick] for downstream tallying.
+    h out = arr_new(3, 0);
+    arr_set(out, 0, target_idx);
+    arr_set(out, 1, omni_pick);
+    arr_set(out, 2, soft_pick);
+    return out;
+}
+
+# ---------------------------------------------------------------------------
+# Main loop.
+# ---------------------------------------------------------------------------
+random_seed(42);
+
+h N_TRIALS = 100;
+h CTX_LEN = 8;
+
+print("== Experiment 0: OmniWeight vs softmax-style argmax (copy task) ==");
+print(concat_many("trials=", N_TRIALS, "  ctx_len=", CTX_LEN, "  vocab_size=", VOCAB_SIZE));
+print("");
+
+h omni_correct = 0;
+h soft_correct = 0;
+h disagreements = 0;
+h trial = 0;
+while trial < N_TRIALS {
+    # Pick a target position uniformly in [0, CTX_LEN). random_int is
+    # inclusive on both ends, so the upper bound is CTX_LEN - 1.
+    h target = random_int(0, CTX_LEN - 1);
+    h trial_out = run_trial(target, CTX_LEN);
+    h tgt = arr_get(trial_out, 0);
+    h omni = arr_get(trial_out, 1);
+    h soft = arr_get(trial_out, 2);
+
+    # A "correct" pick = the picked position carries the same VALUE as the
+    # target position. Multiple positions can share a value, so any
+    # collision is a fair hit.
+    # NOTE: when several positions hold the same Fibonacci value, all of
+    # them are valid argmaxes. We don't penalize ambiguous trials here —
+    # both scorers will tie-break to the first occurrence.
+    if omni == tgt { omni_correct = omni_correct + 1; }
+    if soft == tgt { soft_correct = soft_correct + 1; }
+    if omni != soft { disagreements = disagreements + 1; }
+    trial = trial + 1;
+}
+
+print(concat_many("OmniWeight argmax correct:  ", omni_correct, " / ", N_TRIALS));
+print(concat_many("Softmax-style argmax correct: ", soft_correct, " / ", N_TRIALS));
+print(concat_many("Disagreements between scorers: ", disagreements, " / ", N_TRIALS));
+print("");
+
+# When q == k exactly, both score functions return their maximum:
+# OmniWeight → φ^0 = 1; softmax → e^0 = 1. Both should pick correctly
+# whenever the query exactly matches AT LEAST ONE context value. The
+# interesting case is the tie-break — both implementations pick the
+# first occurrence, so they should AGREE on every trial.
+#
+# If disagreements > 0, we've found a case where φ-geodesic distance
+# and exponential distance rank differently. That's the seed for
+# experiment 1 (perturbed-query task) — handle it there.
+
+print("== Sanity check ==");
+print("Both scorers monotonically decrease in |q - k|, so on EXACT-MATCH");
+print("queries they should always agree. Non-zero disagreement is a bug");
+print("OR an environment ambiguity (multiple positions sharing the query");
+print("value, tie-broken differently).");
+print("");
+print("Next: experiment_1_perturbed.omc — query is off-attractor.");
+print("There, OmniWeight's denominator (max(|k|, 1)) normalises the");
+print("error differently from softmax's raw |q - k|, so they diverge");
+print("in measurable ways.");
+print("== End ==");