|
| 1 | +# Hybrid Harmonic / Transformer LLM |
| 2 | + |
| 3 | +This branch (`claude/phi-field-llm-evolution`) explores using OMC's φ-math |
| 4 | +primitives to replace or augment specific transformer components, with the |
| 5 | +goal of producing measurable behavior differences on real sequence tasks. |
| 6 | + |
| 7 | +The existing pure-OMC demos (`examples/phi_field_llm_demo.omc`, |
| 8 | +`examples/phi_field_llm_multilayer.omc`) prove that geodesic |
| 9 | +attention — picking the Fibonacci attractor with the highest |
| 10 | +`OmniWeight w = φ^(-|e|)` — runs end-to-end. They don't yet show |
| 11 | +**when** that's better than softmax-QK attention and **what it costs**. |
| 12 | +This experiment series answers that. |
| 13 | + |
| 14 | +## The substitutions we want to test |
| 15 | + |
| 16 | +Three transformer pieces map cleanly onto OMC's harmonic primitives: |
| 17 | + |
| 18 | +| Transformer piece | Harmonic replacement | What we're measuring | |
| 19 | +|---|---|---| |
| 20 | +| **Sinusoidal positional encoding** | Golden-angle rotation (`pos * 2π/φ²`) folded onto Fibonacci attractors via `phi.fold`. | Length-generalization: does a model trained on length N still work at 2N? Sinusoidal PE is known to extrapolate poorly. | |
| 21 | +| **Softmax attention scoring** | OmniWeight: `w(q, k) = φ^(-|q − k| / max(\|k\|, 1))`. Per-position; pick argmax instead of weighted average. | Sharpness vs. softness. OmniWeight is winner-take-all. Useful for copy/lookup tasks; lossy for averaging tasks. | |
| 22 | +| **Layer-norm + residual** | `phi.fold(residual_blend)` (already implemented in `phi_field_llm_multilayer.omc`). | Whether the φ-fold provides a useful regularizer that keeps activations on-attractor. | |
| 23 | + |
| 24 | +Phase 0 of this branch focuses on (2) — OmniWeight attention — because |
| 25 | +it's the most isolated and the existing demos already implement it. |
| 26 | +The other two come later. |
| 27 | + |
| 28 | +## Experiment 0: Copy task — OmniWeight vs softmax |
| 29 | + |
| 30 | +The simplest task that distinguishes the two approaches: |
| 31 | + |
| 32 | +- **Input:** a sequence of 8 Fibonacci-aligned tokens drawn at random |
| 33 | + from `{1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233}`, plus a separator, |
| 34 | + plus a "query" token that copies one of the inputs verbatim. |
| 35 | + Example: `[34, 8, 89, 13, 21, |, 89]` → expected next token `89`. |
| 36 | +- **Models:** |
| 37 | + - OmniWeight-attention head over the input (the current |
| 38 | + `best_attractor` mechanism). |
| 39 | + - Softmax-attention head over the same inputs, where the score is |
| 40 | + `exp(-|q − k|)` normalized. Both use **no learned weights** — this |
| 41 | + isolates the scoring function from training dynamics. |
| 42 | +- **Metric:** exact-match accuracy on 100 random instances, broken |
| 43 | + down by (a) whether the query exactly matches an input, (b) how |
| 44 | + many distractors share the query's nearest attractor. |
| 45 | + |
| 46 | +If OmniWeight wins on (a) and loses on (b), that confirms the |
| 47 | +"winner-take-all" thesis and tells us where to apply it in a larger model. |
| 48 | + |
| 49 | +**Status:** `experiment_0_copy_task.omc` runs this comparison. |
| 50 | + |
| 51 | +## Why no torch yet |
| 52 | + |
| 53 | +The current remote environment has no torch / numpy. Pure-OMC |
| 54 | +experiments give us: |
| 55 | + |
| 56 | +1. Deterministic, reproducible runs inside the standalone binary. |
| 57 | +2. No dependency on `python-embed` for the experiment itself. |
| 58 | +3. A baseline that any later torch-based experiment must match |
| 59 | + byte-for-byte on the harmonic side. |
| 60 | + |
| 61 | +Once we have a winning harmonic primitive, the next branch step is to |
| 62 | +port the same scoring rule to PyTorch (via `examples/lib/torch.omc` or |
| 63 | +a stand-alone Python script) and bench against a real learned model |
| 64 | +on a real corpus. |
| 65 | + |
| 66 | +## How to run |
| 67 | + |
| 68 | +```bash |
| 69 | +# Build (one time) |
| 70 | +PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1 cargo build --release |
| 71 | + |
| 72 | +# Run experiment 0 (tree-walk) |
| 73 | +./target/release/omnimcode-standalone experiments/hybrid_llm/experiment_0_copy_task.omc |
| 74 | + |
| 75 | +# Same under the bytecode VM |
| 76 | +OMC_VM=1 ./target/release/omnimcode-standalone experiments/hybrid_llm/experiment_0_copy_task.omc |
| 77 | + |
| 78 | +# Audit: bytecode VM must match tree-walk |
| 79 | +./target/release/omnimcode-standalone --audit experiments/hybrid_llm/experiment_0_copy_task.omc |
| 80 | +``` |
| 81 | + |
| 82 | +## Results so far |
| 83 | + |
| 84 | +| Experiment | Setting | Headline number | |
| 85 | +|---|---|---| |
| 86 | +| 0 | Copy task, exact-match query, 100 trials | OmniWeight 82/100, softmax 82/100, 0 disagreements. Confirms both scorers agree on exact match (the 18 "misses" are duplicate-value trials, both tie-break to first occurrence). | |
| 87 | +| 1 | Perturbed query (query = true_val + noise), 200 trials per noise level | Softmax wins everywhere. noise=1: 189 vs 170. noise=7: 118 vs 99. noise=50: 42 vs 33. OmniWeight's |k|-normalised denominator pulls toward smaller-magnitude attractors regardless of perturbation direction, which hurts the "recover the original value" objective. | |
| 88 | +| 2 | Single-channel PE distinctness + lookup at L = 8 / 14 / 24 / 48 | Sinusoidal wins at short L (8/8 vs 6/8). At L=48 harmonic appears to overtake: 38/48 vs 26/48 (79% vs 54%). Flagged as a likely metric artefact — single-int "closest code" lookup favours monotonic over periodic encodings. | |
| 89 | +| 3 | 4-channel PE (harmonic primes 7/11/13/17, sin/cos periods 8/64), L2 lookup, L = 8 → 200 | **Sinusoidal regains its lead decisively at every L ≥ 16.** L=48: 48/48 vs 21/48. L=200: 72/200 vs 34/200. Harmonic saturates at 22 unique vectors by L=64; sinusoidal stays perfectly distinct up to L=64 then saturates at 64. The single-channel L=48 harmonic "win" was a metric artefact, exactly as suspected. | |
| 90 | +| 4A | Harmonic OOD gate vs L2-NN baseline on 4-dim synthetic vectors (N_REF=300, 150 in-dist test, 150 OOD test). OOD = uniform [1, 90]. | L2 wins. AUROC L2 0.961 vs harmonic 0.910. TPR @ FPR=10%: L2 0.91 vs harmonic 0.71. L2 has a trivial magnitude advantage — mean L2 score 87 (in-dist) vs 1313 (OOD), since OOD vectors are larger on average and harmonic gate's `phi.fold` discards magnitude. | |
| 91 | +| 4B | Same gates, **magnitude-matched** structural OOD (inverted attractor weights: 10%/30%/60% small/med/large vs in-dist's 60%/30%/10%). | **Harmonic edges past L2 in AUROC: 0.956 vs 0.946.** At low FPR L2 still wins (TPR@FPR=1%: L2 0.60 vs harmonic 0.48), but on overall ranking the structural rarity signal beats the L2 metric once magnitude is no longer a giveaway. | |
| 92 | +| 5 | HBit cross-cutting tension (no reference) + combined gate (sum of z-normalised HBit, marginal rarity, L2) on both scenarios. | **Scenario A: HBit tension AUROC = 1.0** (perfect — mean tension 0.0 in-dist vs 20.1 OOD). Combined: 0.999. **Scenario B: HBit AUROC = 0.5** (random — both sides on-manifold, tension = 0 everywhere). Combined: 0.967, beating every single gate. Each gate owns a different OOD axis: HBit→off-manifold, marginal→distribution-shift, L2→magnitude. | |
| 93 | +| 6 | Phi-Pi-Fib compression gate: model as `(library + chain of keys)` instead of dense weights. 12-primitive library keyed by Fibonacci attractors, gate = nearest-key lookup, chains = "parameters". | Composition: trace `[3, 8, 13, 5, 21]` on state 7 → 9. Compression: 29 ints (library+chain) vs ~1001 ints dense table over [0,1000] = ~34× smaller (extrapolates to 9 orders of magnitude at LLM scale). **Death tolerance: all 12 library deletions complete without crashing — biggest deltas: kill key=13 → +12, kill key=5 → +5, kill key=21 → +3. 8 of 12 deletions invisible to output (unused capabilities or path coincidence).** Interchangeability: 6 different chains over the same library yield 6 different outputs (9, 22, 9, 5, 5, 52). | |
| 94 | +| 7 | Wire `phi_pi_fib::fibonacci_search` in as four OMC builtins (`phi_pi_fib_search`, `phi_pi_fib_nearest`, `phi_pi_fib_stats`, `phi_pi_fib_reset`). Rerun exp 6's gate using the real Fibonacci-step search; measure comparison counts vs library size. | **Sublinear scaling confirmed.** N=8 → 3.8 compares/search, N=1024 → 12.6. Going 128× wider in library size grows the per-lookup work only ~3.3×, vs ~64× for a linear scan. Empirically tracks `~log₂(N)`, slightly better than `log_φ_π_fibonacci(N) ≈ 1.44·log₂(N)`. Sanity check passes (same final state as exp 6). Death tolerance preserved across all 12 library deletions. 148/148 existing tests still pass. | |
| 95 | + |
| 96 | +### Cumulative read across experiments 0–5 |
| 97 | + |
| 98 | +The six experiments now form a complete picture. Each OOD axis has |
| 99 | +a gate that owns it: |
| 100 | + |
| 101 | +| Failure mode | Owning gate | Cost | Scenario A AUROC | Scenario B AUROC | |
| 102 | +|---|---|---|---|---| |
| 103 | +| Off-manifold values | **HBit cross-cutting tension** | **Reference-free** | **1.000** | 0.500 | |
| 104 | +| Wrong attractor distribution | Marginal log-rarity (exp 4 harmonic) | needs reference | 0.910 | 0.956 | |
| 105 | +| Wrong magnitude | L2 nearest-neighbour | needs reference | 0.961 | 0.946 | |
| 106 | +| Any of the above | Sum of z-normalised triple | needs reference | 0.999 | 0.967 | |
| 107 | + |
| 108 | +The HBit gate is the cheapest possible: `sum_d |v[d] − phi.fold(v[d])|`. |
| 109 | +Zero fitting, zero reference set, perfect detector when the OOD axis is |
| 110 | +"value isn't a Fibonacci attractor". Useless when both sides are |
| 111 | +on-manifold (scenario B mean tension is 0.0 on both in-dist and OOD — |
| 112 | +the gate can't see any difference). |
| 113 | + |
| 114 | +The combined gate is the clear winner across both scenarios. Sum of |
| 115 | +z-normalised per-gate scores, with the z-normalisation parameters |
| 116 | +fit on **in-dist scores only** (the combiner doesn't peek at OOD data). |
| 117 | +Scenario A: 0.999 — almost perfect, gets HBit's free wins plus L2 and |
| 118 | +marginal contributions. Scenario B: 0.967 — beats every individual |
| 119 | +gate by 1-2 AUROC points. |
| 120 | + |
| 121 | +What this means concretely: |
| 122 | + |
| 123 | +1. **Reference-free OOD detection is real on harmonic-structured |
| 124 | + data.** If your in-distribution lives on (or near) the Fibonacci |
| 125 | + attractor manifold, HBit tension is a free OOD signal you can |
| 126 | + compute on a single test point with no model fitting. Cost is |
| 127 | + D float subtractions per test point. |
| 128 | + |
| 129 | +2. **The "harmonic substrate is a structural detector" thesis is |
| 130 | + now empirically grounded for OOD gating**, with quantified |
| 131 | + contribution from each piece. Exp 0-3 ruled out using harmonic |
| 132 | + primitives as drop-in replacements for transformer components. |
| 133 | + Exp 4-5 found their actual home: as auxiliary detectors layered |
| 134 | + onto raw features (or activations) to catch failure modes that |
| 135 | + L2 alone misses. |
| 136 | + |
| 137 | +3. **The combined gate is the deployable artifact.** Three |
| 138 | + complementary axes, z-normalised on the reference, summed. |
| 139 | + Wins on both magnitude-shifted and structural OOD. Beats every |
| 140 | + single-gate baseline. |
| 141 | + |
| 142 | +### What changed between experiment 2 and experiment 3 |
| 143 | + |
| 144 | +Experiment 2 used **single-integer codes** and a **closest-int** |
| 145 | +lookup metric. Single-integer codes can't capture the geometric |
| 146 | +frequency layering that makes sinusoidal PE work in real |
| 147 | +transformers — once the period wraps, the encoding is dead. |
| 148 | + |
| 149 | +Experiment 3 used **4-channel vectors** and **L2 distance**. That |
| 150 | +gives sinusoidal a long-period channel (P=64) that stays distinct |
| 151 | +well past the short-period channel's wrap. Harmonic gets four |
| 152 | +prime-multiplier channels but they all saturate at the same |
| 153 | +Fibonacci ceiling, so the joint vector hits its uniqueness budget |
| 154 | +fast (22 unique vectors total) and stays there forever. |
| 155 | + |
| 156 | +The lesson is one of the project's existing themes spelled out |
| 157 | +again: **measure honestly, and let the measurement reshape the |
| 158 | +plan.** Experiment 2's headline number was reproducible and |
| 159 | +audited, but the framing was wrong. Adding experiment 3 — same |
| 160 | +question, fairer comparison — flipped the answer. The README is |
| 161 | +updated to reflect the cumulative read, not just the latest |
| 162 | +result. |
| 163 | + |
| 164 | +## Roadmap on this branch |
| 165 | + |
| 166 | +- **0** Copy task: OmniWeight vs softmax scoring. ✓ done |
| 167 | +- **1** Perturbed-query divergence study. ✓ done |
| 168 | +- **2** Single-channel positional-encoding distinctness + lookup. ✓ done |
| 169 | +- **3** Multi-channel PE with L2 lookup. ✓ done |
| 170 | +- **4** Harmonic OOD gate vs L2-NN baseline, two scenarios. ✓ done |
| 171 | +- **5** HBit cross-cutting tension + 3-gate combined detector. ✓ done |
| 172 | +- **6** Phi-Pi-Fib compression gate: model = library + chain. ✓ done |
| 173 | +- **7** Wire `omnimcode-core/src/phi_pi_fib.rs::fibonacci_search` in |
| 174 | + as four OMC builtins; rerun exp 6's gate on top; measure compare |
| 175 | + counts. ✓ done |
| 176 | +- **8** Learnable routing policy: a function `state -> chain` that |
| 177 | + picks WHICH chain to run from input state. Start with a simple |
| 178 | + hand-authored policy (if state on small attractor use chain A, |
| 179 | + else chain B); then explore phi-folded state as a hash into a |
| 180 | + policy table. This is the "compression gate as learned component" |
| 181 | + half — exp 6 had only the library + nearest-key fallback. |
| 182 | +- **9** Layer-norm-matched OOD setup (was the old exp 6): pre- |
| 183 | + normalise to unit L2 and re-run scenarios A and B from exp 4. |
| 184 | + Confirms HBit's magnitude-invariance. |
| 185 | +- **10** Bake the combined OOD gate into a reusable library: |
| 186 | + `experiments/hybrid_llm/lib/ood_gate.omc` exposing |
| 187 | + `ood_gate.fit(ref_corpus)` and `ood_gate.score(vec)`. Then once |
| 188 | + torch is available, replicate on real transformer activations. |
0 commit comments