RandomCoder-lab
diff --git a/‎CHANGELOG.md‎
Lines changed: 57 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 57 additions & 0 deletions
diff --git a/‎examples/lib/prometheus.omc‎
Lines changed: 18 additions & 59 deletions b/‎examples/lib/prometheus.omc‎
Lines changed: 18 additions & 59 deletions
diff --git a/‎examples/tests/test_substrate_modulator_builtins.omc‎
Lines changed: 118 additions & 0 deletions b/‎examples/tests/test_substrate_modulator_builtins.omc‎
Lines changed: 118 additions & 0 deletions
@@ -13,6 +13,7 @@ Read top-to-bottom for the arc; jump to any chapter for the detail.
 
 | Tag | Date | One-line |
 |---|---|---|
+| [v0.8.4-substrate-builtins](#v084-substrate-builtins--2026-05-17) | 2026-05-17 | **40× CPU / 96× GPU end-to-end speedup on Prometheus**. Fused `substrate_adamw_update` Rust builtin replaces ~15 OMC-side element-wise loops per parameter — what was 25.81 s/step at d_model=256 is now 0.65 s (CPU) / 0.27 s (GPU). The v0.8.2 GPU integration and v0.8.3 substrate-tile win finally pay out end-to-end. Identical training trajectory. |
 | [v0.8.3-substrate-gpu](#v083-substrate-gpu--2026-05-17) | 2026-05-17 | **Substrate-shaped GPU matmul wins +38% vs conventional 16×16**. Anisotropic 8×32 tile (Fib short dim, wavefront-divisor long dim) hits 114 GFLOPS at 1024² vs 71 for the standard tile. Pure-square Fib tiles (13×13, 21×21) still lose; the win comes from substrate suggesting "8 first" + hardware demanding wavefront alignment. New default tile baked into the CLI integration. |
 | [v0.8.2-gpu-prometheus](#v082-gpu-prometheus--2026-05-17) | 2026-05-17 | **GPU wired into Prometheus** via a MatmulAccelerator hook. **13× speedup on synthetic chained matmul** (512², CPU 3.47s → GPU 0.27s). End-to-end Prometheus training at d_model=256: wall-clock unchanged — OMC tree-walk overhead in substrate-shaping helpers (smod, resample, Q6) is the next bottleneck, not matmul. Integration is load-bearing for the substrate-native GPU kernels coming next. |
 | [v0.8.1-tape-primitives](#v081-tape-primitives--2026-05-17) | 2026-05-17 | **Substrate-native tape primitive precedent**: `tape_phi_log` fuses Q6's log-distance into one tape node, with `tape_abs` as the boring companion. Composed vs fused trains to within ~1e-7 — fused abstraction is free. **Pre-existing tape_div/tape_mul broadcast-backward bug fixed**, which unblocks OMC-side cross-validation of S-MOD + substrate-K. First Q6 OMC replication: −0.63% 2/3 seeds at small scale, directionally matching PyTorch's −12.15%. |
@@ -34,6 +35,62 @@ Read top-to-bottom for the arc; jump to any chapter for the detail.
 
 ---
 
+## [v0.8.4-substrate-builtins] - 2026-05-17
+
+**40× CPU / 96× GPU end-to-end speedup on Prometheus training. The v0.8.2 wall-clock bottleneck (OMC tree-walk overhead in the training loop) is dissolved by three Rust builtins. The v0.8.2 GPU integration and v0.8.3 substrate-shaped 8×32 tile finally pay out end-to-end. The three chapters compound.**
+
+### What got built
+
+Three Rust builtins:
+
+- **`substrate_smod_matrix(scores, alpha)`** — Rust port of `_prom_smod_matrix`. Per-cell `1 / (1 + α · attractor_distance(int(s)))`. Wrapped by the OMC helper for backward compatibility.
+- **`substrate_resample_matrix(v, scale)`** — Rust port of `_prom_substrate_resample_matrix`. Per-cell `1 / (1 + attractor_distance(int(v · scale)) / scale)`.
+- **`substrate_adamw_update(cur, grad, m, v, lr, b1, b2, eps, wd, step)`** — Fused AdamW per-parameter update. **The actual bottleneck killer.** Replaces ~15 OMC-side element-wise loops per parameter with one tight Rust loop. Mutates `m` and `v` in place via Rc-shared OMC arrays.
+
+### The honest story: first round was the wrong hypothesis
+
+Initial guess from v0.8.2 was that the modulator-matrix construction (`_prom_smod_matrix`, `_prom_substrate_resample_matrix`) was the bottleneck. Both got ported to Rust first — and end-to-end wall-clock **did not move**:
+
+| | CPU s/step | GPU s/step |
+|---|--:|--:|
+| v0.8.2 baseline | 25.81 | 25.88 |
+| v0.8.4 (modulators only) | 26.38 | 26.28 ← no change |
+
+Profiling-by-fixing found the real bottleneck: `prom_adamw_step`. It walks every parameter (6 of them at d_model=256, sizes up to 256×256) doing **15 element-wise loops per parameter** in OMC: `_prom_zip(_prom_scale(...), _prom_scale(...), "add")` chained through several stages. ~6M OMC ops per training step. Replacing the inner block with one Rust builtin:
+
+| | CPU s/step | GPU s/step | vs v0.8.2 |
+|---|--:|--:|--:|
+| **v0.8.4 (+ fused AdamW)** | **0.65** | **0.27** | **40× / 96×** |
+
+Loss agreement with v0.8.2: 6.95930 vs 6.95932 (f32 GPU roundtrip noise). Same training trajectory.
+
+### Why this matters for the chapters that came before
+
+- **v0.8.2** wired GPU into the tape autograd. End-to-end null result because OMC overhead dominated.
+- **v0.8.3** found the substrate-shaped 8×32 tile (114 GFLOPS vs 71 at 1024²). Kernel-level win, no end-to-end change for the same reason.
+- **v0.8.4** removes the OMC overhead. **Both prior chapters finally pay out**:
+  - The GPU/CPU split is now 2.4× (the actual matmul speedup at d_model=256)
+  - The 8×32 tile is doing real work in production training
+
+The three chapters are now compositional. Future scale-ups (d_model=512, batched inference, multi-block, longer sequences) get *both* the OMC-overhead-gone benefit AND the substrate-GPU acceleration.
+
+### What this unlocks immediately
+
+- **L1-MH + S-MOD α=1.0 in pure-OMC Prometheus** (task #264) — was unblocked by v0.8.1's broadcast-backward fix; *now practical to run* (seconds per step rather than minutes).
+- **Larger-scale substrate-attention** (task #265) — d_model=512+, longer sequences, multi-block stacking.
+- **Q6 cross-validation at real training length** — v0.8.1's OMC-side Q6 result was at 80 steps (slowest we could afford). Can now run 5000+ step training and properly cross-validate the PyTorch −12.15% finding.
+
+### Files
+
+- `omnimcode-core/src/interpreter.rs` — three new builtins + helpers (`flatten_2d_or_1d`, `write_back_1d_or_2d`, `rebuild_omc_array`, `build_substrate_modulator_matrix`, `ModulatorKind`, `substrate_adamw_update`)
+- `examples/lib/prometheus.omc` — `_prom_smod_matrix` / `_prom_substrate_resample_matrix` become thin wrappers; `prom_adamw_step` inner block calls the fused builtin
+- `examples/tests/test_substrate_modulator_builtins.omc` — 8 unit tests
+- `experiments/prometheus_parity/SUBSTRATE_BUILTINS_WIN.md` — full writeup
+
+Test suite: **1111/1111 OMC tests pass**.
+
+---
+
 ## [v0.8.3-substrate-gpu] - 2026-05-17
 
 **Substrate-shaped GPU matmul kernels: anisotropic 8×32 (Fib short dim, wavefront-divisor long dim) beats the conventional 16×16 by up to 38% on the user's AMD RX 580 / Vulkan. The substrate's job here isn't to fight hardware physics — it's to direct exploration toward configurations conventional GPU programming would never test. Doing so produced 1.61× the GFLOPS at 1024².**
 
@@ -700,24 +700,14 @@ fn prom_attention_forward(layer, x_id) {
 # Compute the per-cell S-MOD modulation matrix from a [N, T] scores
 # value snapshot. Each cell: 1 / (1 + alpha * attractor_distance(cell)).
 # Used as a non-differentiable const inside the tape's S-MOD path.
+#
+# v0.8.4 — defers to the Rust builtin `substrate_smod_matrix`. The
+# OMC-side inner loop over N×N scores was the v0.8.2 wall-clock
+# bottleneck (single-digit ms matmul drowned by tens of seconds in
+# this iteration). Wrapping the native call here keeps the public
+# signature stable; any caller of this helper picks up the speedup.
 fn _prom_smod_matrix(scores_val, alpha) {
-    h rows = arr_len(scores_val);
-    h out = [];
-    h i = 0;
-    while i < rows {
-        h row = arr_get(scores_val, i);
-        h new_row = [];
-        h j = 0;
-        while j < arr_len(row) {
-            h s = arr_get(row, j);
-            h d = attractor_distance(s);
-            arr_push(new_row, 1.0 / (1.0 + alpha * d));
-            j = j + 1;
-        }
-        arr_push(out, new_row);
-        i = i + 1;
-    }
-    return out;
+    return substrate_smod_matrix(scores_val, alpha);
 }
 
 # Per-cell substrate-resample modulation matrix from a [N, D] value
@@ -726,25 +716,10 @@ fn _prom_smod_matrix(scores_val, alpha) {
 # when already on-attractor. Used as a non-differentiable const inside
 # the tape graph, same pattern as _prom_smod_matrix. Won -2.52% val on
 # top of L1-MH + S-MOD α=1.0 when applied to V (3/3 seeds).
+#
+# v0.8.4 — defers to the Rust builtin `substrate_resample_matrix`.
 fn _prom_substrate_resample_matrix(v_val, scale) {
-    h rows = arr_len(v_val);
-    h out = [];
-    h i = 0;
-    while i < rows {
-        h row = arr_get(v_val, i);
-        h new_row = [];
-        h j = 0;
-        while j < arr_len(row) {
-            h x = arr_get(row, j);
-            h scaled = x * scale;
-            h d = attractor_distance(scaled);
-            arr_push(new_row, 1.0 / (1.0 + d / scale));
-            j = j + 1;
-        }
-        arr_push(out, new_row);
-        i = i + 1;
-    }
-    return out;
+    return substrate_resample_matrix(v_val, scale);
 }
 
 # Apply post-projection substrate resampling to a tape node. Returns
@@ -1108,38 +1083,22 @@ fn prom_adamw_step(state) {
     h bias1 = 1.0 - pow(b1, step * 1.0);
     h bias2 = 1.0 - pow(b2, step * 1.0);
 
+    # v0.8.4 fused inner update: substrate_adamw_update is a Rust builtin
+    # that replaces ~15 OMC-side elementwise loops per parameter with one
+    # tight Rust loop. The m / v OMC arrays are Rc-shared, so the builtin
+    # mutates them in place; the returned value is the new parameter.
+    # See ADAMW_BUILTIN.md for the wall-clock reasoning. bias1/bias2 are
+    # computed inside the builtin from (b1, b2, step) — no need to pass.
     h i = 0;
     while i < arr_len(params) {
         h p = arr_get(params, i);
         h g = tape_grad(p);
-
-        # m_t = b1*m + (1-b1)*g
         h m_old = arr_get(m, i);
-        h m_new = _prom_zip(_prom_scale(m_old, b1, "mul"),
-                            _prom_scale(g, 1.0 - b1, "mul"), "add");
-        arr_set(m, i, m_new);
-
-        # v_t = b2*v + (1-b2)*g²
         h v_old = arr_get(v, i);
-        h gsq = _prom_zip(g, g, "mul");
-        h v_new = _prom_zip(_prom_scale(v_old, b2, "mul"),
-                            _prom_scale(gsq, 1.0 - b2, "mul"), "add");
-        arr_set(v, i, v_new);
-
-        # m_hat = m_t / bias1; v_hat = v_t / bias2
-        h m_hat = _prom_scale(m_new, 1.0 / bias1, "mul");
-        h v_hat = _prom_scale(v_new, 1.0 / bias2, "mul");
-        h denom = _prom_sqrt_eps(v_hat, eps);
-        h adam_step = _prom_zip(m_hat, denom, "div");
-
-        # θ ← θ − lr*adam_step − lr*wd*θ
         h cur = tape_value(p);
-        h wd_term = _prom_scale(cur, lr * wd, "mul");
-        h main_term = _prom_scale(adam_step, lr, "mul");
-        h decayed = _prom_zip(cur, wd_term, "sub");
-        h new_val = _prom_zip(decayed, main_term, "sub");
+        h new_val = substrate_adamw_update(cur, g, m_old, v_old,
+                                            lr, b1, b2, eps, wd, step);
         tape_set_value(p, new_val);
-
         i = i + 1;
     }
 }
 
@@ -0,0 +1,118 @@
+# Tests for the v0.8.4 substrate-modulator Rust builtins.
+#
+# substrate_smod_matrix(scores, alpha) and substrate_resample_matrix(v, scale)
+# are Rust-native ports of the inner-loop helpers that lived in
+# prometheus.omc as `_prom_smod_matrix` / `_prom_substrate_resample_matrix`.
+#
+# The math must be identical — both helpers in prometheus.omc now just
+# wrap the corresponding builtin, so any divergence here would be a
+# semantics regression.
+
+fn assert_true(cond, msg) { if !cond { test_record_failure(msg); } }
+
+fn approx_eq(a, b, tol) {
+    h d = a - b;
+    if d < 0.0 { d = 0.0 - d; }
+    return d <= tol;
+}
+
+# -----------------------------------------------------------------
+# substrate_smod_matrix
+# -----------------------------------------------------------------
+
+fn test_smod_alpha_zero_is_identity() {
+    # alpha=0 → 1/(1+0·d) = 1 for every cell, regardless of value.
+    h m = [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]];
+    h out = substrate_smod_matrix(m, 0.0);
+    h r0 = arr_get(out, 0);
+    h r1 = arr_get(out, 1);
+    assert_true(approx_eq(arr_get(r0, 0), 1.0, 1e-9), "[0][0]=1");
+    assert_true(approx_eq(arr_get(r0, 2), 1.0, 1e-9), "[0][2]=1");
+    assert_true(approx_eq(arr_get(r1, 1), 1.0, 1e-9), "[1][1]=1");
+}
+
+fn test_smod_on_attractor_cell_is_one() {
+    # 8 IS a Fibonacci attractor, so attractor_distance(8)=0, modulator=1
+    # regardless of alpha.
+    h m = [[8.0, 13.0], [21.0, 34.0]];
+    h out = substrate_smod_matrix(m, 2.0);
+    h r0 = arr_get(out, 0);
+    h r1 = arr_get(out, 1);
+    assert_true(approx_eq(arr_get(r0, 0), 1.0, 1e-9), "8 attractor");
+    assert_true(approx_eq(arr_get(r0, 1), 1.0, 1e-9), "13 attractor");
+    assert_true(approx_eq(arr_get(r1, 0), 1.0, 1e-9), "21 attractor");
+}
+
+fn test_smod_off_attractor_dampens() {
+    # 7 is 1 away from 8 (attractor). With alpha=1.0 → 1/(1+1·1) = 0.5
+    h m = [[7.0]];
+    h out = substrate_smod_matrix(m, 1.0);
+    # 1×1 with two-dim input is auto-unwrapped to a 1×1 inner row, so
+    # arr_get(out, 0) returns the row [0.5].
+    h row = arr_get(out, 0);
+    assert_true(approx_eq(arr_get(row, 0), 0.5, 1e-9), "7→0.5 at α=1");
+}
+
+fn test_smod_one_d_array_returns_one_d() {
+    # 1D input → 1D output, same shape semantics as `tape_value` of a
+    # 1-row matrix.
+    h v = [0.0, 1.0, 2.0, 3.0];   # all attractors; modulator=1 everywhere
+    h out = substrate_smod_matrix(v, 1.0);
+    assert_true(arr_len(out) == 4, "1D out length matches input");
+    assert_true(approx_eq(arr_get(out, 0), 1.0, 1e-9), "0 attractor");
+    assert_true(approx_eq(arr_get(out, 3), 1.0, 1e-9), "3 attractor");
+}
+
+# -----------------------------------------------------------------
+# substrate_resample_matrix
+# -----------------------------------------------------------------
+
+fn test_resample_on_attractor_stays_one() {
+    # cell * scale = 8*1 = 8 IS attractor → d=0 → modulator=1
+    h m = [[8.0, 13.0]];
+    h out = substrate_resample_matrix(m, 1.0);
+    h row = arr_get(out, 0);
+    assert_true(approx_eq(arr_get(row, 0), 1.0, 1e-9), "8 on attractor");
+    assert_true(approx_eq(arr_get(row, 1), 1.0, 1e-9), "13 on attractor");
+}
+
+fn test_resample_off_attractor_dampens() {
+    # cell 0.7 with scale 10 → 7. d(7)=1. modulator = 1/(1+1/10) = 0.909...
+    h m = [[0.7]];
+    h out = substrate_resample_matrix(m, 10.0);
+    h row = arr_get(out, 0);
+    h expected = 1.0 / (1.0 + 1.0 / 10.0);   # 0.90909...
+    assert_true(approx_eq(arr_get(row, 0), expected, 1e-6),
+                "0.7@10 off-attractor dampens");
+}
+
+# scale=0 rejection lives in the Rust builtin; it would raise. We don't
+# have try/catch in OMC, so we rely on the builtin error message via
+# a separate exit-code test if needed. (Skipped here — Rust unit tests
+# could cover it; the OMC-side won't exercise that path in practice.)
+
+# -----------------------------------------------------------------
+# Regression: existing OMC helpers (now wrappers) still work
+# -----------------------------------------------------------------
+
+import "examples/lib/prometheus.omc";
+
+fn test_wrapper_smod_matches_builtin() {
+    h scores = [[3.0, 7.0], [8.0, 12.0]];
+    h via_wrapper = _prom_smod_matrix(scores, 1.0);
+    h via_builtin = substrate_smod_matrix(scores, 1.0);
+    h r0_w = arr_get(via_wrapper, 0);
+    h r0_b = arr_get(via_builtin, 0);
+    assert_true(approx_eq(arr_get(r0_w, 0), arr_get(r0_b, 0), 1e-12), "[0][0]");
+    assert_true(approx_eq(arr_get(r0_w, 1), arr_get(r0_b, 1), 1e-12), "[0][1]");
+}
+
+fn test_wrapper_resample_matches_builtin() {
+    h v = [[0.2, 0.5], [0.8, 1.3]];
+    h via_wrapper = _prom_substrate_resample_matrix(v, 10.0);
+    h via_builtin = substrate_resample_matrix(v, 10.0);
+    h r0_w = arr_get(via_wrapper, 0);
+    h r0_b = arr_get(via_builtin, 0);
+    assert_true(approx_eq(arr_get(r0_w, 0), arr_get(r0_b, 0), 1e-12), "[0][0]");
+    assert_true(approx_eq(arr_get(r0_w, 1), arr_get(r0_b, 1), 1e-12), "[0][1]");
+}