Skip to content

Commit 8d8c214

Browse files
🥂 v0.8.4 substrate-builtins: fused AdamW Rust builtin → 40× CPU / 96× GPU end-to-end
Three Rust builtins replace the OMC-side inner-loop helpers that were the v0.8.2 wall-clock bottleneck: substrate_smod_matrix(scores, alpha) substrate_resample_matrix(v, scale) substrate_adamw_update(cur, grad, m, v, lr, b1, b2, eps, wd, step) The first two (modulator matrix construction) did NOT move end-to-end wall-clock when shipped alone. Profiling-by-fixing found the real bottleneck: prom_adamw_step. It ran ~15 OMC-side element-wise loops per parameter per step: _prom_zip(_prom_scale(...), _prom_scale(...), "add") chained through several stages. At d_model=256 with 6 params, ~6M OMC ops per training step. Replacing the AdamW inner block with one Rust builtin: v0.8.2 baseline CPU 25.81 s/step GPU 25.88 s/step v0.8.4 modulators CPU 26.38 s/step GPU 26.28 s/step ← no change v0.8.4 + AdamW CPU 0.65 s/step GPU 0.27 s/step ← 40× / 96× The three chapters now compound: v0.8.2 wired GPU in (no end-to-end win, OMC overhead dominated) v0.8.3 found substrate-shaped 8×32 tile (114 GFLOPS, no end-to-end change) v0.8.4 removes OMC overhead, both prior wins finally pay out GPU/CPU split at v0.8.4 is 2.4× — what we'd expect from the matmul speedup at d_model=256. Future scale-ups (d_model=512+, multi-block, longer sequences) get BOTH benefits compositionally. Loss agrees with v0.8.2 to 5e-5 (f32 GPU roundtrip noise). Identical training trajectory. What this unlocks immediately: - L1-MH + S-MOD α=1.0 OMC cross-validation (task #264) - Larger-scale substrate-attention (task #265) - Q6 OMC cross-validation at real training length (v0.8.1 was 80 steps) Files: omnimcode-core/src/interpreter.rs three builtins + flatten helpers examples/lib/prometheus.omc wrappers + adamw uses builtin examples/tests/test_substrate_modulator_builtins.omc 8 tests experiments/prometheus_parity/SUBSTRATE_BUILTINS_WIN.md 1111/1111 OMC tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent d1fa0a2 commit 8d8c214

5 files changed

Lines changed: 591 additions & 59 deletions

File tree

CHANGELOG.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ Read top-to-bottom for the arc; jump to any chapter for the detail.
1313

1414
| Tag | Date | One-line |
1515
|---|---|---|
16+
| [v0.8.4-substrate-builtins](#v084-substrate-builtins--2026-05-17) | 2026-05-17 | **40× CPU / 96× GPU end-to-end speedup on Prometheus**. Fused `substrate_adamw_update` Rust builtin replaces ~15 OMC-side element-wise loops per parameter — what was 25.81 s/step at d_model=256 is now 0.65 s (CPU) / 0.27 s (GPU). The v0.8.2 GPU integration and v0.8.3 substrate-tile win finally pay out end-to-end. Identical training trajectory. |
1617
| [v0.8.3-substrate-gpu](#v083-substrate-gpu--2026-05-17) | 2026-05-17 | **Substrate-shaped GPU matmul wins +38% vs conventional 16×16**. Anisotropic 8×32 tile (Fib short dim, wavefront-divisor long dim) hits 114 GFLOPS at 1024² vs 71 for the standard tile. Pure-square Fib tiles (13×13, 21×21) still lose; the win comes from substrate suggesting "8 first" + hardware demanding wavefront alignment. New default tile baked into the CLI integration. |
1718
| [v0.8.2-gpu-prometheus](#v082-gpu-prometheus--2026-05-17) | 2026-05-17 | **GPU wired into Prometheus** via a MatmulAccelerator hook. **13× speedup on synthetic chained matmul** (512², CPU 3.47s → GPU 0.27s). End-to-end Prometheus training at d_model=256: wall-clock unchanged — OMC tree-walk overhead in substrate-shaping helpers (smod, resample, Q6) is the next bottleneck, not matmul. Integration is load-bearing for the substrate-native GPU kernels coming next. |
1819
| [v0.8.1-tape-primitives](#v081-tape-primitives--2026-05-17) | 2026-05-17 | **Substrate-native tape primitive precedent**: `tape_phi_log` fuses Q6's log-distance into one tape node, with `tape_abs` as the boring companion. Composed vs fused trains to within ~1e-7 — fused abstraction is free. **Pre-existing tape_div/tape_mul broadcast-backward bug fixed**, which unblocks OMC-side cross-validation of S-MOD + substrate-K. First Q6 OMC replication: −0.63% 2/3 seeds at small scale, directionally matching PyTorch's −12.15%. |
@@ -34,6 +35,62 @@ Read top-to-bottom for the arc; jump to any chapter for the detail.
3435

3536
---
3637

38+
## [v0.8.4-substrate-builtins] - 2026-05-17
39+
40+
**40× CPU / 96× GPU end-to-end speedup on Prometheus training. The v0.8.2 wall-clock bottleneck (OMC tree-walk overhead in the training loop) is dissolved by three Rust builtins. The v0.8.2 GPU integration and v0.8.3 substrate-shaped 8×32 tile finally pay out end-to-end. The three chapters compound.**
41+
42+
### What got built
43+
44+
Three Rust builtins:
45+
46+
- **`substrate_smod_matrix(scores, alpha)`** — Rust port of `_prom_smod_matrix`. Per-cell `1 / (1 + α · attractor_distance(int(s)))`. Wrapped by the OMC helper for backward compatibility.
47+
- **`substrate_resample_matrix(v, scale)`** — Rust port of `_prom_substrate_resample_matrix`. Per-cell `1 / (1 + attractor_distance(int(v · scale)) / scale)`.
48+
- **`substrate_adamw_update(cur, grad, m, v, lr, b1, b2, eps, wd, step)`** — Fused AdamW per-parameter update. **The actual bottleneck killer.** Replaces ~15 OMC-side element-wise loops per parameter with one tight Rust loop. Mutates `m` and `v` in place via Rc-shared OMC arrays.
49+
50+
### The honest story: first round was the wrong hypothesis
51+
52+
Initial guess from v0.8.2 was that the modulator-matrix construction (`_prom_smod_matrix`, `_prom_substrate_resample_matrix`) was the bottleneck. Both got ported to Rust first — and end-to-end wall-clock **did not move**:
53+
54+
| | CPU s/step | GPU s/step |
55+
|---|--:|--:|
56+
| v0.8.2 baseline | 25.81 | 25.88 |
57+
| v0.8.4 (modulators only) | 26.38 | 26.28 ← no change |
58+
59+
Profiling-by-fixing found the real bottleneck: `prom_adamw_step`. It walks every parameter (6 of them at d_model=256, sizes up to 256×256) doing **15 element-wise loops per parameter** in OMC: `_prom_zip(_prom_scale(...), _prom_scale(...), "add")` chained through several stages. ~6M OMC ops per training step. Replacing the inner block with one Rust builtin:
60+
61+
| | CPU s/step | GPU s/step | vs v0.8.2 |
62+
|---|--:|--:|--:|
63+
| **v0.8.4 (+ fused AdamW)** | **0.65** | **0.27** | **40× / 96×** |
64+
65+
Loss agreement with v0.8.2: 6.95930 vs 6.95932 (f32 GPU roundtrip noise). Same training trajectory.
66+
67+
### Why this matters for the chapters that came before
68+
69+
- **v0.8.2** wired GPU into the tape autograd. End-to-end null result because OMC overhead dominated.
70+
- **v0.8.3** found the substrate-shaped 8×32 tile (114 GFLOPS vs 71 at 1024²). Kernel-level win, no end-to-end change for the same reason.
71+
- **v0.8.4** removes the OMC overhead. **Both prior chapters finally pay out**:
72+
- The GPU/CPU split is now 2.4× (the actual matmul speedup at d_model=256)
73+
- The 8×32 tile is doing real work in production training
74+
75+
The three chapters are now compositional. Future scale-ups (d_model=512, batched inference, multi-block, longer sequences) get *both* the OMC-overhead-gone benefit AND the substrate-GPU acceleration.
76+
77+
### What this unlocks immediately
78+
79+
- **L1-MH + S-MOD α=1.0 in pure-OMC Prometheus** (task #264) — was unblocked by v0.8.1's broadcast-backward fix; *now practical to run* (seconds per step rather than minutes).
80+
- **Larger-scale substrate-attention** (task #265) — d_model=512+, longer sequences, multi-block stacking.
81+
- **Q6 cross-validation at real training length** — v0.8.1's OMC-side Q6 result was at 80 steps (slowest we could afford). Can now run 5000+ step training and properly cross-validate the PyTorch −12.15% finding.
82+
83+
### Files
84+
85+
- `omnimcode-core/src/interpreter.rs` — three new builtins + helpers (`flatten_2d_or_1d`, `write_back_1d_or_2d`, `rebuild_omc_array`, `build_substrate_modulator_matrix`, `ModulatorKind`, `substrate_adamw_update`)
86+
- `examples/lib/prometheus.omc``_prom_smod_matrix` / `_prom_substrate_resample_matrix` become thin wrappers; `prom_adamw_step` inner block calls the fused builtin
87+
- `examples/tests/test_substrate_modulator_builtins.omc` — 8 unit tests
88+
- `experiments/prometheus_parity/SUBSTRATE_BUILTINS_WIN.md` — full writeup
89+
90+
Test suite: **1111/1111 OMC tests pass**.
91+
92+
---
93+
3794
## [v0.8.3-substrate-gpu] - 2026-05-17
3895

3996
**Substrate-shaped GPU matmul kernels: anisotropic 8×32 (Fib short dim, wavefront-divisor long dim) beats the conventional 16×16 by up to 38% on the user's AMD RX 580 / Vulkan. The substrate's job here isn't to fight hardware physics — it's to direct exploration toward configurations conventional GPU programming would never test. Doing so produced 1.61× the GFLOPS at 1024².**

examples/lib/prometheus.omc

Lines changed: 18 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -700,24 +700,14 @@ fn prom_attention_forward(layer, x_id) {
700700
# Compute the per-cell S-MOD modulation matrix from a [N, T] scores
701701
# value snapshot. Each cell: 1 / (1 + alpha * attractor_distance(cell)).
702702
# Used as a non-differentiable const inside the tape's S-MOD path.
703+
#
704+
# v0.8.4 — defers to the Rust builtin `substrate_smod_matrix`. The
705+
# OMC-side inner loop over N×N scores was the v0.8.2 wall-clock
706+
# bottleneck (single-digit ms matmul drowned by tens of seconds in
707+
# this iteration). Wrapping the native call here keeps the public
708+
# signature stable; any caller of this helper picks up the speedup.
703709
fn _prom_smod_matrix(scores_val, alpha) {
704-
h rows = arr_len(scores_val);
705-
h out = [];
706-
h i = 0;
707-
while i < rows {
708-
h row = arr_get(scores_val, i);
709-
h new_row = [];
710-
h j = 0;
711-
while j < arr_len(row) {
712-
h s = arr_get(row, j);
713-
h d = attractor_distance(s);
714-
arr_push(new_row, 1.0 / (1.0 + alpha * d));
715-
j = j + 1;
716-
}
717-
arr_push(out, new_row);
718-
i = i + 1;
719-
}
720-
return out;
710+
return substrate_smod_matrix(scores_val, alpha);
721711
}
722712

723713
# Per-cell substrate-resample modulation matrix from a [N, D] value
@@ -726,25 +716,10 @@ fn _prom_smod_matrix(scores_val, alpha) {
726716
# when already on-attractor. Used as a non-differentiable const inside
727717
# the tape graph, same pattern as _prom_smod_matrix. Won -2.52% val on
728718
# top of L1-MH + S-MOD α=1.0 when applied to V (3/3 seeds).
719+
#
720+
# v0.8.4 — defers to the Rust builtin `substrate_resample_matrix`.
729721
fn _prom_substrate_resample_matrix(v_val, scale) {
730-
h rows = arr_len(v_val);
731-
h out = [];
732-
h i = 0;
733-
while i < rows {
734-
h row = arr_get(v_val, i);
735-
h new_row = [];
736-
h j = 0;
737-
while j < arr_len(row) {
738-
h x = arr_get(row, j);
739-
h scaled = x * scale;
740-
h d = attractor_distance(scaled);
741-
arr_push(new_row, 1.0 / (1.0 + d / scale));
742-
j = j + 1;
743-
}
744-
arr_push(out, new_row);
745-
i = i + 1;
746-
}
747-
return out;
722+
return substrate_resample_matrix(v_val, scale);
748723
}
749724

750725
# Apply post-projection substrate resampling to a tape node. Returns
@@ -1108,38 +1083,22 @@ fn prom_adamw_step(state) {
11081083
h bias1 = 1.0 - pow(b1, step * 1.0);
11091084
h bias2 = 1.0 - pow(b2, step * 1.0);
11101085

1086+
# v0.8.4 fused inner update: substrate_adamw_update is a Rust builtin
1087+
# that replaces ~15 OMC-side elementwise loops per parameter with one
1088+
# tight Rust loop. The m / v OMC arrays are Rc-shared, so the builtin
1089+
# mutates them in place; the returned value is the new parameter.
1090+
# See ADAMW_BUILTIN.md for the wall-clock reasoning. bias1/bias2 are
1091+
# computed inside the builtin from (b1, b2, step) — no need to pass.
11111092
h i = 0;
11121093
while i < arr_len(params) {
11131094
h p = arr_get(params, i);
11141095
h g = tape_grad(p);
1115-
1116-
# m_t = b1*m + (1-b1)*g
11171096
h m_old = arr_get(m, i);
1118-
h m_new = _prom_zip(_prom_scale(m_old, b1, "mul"),
1119-
_prom_scale(g, 1.0 - b1, "mul"), "add");
1120-
arr_set(m, i, m_new);
1121-
1122-
# v_t = b2*v + (1-b2)*g²
11231097
h v_old = arr_get(v, i);
1124-
h gsq = _prom_zip(g, g, "mul");
1125-
h v_new = _prom_zip(_prom_scale(v_old, b2, "mul"),
1126-
_prom_scale(gsq, 1.0 - b2, "mul"), "add");
1127-
arr_set(v, i, v_new);
1128-
1129-
# m_hat = m_t / bias1; v_hat = v_t / bias2
1130-
h m_hat = _prom_scale(m_new, 1.0 / bias1, "mul");
1131-
h v_hat = _prom_scale(v_new, 1.0 / bias2, "mul");
1132-
h denom = _prom_sqrt_eps(v_hat, eps);
1133-
h adam_step = _prom_zip(m_hat, denom, "div");
1134-
1135-
# θ ← θ − lr*adam_step − lr*wd*θ
11361098
h cur = tape_value(p);
1137-
h wd_term = _prom_scale(cur, lr * wd, "mul");
1138-
h main_term = _prom_scale(adam_step, lr, "mul");
1139-
h decayed = _prom_zip(cur, wd_term, "sub");
1140-
h new_val = _prom_zip(decayed, main_term, "sub");
1099+
h new_val = substrate_adamw_update(cur, g, m_old, v_old,
1100+
lr, b1, b2, eps, wd, step);
11411101
tape_set_value(p, new_val);
1142-
11431102
i = i + 1;
11441103
}
11451104
}
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
# Tests for the v0.8.4 substrate-modulator Rust builtins.
2+
#
3+
# substrate_smod_matrix(scores, alpha) and substrate_resample_matrix(v, scale)
4+
# are Rust-native ports of the inner-loop helpers that lived in
5+
# prometheus.omc as `_prom_smod_matrix` / `_prom_substrate_resample_matrix`.
6+
#
7+
# The math must be identical — both helpers in prometheus.omc now just
8+
# wrap the corresponding builtin, so any divergence here would be a
9+
# semantics regression.
10+
11+
fn assert_true(cond, msg) { if !cond { test_record_failure(msg); } }
12+
13+
fn approx_eq(a, b, tol) {
14+
h d = a - b;
15+
if d < 0.0 { d = 0.0 - d; }
16+
return d <= tol;
17+
}
18+
19+
# -----------------------------------------------------------------
20+
# substrate_smod_matrix
21+
# -----------------------------------------------------------------
22+
23+
fn test_smod_alpha_zero_is_identity() {
24+
# alpha=0 → 1/(1+0·d) = 1 for every cell, regardless of value.
25+
h m = [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]];
26+
h out = substrate_smod_matrix(m, 0.0);
27+
h r0 = arr_get(out, 0);
28+
h r1 = arr_get(out, 1);
29+
assert_true(approx_eq(arr_get(r0, 0), 1.0, 1e-9), "[0][0]=1");
30+
assert_true(approx_eq(arr_get(r0, 2), 1.0, 1e-9), "[0][2]=1");
31+
assert_true(approx_eq(arr_get(r1, 1), 1.0, 1e-9), "[1][1]=1");
32+
}
33+
34+
fn test_smod_on_attractor_cell_is_one() {
35+
# 8 IS a Fibonacci attractor, so attractor_distance(8)=0, modulator=1
36+
# regardless of alpha.
37+
h m = [[8.0, 13.0], [21.0, 34.0]];
38+
h out = substrate_smod_matrix(m, 2.0);
39+
h r0 = arr_get(out, 0);
40+
h r1 = arr_get(out, 1);
41+
assert_true(approx_eq(arr_get(r0, 0), 1.0, 1e-9), "8 attractor");
42+
assert_true(approx_eq(arr_get(r0, 1), 1.0, 1e-9), "13 attractor");
43+
assert_true(approx_eq(arr_get(r1, 0), 1.0, 1e-9), "21 attractor");
44+
}
45+
46+
fn test_smod_off_attractor_dampens() {
47+
# 7 is 1 away from 8 (attractor). With alpha=1.0 → 1/(1+1·1) = 0.5
48+
h m = [[7.0]];
49+
h out = substrate_smod_matrix(m, 1.0);
50+
# 1×1 with two-dim input is auto-unwrapped to a 1×1 inner row, so
51+
# arr_get(out, 0) returns the row [0.5].
52+
h row = arr_get(out, 0);
53+
assert_true(approx_eq(arr_get(row, 0), 0.5, 1e-9), "7→0.5 at α=1");
54+
}
55+
56+
fn test_smod_one_d_array_returns_one_d() {
57+
# 1D input → 1D output, same shape semantics as `tape_value` of a
58+
# 1-row matrix.
59+
h v = [0.0, 1.0, 2.0, 3.0]; # all attractors; modulator=1 everywhere
60+
h out = substrate_smod_matrix(v, 1.0);
61+
assert_true(arr_len(out) == 4, "1D out length matches input");
62+
assert_true(approx_eq(arr_get(out, 0), 1.0, 1e-9), "0 attractor");
63+
assert_true(approx_eq(arr_get(out, 3), 1.0, 1e-9), "3 attractor");
64+
}
65+
66+
# -----------------------------------------------------------------
67+
# substrate_resample_matrix
68+
# -----------------------------------------------------------------
69+
70+
fn test_resample_on_attractor_stays_one() {
71+
# cell * scale = 8*1 = 8 IS attractor → d=0 → modulator=1
72+
h m = [[8.0, 13.0]];
73+
h out = substrate_resample_matrix(m, 1.0);
74+
h row = arr_get(out, 0);
75+
assert_true(approx_eq(arr_get(row, 0), 1.0, 1e-9), "8 on attractor");
76+
assert_true(approx_eq(arr_get(row, 1), 1.0, 1e-9), "13 on attractor");
77+
}
78+
79+
fn test_resample_off_attractor_dampens() {
80+
# cell 0.7 with scale 10 → 7. d(7)=1. modulator = 1/(1+1/10) = 0.909...
81+
h m = [[0.7]];
82+
h out = substrate_resample_matrix(m, 10.0);
83+
h row = arr_get(out, 0);
84+
h expected = 1.0 / (1.0 + 1.0 / 10.0); # 0.90909...
85+
assert_true(approx_eq(arr_get(row, 0), expected, 1e-6),
86+
"0.7@10 off-attractor dampens");
87+
}
88+
89+
# scale=0 rejection lives in the Rust builtin; it would raise. We don't
90+
# have try/catch in OMC, so we rely on the builtin error message via
91+
# a separate exit-code test if needed. (Skipped here — Rust unit tests
92+
# could cover it; the OMC-side won't exercise that path in practice.)
93+
94+
# -----------------------------------------------------------------
95+
# Regression: existing OMC helpers (now wrappers) still work
96+
# -----------------------------------------------------------------
97+
98+
import "examples/lib/prometheus.omc";
99+
100+
fn test_wrapper_smod_matches_builtin() {
101+
h scores = [[3.0, 7.0], [8.0, 12.0]];
102+
h via_wrapper = _prom_smod_matrix(scores, 1.0);
103+
h via_builtin = substrate_smod_matrix(scores, 1.0);
104+
h r0_w = arr_get(via_wrapper, 0);
105+
h r0_b = arr_get(via_builtin, 0);
106+
assert_true(approx_eq(arr_get(r0_w, 0), arr_get(r0_b, 0), 1e-12), "[0][0]");
107+
assert_true(approx_eq(arr_get(r0_w, 1), arr_get(r0_b, 1), 1e-12), "[0][1]");
108+
}
109+
110+
fn test_wrapper_resample_matches_builtin() {
111+
h v = [[0.2, 0.5], [0.8, 1.3]];
112+
h via_wrapper = _prom_substrate_resample_matrix(v, 10.0);
113+
h via_builtin = substrate_resample_matrix(v, 10.0);
114+
h r0_w = arr_get(via_wrapper, 0);
115+
h r0_b = arr_get(via_builtin, 0);
116+
assert_true(approx_eq(arr_get(r0_w, 0), arr_get(r0_b, 0), 1e-12), "[0][0]");
117+
assert_true(approx_eq(arr_get(r0_w, 1), arr_get(r0_b, 1), 1e-12), "[0][1]");
118+
}

0 commit comments

Comments
 (0)