Skip to content

Commit 5b83d76

Browse files
🥂 v0.8.1 tape-primitives: substrate-native tape_phi_log + tape_abs, broadcast-backward fix
Two new tape autograd primitives — one boring (tape_abs), one substrate- native invention (tape_phi_log) — plus a latent backward-broadcast bug fix that unblocks S-MOD + substrate-K end-to-end in OMC. tape_phi_log(x, scale=10.0) fuses Q6's ln(|x·scale|+1)/(π·ln φ) into one tape node with the substrate basis baked into the backward derivation. Defined at zero (boring tape_log(0) returns -∞), and exposes the substrate constants at the AST level rather than burying them in scalar denominators. A/B in pure-OMC Prometheus (3 seeds, 80 AdamW steps): off 2.5692 composed 2.5530 (−0.63%, 2/3 seeds win) fused 2.5530 (composed − fused = 1.2e-7) The composed-fused divergence sits at float64 accumulation-noise floor after ~80 forward+backward steps — the substrate-native fused primitive matches the boring composed reference exactly under training. First OMC cross-validation of the PyTorch Q6 finding. Pre-existing tape_div / tape_mul backward panicked with col-broadcast denominators ([N, N] / [N, 1]). The prom_substrate_softmax α>0 path ends in exactly that shape, so S-MOD + substrate-K had never actually trained end-to-end in OMC — it would panic at first backward. Both backwards now iterate the dy shape, reduce indices against each operand's actual extent, and sum contributions across broadcast axes. Files: omnimcode-core/src/interpreter.rs TapeOp::Abs, TapeOp::PhiLog, broadcast-aware Mul/Div backward examples/lib/prometheus.omc prom_q6_modulate + q6_mode field examples/prometheus_q6_ab.omc A/B harness (3 seeds × 3 modes) examples/tests/test_tape_abs_phi_log.omc 12 primitive unit tests examples/tests/test_q6_modulate.omc 4 modulation-dispatch tests experiments/prometheus_parity/TAPE_PRIMITIVES_AB.md full writeup 1103/1103 OMC tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 1386fd2 commit 5b83d76

7 files changed

Lines changed: 878 additions & 32 deletions

File tree

CHANGELOG.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ Read top-to-bottom for the arc; jump to any chapter for the detail.
1313

1414
| Tag | Date | One-line |
1515
|---|---|---|
16+
| [v0.8.1-tape-primitives](#v081-tape-primitives--2026-05-17) | 2026-05-17 | **Substrate-native tape primitive precedent**: `tape_phi_log` fuses Q6's log-distance into one tape node, with `tape_abs` as the boring companion. Composed vs fused trains to within ~1e-7 — fused abstraction is free. **Pre-existing tape_div/tape_mul broadcast-backward bug fixed**, which unblocks OMC-side cross-validation of S-MOD + substrate-K. First Q6 OMC replication: −0.63% 2/3 seeds at small scale, directionally matching PyTorch's −12.15%. |
1617
| [v0.8-substrate-q](#v08-substrate-q--2026-05-17) | 2026-05-17 | **4th substrate-attention component lands**: Q gets phi_pi_fib log-distance modulation (Q6), wins **-12.15% val 6/6 seeds**. Cumulative stack now -16.7% vs vanilla baseline. |
1718
| [v0.7-gpu-scaffold](#v07-gpu-scaffold--2026-05-17) | 2026-05-17 | GPU compute scaffold: `omnimcode-gpu` crate with wgpu (Vulkan) backend, ROCm/CUDA stubs. **4.04× speedup verified on the user's AMD RX 580** via Vulkan (no ROCm pain). |
1819
| [v0.6-fibtier-memory](#v06-fibtier-memory--2026-05-17) | 2026-05-17 | Fibtier-bounded eviction for memory: cap the index at fibonacci-tier capacity (default 232), evicted entries still recoverable by hash. Memory now safe for arbitrarily long agent sessions. |
@@ -31,6 +32,52 @@ Read top-to-bottom for the arc; jump to any chapter for the detail.
3132

3233
---
3334

35+
## [v0.8.1-tape-primitives] - 2026-05-17
36+
37+
**Two new tape autograd primitives + a latent backward-broadcast bug fix. The substrate-native `tape_phi_log` is mathematically equivalent to the boring composed reference and trains to within ~1e-7 of it — the substrate-native abstraction is free. The broadcast-backward fix unblocks S-MOD + substrate-K end-to-end training in OMC for the first time.**
38+
39+
### What's new
40+
41+
- **`tape_abs(x)`** — element-wise |x|, the obvious-but-missing PyTorch-parity primitive.
42+
- **`tape_phi_log(x, scale=10.0)`** — fused `ln(|x · scale| + 1) / (π · ln φ)`. One tape node instead of four. Defined at zero (boring `tape_log(0)` returns -∞). Substrate basis (π·ln φ) visible at the AST level rather than buried in a scalar constant.
43+
- **`prom_q6_modulate(q, scale, gamma, mode)`** — dispatches Q6 modulation through `"off"`, `"composed"` (boring `tape_abs` + `tape_log` + scalar denom), or `"fused"` (`tape_phi_log`).
44+
- **`q6_mode` field on `prom_attention_substrate_k_*`** — opt-in (default `"off"` for backward compat) for the substrate-K layer.
45+
46+
### Broadcast-backward fix (the real load-bearing fix)
47+
48+
`tape_div` and `tape_mul` backwards were panicking with col-broadcast denominators (`bv.cols == 1`) — the `prom_substrate_softmax` α>0 path ends in `tape_div(attn_unnorm[N, N], row_sums[N, 1])` and indexed out-of-bounds during backward. This meant **S-MOD + substrate-K had never actually trained end-to-end in OMC**; it would panic at first backward.
49+
50+
Fix: both backwards now iterate the output (dy) shape, reduce indices against each operand's actual extent, and sum contributions across broadcast axes. This is the correct broadcast-aware backward.
51+
52+
### A/B result: substrate-native primitive is exact
53+
54+
`examples/prometheus_q6_ab.omc`, substrate-K transformer, seq_len=6, d_model=8, ff_dim=16, 80 AdamW steps, 3 seeds:
55+
56+
| | mean val | Δ vs off | composed − fused |
57+
|---|--:|--:|--:|
58+
| off (no Q6) | 2.5692 |||
59+
| composed Q6 | 2.5530 | −0.0162 (−0.63%) ||
60+
| fused Q6 | 2.5530 | −0.0162 (−0.63%) | **1.2 × 10⁻⁷** |
61+
62+
Composed and fused agree to ~1e-7 after 80 forward+backward AdamW steps — floating-point accumulation noise floor. **The substrate-native primitive matches the boring composed reference exactly under actual training.** Q6 itself wins 2/3 seeds at this tiny scale, directionally consistent with PyTorch's −12.15% 6/6 seeds at TinyShakespeare L1-MH.
63+
64+
### What this opens up
65+
66+
`tape_phi_log` is the precedent. Future substrate-native primitives can be slotted in the same way: composed reference + fused alternative + A/B at the unit + training levels. Candidates: `tape_substrate_resample`, `tape_attractor_snap`, attractor-modulated-backward `tape_phi_log_v2`.
67+
68+
### Files
69+
70+
- `omnimcode-core/src/interpreter.rs``TapeOp::Abs`, `TapeOp::PhiLog(usize, f64)`, broadcast-aware Mul/Div backward
71+
- `examples/lib/prometheus.omc``prom_q6_modulate` + `q6_mode` field
72+
- `examples/prometheus_q6_ab.omc` — A/B harness
73+
- `examples/tests/test_tape_abs_phi_log.omc` — 12 primitive unit tests
74+
- `examples/tests/test_q6_modulate.omc` — 4 modulation-dispatch tests
75+
- `experiments/prometheus_parity/TAPE_PRIMITIVES_AB.md` — full writeup
76+
77+
Test suite: **1103/1103 pass** after these additions and the broadcast-backward fix.
78+
79+
---
80+
3481
## [v0.8-substrate-q] - 2026-05-17
3582

3683
**4th substrate-attention component lands: Q gets phi_pi_fib log-distance modulation (Q6), wins -12.15% val 6/6 seeds. Cumulative substrate-attention stack now -16.7% vs vanilla baseline on TinyShakespeare.**

examples/lib/prometheus.omc

Lines changed: 69 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -801,6 +801,15 @@ fn prom_attention_substrate_k_new(d_model, seq_len, rng_state) {
801801
# seeds (-2.52% val on top of S-MOD α=1.0) on TinyShakespeare L1-MH.
802802
# scale=0.0 disables. Sweep candidate; not yet tuned.
803803
dict_set(layer, "v_resample_scale", 10.0);
804+
# Q6 substrate-modulation: q ← q * exp(-γ · log_φπfib(|q·scale|+1)).
805+
# PyTorch parity confirmed -12.15% val 6/6 seeds at L1-MH. Mode picks
806+
# between the composed (tape_abs + tape_log) and fused (tape_phi_log)
807+
# primitive paths — they are mathematically equivalent, so a divergence
808+
# at training time pinpoints the cost of the abstraction. "off"
809+
# disables (legacy behavior).
810+
dict_set(layer, "q6_mode", "off");
811+
dict_set(layer, "q6_scale", 10.0);
812+
dict_set(layer, "q6_gamma", 0.5);
804813
dict_set(layer, "rng_state", dict_get(V, "state"));
805814
return layer;
806815
}
@@ -814,22 +823,81 @@ fn prom_attention_substrate_k_forward(layer, x_id) {
814823
# loaded from checkpoints predating substrate-V won't have it.
815824
h v_scale = dict_get(layer, "v_resample_scale");
816825
if v_scale == null { v_scale = 0.0; }
826+
h q6_mode = dict_get(layer, "q6_mode");
827+
if q6_mode == null { q6_mode = "off"; }
828+
h q6_scale = dict_get(layer, "q6_scale");
829+
if q6_scale == null { q6_scale = 10.0; }
830+
h q6_gamma = dict_get(layer, "q6_gamma");
831+
if q6_gamma == null { q6_gamma = 0.5; }
817832

818833
h q = tape_matmul(x_id, Q_w);
834+
# Q6 substrate-modulation on the projected Q. The two modes are
835+
# mathematically equivalent — the divergence under the optimizer is
836+
# what we measure.
837+
h q_mod = prom_q6_modulate(q, q6_scale, q6_gamma, q6_mode);
838+
819839
h v_raw = tape_matmul(x_id, V_w);
820840
# Substrate-resample post-projection (v_scale=0.0 → identity).
821841
h v = prom_substrate_resample(v_raw, v_scale);
822842

823843
# K is the substrate (CRT-PE table). No learnable params on K side.
824844
h k = tape_const(K_const);
825845
h kt = tape_transpose(k);
826-
h scores = tape_matmul(q, kt);
846+
h scores = tape_matmul(q_mod, kt);
827847

828848
# Substrate-modulated softmax (smod_alpha=0.0 falls back to standard).
829849
h attn = prom_substrate_softmax(scores, smod_alpha);
830850
return tape_matmul(attn, v);
831851
}
832852

853+
# ---------------------------------------------------------------------------
854+
# Q6 modulation: q_full = q * exp(-γ · log_φπfib(|q·scale|+1))
855+
#
856+
# Three modes:
857+
# "off" → identity (legacy behavior)
858+
# "composed" → tape_abs + tape_log + scalar denom (boring PyTorch-parity)
859+
# "fused" → tape_phi_log (substrate-native fused op)
860+
#
861+
# The composed and fused paths compute identical forward values (verified by
862+
# test_composed_equals_fused_forward) and propagate identical analytic
863+
# gradients. Any training-time divergence between them comes from rounding
864+
# accumulation, allocation patterns, or AdamW interactions — NOT from the
865+
# math. Measuring that divergence is exactly what the A/B is for.
866+
# ---------------------------------------------------------------------------
867+
868+
fn _prom_q6_log_distance_composed(q_id, scale) {
869+
# ln(|q · scale| + 1) / (π · ln φ)
870+
h scale_c = tape_const(scale);
871+
h qs = tape_mul(q_id, scale_c);
872+
h qs_abs = tape_abs(qs);
873+
h one = tape_const(1.0);
874+
h qs_abs1 = tape_add(qs_abs, one);
875+
h ln_qs = tape_log(qs_abs1);
876+
# π · ln φ = 3.14159... · 0.481211... ≈ 1.511919...
877+
h denom = tape_const(1.5119192540204373);
878+
return tape_div(ln_qs, denom);
879+
}
880+
881+
fn _prom_q6_modulation_from_log_d(log_d_id, gamma) {
882+
# exp(-γ · log_d)
883+
h neg_gamma = tape_const(0.0 - gamma);
884+
h scaled = tape_mul(neg_gamma, log_d_id);
885+
return tape_exp(scaled);
886+
}
887+
888+
fn prom_q6_modulate(q_id, scale, gamma, mode) {
889+
if mode == "off" { return q_id; }
890+
h log_d = null;
891+
if mode == "fused" {
892+
log_d = tape_phi_log(q_id, scale);
893+
} else {
894+
# "composed" (or any unrecognized mode falls through to the boring path)
895+
log_d = _prom_q6_log_distance_composed(q_id, scale);
896+
}
897+
h modulation = _prom_q6_modulation_from_log_d(log_d, gamma);
898+
return tape_mul(q_id, modulation);
899+
}
900+
833901
# L2: substrate K + Q. Only V is learned.
834902
# Q is derived as: x_pos_concat * fixed projection (use CRT-PE directly).
835903
# In the simplest form: Q = CRT-PE (same as K) so each position queries

examples/prometheus_q6_ab.omc

Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
# Q6 substrate-modulation A/B in pure-OMC Prometheus.
2+
#
3+
# Trains the substrate-K transformer three ways:
4+
# "off" → Q passes through unmodulated (baseline)
5+
# "composed" → Q6 via tape_abs + tape_log + scalar denom
6+
# "fused" → Q6 via tape_phi_log (substrate-native primitive)
7+
#
8+
# Composed and fused are mathematically equivalent (test_q6_modulate.omc
9+
# locks that in at the unit level). The end-to-end training comparison
10+
# answers a different question: does the fused primitive give identical
11+
# results once thousands of forward/backward passes have accumulated, or
12+
# does abstraction cost (allocation patterns, accumulation order) show up
13+
# at training time?
14+
#
15+
# Both Q6 paths compared to "off" also gives an OMC-side cross-validation
16+
# of the PyTorch -12.15% Q6 finding at small scale.
17+
18+
import "examples/lib/prometheus.omc";
19+
20+
fn build_vocab(text) {
21+
h seen = dict_new();
22+
h chars = [];
23+
h i = 0;
24+
while i < str_len(text) {
25+
h ch = str_slice(text, i, i + 1);
26+
if !dict_has(seen, ch) {
27+
dict_set(seen, ch, arr_len(chars));
28+
arr_push(chars, ch);
29+
}
30+
i = i + 1;
31+
}
32+
h v = dict_new();
33+
dict_set(v, "chars", chars);
34+
dict_set(v, "lookup", seen);
35+
return v;
36+
}
37+
38+
fn encode(text, vocab) {
39+
h lookup = dict_get(vocab, "lookup");
40+
h ids = [];
41+
h i = 0;
42+
while i < str_len(text) {
43+
h ch = str_slice(text, i, i + 1);
44+
arr_push(ids, dict_get(lookup, ch));
45+
i = i + 1;
46+
}
47+
return ids;
48+
}
49+
50+
fn build_model(q6_mode, vocab_size, d_model, ff_dim, seq_len, seed) {
51+
h emb = prom_embedding_new(vocab_size, d_model, seed);
52+
h s1 = dict_get(emb, "rng_state");
53+
h attn = prom_attention_substrate_k_new(d_model, seq_len, s1 + 11);
54+
dict_set(attn, "q6_mode", q6_mode);
55+
h s2 = dict_get(attn, "rng_state");
56+
h ln1 = prom_layernorm_new(d_model, s2);
57+
h ff_up = prom_linear_new(d_model, ff_dim, s2 + 13);
58+
h s3 = dict_get(ff_up, "rng_state");
59+
h ff_down = prom_linear_new(ff_dim, d_model, s3);
60+
h s4 = dict_get(ff_down, "rng_state");
61+
h ln2 = prom_layernorm_new(d_model, s4);
62+
h head = prom_linear_new(d_model, vocab_size, s4 + 17);
63+
h m = dict_new();
64+
dict_set(m, "q6_mode", q6_mode);
65+
dict_set(m, "emb", emb);
66+
dict_set(m, "attn", attn);
67+
dict_set(m, "ln1", ln1);
68+
dict_set(m, "ff_up", ff_up);
69+
dict_set(m, "ff_down", ff_down);
70+
dict_set(m, "ln2", ln2);
71+
dict_set(m, "head", head);
72+
return m;
73+
}
74+
75+
fn forward_window(model, token_ids, pe_table) {
76+
h x = prom_embedding_batch(dict_get(model, "emb"), token_ids);
77+
h pe_rows = [];
78+
h i = 0;
79+
while i < arr_len(token_ids) {
80+
arr_push(pe_rows, arr_get(pe_table, i));
81+
i = i + 1;
82+
}
83+
h pe_const = tape_const(pe_rows);
84+
x = tape_add(x, pe_const);
85+
h attn_out = prom_attention_substrate_k_forward(dict_get(model, "attn"), x);
86+
h x_post_attn = tape_add(x, attn_out);
87+
h normed1 = prom_layernorm_forward(dict_get(model, "ln1"), x_post_attn);
88+
h up = prom_linear_forward(dict_get(model, "ff_up"), normed1);
89+
h activated = prom_relu(up);
90+
h down = prom_linear_forward(dict_get(model, "ff_down"), activated);
91+
h x_post_ff = tape_add(x_post_attn, down);
92+
h normed2 = prom_layernorm_forward(dict_get(model, "ln2"), x_post_ff);
93+
return prom_linear_forward(dict_get(model, "head"), normed2);
94+
}
95+
96+
fn collect_all_params(model) {
97+
h attn_p = prom_attention_substrate_k_params(dict_get(model, "attn"));
98+
h other = prom_collect_params_v2([
99+
dict_get(model, "emb"),
100+
dict_get(model, "ln1"),
101+
dict_get(model, "ff_up"),
102+
dict_get(model, "ff_down"),
103+
dict_get(model, "ln2"),
104+
dict_get(model, "head"),
105+
]);
106+
h out = [];
107+
h i = 0;
108+
while i < arr_len(attn_p) { arr_push(out, arr_get(attn_p, i)); i = i + 1; }
109+
i = 0;
110+
while i < arr_len(other) { arr_push(out, arr_get(other, i)); i = i + 1; }
111+
return out;
112+
}
113+
114+
fn train_arm(q6_mode, vocab_size, ids, seq_len, d_model, ff_dim, lr, steps, seed) {
115+
tape_reset();
116+
h model = build_model(q6_mode, vocab_size, d_model, ff_dim, seq_len, seed);
117+
h params = collect_all_params(model);
118+
h opt = prom_adamw_new(params, lr, 0.9, 0.999, 1e-8, 0.0);
119+
h pe_table = prom_crt_pe_matrix(seq_len, d_model);
120+
h n_windows = arr_len(ids) - seq_len - 1;
121+
122+
h tail_losses = [];
123+
h step = 0;
124+
while step < steps {
125+
h start = step - (step / n_windows) * n_windows;
126+
h window = [];
127+
h targets = [];
128+
h k = 0;
129+
while k < seq_len {
130+
arr_push(window, arr_get(ids, start + k));
131+
arr_push(targets, arr_get(ids, start + k + 1));
132+
k = k + 1;
133+
}
134+
h logits = forward_window(model, window, pe_table);
135+
h loss = prom_cross_entropy_batch(logits, targets, vocab_size);
136+
tape_backward(loss);
137+
prom_adamw_step(opt);
138+
if step >= steps - 10 { arr_push(tail_losses, tape_value(loss)); }
139+
step = step + 1;
140+
}
141+
h sum = 0.0;
142+
h i = 0;
143+
while i < arr_len(tail_losses) { sum = sum + arr_get(tail_losses, i); i = i + 1; }
144+
return sum / arr_len(tail_losses);
145+
}
146+
147+
fn mean_arr(xs) {
148+
h sum = 0.0;
149+
h i = 0;
150+
while i < arr_len(xs) { sum = sum + arr_get(xs, i); i = i + 1; }
151+
return sum / arr_len(xs);
152+
}
153+
154+
fn main() {
155+
print("=== OMC Q6 A/B (off vs composed vs fused) ===");
156+
h text = "the rain in spain falls mainly on the plain and the sun rises in the east while the moon hides behind the mountain peaks of distant lands where ancient creatures sleep in caves of silver";
157+
h vocab = build_vocab(text);
158+
h vocab_size = arr_len(dict_get(vocab, "chars"));
159+
h ids = encode(text, vocab);
160+
h seq_len = 6;
161+
h d_model = 8;
162+
h ff_dim = 16;
163+
h lr = 0.01;
164+
h steps = 80;
165+
h seeds = [42, 7, 123];
166+
167+
print(concat_many("vocab: ", to_string(vocab_size),
168+
" seq_len: ", to_string(seq_len),
169+
" d_model: ", to_string(d_model),
170+
" ff: ", to_string(ff_dim),
171+
" steps: ", to_string(steps),
172+
" seeds: ", to_string(arr_len(seeds))));
173+
print("");
174+
175+
h off_losses = [];
176+
h composed_losses = [];
177+
h fused_losses = [];
178+
179+
h s = 0;
180+
while s < arr_len(seeds) {
181+
h seed = arr_get(seeds, s);
182+
print(concat_many("seed=", to_string(seed)));
183+
184+
h loff = train_arm("off", vocab_size, ids, seq_len, d_model, ff_dim, lr, steps, seed);
185+
h lcomp = train_arm("composed", vocab_size, ids, seq_len, d_model, ff_dim, lr, steps, seed);
186+
h lfus = train_arm("fused", vocab_size, ids, seq_len, d_model, ff_dim, lr, steps, seed);
187+
188+
print(concat_many(" off= ", to_string(loff)));
189+
print(concat_many(" composed=", to_string(lcomp)));
190+
print(concat_many(" fused= ", to_string(lfus)));
191+
192+
arr_push(off_losses, loff);
193+
arr_push(composed_losses, lcomp);
194+
arr_push(fused_losses, lfus);
195+
s = s + 1;
196+
}
197+
198+
print("");
199+
print("=== aggregate ===");
200+
h moff = mean_arr(off_losses);
201+
h mcomp = mean_arr(composed_losses);
202+
h mfus = mean_arr(fused_losses);
203+
print(concat_many("mean off= ", to_string(moff)));
204+
print(concat_many("mean composed=", to_string(mcomp), " Δ vs off: ", to_string(mcomp - moff)));
205+
print(concat_many("mean fused= ", to_string(mfus), " Δ vs off: ", to_string(mfus - moff)));
206+
print(concat_many("composed vs fused divergence: ", to_string(mfus - mcomp)));
207+
print("");
208+
print("Interpretation:");
209+
print(" | composed - fused | small (~1e-6) → fused primitive matches math (expected)");
210+
print(" Q6 modes vs off → does Q6 help at small scale in OMC?");
211+
print(" PyTorch-side -12.15% baseline was at multi-head TinyShakespeare; small-scale");
212+
print(" single-head OMC may show smaller or noisier signal — what we want to confirm");
213+
print(" is that the fused path doesn't introduce its own training-time divergence.");
214+
}
215+
216+
main();

0 commit comments

Comments
 (0)