v0.8.10 substrate-aware backward gradients: TRIED, falsified at this scale

RandomCoder-lab · claude · RandomCoder-lab · commit 06a7c16cdd3e · 2026-05-17T18:46:41.000-05:00
Built tape_substrate_grad_mod(x, scale, alpha) — identity forward, but backward amplifies gradient components pulling θ toward nearest Fibonacci attractor and dampens components pushing away. The substrate as gradient- flow preconditioner instead of forward modulator. Math verified on 3 hand-checked smoke cases (x=0.6/0.7/0.5 at scale=10, alpha=0.5, all match analytical expectation). A/B at d_model=32, 250 steps, 3 seeds (wrap Q and V in grad_mod before matmul, forward unchanged): baseline 1.998 + substrate gm 2.165 (+8.4%, wins 1/3) + substrate gm + Q6 2.157 (+7.9%, wins 1/3) Falsified at this scale. Loss landscape pulls harder than substrate alignment can resist. Two "constrain toward substrate" hypotheses now falsified (this + v0.8.8 #3 substrate-init). The empirical map after v0.8: substrate at OUTPUTS or in STRUCTURE works (Q6, S-MOD, CRT-PE, 8x32 tile, MH-Q6 compound). Substrate as INPUT constraint or BACKWARD bias does not (at current scales). Reformulations possible (each its own chapter): different scale, apply to FF not attention, decay alpha during training, use as regularization term not gradient bias. v0.8.10 ships the honest negative. #2 d_model=128 larger-scale bench still running (22 min in, buffered output won't print until exit); lands in v0.8.11. Files: omnimcode-core/src/interpreter.rs TapeOp::SubstrateGradMod examples/prometheus_substrate_grad_mod_xval.omc 3-arm A/B experiments/prometheus_parity/V0810_SUBSTRATE_AWARE_BACKWARD.md writeup 1111/1111 OMC tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
diff --git a/examples/prometheus_substrate_grad_mod_xval.omc b/examples/prometheus_substrate_grad_mod_xval.omc
@@ -0,0 +1,249 @@
+# Substrate-aware backward gradients A/B (task #284, v0.8.10 research item).
+#
+# Forward is identity; backward amplifies gradient components pulling params
+# toward Fibonacci attractors and dampens components pushing away. The
+# substrate as gradient-flow regularizer.
+#
+# Three arms at d_model=32, single-head, 250 steps, 3 seeds:
+#   A. baseline   plain L1+SMOD+V
+#   B. grad_mod   wrap Q and V projections with tape_substrate_grad_mod
+#                 before tape_matmul. Forward UNCHANGED; backward biased.
+#   C. + Q6       grad_mod + Q6 fused, to see if substrate-shaped backward
+#                 compounds with substrate-shaped forward modulation.
+#
+# Hypothesis: gradient bias toward attractors might regularize Q/V like
+# substrate-init was supposed to — but at TRAINING time instead of init,
+# which lets parameters drift if the loss landscape pulls hard.
+
+import "examples/lib/prometheus.omc";
+
+fn build_vocab(text) {
+    h seen = dict_new();
+    h chars = [];
+    h i = 0;
+    while i < str_len(text) {
+        h ch = str_slice(text, i, i + 1);
+        if !dict_has(seen, ch) { dict_set(seen, ch, arr_len(chars)); arr_push(chars, ch); }
+        i = i + 1;
+    }
+    h v = dict_new();
+    dict_set(v, "chars", chars);
+    dict_set(v, "lookup", seen);
+    return v;
+}
+
+fn encode(text, vocab) {
+    h lookup = dict_get(vocab, "lookup");
+    h ids = [];
+    h i = 0;
+    while i < str_len(text) {
+        h ch = str_slice(text, i, i + 1);
+        arr_push(ids, dict_get(lookup, ch));
+        i = i + 1;
+    }
+    return ids;
+}
+
+# Custom attention forward that wraps Q and V tape nodes in substrate_grad_mod
+# before they enter the matmul. This biases their gradients without changing
+# the forward computation.
+fn attn_forward_with_grad_mod(layer, x_id, gm_scale, gm_alpha, use_q6) {
+    h Q_w = dict_get(layer, "Q");
+    h V_w = dict_get(layer, "V");
+    h K_const = dict_get(layer, "K_const");
+    h smod_alpha = dict_get(layer, "smod_alpha");
+    h v_scale = dict_get(layer, "v_resample_scale");
+    if v_scale == null { v_scale = 0.0; }
+
+    # Wrap Q and V param tape nodes — biases backward flow into them.
+    h Q_mod = tape_substrate_grad_mod(Q_w, gm_scale, gm_alpha);
+    h V_mod = tape_substrate_grad_mod(V_w, gm_scale, gm_alpha);
+
+    h q = tape_matmul(x_id, Q_mod);
+    h q_mod = q;
+    if use_q6 {
+        q_mod = prom_q6_modulate(q, 10.0, 0.5, "fused");
+    }
+    h v_raw = tape_matmul(x_id, V_mod);
+    h v = prom_substrate_resample(v_raw, v_scale);
+
+    h k = tape_const(K_const);
+    h kt = tape_transpose(k);
+    h scores = tape_matmul(q_mod, kt);
+    h attn = prom_substrate_softmax(scores, smod_alpha);
+    return tape_matmul(attn, v);
+}
+
+fn build_model(arm, vocab_size, d_model, ff_dim, seq_len, seed) {
+    h emb = prom_embedding_new(vocab_size, d_model, seed);
+    h s1 = dict_get(emb, "rng_state");
+    h attn = prom_attention_substrate_k_new(d_model, seq_len, s1 + 11);
+    h s2 = dict_get(attn, "rng_state");
+    h ln1 = prom_layernorm_new(d_model, s2);
+    h ff_up = prom_linear_new(d_model, ff_dim, s2 + 13);
+    h s3 = dict_get(ff_up, "rng_state");
+    h ff_down = prom_linear_new(ff_dim, d_model, s3);
+    h s4 = dict_get(ff_down, "rng_state");
+    h ln2 = prom_layernorm_new(d_model, s4);
+    h head = prom_linear_new(d_model, vocab_size, s4 + 17);
+    h m = dict_new();
+    dict_set(m, "arm", arm);
+    dict_set(m, "emb", emb);
+    dict_set(m, "attn", attn);
+    dict_set(m, "ln1", ln1);
+    dict_set(m, "ff_up", ff_up);
+    dict_set(m, "ff_down", ff_down);
+    dict_set(m, "ln2", ln2);
+    dict_set(m, "head", head);
+    return m;
+}
+
+fn forward_window(model, token_ids, pe_table) {
+    h arm = dict_get(model, "arm");
+    h x = prom_embedding_batch(dict_get(model, "emb"), token_ids);
+    h pe_rows = [];
+    h i = 0;
+    while i < arr_len(token_ids) { arr_push(pe_rows, arr_get(pe_table, i)); i = i + 1; }
+    x = tape_add(x, tape_const(pe_rows));
+    h attn_out = null;
+    if arm == "baseline" {
+        attn_out = prom_attention_substrate_k_forward(dict_get(model, "attn"), x);
+    } elif arm == "gradmod" {
+        attn_out = attn_forward_with_grad_mod(dict_get(model, "attn"), x, 64.0, 0.5, false);
+    } else {
+        # gradmod_q6
+        attn_out = attn_forward_with_grad_mod(dict_get(model, "attn"), x, 64.0, 0.5, true);
+    }
+    h x_post = tape_add(x, attn_out);
+    h n1 = prom_layernorm_forward(dict_get(model, "ln1"), x_post);
+    h up = prom_linear_forward(dict_get(model, "ff_up"), n1);
+    h down = prom_linear_forward(dict_get(model, "ff_down"), prom_relu(up));
+    h x_ff = tape_add(x_post, down);
+    h n2 = prom_layernorm_forward(dict_get(model, "ln2"), x_ff);
+    return prom_linear_forward(dict_get(model, "head"), n2);
+}
+
+fn collect_all(model) {
+    h attn_p = prom_attention_substrate_k_params(dict_get(model, "attn"));
+    h other = prom_collect_params_v2([
+        dict_get(model, "emb"),
+        dict_get(model, "ln1"),
+        dict_get(model, "ff_up"),
+        dict_get(model, "ff_down"),
+        dict_get(model, "ln2"),
+        dict_get(model, "head"),
+    ]);
+    h out = [];
+    h i = 0;
+    while i < arr_len(attn_p) { arr_push(out, arr_get(attn_p, i)); i = i + 1; }
+    i = 0;
+    while i < arr_len(other) { arr_push(out, arr_get(other, i)); i = i + 1; }
+    return out;
+}
+
+fn train(arm, vocab_size, ids, seq_len, d_model, ff_dim, lr, steps, seed) {
+    tape_reset();
+    h model = build_model(arm, vocab_size, d_model, ff_dim, seq_len, seed);
+    h params = collect_all(model);
+    h opt = prom_adamw_new(params, lr, 0.9, 0.999, 1e-8, 0.0);
+    h pe_table = prom_crt_pe_matrix(seq_len, d_model);
+    h n_windows = arr_len(ids) - seq_len - 1;
+    h tail = [];
+    h step = 0;
+    while step < steps {
+        h start = step - (step / n_windows) * n_windows;
+        h window = [];
+        h targets = [];
+        h k = 0;
+        while k < seq_len {
+            arr_push(window, arr_get(ids, start + k));
+            arr_push(targets, arr_get(ids, start + k + 1));
+            k = k + 1;
+        }
+        h logits = forward_window(model, window, pe_table);
+        h loss = prom_cross_entropy_batch(logits, targets, vocab_size);
+        tape_backward(loss);
+        prom_adamw_step(opt);
+        if step >= steps - 30 { arr_push(tail, tape_value(loss)); }
+        step = step + 1;
+    }
+    h s = 0.0; h i = 0;
+    while i < arr_len(tail) { s = s + arr_get(tail, i); i = i + 1; }
+    return s / arr_len(tail);
+}
+
+fn mean_arr(xs) {
+    h s = 0.0; h i = 0;
+    while i < arr_len(xs) { s = s + arr_get(xs, i); i = i + 1; }
+    return s / arr_len(xs);
+}
+
+fn main() {
+    print("=== substrate-aware backward gradients A/B (task #284) ===");
+    h text = "the rain in spain falls mainly on the plain and the sun rises in the east while the moon hides behind the mountain peaks of distant lands";
+    h vocab = build_vocab(text);
+    h vocab_size = arr_len(dict_get(vocab, "chars"));
+    h ids = encode(text, vocab);
+    h seq_len = 16;
+    h d_model = 32;
+    h ff_dim = 64;
+    h lr = 0.005;
+    h steps = 250;
+    h seeds = [42, 7, 123];
+
+    print(concat_many("d_model=", to_string(d_model),
+        "  steps=", to_string(steps),
+        "  seeds=", to_string(arr_len(seeds))));
+    print("");
+
+    h arms = ["baseline", "gradmod", "gradmod_q6"];
+    h labels = dict_new();
+    dict_set(labels, "baseline",   "baseline (no gm)  ");
+    dict_set(labels, "gradmod",    "+ substrate gm   ");
+    dict_set(labels, "gradmod_q6", "+ substrate gm + Q6");
+
+    h results = dict_new();
+    h ai = 0;
+    while ai < arr_len(arms) {
+        h arm = arr_get(arms, ai);
+        h losses = [];
+        h si = 0;
+        while si < arr_len(seeds) {
+            h seed = arr_get(seeds, si);
+            h L = train(arm, vocab_size, ids, seq_len, d_model, ff_dim, lr, steps, seed);
+            arr_push(losses, L);
+            si = si + 1;
+        }
+        dict_set(results, arm, losses);
+        h mu = mean_arr(losses);
+        print(concat_many(dict_get(labels, arm), " mean=", to_string(mu)));
+        ai = ai + 1;
+    }
+
+    print("");
+    print("=== headline ===");
+    h base_mu = mean_arr(dict_get(results, "baseline"));
+    ai = 0;
+    while ai < arr_len(arms) {
+        h arm = arr_get(arms, ai);
+        h mu = mean_arr(dict_get(results, arm));
+        h delta = mu - base_mu;
+        h pct = (delta / base_mu) * 100.0;
+        h wins = 0;
+        h si = 0;
+        while si < arr_len(seeds) {
+            if arr_get(dict_get(results, arm), si) < arr_get(dict_get(results, "baseline"), si) {
+                wins = wins + 1;
+            }
+            si = si + 1;
+        }
+        print(concat_many(dict_get(labels, arm),
+            " mean=", to_string(mu),
+            "  Δ=", to_string(delta),
+            "  (", to_string(pct), "%)",
+            "  wins ", to_string(wins), "/", to_string(arr_len(seeds))));
+        ai = ai + 1;
+    }
+}
+
+main();
diff --git a/experiments/prometheus_parity/V0810_SUBSTRATE_AWARE_BACKWARD.md b/experiments/prometheus_parity/V0810_SUBSTRATE_AWARE_BACKWARD.md
@@ -0,0 +1,134 @@
+# v0.8.10 — substrate-aware backward gradients: TRIED, falsified at this scale
+
+## Headline
+
+Built and tested `tape_substrate_grad_mod(x, scale, alpha)` — a fused
+tape op with identity forward but **substrate-shaped backward**. The
+gradient is amplified when it pulls θ toward the nearest Fibonacci
+attractor, dampened when it pushes θ away. The substrate as a gradient-
+flow preconditioner instead of (or in addition to) a forward modulator.
+
+**Result**: training is **+8.4% worse** at d_model=32 with substrate
+backward applied to Q and V. The loss landscape pulls harder than
+substrate alignment can resist. **Hypothesis falsified at this scale.**
+
+Three reformulations are scoped for future chapters (none rushed today).
+
+## Construction
+
+The op is mathematically:
+
+```
+forward:   y = x                                    # identity
+backward:
+  for each cell:
+    xs = round(x · scale)
+    (attractor, dist) = nearest_attractor_with_dist(xs)
+    if dist == 0:    dx = dy                        # on attractor, passthrough
+    else:
+      dir = sign(attractor - xs)
+      pulls_toward = sign(g) · dir < 0              # update -lr·g moves toward attractor
+      dx = dy · (1 + alpha) if pulls_toward         # amplify
+           else dy · 1/(1 + alpha)                  # dampen
+```
+
+The sign math: parameter update is `θ ← θ − lr · grad`. If attractor is
+above x (`dir > 0`), the update must be NEGATIVE → grad must be POSITIVE.
+Amplifying grad in that case = good. If grad is negative when attractor
+is above, the update pushes x further from attractor → dampen.
+
+**Smoke test verifies math** (scale=10, alpha=0.5):
+
+| x | xs | nearest_attractor | dist | dir | grad | result | expected |
+|---|---|---|---|---|--:|--:|--:|
+| 0.6 | 6 | 5 | 1 | -1 | +1 | **1.5** | 1.5 (amplify) ✓ |
+| 0.7 | 7 | 8 | 1 | +1 | +1 | **0.667** | 0.667 (dampen) ✓ |
+| 0.5 | 5 | 5 | 0 | — | +1 | **1.0** | 1.0 (passthrough) ✓ |
+
+Math correct end-to-end.
+
+## A/B at d_model=32, 250 steps, 3 seeds
+
+Wrapped Q and V projection params in `tape_substrate_grad_mod(node, 64, 0.5)`
+before the matmul (forward unchanged; backward biased).
+
+| arm | mean tail loss | Δ vs baseline | wins |
+|---|--:|--:|--:|
+| baseline | 1.998 | — | — |
+| + substrate gm | 2.165 | **+8.4%** | 1/3 |
+| + substrate gm + Q6 | 2.157 | **+7.9%** | 1/3 |
+
+**Falsified.** Substrate-shaped gradient bias hurts training at this
+scale. The hypothesis was that pulling Q/V toward attractor positions
+during training would regularize like substrate-init was supposed to,
+without the rigidity of init-time snapping. The result says: the loss
+landscape gradient is informative and biasing it toward substrate-
+aligned positions costs more than it gains.
+
+This mirrors the v0.8.8 substrate-init falsification — both "constrain
+toward substrate" hypotheses fail. The substrate is good at:
+- **Forward modulation** (Q6, S-MOD, V-resample) — explicit substrate
+  shaping of activations
+- **Architectural priors** (CRT-PE, fibonacci attractor table) —
+  substrate in the data and structure
+- **Post-training pattern** (v0.8.8 finding) — substrate emerges in
+  attention after Q6 training
+
+The substrate is NOT good at:
+- **Init-time constraint** (v0.8.8 #3 falsified)
+- **Gradient-time bias** (v0.8.10 falsified)
+
+Pattern: **the substrate works when applied to outputs (forward modulation)
+or revealed by training (post-train alignment), but NOT when forced on
+inputs or gradients.** The information flow direction matters.
+
+## What's NOT ruled out (future chapter reformulations)
+
+1. **Different scale**: scale=64 may be too coarse. scale=1024 or scale
+   per-layer (computed from param magnitude statistics) may give
+   gentler bias that the loss can integrate.
+
+2. **Apply to FF instead of attention**: attention Q/V are loss-critical;
+   FF down-projection weights may be more tolerant of substrate bias.
+
+3. **Decay alpha during training**: start with strong substrate bias
+   (alpha=0.5), decay linearly to 0 over training. Substrate as a
+   warm-start regularizer.
+
+4. **Substrate as REGULARIZATION TERM, not gradient bias**: add
+   `sum(attractor_distance(param)) · lambda` to the loss. Gradient
+   then has substrate component naturally; doesn't override the loss.
+
+Each is its own chapter. v0.8.10 ships the negative honestly.
+
+## Where it lands in the substrate-IS-architecture map
+
+The substrate has been validated at 5 layers across v0.8:
+1. **Data** — CRT-PE positional encoding (cross-validates)
+2. **Algorithm** — substrate-K + S-MOD + V-resample (cross-validates)
+3. **Hardware tile** — 8×32 wavefront-aligned (cross-validates +38-61%)
+4. **Post-training attention pattern** — Q6 → 8.3× concentration
+   (v0.8.8 finding)
+5. **Multi-head Q6 compound** — −3.57% vs baseline (v0.8.9 confirms)
+
+Now-falsified attempts:
+- **Init-time substrate-snap** — substrate-init regularization
+  (v0.8.8 #3)
+- **Gradient-time substrate-pull** — substrate backward modulation
+  (v0.8.10 this chapter)
+
+The empirical map is: substrate at OUTPUTS or in STRUCTURE works.
+Substrate as INPUT constraint or BACKWARD bias does not (at current
+scales, with current scale parameter, on current architectures).
+
+## Files
+
+- `omnimcode-core/src/interpreter.rs` — `TapeOp::SubstrateGradMod`
+  variant + `tape_substrate_grad_mod` dispatch + substrate-aware
+  backward
+- `examples/prometheus_substrate_grad_mod_xval.omc` — 3-arm A/B
+- `experiments/prometheus_parity/V0810_SUBSTRATE_AWARE_BACKWARD.md`
+
+## Tests
+
+**1111/1111 OMC tests pass.**
diff --git a/omnimcode-core/src/interpreter.rs b/omnimcode-core/src/interpreter.rs