Skip to content

Commit 06a7c16

Browse files
v0.8.10 substrate-aware backward gradients: TRIED, falsified at this scale
Built tape_substrate_grad_mod(x, scale, alpha) — identity forward, but backward amplifies gradient components pulling θ toward nearest Fibonacci attractor and dampens components pushing away. The substrate as gradient- flow preconditioner instead of forward modulator. Math verified on 3 hand-checked smoke cases (x=0.6/0.7/0.5 at scale=10, alpha=0.5, all match analytical expectation). A/B at d_model=32, 250 steps, 3 seeds (wrap Q and V in grad_mod before matmul, forward unchanged): baseline 1.998 + substrate gm 2.165 (+8.4%, wins 1/3) + substrate gm + Q6 2.157 (+7.9%, wins 1/3) Falsified at this scale. Loss landscape pulls harder than substrate alignment can resist. Two "constrain toward substrate" hypotheses now falsified (this + v0.8.8 #3 substrate-init). The empirical map after v0.8: substrate at OUTPUTS or in STRUCTURE works (Q6, S-MOD, CRT-PE, 8x32 tile, MH-Q6 compound). Substrate as INPUT constraint or BACKWARD bias does not (at current scales). Reformulations possible (each its own chapter): different scale, apply to FF not attention, decay alpha during training, use as regularization term not gradient bias. v0.8.10 ships the honest negative. #2 d_model=128 larger-scale bench still running (22 min in, buffered output won't print until exit); lands in v0.8.11. Files: omnimcode-core/src/interpreter.rs TapeOp::SubstrateGradMod examples/prometheus_substrate_grad_mod_xval.omc 3-arm A/B experiments/prometheus_parity/V0810_SUBSTRATE_AWARE_BACKWARD.md writeup 1111/1111 OMC tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 6407f83 commit 06a7c16

3 files changed

Lines changed: 481 additions & 0 deletions

File tree

Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
# Substrate-aware backward gradients A/B (task #284, v0.8.10 research item).
2+
#
3+
# Forward is identity; backward amplifies gradient components pulling params
4+
# toward Fibonacci attractors and dampens components pushing away. The
5+
# substrate as gradient-flow regularizer.
6+
#
7+
# Three arms at d_model=32, single-head, 250 steps, 3 seeds:
8+
# A. baseline plain L1+SMOD+V
9+
# B. grad_mod wrap Q and V projections with tape_substrate_grad_mod
10+
# before tape_matmul. Forward UNCHANGED; backward biased.
11+
# C. + Q6 grad_mod + Q6 fused, to see if substrate-shaped backward
12+
# compounds with substrate-shaped forward modulation.
13+
#
14+
# Hypothesis: gradient bias toward attractors might regularize Q/V like
15+
# substrate-init was supposed to — but at TRAINING time instead of init,
16+
# which lets parameters drift if the loss landscape pulls hard.
17+
18+
import "examples/lib/prometheus.omc";
19+
20+
fn build_vocab(text) {
21+
h seen = dict_new();
22+
h chars = [];
23+
h i = 0;
24+
while i < str_len(text) {
25+
h ch = str_slice(text, i, i + 1);
26+
if !dict_has(seen, ch) { dict_set(seen, ch, arr_len(chars)); arr_push(chars, ch); }
27+
i = i + 1;
28+
}
29+
h v = dict_new();
30+
dict_set(v, "chars", chars);
31+
dict_set(v, "lookup", seen);
32+
return v;
33+
}
34+
35+
fn encode(text, vocab) {
36+
h lookup = dict_get(vocab, "lookup");
37+
h ids = [];
38+
h i = 0;
39+
while i < str_len(text) {
40+
h ch = str_slice(text, i, i + 1);
41+
arr_push(ids, dict_get(lookup, ch));
42+
i = i + 1;
43+
}
44+
return ids;
45+
}
46+
47+
# Custom attention forward that wraps Q and V tape nodes in substrate_grad_mod
48+
# before they enter the matmul. This biases their gradients without changing
49+
# the forward computation.
50+
fn attn_forward_with_grad_mod(layer, x_id, gm_scale, gm_alpha, use_q6) {
51+
h Q_w = dict_get(layer, "Q");
52+
h V_w = dict_get(layer, "V");
53+
h K_const = dict_get(layer, "K_const");
54+
h smod_alpha = dict_get(layer, "smod_alpha");
55+
h v_scale = dict_get(layer, "v_resample_scale");
56+
if v_scale == null { v_scale = 0.0; }
57+
58+
# Wrap Q and V param tape nodes — biases backward flow into them.
59+
h Q_mod = tape_substrate_grad_mod(Q_w, gm_scale, gm_alpha);
60+
h V_mod = tape_substrate_grad_mod(V_w, gm_scale, gm_alpha);
61+
62+
h q = tape_matmul(x_id, Q_mod);
63+
h q_mod = q;
64+
if use_q6 {
65+
q_mod = prom_q6_modulate(q, 10.0, 0.5, "fused");
66+
}
67+
h v_raw = tape_matmul(x_id, V_mod);
68+
h v = prom_substrate_resample(v_raw, v_scale);
69+
70+
h k = tape_const(K_const);
71+
h kt = tape_transpose(k);
72+
h scores = tape_matmul(q_mod, kt);
73+
h attn = prom_substrate_softmax(scores, smod_alpha);
74+
return tape_matmul(attn, v);
75+
}
76+
77+
fn build_model(arm, vocab_size, d_model, ff_dim, seq_len, seed) {
78+
h emb = prom_embedding_new(vocab_size, d_model, seed);
79+
h s1 = dict_get(emb, "rng_state");
80+
h attn = prom_attention_substrate_k_new(d_model, seq_len, s1 + 11);
81+
h s2 = dict_get(attn, "rng_state");
82+
h ln1 = prom_layernorm_new(d_model, s2);
83+
h ff_up = prom_linear_new(d_model, ff_dim, s2 + 13);
84+
h s3 = dict_get(ff_up, "rng_state");
85+
h ff_down = prom_linear_new(ff_dim, d_model, s3);
86+
h s4 = dict_get(ff_down, "rng_state");
87+
h ln2 = prom_layernorm_new(d_model, s4);
88+
h head = prom_linear_new(d_model, vocab_size, s4 + 17);
89+
h m = dict_new();
90+
dict_set(m, "arm", arm);
91+
dict_set(m, "emb", emb);
92+
dict_set(m, "attn", attn);
93+
dict_set(m, "ln1", ln1);
94+
dict_set(m, "ff_up", ff_up);
95+
dict_set(m, "ff_down", ff_down);
96+
dict_set(m, "ln2", ln2);
97+
dict_set(m, "head", head);
98+
return m;
99+
}
100+
101+
fn forward_window(model, token_ids, pe_table) {
102+
h arm = dict_get(model, "arm");
103+
h x = prom_embedding_batch(dict_get(model, "emb"), token_ids);
104+
h pe_rows = [];
105+
h i = 0;
106+
while i < arr_len(token_ids) { arr_push(pe_rows, arr_get(pe_table, i)); i = i + 1; }
107+
x = tape_add(x, tape_const(pe_rows));
108+
h attn_out = null;
109+
if arm == "baseline" {
110+
attn_out = prom_attention_substrate_k_forward(dict_get(model, "attn"), x);
111+
} elif arm == "gradmod" {
112+
attn_out = attn_forward_with_grad_mod(dict_get(model, "attn"), x, 64.0, 0.5, false);
113+
} else {
114+
# gradmod_q6
115+
attn_out = attn_forward_with_grad_mod(dict_get(model, "attn"), x, 64.0, 0.5, true);
116+
}
117+
h x_post = tape_add(x, attn_out);
118+
h n1 = prom_layernorm_forward(dict_get(model, "ln1"), x_post);
119+
h up = prom_linear_forward(dict_get(model, "ff_up"), n1);
120+
h down = prom_linear_forward(dict_get(model, "ff_down"), prom_relu(up));
121+
h x_ff = tape_add(x_post, down);
122+
h n2 = prom_layernorm_forward(dict_get(model, "ln2"), x_ff);
123+
return prom_linear_forward(dict_get(model, "head"), n2);
124+
}
125+
126+
fn collect_all(model) {
127+
h attn_p = prom_attention_substrate_k_params(dict_get(model, "attn"));
128+
h other = prom_collect_params_v2([
129+
dict_get(model, "emb"),
130+
dict_get(model, "ln1"),
131+
dict_get(model, "ff_up"),
132+
dict_get(model, "ff_down"),
133+
dict_get(model, "ln2"),
134+
dict_get(model, "head"),
135+
]);
136+
h out = [];
137+
h i = 0;
138+
while i < arr_len(attn_p) { arr_push(out, arr_get(attn_p, i)); i = i + 1; }
139+
i = 0;
140+
while i < arr_len(other) { arr_push(out, arr_get(other, i)); i = i + 1; }
141+
return out;
142+
}
143+
144+
fn train(arm, vocab_size, ids, seq_len, d_model, ff_dim, lr, steps, seed) {
145+
tape_reset();
146+
h model = build_model(arm, vocab_size, d_model, ff_dim, seq_len, seed);
147+
h params = collect_all(model);
148+
h opt = prom_adamw_new(params, lr, 0.9, 0.999, 1e-8, 0.0);
149+
h pe_table = prom_crt_pe_matrix(seq_len, d_model);
150+
h n_windows = arr_len(ids) - seq_len - 1;
151+
h tail = [];
152+
h step = 0;
153+
while step < steps {
154+
h start = step - (step / n_windows) * n_windows;
155+
h window = [];
156+
h targets = [];
157+
h k = 0;
158+
while k < seq_len {
159+
arr_push(window, arr_get(ids, start + k));
160+
arr_push(targets, arr_get(ids, start + k + 1));
161+
k = k + 1;
162+
}
163+
h logits = forward_window(model, window, pe_table);
164+
h loss = prom_cross_entropy_batch(logits, targets, vocab_size);
165+
tape_backward(loss);
166+
prom_adamw_step(opt);
167+
if step >= steps - 30 { arr_push(tail, tape_value(loss)); }
168+
step = step + 1;
169+
}
170+
h s = 0.0; h i = 0;
171+
while i < arr_len(tail) { s = s + arr_get(tail, i); i = i + 1; }
172+
return s / arr_len(tail);
173+
}
174+
175+
fn mean_arr(xs) {
176+
h s = 0.0; h i = 0;
177+
while i < arr_len(xs) { s = s + arr_get(xs, i); i = i + 1; }
178+
return s / arr_len(xs);
179+
}
180+
181+
fn main() {
182+
print("=== substrate-aware backward gradients A/B (task #284) ===");
183+
h text = "the rain in spain falls mainly on the plain and the sun rises in the east while the moon hides behind the mountain peaks of distant lands";
184+
h vocab = build_vocab(text);
185+
h vocab_size = arr_len(dict_get(vocab, "chars"));
186+
h ids = encode(text, vocab);
187+
h seq_len = 16;
188+
h d_model = 32;
189+
h ff_dim = 64;
190+
h lr = 0.005;
191+
h steps = 250;
192+
h seeds = [42, 7, 123];
193+
194+
print(concat_many("d_model=", to_string(d_model),
195+
" steps=", to_string(steps),
196+
" seeds=", to_string(arr_len(seeds))));
197+
print("");
198+
199+
h arms = ["baseline", "gradmod", "gradmod_q6"];
200+
h labels = dict_new();
201+
dict_set(labels, "baseline", "baseline (no gm) ");
202+
dict_set(labels, "gradmod", "+ substrate gm ");
203+
dict_set(labels, "gradmod_q6", "+ substrate gm + Q6");
204+
205+
h results = dict_new();
206+
h ai = 0;
207+
while ai < arr_len(arms) {
208+
h arm = arr_get(arms, ai);
209+
h losses = [];
210+
h si = 0;
211+
while si < arr_len(seeds) {
212+
h seed = arr_get(seeds, si);
213+
h L = train(arm, vocab_size, ids, seq_len, d_model, ff_dim, lr, steps, seed);
214+
arr_push(losses, L);
215+
si = si + 1;
216+
}
217+
dict_set(results, arm, losses);
218+
h mu = mean_arr(losses);
219+
print(concat_many(dict_get(labels, arm), " mean=", to_string(mu)));
220+
ai = ai + 1;
221+
}
222+
223+
print("");
224+
print("=== headline ===");
225+
h base_mu = mean_arr(dict_get(results, "baseline"));
226+
ai = 0;
227+
while ai < arr_len(arms) {
228+
h arm = arr_get(arms, ai);
229+
h mu = mean_arr(dict_get(results, arm));
230+
h delta = mu - base_mu;
231+
h pct = (delta / base_mu) * 100.0;
232+
h wins = 0;
233+
h si = 0;
234+
while si < arr_len(seeds) {
235+
if arr_get(dict_get(results, arm), si) < arr_get(dict_get(results, "baseline"), si) {
236+
wins = wins + 1;
237+
}
238+
si = si + 1;
239+
}
240+
print(concat_many(dict_get(labels, arm),
241+
" mean=", to_string(mu),
242+
" Δ=", to_string(delta),
243+
" (", to_string(pct), "%)",
244+
" wins ", to_string(wins), "/", to_string(arr_len(seeds))));
245+
ai = ai + 1;
246+
}
247+
}
248+
249+
main();
Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
# v0.8.10 — substrate-aware backward gradients: TRIED, falsified at this scale
2+
3+
## Headline
4+
5+
Built and tested `tape_substrate_grad_mod(x, scale, alpha)` — a fused
6+
tape op with identity forward but **substrate-shaped backward**. The
7+
gradient is amplified when it pulls θ toward the nearest Fibonacci
8+
attractor, dampened when it pushes θ away. The substrate as a gradient-
9+
flow preconditioner instead of (or in addition to) a forward modulator.
10+
11+
**Result**: training is **+8.4% worse** at d_model=32 with substrate
12+
backward applied to Q and V. The loss landscape pulls harder than
13+
substrate alignment can resist. **Hypothesis falsified at this scale.**
14+
15+
Three reformulations are scoped for future chapters (none rushed today).
16+
17+
## Construction
18+
19+
The op is mathematically:
20+
21+
```
22+
forward: y = x # identity
23+
backward:
24+
for each cell:
25+
xs = round(x · scale)
26+
(attractor, dist) = nearest_attractor_with_dist(xs)
27+
if dist == 0: dx = dy # on attractor, passthrough
28+
else:
29+
dir = sign(attractor - xs)
30+
pulls_toward = sign(g) · dir < 0 # update -lr·g moves toward attractor
31+
dx = dy · (1 + alpha) if pulls_toward # amplify
32+
else dy · 1/(1 + alpha) # dampen
33+
```
34+
35+
The sign math: parameter update is `θ ← θ − lr · grad`. If attractor is
36+
above x (`dir > 0`), the update must be NEGATIVE → grad must be POSITIVE.
37+
Amplifying grad in that case = good. If grad is negative when attractor
38+
is above, the update pushes x further from attractor → dampen.
39+
40+
**Smoke test verifies math** (scale=10, alpha=0.5):
41+
42+
| x | xs | nearest_attractor | dist | dir | grad | result | expected |
43+
|---|---|---|---|---|--:|--:|--:|
44+
| 0.6 | 6 | 5 | 1 | -1 | +1 | **1.5** | 1.5 (amplify) ✓ |
45+
| 0.7 | 7 | 8 | 1 | +1 | +1 | **0.667** | 0.667 (dampen) ✓ |
46+
| 0.5 | 5 | 5 | 0 || +1 | **1.0** | 1.0 (passthrough) ✓ |
47+
48+
Math correct end-to-end.
49+
50+
## A/B at d_model=32, 250 steps, 3 seeds
51+
52+
Wrapped Q and V projection params in `tape_substrate_grad_mod(node, 64, 0.5)`
53+
before the matmul (forward unchanged; backward biased).
54+
55+
| arm | mean tail loss | Δ vs baseline | wins |
56+
|---|--:|--:|--:|
57+
| baseline | 1.998 |||
58+
| + substrate gm | 2.165 | **+8.4%** | 1/3 |
59+
| + substrate gm + Q6 | 2.157 | **+7.9%** | 1/3 |
60+
61+
**Falsified.** Substrate-shaped gradient bias hurts training at this
62+
scale. The hypothesis was that pulling Q/V toward attractor positions
63+
during training would regularize like substrate-init was supposed to,
64+
without the rigidity of init-time snapping. The result says: the loss
65+
landscape gradient is informative and biasing it toward substrate-
66+
aligned positions costs more than it gains.
67+
68+
This mirrors the v0.8.8 substrate-init falsification — both "constrain
69+
toward substrate" hypotheses fail. The substrate is good at:
70+
- **Forward modulation** (Q6, S-MOD, V-resample) — explicit substrate
71+
shaping of activations
72+
- **Architectural priors** (CRT-PE, fibonacci attractor table) —
73+
substrate in the data and structure
74+
- **Post-training pattern** (v0.8.8 finding) — substrate emerges in
75+
attention after Q6 training
76+
77+
The substrate is NOT good at:
78+
- **Init-time constraint** (v0.8.8 #3 falsified)
79+
- **Gradient-time bias** (v0.8.10 falsified)
80+
81+
Pattern: **the substrate works when applied to outputs (forward modulation)
82+
or revealed by training (post-train alignment), but NOT when forced on
83+
inputs or gradients.** The information flow direction matters.
84+
85+
## What's NOT ruled out (future chapter reformulations)
86+
87+
1. **Different scale**: scale=64 may be too coarse. scale=1024 or scale
88+
per-layer (computed from param magnitude statistics) may give
89+
gentler bias that the loss can integrate.
90+
91+
2. **Apply to FF instead of attention**: attention Q/V are loss-critical;
92+
FF down-projection weights may be more tolerant of substrate bias.
93+
94+
3. **Decay alpha during training**: start with strong substrate bias
95+
(alpha=0.5), decay linearly to 0 over training. Substrate as a
96+
warm-start regularizer.
97+
98+
4. **Substrate as REGULARIZATION TERM, not gradient bias**: add
99+
`sum(attractor_distance(param)) · lambda` to the loss. Gradient
100+
then has substrate component naturally; doesn't override the loss.
101+
102+
Each is its own chapter. v0.8.10 ships the negative honestly.
103+
104+
## Where it lands in the substrate-IS-architecture map
105+
106+
The substrate has been validated at 5 layers across v0.8:
107+
1. **Data** — CRT-PE positional encoding (cross-validates)
108+
2. **Algorithm** — substrate-K + S-MOD + V-resample (cross-validates)
109+
3. **Hardware tile** — 8×32 wavefront-aligned (cross-validates +38-61%)
110+
4. **Post-training attention pattern** — Q6 → 8.3× concentration
111+
(v0.8.8 finding)
112+
5. **Multi-head Q6 compound** — −3.57% vs baseline (v0.8.9 confirms)
113+
114+
Now-falsified attempts:
115+
- **Init-time substrate-snap** — substrate-init regularization
116+
(v0.8.8 #3)
117+
- **Gradient-time substrate-pull** — substrate backward modulation
118+
(v0.8.10 this chapter)
119+
120+
The empirical map is: substrate at OUTPUTS or in STRUCTURE works.
121+
Substrate as INPUT constraint or BACKWARD bias does not (at current
122+
scales, with current scale parameter, on current architectures).
123+
124+
## Files
125+
126+
- `omnimcode-core/src/interpreter.rs``TapeOp::SubstrateGradMod`
127+
variant + `tape_substrate_grad_mod` dispatch + substrate-aware
128+
backward
129+
- `examples/prometheus_substrate_grad_mod_xval.omc` — 3-arm A/B
130+
- `experiments/prometheus_parity/V0810_SUBSTRATE_AWARE_BACKWARD.md`
131+
132+
## Tests
133+
134+
**1111/1111 OMC tests pass.**

0 commit comments

Comments
 (0)