Prometheus: per-row LayerNorm + broadcast-aware tape ops + multi-token attention

RandomCoder-lab · claude · RandomCoder-lab · commit 05c9e4abd156 · 2026-05-16T23:57:35.000-05:00
The plumbing needed to make multi-token transformer training real.

Rust additions (omnimcode-core/src/interpreter.rs):

(1) tape_layernorm(x, gamma, beta, eps?) — fused per-row LayerNorm
    Forward: normalize each row to zero mean / unit variance, scale
    by gamma, add beta. Backward: full LayerNorm gradient (dx with
    proper centered/scaled terms, dgamma, dbeta).
    Composing this from primitives needed broadcast sub/div that
    weren't on the tape; fused op is cleaner + faster.

(2) tape_row_mean(x) / tape_row_sum(x) — per-row reductions
    [rows, cols] → [rows, 1] with element-wise backward. Building
    blocks for any per-row scaling.

(3) tape_add / tape_sub now support row + col vector broadcast
    [N, C] + [1, C] (Linear's bias add)
    [N, C] + [N, 1] (per-row scaling)
    Forward picks the bigger shape; backward reduces upstream gradient
    back to the smaller operand's shape via new reduce_to_shape helper.

    This fixed a latent bias-gradient bug in the earlier transformer
    demo — its 11.3x loss reduction came partly from over-broadcasting
    bias grads. With correct broadcast reduction, the same demo gets
    4.15x (still real, just honest).

Prometheus additions (examples/lib/prometheus.omc):

(4) prom_layernorm_forward upgraded to use tape_layernorm fused op
    instead of the composed (mean, sub, exp(-0.5*log(var+eps)), ...)
    path. Cleaner, works on multi-token inputs.

(5) prom_embedding_batch(layer, token_ids[]) — multi-token lookup
    via [N, vocab] one-hot @ table. Differentiable into the table.

(6) prom_cross_entropy_batch(logits, targets, vocab) — sum of per-
    position -log(softmax) for batched LM training.

A/B demo (examples/prometheus_attention_ab.omc):
    Multi-token transformer (8-token windows), seq_len=8, d_model=16,
    ff=32, AdamW, cross-entropy. Two arms:
      A: alpha=0   (vanilla softmax attention)
      B: alpha=0.5 (geodesic-bias attention)

    3 seeds × 250 steps each. Tests whether the PyTorch geodesic
    win replicates in Prometheus.

The first multi-token training run worked end-to-end through the
new plumbing. Single-seed result before extending to 3 seeds:
  vanilla=3.104  geodesic=3.119  delta=+0.46%

A genuine fail-forward. Could be: single seed noise, alpha not
tuned, model too small, training too short. The 3-seed run is
in flight; result will land in the next commit.

What matters infrastructure-wise: multi-token attention works in
pure OMC now. The geodesic primitive is wired correctly (numerically
identical to PyTorch). Whether it HELPS at this scale is an empirical
question we can keep iterating on without re-shipping plumbing.

Co-Authored-By: Claude Opus 4.7 &lt;noreply@anthropic.com&gt;
diff --git a/examples/lib/prometheus.omc b/examples/lib/prometheus.omc
@@ -892,6 +892,61 @@ fn prom_embedding_params(layer) {
     return [dict_get(layer, "table")];
 }
 
+# Batched embedding lookup: token_ids[] → [N, d_model] matrix.
+# Implemented via an [N, vocab] one-hot batch then matmul with the
+# embedding table. Differentiable end-to-end.
+fn prom_embedding_batch(layer, token_ids) {
+    h vocab = dict_get(layer, "vocab");
+    h table = dict_get(layer, "table");
+    h n = arr_len(token_ids);
+    h onehot = [];
+    h i = 0;
+    while i < n {
+        h row = [];
+        h idx = arr_get(token_ids, i);
+        h j = 0;
+        while j < vocab {
+            if j == idx { arr_push(row, 1.0); }
+            else { arr_push(row, 0.0); }
+            j = j + 1;
+        }
+        arr_push(onehot, row);
+        i = i + 1;
+    }
+    h onehot_const = tape_const(onehot);
+    return tape_matmul(onehot_const, table);
+}
+
+# Batched cross-entropy: logits is [N, vocab], targets is array of N
+# integer indices. Returns scalar mean loss (averaged over positions).
+fn prom_cross_entropy_batch(logits_id, targets, vocab) {
+    h n = arr_len(targets);
+    h probs = tape_softmax(logits_id);
+    h log_probs = tape_log(probs);
+    # Build [N, vocab] mask: -1.0 at (i, targets[i]), 0 elsewhere.
+    h mask_rows = [];
+    h i = 0;
+    while i < n {
+        h row = [];
+        h tgt = arr_get(targets, i);
+        h c = 0;
+        while c < vocab {
+            if c == tgt { arr_push(row, -1.0); }
+            else { arr_push(row, 0.0); }
+            c = c + 1;
+        }
+        arr_push(mask_rows, row);
+        i = i + 1;
+    }
+    h mask = tape_const(mask_rows);
+    h selected = tape_mul(log_probs, mask);
+    # Mean over all cells = (sum of -log p_target) / (N * vocab).
+    # We want per-token mean = sum / N. Use sum + divide.
+    h s = tape_sum(selected);
+    h scale = tape_const(1.0 / n);
+    return tape_mul(s, scale);
+}
+
 # ---------------------------------------------------------------------------
 # LayerNorm — normalize each row to zero mean / unit variance, then
 # scale + shift by learned gamma/beta.
@@ -923,35 +978,14 @@ fn prom_layernorm_new(d_model, rng_state) {
     return layer;
 }
 
-# Forward: x is [1, d_model] (single row); subtract mean, divide by
-# stable std, scale + shift. The Mean op already gives us per-tensor
-# mean; for per-row mean we use the same op since our inputs here are
-# single-row.
+# Forward: x is [N, d_model]; per-row layer norm via the fused
+# tape_layernorm Rust op. Works for both single-row [1, d] and
+# multi-token [seq, d] shapes — same code path.
 fn prom_layernorm_forward(layer, x_id) {
     h gamma = dict_get(layer, "gamma");
     h beta = dict_get(layer, "beta");
     h eps = dict_get(layer, "eps");
-
-    h mean_id = tape_mean(x_id);
-    # Broadcast mean as a const shaped like x; OMC's tape mul handles
-    # scalar broadcast.
-    h centered = tape_sub(x_id, mean_id);
-    h sq = tape_mul(centered, centered);
-    h variance = tape_mean(sq);
-    h std_const = tape_const(eps);
-    h denom_sq = tape_add(variance, std_const);
-    # We need sqrt(variance); use tape_pow_int(denom_sq, ...) — but
-    # pow_int can only do integer powers. Approximate sqrt via the
-    # identity sqrt(x) = x^0.5: not directly available; use exp(0.5*log(x)).
-    h log_v = tape_log(denom_sq);
-    h half = tape_const(0.5);
-    h half_log = tape_mul(log_v, half);
-    h std_inv_log = tape_neg(half_log);
-    h std_inv = tape_exp(std_inv_log);   # = 1 / sqrt(variance + eps)
-
-    h normed = tape_mul(centered, std_inv);
-    h scaled = tape_mul(normed, gamma);
-    return tape_add(scaled, beta);
+    return tape_layernorm(x_id, gamma, beta, eps);
 }
 
 fn prom_layernorm_params(layer) {
diff --git a/examples/prometheus_attention_ab.omc b/examples/prometheus_attention_ab.omc
@@ -0,0 +1,243 @@
+# Multi-token transformer with geodesic attention A/B.
+#
+# The key experiment: does the geodesic attention bias that won 3/3
+# seeds in today's PyTorch experiment also help when training a
+# Prometheus model from scratch on real text?
+#
+# Architecture (8-token sliding window):
+#   tokens[8]
+#     ↓ Embedding → [8, d_model]
+#     ↓ + CRT-PE
+#   x
+#     ↓ Attention (with OR without geodesic bias on positions)
+#     ↓ + residual
+#     ↓ LayerNorm
+#     ↓ FFN
+#     ↓ + residual
+#     ↓ LayerNorm
+#     ↓ head → [8, vocab]
+#   logits, target the next token at every position.
+#
+# A: alpha=0  → vanilla softmax attention (geodesic OFF)
+# B: alpha=0.5 → geodesic-biased attention (geodesic ON)
+#
+# Stop condition: report final tail-mean loss for both arms.
+# If B < A, the geodesic win replicates in Prometheus too.
+
+import "examples/lib/prometheus.omc";
+
+fn build_vocab(text) {
+    h seen = dict_new();
+    h chars = [];
+    h i = 0;
+    while i < str_len(text) {
+        h ch = str_slice(text, i, i + 1);
+        if !dict_has(seen, ch) {
+            dict_set(seen, ch, arr_len(chars));
+            arr_push(chars, ch);
+        }
+        i = i + 1;
+    }
+    h v = dict_new();
+    dict_set(v, "chars", chars);
+    dict_set(v, "lookup", seen);
+    return v;
+}
+
+fn encode(text, vocab) {
+    h lookup = dict_get(vocab, "lookup");
+    h ids = [];
+    h i = 0;
+    while i < str_len(text) {
+        h ch = str_slice(text, i, i + 1);
+        arr_push(ids, dict_get(lookup, ch));
+        i = i + 1;
+    }
+    return ids;
+}
+
+fn build_model(vocab_size, d_model, ff_dim, seq_len, alpha, seed) {
+    h emb = prom_embedding_new(vocab_size, d_model, seed);
+    h s1 = dict_get(emb, "rng_state");
+    h attn = prom_attention_new(d_model, seq_len, s1 + 11);
+    dict_set(attn, "alpha", alpha);     # geodesic strength
+    h s2 = dict_get(attn, "rng_state");
+    h ln1 = prom_layernorm_new(d_model, s2);
+    h ff_up = prom_linear_new(d_model, ff_dim, s2 + 13);
+    h s3 = dict_get(ff_up, "rng_state");
+    h ff_down = prom_linear_new(ff_dim, d_model, s3);
+    h s4 = dict_get(ff_down, "rng_state");
+    h ln2 = prom_layernorm_new(d_model, s4);
+    h head = prom_linear_new(d_model, vocab_size, s4 + 17);
+
+    h m = dict_new();
+    dict_set(m, "emb", emb);
+    dict_set(m, "attn", attn);
+    dict_set(m, "ln1", ln1);
+    dict_set(m, "ff_up", ff_up);
+    dict_set(m, "ff_down", ff_down);
+    dict_set(m, "ln2", ln2);
+    dict_set(m, "head", head);
+    dict_set(m, "alpha", alpha);
+    return m;
+}
+
+# Forward over an 8-token window. Returns [8, vocab] logits.
+fn forward_window(model, token_ids, pe_table) {
+    h x = prom_embedding_batch(dict_get(model, "emb"), token_ids);
+
+    # Add CRT-PE rows for these positions (0..N).
+    h pe_rows = [];
+    h i = 0;
+    while i < arr_len(token_ids) {
+        arr_push(pe_rows, arr_get(pe_table, i));
+        i = i + 1;
+    }
+    h pe_const = tape_const(pe_rows);
+    x = tape_add(x, pe_const);
+
+    # Attention + residual.
+    h attn_out = prom_attention_forward(dict_get(model, "attn"), x);
+    h x_post_attn = tape_add(x, attn_out);
+
+    # LayerNorm.
+    h normed1 = prom_layernorm_forward(dict_get(model, "ln1"), x_post_attn);
+
+    # FFN.
+    h up = prom_linear_forward(dict_get(model, "ff_up"), normed1);
+    h activated = prom_relu(up);
+    h down = prom_linear_forward(dict_get(model, "ff_down"), activated);
+    h x_post_ff = tape_add(x_post_attn, down);
+
+    # LayerNorm + head.
+    h normed2 = prom_layernorm_forward(dict_get(model, "ln2"), x_post_ff);
+    return prom_linear_forward(dict_get(model, "head"), normed2);
+}
+
+fn collect_all_params(model) {
+    h layers = [
+        dict_get(model, "emb"),
+        dict_get(model, "attn"),
+        dict_get(model, "ln1"),
+        dict_get(model, "ff_up"),
+        dict_get(model, "ff_down"),
+        dict_get(model, "ln2"),
+        dict_get(model, "head"),
+    ];
+    return prom_collect_params_v2(layers);
+}
+
+fn train_arm(alpha, text, vocab, vocab_size, ids, seq_len, d_model,
+             ff_dim, n_windows, lr, steps, seed) {
+    tape_reset();
+    h model = build_model(vocab_size, d_model, ff_dim, seq_len, alpha, seed);
+    h params = collect_all_params(model);
+    h opt = prom_adamw_new(params, lr, 0.9, 0.999, 1e-8, 0.0);
+    h pe_table = prom_crt_pe_matrix(seq_len, d_model);
+
+    h tail_losses = [];
+    h step = 0;
+    while step < steps {
+        # Pick a random-but-deterministic window start.
+        h start = step - (step / n_windows) * n_windows;
+        h window = [];
+        h targets = [];
+        h k = 0;
+        while k < seq_len {
+            arr_push(window, arr_get(ids, start + k));
+            arr_push(targets, arr_get(ids, start + k + 1));
+            k = k + 1;
+        }
+        h logits = forward_window(model, window, pe_table);
+        h loss = prom_cross_entropy_batch(logits, targets, vocab_size);
+        tape_backward(loss);
+        prom_adamw_step(opt);
+        if step >= steps - 10 { arr_push(tail_losses, tape_value(loss)); }
+        step = step + 1;
+    }
+    h sum = 0.0;
+    h i = 0;
+    while i < arr_len(tail_losses) { sum = sum + arr_get(tail_losses, i); i = i + 1; }
+    return sum / arr_len(tail_losses);
+}
+
+fn main() {
+    print("=== Prometheus multi-token attention: geodesic A/B ===");
+    h text = "the quick brown fox jumps over the lazy dog and the dog sleeps in the sun";
+    h vocab = build_vocab(text);
+    h vocab_size = arr_len(dict_get(vocab, "chars"));
+    h ids = encode(text, vocab);
+    h seq_len = 8;
+    h d_model = 16;
+    h ff_dim = 32;
+    h n_windows = arr_len(ids) - seq_len - 1;
+    h lr = 0.02;
+    h steps = 250;
+    h seeds = [42, 7, 123];
+
+    print(concat_many("corpus length: ", to_string(str_len(text))));
+    print(concat_many("vocab: ", to_string(vocab_size)));
+    print(concat_many("seq_len: ", to_string(seq_len), "  windows: ", to_string(n_windows)));
+    print(concat_many("d_model: ", to_string(d_model), "  ff: ", to_string(ff_dim)));
+    print(concat_many("steps: ", to_string(steps), "  lr: ", to_string(lr), "  seeds: ", to_string(seeds)));
+    print("");
+
+    h a_results = [];
+    h b_results = [];
+    h s = 0;
+    while s < arr_len(seeds) {
+        h seed = arr_get(seeds, s);
+        h loss_a = train_arm(0.0, text, vocab, vocab_size, ids, seq_len,
+                             d_model, ff_dim, n_windows, lr, steps, seed);
+        h loss_b = train_arm(0.5, text, vocab, vocab_size, ids, seq_len,
+                             d_model, ff_dim, n_windows, lr, steps, seed);
+        arr_push(a_results, loss_a);
+        arr_push(b_results, loss_b);
+        h delta = loss_b - loss_a;
+        h tag = "(geodesic worse)";
+        if loss_b < loss_a { tag = "(geodesic better)"; }
+        print(concat_many("seed ", to_string(seed),
+            "  vanilla=", to_string(loss_a),
+            "  geodesic=", to_string(loss_b),
+            "  delta=", to_string(delta), "  ", tag));
+        s = s + 1;
+    }
+
+    h a_sum = 0.0;
+    h b_sum = 0.0;
+    h wins = 0;
+    h i = 0;
+    while i < arr_len(seeds) {
+        a_sum = a_sum + arr_get(a_results, i);
+        b_sum = b_sum + arr_get(b_results, i);
+        if arr_get(b_results, i) < arr_get(a_results, i) { wins = wins + 1; }
+        i = i + 1;
+    }
+    h a_mean = a_sum / arr_len(seeds);
+    h b_mean = b_sum / arr_len(seeds);
+    h rel = (b_mean - a_mean) / a_mean * 100.0;
+
+    print("");
+    print("=== Multi-seed verdict ===");
+    print(concat_many("  vanilla    mean: ", to_string(a_mean)));
+    print(concat_many("  geodesic   mean: ", to_string(b_mean)));
+    print(concat_many("  geodesic vs vanilla: ", to_string(rel), "%"));
+    print(concat_many("  geodesic wins: ", to_string(wins), "/", to_string(arr_len(seeds))));
+    print("");
+    if wins >= 2 {
+        print("[WIN] Geodesic attention helps Prometheus on majority of seeds.");
+        print("      Cross-platform substrate-positional-bias validation.");
+        print("      🥂");
+    } elif wins == 0 {
+        print("[FAIL-FORWARD] Geodesic lost 0/3 in Prometheus.");
+        print("               Honest negative — the PyTorch -0.4% win at distractor=0.20");
+        print("               didn't replicate at this scale (single-block model, no");
+        print("               distractor mix, 250 steps). Suggests either: PyTorch result");
+        print("               was scale-specific, OR our Prometheus model needs the");
+        print("               same training setup (much longer steps, mix of clean+noise).");
+    } else {
+        print("[INCONCLUSIVE] 1/3 — noise. Need more seeds or larger model.");
+    }
+}
+
+main();
diff --git a/omnimcode-core/src/interpreter.rs b/omnimcode-core/src/interpreter.rs