RandomCoder-lab
diff --git a/‎CHANGELOG.md‎
Lines changed: 60 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 60 additions & 0 deletions
diff --git a/‎Cargo.lock‎
Lines changed: 1 addition & 0 deletions b/‎Cargo.lock‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/bench_prometheus_gpu.omc‎
Lines changed: 144 additions & 0 deletions b/‎examples/bench_prometheus_gpu.omc‎
Lines changed: 144 additions & 0 deletions
diff --git a/‎experiments/prometheus_parity/GPU_INTEGRATION.md‎
Lines changed: 93 additions & 0 deletions b/‎experiments/prometheus_parity/GPU_INTEGRATION.md‎
Lines changed: 93 additions & 0 deletions
diff --git a/‎omnimcode-cli/Cargo.toml‎
Lines changed: 7 additions & 0 deletions b/‎omnimcode-cli/Cargo.toml‎
Lines changed: 7 additions & 0 deletions
@@ -13,6 +13,7 @@ Read top-to-bottom for the arc; jump to any chapter for the detail.
 
 | Tag | Date | One-line |
 |---|---|---|
+| [v0.8.2-gpu-prometheus](#v082-gpu-prometheus--2026-05-17) | 2026-05-17 | **GPU wired into Prometheus** via a MatmulAccelerator hook. **13× speedup on synthetic chained matmul** (512², CPU 3.47s → GPU 0.27s). End-to-end Prometheus training at d_model=256: wall-clock unchanged — OMC tree-walk overhead in substrate-shaping helpers (smod, resample, Q6) is the next bottleneck, not matmul. Integration is load-bearing for the substrate-native GPU kernels coming next. |
 | [v0.8.1-tape-primitives](#v081-tape-primitives--2026-05-17) | 2026-05-17 | **Substrate-native tape primitive precedent**: `tape_phi_log` fuses Q6's log-distance into one tape node, with `tape_abs` as the boring companion. Composed vs fused trains to within ~1e-7 — fused abstraction is free. **Pre-existing tape_div/tape_mul broadcast-backward bug fixed**, which unblocks OMC-side cross-validation of S-MOD + substrate-K. First Q6 OMC replication: −0.63% 2/3 seeds at small scale, directionally matching PyTorch's −12.15%. |
 | [v0.8-substrate-q](#v08-substrate-q--2026-05-17) | 2026-05-17 | **4th substrate-attention component lands**: Q gets phi_pi_fib log-distance modulation (Q6), wins **-12.15% val 6/6 seeds**. Cumulative stack now -16.7% vs vanilla baseline. |
 | [v0.7-gpu-scaffold](#v07-gpu-scaffold--2026-05-17) | 2026-05-17 | GPU compute scaffold: `omnimcode-gpu` crate with wgpu (Vulkan) backend, ROCm/CUDA stubs. **4.04× speedup verified on the user's AMD RX 580** via Vulkan (no ROCm pain). |
@@ -32,6 +33,65 @@ Read top-to-bottom for the arc; jump to any chapter for the detail.
 
 ---
 
+## [v0.8.2-gpu-prometheus] - 2026-05-17
+
+**GPU wired into Prometheus via a pluggable MatmulAccelerator hook. Kernel-level 13× speedup confirmed; end-to-end Prometheus training is now bottlenecked by OMC tree-walk overhead in the substrate-shaping helpers, not by matmul. The integration is load-bearing for v0.8.3+ substrate-native GPU kernels.**
+
+### What's new
+
+- **`omnimcode_core::accel::register_matmul_accelerator(f)`** — outer binaries (CLI, MCP server) install a matmul implementation at startup. `omnimcode-core` doesn't depend on `omnimcode-gpu` (which would be a cycle); the hook keeps the layering clean.
+- **`tape_matmul` checks the hook first**, falls through to the in-core triple-loop when unregistered or when the hook declines (e.g. below threshold).
+- **`omnimcode-cli --features gpu`** wires the wgpu Vulkan backend in. Tunables:
+  - `OMC_GPU_BACKEND=cpu|wgpu` — force a backend (or none).
+  - `OMC_GPU_MATMUL_MIN_FLOPS=<N>` — crossover threshold (default 1,000,000).
+  - `OMC_GPU_VERBOSE=1` — log backend + threshold at startup.
+
+### Kernel-level result: 13× on a chained matmul
+
+5 sequential 512² matmuls inside an OMC tape:
+
+| backend | wall-clock | speedup |
+|---|--:|--:|
+| `cpu` | 3.47 s | 1.00× |
+| `wgpu` (RX 580, Vulkan) | 0.27 s | **12.85×** |
+
+Parity: f64 → f32 → f64 round-trip differs at the 9th significant digit — fine for any Prometheus-scale workload.
+
+### End-to-end Prometheus result: unchanged at d_model=256
+
+`examples/bench_prometheus_gpu.omc`, substrate-K transformer, seq_len=64, d_model=256, ff_dim=512, 5 AdamW steps:
+
+| | wall-clock | per step | loss |
+|---|--:|--:|--:|
+| CPU  | 129.05 s | 25.81 s | 6.95930 |
+| wgpu | 129.39 s | 25.88 s | 6.95932 |
+
+GPU and CPU are dead even (+0.3% slower on GPU due to f64↔f32 conversion overhead). **The matmul wall-clock is single-digit milliseconds per step; the surrounding OMC-side iteration in `_prom_smod_matrix`, `_prom_substrate_resample_matrix`, and Q6 modulation is tens of seconds**. GPU saves ~50ms; OMC burns ~25s. The ratio explains the 0% wall-clock movement.
+
+### What this opens up
+
+The integration is load-bearing for:
+- **v0.8.3 substrate-native GPU kernels**: Fibonacci-tile workgroups (13×13, 21×21, 34×34 vs 16×16), substrate-quantized weights, CRT-PE-keyed sparse matmul. Same composed-vs-fused protocol as `tape_phi_log` in v0.8.1, applied at the GPU layer. The substrate-native question at the kernel level.
+- **Bigger d_model**: at d_model=1024+ the matmul time grows ~64× while the OMC-side substrate ops grow ~4× — the ratio inverts and GPU starts to win end-to-end.
+- **Substrate ops as Rust builtins** (separate work): moving `_prom_smod_matrix` / `_prom_substrate_resample_matrix` into Rust would dissolve the current bottleneck and let the GPU win show through at today's scales.
+
+### Honest framing
+
+This chapter ships the **integration**, not an end-to-end speedup. The 13× kernel-level win is real and reproducible; the end-to-end null result is also real and points cleanly at the next bottleneck. Naming the wall is the chapter — the integration unlocks every direction that needs more matmul work in the time budget without paying re-integration cost later.
+
+### Files
+
+- `omnimcode-core/src/accel.rs` — new module: `MatmulAccelerator`, `register_matmul_accelerator`, `try_accelerated_matmul`
+- `omnimcode-core/src/interpreter.rs` — `tape_matmul` consults the hook first
+- `omnimcode-cli/Cargo.toml` — `gpu` feature pulls in `omnimcode-gpu` with `wgpu`
+- `omnimcode-cli/src/main.rs` — `install_gpu_matmul_accelerator()` at startup
+- `examples/bench_prometheus_gpu.omc` — wall-clock harness
+- `experiments/prometheus_parity/GPU_INTEGRATION.md` — full writeup
+
+Test suite: **1103/1103 OMC tests pass** (small tests stay below GPU threshold and run on CPU as before; broadcast-backward fix from v0.8.1 still holds).
+
+---
+
 ## [v0.8.1-tape-primitives] - 2026-05-17
 
 **Two new tape autograd primitives + a latent backward-broadcast bug fix. The substrate-native `tape_phi_log` is mathematically equivalent to the boring composed reference and trains to within ~1e-7 of it — the substrate-native abstraction is free. The broadcast-backward fix unblocks S-MOD + substrate-K end-to-end training in OMC for the first time.**
 
@@ -0,0 +1,144 @@
+# Wall-clock bench: pure-OMC Prometheus training, CPU vs GPU.
+#
+# Trains the substrate-K transformer for N steps on a synthetic corpus,
+# measures total time, prints loss for parity. The matmul accelerator
+# (omnimcode-gpu wgpu backend) is registered by the CLI when built with
+# --features gpu; OMC_GPU_BACKEND=cpu forces the in-core CPU path so
+# we can compare wall-clock.
+#
+# Pick d_model big enough that the QKV/FF matmuls cross the
+# OMC_GPU_MATMUL_MIN_FLOPS threshold (default 1M). With d_model=128
+# and seq_len=32, x@Q is 32*128*128 = 524k FLOPS (below); FF up at
+# (32×128)·(128×512) = 2M (above). With d_model=256: 32*256*256 = 2M
+# (above on QKV too).
+
+import "examples/lib/prometheus.omc";
+
+fn build_random_corpus(n_tokens, vocab_size) {
+    h ids = [];
+    h s = 17;
+    h i = 0;
+    while i < n_tokens {
+        s = (s * 1103515245 + 12345) - ((s * 1103515245 + 12345) / 2147483648) * 2147483648;
+        arr_push(ids, s - (s / vocab_size) * vocab_size);
+        i = i + 1;
+    }
+    return ids;
+}
+
+fn build_model(vocab_size, d_model, ff_dim, seq_len, seed) {
+    h emb = prom_embedding_new(vocab_size, d_model, seed);
+    h attn = prom_attention_substrate_k_new(d_model, seq_len, seed + 11);
+    h ln1 = prom_layernorm_new(d_model, seed + 23);
+    h ff_up = prom_linear_new(d_model, ff_dim, seed + 31);
+    h ff_down = prom_linear_new(ff_dim, d_model, seed + 41);
+    h ln2 = prom_layernorm_new(d_model, seed + 53);
+    h head = prom_linear_new(d_model, vocab_size, seed + 61);
+    h m = dict_new();
+    dict_set(m, "emb", emb);
+    dict_set(m, "attn", attn);
+    dict_set(m, "ln1", ln1);
+    dict_set(m, "ff_up", ff_up);
+    dict_set(m, "ff_down", ff_down);
+    dict_set(m, "ln2", ln2);
+    dict_set(m, "head", head);
+    return m;
+}
+
+fn forward_window(model, token_ids, pe_table) {
+    h x = prom_embedding_batch(dict_get(model, "emb"), token_ids);
+    h pe_rows = [];
+    h i = 0;
+    while i < arr_len(token_ids) {
+        arr_push(pe_rows, arr_get(pe_table, i));
+        i = i + 1;
+    }
+    x = tape_add(x, tape_const(pe_rows));
+    h attn_out = prom_attention_substrate_k_forward(dict_get(model, "attn"), x);
+    h x_post_attn = tape_add(x, attn_out);
+    h n1 = prom_layernorm_forward(dict_get(model, "ln1"), x_post_attn);
+    h up = prom_linear_forward(dict_get(model, "ff_up"), n1);
+    h down = prom_linear_forward(dict_get(model, "ff_down"), prom_relu(up));
+    h x_post_ff = tape_add(x_post_attn, down);
+    h n2 = prom_layernorm_forward(dict_get(model, "ln2"), x_post_ff);
+    return prom_linear_forward(dict_get(model, "head"), n2);
+}
+
+fn collect_params(model) {
+    h attn_p = prom_attention_substrate_k_params(dict_get(model, "attn"));
+    h other = prom_collect_params_v2([
+        dict_get(model, "emb"),
+        dict_get(model, "ln1"),
+        dict_get(model, "ff_up"),
+        dict_get(model, "ff_down"),
+        dict_get(model, "ln2"),
+        dict_get(model, "head"),
+    ]);
+    h out = [];
+    h i = 0;
+    while i < arr_len(attn_p) { arr_push(out, arr_get(attn_p, i)); i = i + 1; }
+    i = 0;
+    while i < arr_len(other) { arr_push(out, arr_get(other, i)); i = i + 1; }
+    return out;
+}
+
+fn main() {
+    h vocab_size = 32;
+    h seq_len = 64;
+    h d_model = 256;
+    h ff_dim = 512;
+    h n_tokens = 200;
+    h steps = 5;
+    h lr = 0.005;
+    h seed = 42;
+
+    print(concat_many("== bench_prometheus_gpu =="));
+    print(concat_many("vocab=", to_string(vocab_size),
+        "  seq_len=", to_string(seq_len),
+        "  d_model=", to_string(d_model),
+        "  ff_dim=", to_string(ff_dim),
+        "  steps=", to_string(steps)));
+    print(concat_many("Per-step matmul shapes:"));
+    print(concat_many("  x @ Q  : ", to_string(seq_len), "x", to_string(d_model),
+                      " · ", to_string(d_model), "x", to_string(d_model),
+                      "  = ", to_string(seq_len * d_model * d_model), " flops"));
+    print(concat_many("  ff_up  : ", to_string(seq_len), "x", to_string(d_model),
+                      " · ", to_string(d_model), "x", to_string(ff_dim),
+                      "  = ", to_string(seq_len * d_model * ff_dim), " flops"));
+
+    h ids = build_random_corpus(n_tokens, vocab_size);
+    tape_reset();
+    h model = build_model(vocab_size, d_model, ff_dim, seq_len, seed);
+    h params = collect_params(model);
+    h opt = prom_adamw_new(params, lr, 0.9, 0.999, 1e-8, 0.0);
+    h pe_table = prom_crt_pe_matrix(seq_len, d_model);
+
+    h t0 = py_eval("__import__('time').time()");
+    h step = 0;
+    h last_loss = 0.0;
+    while step < steps {
+        h start = step - (step / (n_tokens - seq_len - 1)) * (n_tokens - seq_len - 1);
+        h window = [];
+        h targets = [];
+        h k = 0;
+        while k < seq_len {
+            arr_push(window, arr_get(ids, start + k));
+            arr_push(targets, arr_get(ids, start + k + 1));
+            k = k + 1;
+        }
+        h logits = forward_window(model, window, pe_table);
+        h loss = prom_cross_entropy_batch(logits, targets, vocab_size);
+        tape_backward(loss);
+        prom_adamw_step(opt);
+        last_loss = tape_value(loss);
+        step = step + 1;
+    }
+    h t1 = py_eval("__import__('time').time()");
+    h elapsed = t1 - t0;
+
+    print(concat_many("final_loss=", to_string(last_loss)));
+    print(concat_many("elapsed=",    to_string(elapsed), " s"));
+    print(concat_many("per_step=",   to_string(elapsed / steps), " s"));
+}
+
+main();
@@ -0,0 +1,93 @@
+# GPU into Prometheus: tape_matmul routed through omnimcode-gpu
+
+## Headline
+
+Integration shipped: tape_matmul forwards above the CPU/GPU crossover threshold get routed through omnimcode-gpu's wgpu (Vulkan) backend. The kernel-level speedup is large (13× on a chained 512² matmul), but **end-to-end Prometheus training is now bottlenecked by OMC tree-walk overhead in the substrate-shaping helpers** (substrate_softmax, substrate_resample, Q6 modulation), not by matmul time. The honest read: the integration is correct and load-bearing for any future work that pushes matmul further into the budget — but **GPU alone doesn't accelerate today's Prometheus**.
+
+## What got wired
+
+A `MatmulAccelerator` hook in `omnimcode-core` that an outer binary can register at startup. The CLI binary now does so under the `gpu` feature, pointing it at `omnimcode-gpu::pick_backend()`. The hook:
+
+- Accepts `(m, k, n, &[f64], &[f64])`, declines (returns `None`) when `m·k·n < OMC_GPU_MATMUL_MIN_FLOPS` (default 1,000,000)
+- Converts f64 → f32 at the boundary, calls the backend, converts f32 → f64 back
+- Disabled by `OMC_GPU_BACKEND=cpu`
+- `OMC_GPU_VERBOSE=1` logs the chosen backend + threshold at startup
+
+Pre-existing tape_matmul implementation is unchanged when no hook is registered — backward compatibility is total. Backward pass (`dA = dy @ B^T`, `dB = A^T @ dy`) automatically benefits because it calls the same `tape_matmul` helper.
+
+## Kernel-level win: synthetic matmul chain
+
+5 chained 512² matmuls, f64 OMC tape:
+
+```
+OMC_GPU_BACKEND=cpu   3.47 s
+OMC_GPU_BACKEND=wgpu  0.27 s    ~13× speedup
+```
+
+f64 → f32 → f64 round-trip vs pure-f64 reference: result differs at the 9th significant digit (`239899095...` vs `239899097...`), well within f32 + summation-order noise. Parity is fine for any Prometheus-scale workload.
+
+## End-to-end Prometheus training (d_model=256)
+
+`examples/bench_prometheus_gpu.omc`, substrate-K transformer, seq_len=64, d_model=256, ff_dim=512, 5 AdamW steps:
+
+| | wall-clock | per step | final loss |
+|---|--:|--:|--:|
+| `OMC_GPU_BACKEND=cpu`  | 129.05 s | 25.81 s | 6.95930 |
+| `OMC_GPU_BACKEND=wgpu` | 129.39 s | 25.88 s | 6.95932 |
+| **diff** | +0.3% slower | +0.3% | 2e-5 (f32 noise) |
+
+Per-step matmul shapes that DID cross the GPU threshold:
+- `x @ Q` : 64×256·256×256 = 4.2M flops
+- `ff_up` : 64×256·256×512 = 8.4M flops
+
+Both are well above the 1M threshold and get routed to GPU. But the wall-clock numbers don't move. Why? Because at this scale, **matmul wall-clock is single-digit milliseconds per step**, and the surrounding OMC-side iteration is multiple seconds per step.
+
+### Where the time actually goes
+
+For seq_len=64, d_model=256:
+
+- `_prom_smod_matrix(scores_val, alpha)` — OMC loop over 64² = 4096 score cells, each calling `attractor_distance`. Per step: 1 forward + 1 backward = 8192 OMC arr_get/arith calls. At tree-walk speed (~100k ops/sec for fat dicts), that's ~80ms purely for the substrate-modulator matrix.
+- `_prom_substrate_resample_matrix(v_val, scale)` — same shape OMC loop over V projections. Another ~80ms.
+- `_prom_q6_log_distance_composed` / `_prom_q6_modulation_from_log_d` — runs at the same scale, several more OMC iterations.
+- The whole inner-loop runs in OMC because it has to call `attractor_distance` which is an OMC builtin chain.
+- Multiply by 5 steps and you get tens of seconds, not the 25 we measured — so there's additional OMC overhead in embedding lookup, parameter collection, AdamW state mutation, etc.
+
+The GPU saves us maybe ~50ms per step on the matmul side. The OMC interp burns ~25 seconds per step on substrate-shaping logic. The 50ms vs 25s ratio is why we see 0% wall-clock movement.
+
+## What this means
+
+The GPU integration is **architecturally complete and load-bearing for any future direction that pushes matmul further into the time budget** — bigger d_model (1024+), batched inference, scaled corpora. It also opens the door to v0.8.3+ **substrate-native GPU kernels** (Fibonacci-tile workgroups, substrate-quantized weights, CRT-PE-keyed sparse matmul) where the substrate IS the kernel architecture.
+
+But **GPU alone doesn't speed up today's Prometheus**. The next bottleneck is OMC tree-walk overhead in the substrate-shaping helpers. Three concrete options for that:
+
+1. **Move substrate modulators into Rust builtins** — `_prom_smod_matrix` / `_prom_substrate_resample_matrix` become `prom_substrate_modulator_smod` / `prom_substrate_modulator_resample` Rust ops that take a tape node id, allocate the modulator matrix natively, return a const tape node. Estimated 100-1000× on these inner loops alone.
+2. **Bytecode VM for the OMC side** — the existing `OMC_VM=1` path already gives 2-10× on hot loops. Hadn't been tested for tape-using paths; worth a measurement.
+3. **Fused substrate tape ops** — `tape_substrate_resample`, `tape_smod_softmax` as single Rust nodes (the precedent set by `tape_phi_log` in v0.8.1). Eliminates the OMC-side iteration entirely.
+
+(3) is the cleanest path and aligns with the substrate-native primitive thesis. (1) is the cheapest. (2) is free measurement.
+
+## Files
+
+- `omnimcode-core/src/accel.rs` — the `MatmulAccelerator` hook + `OnceLock` global + `try_accelerated_matmul` call site
+- `omnimcode-core/src/interpreter.rs` — `tape_matmul` consults the hook before falling back to triple-loop
+- `omnimcode-cli/Cargo.toml` — new `gpu` feature pulls in `omnimcode-gpu`
+- `omnimcode-cli/src/main.rs` — `install_gpu_matmul_accelerator()` registers wgpu backend at startup
+- `examples/bench_prometheus_gpu.omc` — wall-clock harness
+
+## Reproduction
+
+```bash
+# Build with GPU feature
+cargo build --release -p omnimcode-cli --features gpu
+
+# Synthetic matmul chain (kernel-level win)
+OMC_GPU_BACKEND=cpu  ./target/release/omnimcode-standalone /tmp/gpu_matmul_big.omc
+OMC_GPU_BACKEND=wgpu ./target/release/omnimcode-standalone /tmp/gpu_matmul_big.omc
+
+# End-to-end Prometheus training (no end-to-end win at d_model=256)
+OMC_GPU_BACKEND=cpu  ./target/release/omnimcode-standalone examples/bench_prometheus_gpu.omc
+OMC_GPU_BACKEND=wgpu ./target/release/omnimcode-standalone examples/bench_prometheus_gpu.omc
+
+# Tune the crossover threshold
+OMC_GPU_MATMUL_MIN_FLOPS=10000000 ./target/release/omnimcode-standalone ...
+```
@@ -43,10 +43,17 @@ python-embed = ["omnimcode-core/python-embed"]
 # consults `OMC_HBIT_JIT=1` at runtime; if also set, eligible user
 # fns are routed through omnimcode-codegen instead of tree-walk/VM.
 llvm-jit = ["dep:omnimcode-codegen", "dep:inkwell"]
+# GPU matmul acceleration via omnimcode-gpu's wgpu backend. When set,
+# the CLI registers a matmul accelerator at startup that routes
+# tape_matmul calls above the CPU/GPU crossover (default ~128³ FLOPS,
+# tunable via `OMC_GPU_MATMUL_MIN_FLOPS`) through Vulkan/Metal/DX12.
+# Honored if `OMC_GPU_BACKEND != "cpu"` (default = wgpu when built in).
+gpu = ["dep:omnimcode-gpu"]
 
 [dependencies]
 omnimcode-core = { path = "../omnimcode-core", default-features = false }
 omnimcode-codegen = { path = "../omnimcode-codegen", optional = true, features = ["llvm-jit"] }
+omnimcode-gpu = { path = "../omnimcode-gpu", optional = true, features = ["wgpu"] }
 serde_json = "1.0"  # used by omc-kernel for wire-format messages
 # inkwell is needed at the CLI level only because we leak the LLVM
 # Context for process-lifetime; the dispatch closure lives on the