Integration shipped: tape_matmul forwards above the CPU/GPU crossover threshold get routed through omnimcode-gpu's wgpu (Vulkan) backend. The kernel-level speedup is large (13× on a chained 512² matmul), but end-to-end Prometheus training is now bottlenecked by OMC tree-walk overhead in the substrate-shaping helpers (substrate_softmax, substrate_resample, Q6 modulation), not by matmul time. The honest read: the integration is correct and load-bearing for any future work that pushes matmul further into the budget — but GPU alone doesn't accelerate today's Prometheus.
A MatmulAccelerator hook in omnimcode-core that an outer binary can register at startup. The CLI binary now does so under the gpu feature, pointing it at omnimcode-gpu::pick_backend(). The hook:
- Accepts
(m, k, n, &[f64], &[f64]), declines (returnsNone) whenm·k·n < OMC_GPU_MATMUL_MIN_FLOPS(default 1,000,000) - Converts f64 → f32 at the boundary, calls the backend, converts f32 → f64 back
- Disabled by
OMC_GPU_BACKEND=cpu OMC_GPU_VERBOSE=1logs the chosen backend + threshold at startup
Pre-existing tape_matmul implementation is unchanged when no hook is registered — backward compatibility is total. Backward pass (dA = dy @ B^T, dB = A^T @ dy) automatically benefits because it calls the same tape_matmul helper.
5 chained 512² matmuls, f64 OMC tape:
OMC_GPU_BACKEND=cpu 3.47 s
OMC_GPU_BACKEND=wgpu 0.27 s ~13× speedup
f64 → f32 → f64 round-trip vs pure-f64 reference: result differs at the 9th significant digit (239899095... vs 239899097...), well within f32 + summation-order noise. Parity is fine for any Prometheus-scale workload.
examples/bench_prometheus_gpu.omc, substrate-K transformer, seq_len=64, d_model=256, ff_dim=512, 5 AdamW steps:
| wall-clock | per step | final loss | |
|---|---|---|---|
OMC_GPU_BACKEND=cpu |
129.05 s | 25.81 s | 6.95930 |
OMC_GPU_BACKEND=wgpu |
129.39 s | 25.88 s | 6.95932 |
| diff | +0.3% slower | +0.3% | 2e-5 (f32 noise) |
Per-step matmul shapes that DID cross the GPU threshold:
x @ Q: 64×256·256×256 = 4.2M flopsff_up: 64×256·256×512 = 8.4M flops
Both are well above the 1M threshold and get routed to GPU. But the wall-clock numbers don't move. Why? Because at this scale, matmul wall-clock is single-digit milliseconds per step, and the surrounding OMC-side iteration is multiple seconds per step.
For seq_len=64, d_model=256:
_prom_smod_matrix(scores_val, alpha)— OMC loop over 64² = 4096 score cells, each callingattractor_distance. Per step: 1 forward + 1 backward = 8192 OMC arr_get/arith calls. At tree-walk speed (~100k ops/sec for fat dicts), that's ~80ms purely for the substrate-modulator matrix._prom_substrate_resample_matrix(v_val, scale)— same shape OMC loop over V projections. Another ~80ms._prom_q6_log_distance_composed/_prom_q6_modulation_from_log_d— runs at the same scale, several more OMC iterations.- The whole inner-loop runs in OMC because it has to call
attractor_distancewhich is an OMC builtin chain. - Multiply by 5 steps and you get tens of seconds, not the 25 we measured — so there's additional OMC overhead in embedding lookup, parameter collection, AdamW state mutation, etc.
The GPU saves us maybe ~50ms per step on the matmul side. The OMC interp burns ~25 seconds per step on substrate-shaping logic. The 50ms vs 25s ratio is why we see 0% wall-clock movement.
The GPU integration is architecturally complete and load-bearing for any future direction that pushes matmul further into the time budget — bigger d_model (1024+), batched inference, scaled corpora. It also opens the door to v0.8.3+ substrate-native GPU kernels (Fibonacci-tile workgroups, substrate-quantized weights, CRT-PE-keyed sparse matmul) where the substrate IS the kernel architecture.
But GPU alone doesn't speed up today's Prometheus. The next bottleneck is OMC tree-walk overhead in the substrate-shaping helpers. Three concrete options for that:
- Move substrate modulators into Rust builtins —
_prom_smod_matrix/_prom_substrate_resample_matrixbecomeprom_substrate_modulator_smod/prom_substrate_modulator_resampleRust ops that take a tape node id, allocate the modulator matrix natively, return a const tape node. Estimated 100-1000× on these inner loops alone. - Bytecode VM for the OMC side — the existing
OMC_VM=1path already gives 2-10× on hot loops. Hadn't been tested for tape-using paths; worth a measurement. - Fused substrate tape ops —
tape_substrate_resample,tape_smod_softmaxas single Rust nodes (the precedent set bytape_phi_login v0.8.1). Eliminates the OMC-side iteration entirely.
(3) is the cleanest path and aligns with the substrate-native primitive thesis. (1) is the cheapest. (2) is free measurement.
omnimcode-core/src/accel.rs— theMatmulAcceleratorhook +OnceLockglobal +try_accelerated_matmulcall siteomnimcode-core/src/interpreter.rs—tape_matmulconsults the hook before falling back to triple-loopomnimcode-cli/Cargo.toml— newgpufeature pulls inomnimcode-gpuomnimcode-cli/src/main.rs—install_gpu_matmul_accelerator()registers wgpu backend at startupexamples/bench_prometheus_gpu.omc— wall-clock harness
# Build with GPU feature
cargo build --release -p omnimcode-cli --features gpu
# Synthetic matmul chain (kernel-level win)
OMC_GPU_BACKEND=cpu ./target/release/omnimcode-standalone /tmp/gpu_matmul_big.omc
OMC_GPU_BACKEND=wgpu ./target/release/omnimcode-standalone /tmp/gpu_matmul_big.omc
# End-to-end Prometheus training (no end-to-end win at d_model=256)
OMC_GPU_BACKEND=cpu ./target/release/omnimcode-standalone examples/bench_prometheus_gpu.omc
OMC_GPU_BACKEND=wgpu ./target/release/omnimcode-standalone examples/bench_prometheus_gpu.omc
# Tune the crossover threshold
OMC_GPU_MATMUL_MIN_FLOPS=10000000 ./target/release/omnimcode-standalone ...