v0.8.6 — #3 softmax accel scaffold + survey for #7-10
Scope
Item #3 in the v0.8.5 optimization plan: route more tape ops through GPU. Shipped as scaffolding — the hook is in place, the dispatch consults it, the binary registers a stub. At current Prometheus shapes the stub declines (default threshold = 1M cells); larger-scale runs and future hardware can opt in via env.
Plus an honest survey of items #7-10 (each a future chapter rather than rushed half-implementations in this one).
What's new
SoftmaxAccelerator hook in omnimcode-core::accel, mirroring MatmulAccelerator:
```rust
pub type SoftmaxAccelerator = Box<
dyn Fn(usize, usize, &[f64]) -> Option<Result<Vec, String>>
+ Send + Sync,
;
pub fn register_softmax_accelerator(f: SoftmaxAccelerator) -> Result<(), &'static str>;
```
tape_softmax interpreter dispatch consults the hook first, falls through to the existing CPU triple-pass when the hook declines. omnimcode-cli registers a stub at startup that declines all calls under OMC_GPU_SOFTMAX_MIN_CELLS (default 1,000,000) — high enough that no current Prometheus path opts in.
Honest framing
At current Prometheus shapes we exercise (d_model=256, seq_len=64, scores 64×64), per-row softmax is memory-bound and tiny (4k cells = microseconds of CPU work). GPU buffer alloc + dispatch overhead would dominate any kernel speedup. The scaffold lives so:
- Larger-scale runs (seq_len=512+, d_model=1024+) can opt in by setting
OMC_GPU_SOFTMAX_MIN_CELLSlower - Future hardware with cheaper dispatch (Apple M-series unified memory, NVIDIA persistent kernels) can register a non-stub accelerator
- The same pattern extends to LayerNorm, element-wise, etc. —
accel.rsis the precedent
This is the right size of attempt for an item whose payoff at current scales is small but whose architectural slot matters for the trajectory.
What's deferred to v0.8.7+
experiments/prometheus_parity/V086_OPTIMIZATION_SURVEY.md records the scope for each remaining item:
- #7 substrate-quantized GPU weights — own chapter (~half-day). Encode f32 as
(u8 attractor_index, i16 delta), dequant on GPU. Substrate at the data layer where it actually lives. - #8 CRT-PE-keyed sparse attention matmul — own chapter. Sparse WGSL kernel + sparse-aware backward.
- #9 omnimcode-codegen LLVM JIT for tape paths — own chapter. Needs Prometheus-fn JIT-compatibility audit.
- #10 f16/bf16 GPU paths — own chapter. New WGSL + loss-scaling logic for training stability.
Tests
1111/1111 OMC tests pass.