Skip to content

v0.8.6 — #3 softmax accel scaffold + survey for #7-10

Choose a tag to compare

@RandomCoder-lab RandomCoder-lab released this 17 May 22:05
· 283 commits to master since this release

Scope

Item #3 in the v0.8.5 optimization plan: route more tape ops through GPU. Shipped as scaffolding — the hook is in place, the dispatch consults it, the binary registers a stub. At current Prometheus shapes the stub declines (default threshold = 1M cells); larger-scale runs and future hardware can opt in via env.

Plus an honest survey of items #7-10 (each a future chapter rather than rushed half-implementations in this one).

What's new

SoftmaxAccelerator hook in omnimcode-core::accel, mirroring MatmulAccelerator:

```rust
pub type SoftmaxAccelerator = Box<
dyn Fn(usize, usize, &[f64]) -> Option<Result<Vec, String>>
+ Send + Sync,

;
pub fn register_softmax_accelerator(f: SoftmaxAccelerator) -> Result<(), &'static str>;
```

tape_softmax interpreter dispatch consults the hook first, falls through to the existing CPU triple-pass when the hook declines. omnimcode-cli registers a stub at startup that declines all calls under OMC_GPU_SOFTMAX_MIN_CELLS (default 1,000,000) — high enough that no current Prometheus path opts in.

Honest framing

At current Prometheus shapes we exercise (d_model=256, seq_len=64, scores 64×64), per-row softmax is memory-bound and tiny (4k cells = microseconds of CPU work). GPU buffer alloc + dispatch overhead would dominate any kernel speedup. The scaffold lives so:

  • Larger-scale runs (seq_len=512+, d_model=1024+) can opt in by setting OMC_GPU_SOFTMAX_MIN_CELLS lower
  • Future hardware with cheaper dispatch (Apple M-series unified memory, NVIDIA persistent kernels) can register a non-stub accelerator
  • The same pattern extends to LayerNorm, element-wise, etc. — accel.rs is the precedent

This is the right size of attempt for an item whose payoff at current scales is small but whose architectural slot matters for the trajectory.

What's deferred to v0.8.7+

experiments/prometheus_parity/V086_OPTIMIZATION_SURVEY.md records the scope for each remaining item:

  • #7 substrate-quantized GPU weights — own chapter (~half-day). Encode f32 as (u8 attractor_index, i16 delta), dequant on GPU. Substrate at the data layer where it actually lives.
  • #8 CRT-PE-keyed sparse attention matmul — own chapter. Sparse WGSL kernel + sparse-aware backward.
  • #9 omnimcode-codegen LLVM JIT for tape paths — own chapter. Needs Prometheus-fn JIT-compatibility audit.
  • #10 f16/bf16 GPU paths — own chapter. New WGSL + loss-scaling logic for training stability.

Tests

1111/1111 OMC tests pass.