Skip to content

Commit ff46dac

Browse files
v0.8.6 #3 scaffold + scope survey for v0.8.5 items #7-10
#3 (route more tape ops through GPU): - SoftmaxAccelerator hook added to omnimcode-core::accel, mirroring MatmulAccelerator - omnimcode-cli registers a stub at startup that declines all calls (threshold OMC_GPU_SOFTMAX_MIN_CELLS defaults to 1M) - tape_softmax interpreter dispatch consults the hook first, falls through to the existing CPU triple-pass when declined The honest framing: at Prometheus shapes we exercise today (d_model=256, seq_len=64, scores 64×64), per-row softmax is memory-bound and 4k cells — GPU buffer alloc + dispatch overhead would dominate. The scaffold lives so larger-scale runs (seq_len=512+, d_model=1024+) and future hardware can opt in by lowering the threshold; no current path benefits. Same pattern can extend to LayerNorm / element-wise / etc. — the accel module is the precedent. Survey doc V086_OPTIMIZATION_SURVEY.md records: - what shipped in v0.8.5 (#1, #2, #4 negative, #5, #6) - what scaffolded in v0.8.6 (#3) - what's scoped for v0.8.7+ chapters (#7 substrate-quantized weights, #8 CRT-PE sparse attention, #9 LLVM JIT for tape paths, #10 f16/bf16) Each remaining item is sized at ~half-day to a day and deserves its own chapter with a real bench rather than a rushed half-implementation. "Fail forward" applies to attempts (we tried #3 honestly, found the scaffold is the right size of attempt); it doesn't mean rushing 10 items to ship one chapter of unmeasured code. 1111/1111 OMC tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 34f61fa commit ff46dac

4 files changed

Lines changed: 150 additions & 0 deletions

File tree

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# v0.8.5 / v0.8.6 optimization sweep — what shipped, what's scoped, what's next
2+
3+
The user's optimization roadmap had 10 items. Five shipped in v0.8.5
4+
(#1, #2, #4 negative, #5, #6) and one in v0.8.6 (#3, scaffold only).
5+
Items #7-10 are each their own chapter. This doc records the honest state.
6+
7+
## Shipped
8+
9+
| # | item | status | notes |
10+
|---|---|---|---|
11+
| 1 | tape_cross_entropy_batch fused | v0.8.5 ✓ | closed-form (p−one_hot)/N backward, 5→1 tape nodes |
12+
| 2 | tape_embedding_lookup direct gather | v0.8.5 ✓ | skips one-hot construction |
13+
| 3 | route more tape ops through GPU | v0.8.6 scaffold | softmax hook in place, default declines |
14+
| 4 | OMC_VM=1 on tape paths | v0.8.5 negative | 0.662 s/step vs 0.661 tree-walk |
15+
| 5 | multi-head substrate-K | v0.8.5 ✓ | -0.25% MH vs SH, wins 2/3 seeds, d_model=32 |
16+
| 6 | tape_substrate_resample fused | v0.8.5 ✓ | (tape_smod_softmax fusion deferred — bigger backward chain) |
17+
18+
## Scoped — each its own future chapter
19+
20+
### #7 Substrate-quantized GPU weights
21+
22+
**Goal**: encode f32 weights as substrate-shaped (attractor index + small delta) for smaller buffers and more bandwidth.
23+
24+
**What needs to happen**:
25+
1. Rust quantizer: given an f64 cell, return `(u8 attractor_index, i16 delta)` where attractor_index is into the FIBONACCI table (40 entries, 6 bits used) and delta is a signed offset from the attractor.
26+
2. Dequantizer: inverse. `attractor + (delta / scale)` reconstructs an approximate f64.
27+
3. CPU-side validation: train a Prometheus model where every parameter goes through quantize→dequantize on each forward. Compare loss curve to baseline. If quality holds, the substrate encoding is doing useful work.
28+
4. GPU port: a WGSL shader that takes packed u24-per-cell substrate-encoded buffer + emits f32 matmul inputs. Bench bandwidth-bound shapes (d_model=1024+).
29+
30+
**Expected payoff**: 1.3-2× on memory-bandwidth-bound matmuls. Substrate encoding has structured (not random) quant noise which the model may train around better than uniform i8 quantization.
31+
32+
**Why not shipped this chapter**: substantial cross-layer work — quantizer in Rust + WGSL changes + bench harness. Each piece is straightforward; together is ~half a day.
33+
34+
### #8 CRT-PE-keyed sparse attention matmul
35+
36+
**Goal**: for `scores = q @ k^T` where k is the CRT-PE table, only compute output cells where the CRT-substrate distance between (row, col) is small. Skip far pairs (they softmax to ~0 anyway).
37+
38+
**What needs to happen**:
39+
1. CSR or coordinate-list sparse output buffer.
40+
2. WGSL kernel that walks the query row, computes substrate-distance to each candidate col, skips above threshold.
41+
3. Backward needs to scatter the sparse gradient back into a dense q grad. Doable but non-trivial.
42+
4. Bench at seq_len=512+ where the sparsity payoff is large.
43+
44+
**Expected payoff**: 5-20× on attention computation at long sequences; minimal/negative at seq_len=64 because the substrate-distance check costs more than the saved MACs.
45+
46+
**Why not shipped**: real WGSL work for a sparse kernel + the OMC tape op needs sparse-aware backward. Half-day to a day of focused work.
47+
48+
### #9 omnimcode-codegen LLVM JIT for hot Prometheus paths
49+
50+
**Goal**: JIT-compile hot OMC functions (the `forward_window`, `train_arm` outer loops) to native via the existing omnimcode-codegen crate.
51+
52+
**What needs to happen**:
53+
1. Identify Prometheus orchestration functions that JIT-elidigible (no tape mutation? no closures? need to check).
54+
2. Currently the JIT path is opt-in via OMC_HBIT_JIT=1 — needs testing on tape-using code.
55+
3. Tape ops are already in Rust; JIT'ing the OMC orchestration loop around them would compress the 10-50% of time still spent in OMC interp.
56+
57+
**Expected payoff**: 1.5-3× if the OMC orchestration overhead is non-trivial; near-zero if tape ops dominate (which v0.8.4 indicated they do at d_model=256).
58+
59+
**Why not shipped**: needs JIT compatibility audit of the Prometheus code path. Likely several hours of debugging if JIT chokes on prom_* fns.
60+
61+
### #10 f16/bfloat16 GPU paths
62+
63+
**Goal**: a second WGSL kernel variant taking f16 inputs. Halves the memory bandwidth, may halve the latency on bandwidth-bound shapes.
64+
65+
**What needs to happen**:
66+
1. New WGSL kernel using `f16` type (or `i16`/`u16` packed).
67+
2. f64 → f16 conversion at the boundary; verify training stability.
68+
3. wgpu may need a feature flag for f16.
69+
70+
**Expected payoff**: ~2× on bandwidth-bound shapes (large weight matrices); training stability is the open question — PyTorch trains f16 with loss scaling, which we'd need to replicate.
71+
72+
**Why not shipped**: requires loss-scaling logic for training stability. Substantial cross-layer work.
73+
74+
## What the "try → if failed, reformulate → try again" record looks like
75+
76+
- #1 cross-entropy: tried (cheap), shipped, small visible wall-clock gain at vocab=32 (the test scale), bigger gain expected at vocab=10k+
77+
- #2 embedding lookup: tried, shipped, same story (small at our vocabs, big at larger)
78+
- #3 softmax through GPU: tried with the scaffold; **reformulated** the goal once measurement showed memory-bound element-wise ops won't benefit at our shapes; shipped the scaffold so larger-scale or different-hardware runs can opt in
79+
- #4 OMC_VM=1: tried with zero code (free experiment), **negative result**, recorded and not pursued — that's the correct "fail forward"
80+
- #5 multi-head substrate-K: tried, shipped, -0.25% with 2/3 wins (directionally consistent with PyTorch L1-MH -8.94%)
81+
- #6 substrate_resample fused: tried, shipped, eliminates tape_value round-trip
82+
- #7-10: scoped honestly above. Each is its own chapter.
83+
84+
## Velocity
85+
86+
Five items + scaffold of one = 6/10 of the v0.8.5 plan in one chapter. The 4 remaining are each substantial enough to deserve focused attention rather than being rushed in this same chapter.
87+
88+
Rome wasn't built overnight; v0.8 was built across 6 chapters this week.

omnimcode-cli/src/main.rs

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1373,6 +1373,29 @@ fn install_gpu_matmul_accelerator() {
13731373
}
13741374
},
13751375
));
1376+
1377+
// v0.8.6 — softmax accelerator scaffold. Per-row softmax is memory-bound
1378+
// and the GPU rarely wins on shapes Prometheus exercises at d_model ≤ 256
1379+
// (e.g. 64×64 scores ≈ 4k cells = microseconds of CPU work). Wired so
1380+
// larger-shape runs and future hardware can opt in via OMC_GPU_SOFTMAX_MIN_CELLS,
1381+
// but the default threshold is high enough that this declines on every
1382+
// Prometheus call today. The architectural slot exists; the win waits.
1383+
let softmax_threshold: usize = std::env::var("OMC_GPU_SOFTMAX_MIN_CELLS")
1384+
.ok()
1385+
.and_then(|s| s.parse().ok())
1386+
.unwrap_or(1_000_000); // intentionally high — opt-in scaffolding
1387+
let _ = omnimcode_core::accel::register_softmax_accelerator(Box::new(
1388+
move |_rows: usize, _cols: usize, _input: &[f64]| {
1389+
if _rows * _cols < softmax_threshold {
1390+
return None;
1391+
}
1392+
// Decline for now — actual GPU softmax kernel is a v0.8.7+
1393+
// candidate. Path A: WGSL with per-row threadgroup reduce.
1394+
// Path B: f64 → f32 → GPU softmax → f32 → f64 round-trip.
1395+
// Both are scoped but not in this chapter.
1396+
None
1397+
},
1398+
));
13761399
}
13771400

13781401
#[cfg(test)]

omnimcode-core/src/accel.rs

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,14 +26,32 @@ pub type MatmulAccelerator = Box<
2626
+ Send + Sync,
2727
>;
2828

29+
/// Per-row softmax accelerator. Receives `(rows, cols, input_row_major)`,
30+
/// returns the same-shape output. Per-row stable softmax. Same hook
31+
/// pattern as matmul. As of v0.8.6 this exists primarily as scaffolding —
32+
/// per-row softmax is memory-bound and GPU rarely wins at the shapes
33+
/// Prometheus exercises today (e.g. 64×64 scores at d_model=256). Wired
34+
/// here so larger-scale runs and future hardware can opt in without
35+
/// touching omnimcode-core.
36+
pub type SoftmaxAccelerator = Box<
37+
dyn Fn(usize, usize, &[f64]) -> Option<Result<Vec<f64>, String>>
38+
+ Send + Sync,
39+
>;
40+
2941
static MATMUL_ACCELERATOR: OnceLock<MatmulAccelerator> = OnceLock::new();
42+
static SOFTMAX_ACCELERATOR: OnceLock<SoftmaxAccelerator> = OnceLock::new();
3043

3144
/// Register a matmul accelerator. Idempotent — second call is a no-op,
3245
/// matching `OnceLock::set` semantics. Call once during binary startup.
3346
pub fn register_matmul_accelerator(f: MatmulAccelerator) -> Result<(), &'static str> {
3447
MATMUL_ACCELERATOR.set(f).map_err(|_| "matmul accelerator already registered")
3548
}
3649

50+
/// Register a softmax accelerator. Same semantics as matmul registration.
51+
pub fn register_softmax_accelerator(f: SoftmaxAccelerator) -> Result<(), &'static str> {
52+
SOFTMAX_ACCELERATOR.set(f).map_err(|_| "softmax accelerator already registered")
53+
}
54+
3755
/// Internal — used by `interpreter::tape_matmul`. Returns
3856
/// `Some(Result<Vec<f64>, String>)` when the accelerator committed,
3957
/// `None` when no accelerator is registered OR the registered one
@@ -43,3 +61,10 @@ pub(crate) fn try_accelerated_matmul(
4361
) -> Option<Result<Vec<f64>, String>> {
4462
MATMUL_ACCELERATOR.get().and_then(|f| f(m, k, n, a, b))
4563
}
64+
65+
/// Internal — used by `interpreter` for `tape_softmax`.
66+
pub(crate) fn try_accelerated_softmax(
67+
rows: usize, cols: usize, input: &[f64],
68+
) -> Option<Result<Vec<f64>, String>> {
69+
SOFTMAX_ACCELERATOR.get().and_then(|f| f(rows, cols, input))
70+
}

omnimcode-core/src/interpreter.rs

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6872,6 +6872,20 @@ impl Interpreter {
68726872
}
68736873
let a = self.eval_expr(&args[0])?.to_int() as usize;
68746874
let av = self.autograd_tape[a].value.clone();
6875+
// Try the GPU softmax accelerator first (v0.8.6 scaffold).
6876+
// The accelerator may decline (return None) for small shapes,
6877+
// in which case the CPU triple-pass below runs.
6878+
if let Some(result) = crate::accel::try_accelerated_softmax(av.rows, av.cols, &av.data) {
6879+
return result.map(|data| {
6880+
let out = TapeMat { rows: av.rows, cols: av.cols, data };
6881+
let grad = TapeMat::zeros(av.rows, av.cols);
6882+
let id = self.autograd_tape.len();
6883+
self.autograd_tape.push(TapeNode {
6884+
op: TapeOp::Softmax(a), value: out, grad,
6885+
});
6886+
Value::HInt(HInt::new(id as i64))
6887+
}).map_err(|e| format!("tape_softmax accelerated: {}", e));
6888+
}
68756889
let mut out = TapeMat::zeros(av.rows, av.cols);
68766890
for r in 0..av.rows {
68776891
// Row max for numerical stability.

0 commit comments

Comments
 (0)