Releases: RandomCoder-lab/OMC
v0.8.5 — substrate ops, embedding lookup, cross-entropy fused, multi-head substrate-K
Five v0.8.5 optimization items shipped. The compound v0.8.4 + v0.8.5 effect: training-loop hot path is now fully in Rust builtins; the math-equivalent multi-head substrate-K attention is available; the architecture is positioned for v0.8.6+ to push remaining items (substrate-quantized GPU weights, sparse attention, etc.).
What's new
#1 tape_cross_entropy_batch — fused tape op
Per-batch cross-entropy as one tape node. Closed-form (p - one_hot)/N backward replaces the chain through 5 intermediate nodes (softmax → log → mask → mul → sum). Wins materialize at large vocab.
#2 tape_embedding_lookup — direct row gather
Replaces prom_embedding_batch's OMC-built [N, vocab] one-hot + matmul chain with a direct row gather. Backward scatters rows of dy back into the table gradient (same gradient as the one-hot @ table chain). Wins scale with vocab size.
#4 OMC_VM=1 negative finding
Measured: 0.662 s/step at d_model=256 (was 0.661 tree-walk). No win once hot paths are in Rust builtins. Not pursued further for Prometheus — the bytecode VM optimizes basic-block dispatch, but the hot work is now happening below that layer.
#5 Multi-head substrate-K attention — prom_attention_substrate_k_mh_*
Math-equivalent "sum of per-head W_O projections" form (avoids needing a tape_concat op). All single-head toggles (smod_alpha, v_resample_scale, q6_mode) honored per-head with same defaults.
Cross-validation at d_model=32, 4 heads (d_head=8), 400 steps, 3 seeds:
| mean tail loss | wins | |
|---|---|---|
| SH (single head) | 2.0047 | — |
| MH (4 heads) | 1.9998 | 2/3 |
Δ = −0.25%, directionally consistent with PyTorch's L1-MH −8.94%. Effect grows with capacity; same code path supports the PyTorch −12.15% Q6-MH finding once you turn on q6_mode=\"fused\".
#6 tape_substrate_resample — fused tape op
Skips tape_value → modulator_matrix → tape_const round-trip (which was extracting 16k f64s at d_model=256 seq_len=64 per call). Pairs with the substrate_resample_matrix Rust builtin from v0.8.4. Same math.
Honest framing
Wall-clock at d_model=256 is essentially unchanged from v0.8.4 for these five items in isolation — that scale was already AdamW-bound and the OMC overhead was already removed. These wins materialize when:
- Vocab grows large — cross-entropy and embedding lookup get O(vocab) cheaper
- Multi-head trained — the architectural win + the OMC-overhead-gone substrate-attention compose
- Bigger d_model — fused substrate_resample skips proportionally more I/O
The MH cross-validation result is the load-bearing finding here: the PyTorch L1-MH win cross-validates in OMC's tape autograd.
What's still on the v0.8.5 list
- #3 Route more tape ops through GPU — modest win at current scales (memory-bound ops aren't matmul-class), scaffold to be added in v0.8.6
- #7 Substrate-quantized GPU weights — own chapter
- #8 CRT-PE-keyed sparse attention matmul — own chapter
- #9 LLVM JIT for tape paths — own chapter
- #10 f16/bf16 GPU paths — own chapter
Tests
1111/1111 OMC tests pass.
Files
omnimcode-core/src/interpreter.rs—tape_cross_entropy_batch,tape_embedding_lookup,tape_substrate_resamplebuiltins + backwardsexamples/lib/prometheus.omc— wrappers +prom_attention_substrate_k_mh_*examples/prometheus_mh_xval.omc— SH vs MH cross-validation harness
v0.8.4 — Substrate Rust builtins: 40× CPU / 96× GPU end-to-end on Prometheus
Headline
Three Rust builtins replace OMC-side inner-loop helpers. The fused substrate_adamw_update is the actual bottleneck killer — replaces ~15 element-wise loops per parameter with one tight Rust loop. Combined with v0.8.2 (GPU integration) and v0.8.3 (substrate-shaped 8×32 tile), the three chapters compound to give the first real end-to-end Prometheus training speedup.
| CPU s/step | GPU s/step | speedup vs v0.8.2 | |
|---|---|---|---|
| v0.8.2 baseline | 25.81 | 25.88 | 1.00× |
| v0.8.4 modulators only | 26.38 | 26.28 | 0.98× ← no change |
| v0.8.4 + fused AdamW | 0.65 | 0.27 | 40× / 96× |
Same d_model=256 substrate-K transformer, same 5-step training, same final loss (6.95930 ± 5e-5 GPU roundtrip noise). Identical training trajectory, 96× faster on GPU.
The honest story
Initial guess was that the substrate-modulator matrix construction (_prom_smod_matrix, _prom_substrate_resample_matrix) was the bottleneck. Both got ported to Rust first — wall-clock did not move. Useful debugging finding, not a chapter on its own.
Profiling-by-fixing found the real bottleneck in prom_adamw_step: ~15 OMC-side element-wise loops per parameter per step. At 6 params of 256×256 cells, that's ~6M OMC ops per step. Replacing the inner block with one Rust builtin produced the 40× / 96× drop.
Both ports shipped — modulators because they're architecturally cleaner and verified correct, AdamW because it's the actual win.
The compound effect
- v0.8.2 wired GPU in. End-to-end null result — OMC overhead dominated.
- v0.8.3 found the substrate-shaped 8×32 tile (114 GFLOPS vs 71 at 1024²). Kernel-level win, no end-to-end change.
- v0.8.4 removes the OMC overhead. Both prior chapters finally pay out:
- The GPU/CPU split is now 2.4× (the actual matmul speedup at d_model=256)
- The 8×32 substrate-shaped tile is doing real work in production training
Future scale-ups (d_model=512+, batched inference, longer sequences, multi-block) get both the OMC-overhead-gone benefit AND the substrate-GPU acceleration.
What this unlocks immediately
- L1-MH + S-MOD α=1.0 in pure-OMC Prometheus — was unblocked by v0.8.1's broadcast-backward fix; now practical to run (seconds per step rather than minutes)
- Larger-scale substrate-attention — d_model=512+, multi-block, longer sequences
- Q6 cross-validation at real training length — v0.8.1's OMC-side Q6 result was at 80 steps; can now run 5000+ step training
API
Three new builtins:
```omc
Per-cell S-MOD modulator (alpha=0 → 1 everywhere)
substrate_smod_matrix(scores_2d, alpha)
Per-cell substrate-V resample modulator (scale != 0)
substrate_resample_matrix(v_2d, scale)
Fused AdamW per-parameter update — mutates m, v in place
substrate_adamw_update(cur, grad, m, v, lr, b1, b2, eps, wd, step)
```
prom_adamw_step in prometheus.omc now uses the fused builtin internally. Public AdamW interface is unchanged; any existing Prometheus training script picks up the speedup automatically.
Files
omnimcode-core/src/interpreter.rs— three builtins + flatten/rebuild helpersexamples/lib/prometheus.omc—_prom_smod_matrix/_prom_substrate_resample_matrixwrappers;prom_adamw_stepinner block calls the fused builtinexamples/tests/test_substrate_modulator_builtins.omc— 8 unit tests verifying equivalenceexperiments/prometheus_parity/SUBSTRATE_BUILTINS_WIN.md— full writeup
1111/1111 OMC tests pass.
Reproduction
```bash
cargo build --release -p omnimcode-cli --features gpu
CPU baseline (now fast)
OMC_GPU_BACKEND=cpu ./target/release/omnimcode-standalone examples/bench_prometheus_gpu.omc
GPU (now wins)
OMC_GPU_BACKEND=wgpu ./target/release/omnimcode-standalone examples/bench_prometheus_gpu.omc
```
v0.8.3 — Substrate-shaped GPU matmul wins +38% vs conventional 16×16
Headline
Anisotropic 8×32 tiles (Fibonacci-aligned short dim, wavefront-divisor long dim) decisively beat the conventional square 16×16 tile on the user's AMD RX 580 / Vulkan. At 1024² matmul: 18.81 ms vs 30.31 ms — 1.61× the GFLOPS.
The substrate's role here isn't to fight hardware physics. It's to direct exploration toward configurations conventional GPU programming would never test. Nobody writes 8×32 for matmul by convention. The substrate said "try 8 first," the 9-variant sweep found that 8 paired with a wavefront-divisor long axis dominates, and now that's the default.
The sweep
9 variants × 3 sizes on AMD RX 580 / RADV Vulkan. 1 warmup + 5 timed iterations averaged. Parity verified (max_abs_diff < 1e-2) on every cell.
1024×1024×1024 (the most decisive case)
| variant | ms | GFLOPS | vs 16×16 |
|---|---|---|---|
| 16×16 linear-K REF | 30.31 | 70.85 | ref |
| 8×32 linear-K aniso | 18.81 | 114.19 | +61% ← winner |
| 8×16 linear-K aniso | 18.99 | 113.10 | +60% |
| 8×8 linear-K (1WF, Fib) | 22.30 | 96.29 | +36% |
| 13×13 linear-K (3WF) | 37.61 | 57.11 | -19% |
| 21×21 linear-K (7WF) | 46.43 | 46.25 | -35% |
| 32×8 linear-K aniso | 42.20 | 50.89 | -28% |
| 16×16 Fib-K-stride | 29.74 | 72.20 | +0.2% |
The pattern
- Anisotropic 8×N (Fib-short × wavefront-long) wins decisively. 8×32 = 256 threads = exactly 4 wavefronts. Short dim is Fib-8 (= half wavefront, fits L1 cache line). Long dim is a cache-line multiple AND maps to N (the output-column axis) for coalesced writes.
- The 32×8 transpose LOSES by 30% — same total threads, but the wavefront-aligned axis is now M (rows) and writes become strided. Substrate wins only when it pairs with hardware constraints, not against them.
- Pure-square Fibonacci tiles LOSE. 13×13 = 3 wavefronts × 64 with 23 idle lanes (12% waste). 21×21 = 7 wavefronts hurts occupancy. Fib alone isn't enough — needs to align with wavefront divisors.
- Fib-K-stride is a wash — substrate-shaped reduction order doesn't matter; tile geometry does.
The deeper thesis
The substrate-IS-the-architecture hypothesis: strong form falsified, weak form confirmed.
- Falsified: "any Fibonacci tile beats power-of-2 tiles." Wavefront geometry (64 lanes lockstep) is a hard constraint. Pure 13/21 tiles pay an occupancy tax.
- Confirmed: "substrate-aligned dimensions, when they don't fight hardware, beat conventional tiles." 8×32 has Fib-8 short dim AND wavefront-divisor long dim, and wins by 60% at 1024².
The substrate is the heuristic that directs you to configurations conventional wisdom skips. Conventional GPU programming would never test 8×32 vs 16×16 — it's "too small a tile." The substrate said try 8, and the answer came back: not 8×8 (loses at small sizes due to dispatch overhead), not 13×13 (occupancy loss), but 8×wavefront-aligned.
Adoption
omnimcode-cli's install_gpu_matmul_accelerator() now uses WgpuBackend::with_tile_xy(8, 32) by default. Tunable via OMC_GPU_TILE_X / OMC_GPU_TILE_Y for hardware-specific A/Bs:
# Use the substrate-shaped default (8×32)
./omnimcode-standalone yourcode.omc
# Try a different tile for testing
OMC_GPU_TILE_X=4 OMC_GPU_TILE_Y=16 ./omnimcode-standalone ... # NVIDIA warp=32 candidateWhat's not yet tested
- Other anisotropic shapes (5×32, 5×40, 13×32, 8×64)
- Other GPU hardware: NVIDIA (warp=32), Apple M-series (different cache geometry). The hypothesis: 4×16 or 8×16 might win on NVIDIA
- Combined with substrate-quantized weights (data-layer substrate-shaping)
- Combined with sparse-via-substrate-distance (only computing high-value attention cells)
Files
omnimcode-gpu/src/wgpu_backend.rs—WgpuBackend::with_tile_xy(tx, ty)andwith_config(tx, ty, kernel);MatmulKernel::{Linear, FibKStride}enum; WGSL source-substitution for both tile and inner-loop bodyomnimcode-gpu/shaders/matmul.wgsl— parameterized templateomnimcode-gpu/examples/bench_fib_tile.rs— 9-variant sweep harnessomnimcode-cli/src/main.rs— default tile 8×32experiments/prometheus_parity/SUBSTRATE_GPU_WINS.md— full writeup
1103/1103 OMC tests pass.
Reproduction
cargo run --release -p omnimcode-gpu --features wgpu --example bench_fib_tilev0.8.2 — GPU wired into Prometheus (kernel: 13×, end-to-end: bottleneck elsewhere)
What's new
GPU matmul acceleration wired into the OMC tape autograd via a pluggable hook. tape_matmul forward + backward now route through omnimcode-gpu's wgpu (Vulkan) backend when built with --features gpu and shapes cross the CPU/GPU crossover threshold.
Kernel-level result: 13×
5 sequential 512² matmuls in an OMC tape:
| backend | wall-clock | speedup |
|---|---|---|
OMC_GPU_BACKEND=cpu |
3.47 s | 1.00× |
OMC_GPU_BACKEND=wgpu (RX 580, Vulkan) |
0.27 s | 12.85× |
Parity: f64 → f32 → f64 round-trip differs at the 9th significant digit. Fine for Prometheus training.
End-to-end Prometheus result: unchanged at d_model=256
Substrate-K transformer, seq_len=64, d_model=256, ff_dim=512, 5 AdamW steps:
| wall-clock | per step | loss | |
|---|---|---|---|
| CPU | 129.05 s | 25.81 s | 6.95930 |
| wgpu | 129.39 s | 25.88 s | 6.95932 |
GPU and CPU are dead even end-to-end. Why: matmul wall-clock is single-digit milliseconds per step; OMC tree-walk iteration in the substrate-shaping helpers (_prom_smod_matrix, _prom_substrate_resample_matrix, Q6 modulation) is tens of seconds. GPU saves ~50ms; OMC burns ~25s. The ratio explains the 0% movement.
This chapter ships the integration, not an end-to-end speedup. Naming the wall IS the chapter — every future direction that needs more matmul work in the time budget now gets it for free.
Architecture
omnimcode-core can't depend on omnimcode-gpu (which already depends on -core — would be a cycle). Solved with a OnceLock MatmulAccelerator hook in core::accel that the outer binary registers at startup. The hook signature uses raw (m, k, n, &[f64], &[f64]) so no core-internal types leak.
Tunables
OMC_GPU_BACKEND=cpu|wgpu # force a backend
OMC_GPU_MATMUL_MIN_FLOPS=N # crossover threshold (default 1,000,000)
OMC_GPU_VERBOSE=1 # log backend + threshold at startupWhat this opens up
- v0.8.3 substrate-native GPU kernels: Fibonacci-tile workgroups (13×13, 21×21, 34×34 vs the conventional 16×16), substrate-quantized weights, CRT-PE-keyed sparse attention matmul. Same composed-vs-fused protocol as
tape_phi_logfrom v0.8.1, applied at the GPU layer. The substrate-IS-the-architecture question at kernel-level. - Bigger d_model (1024+): matmul time grows ~64× while OMC-side substrate ops grow ~4×, so the ratio inverts and GPU starts to win end-to-end.
- Substrate ops as Rust builtins (separate work): would dissolve today's bottleneck — the substrate helpers are pure compute, fit cleanly into the tape primitive pattern.
Files
omnimcode-core/src/accel.rs—MatmulAcceleratorhook +OnceLock+try_accelerated_matmulomnimcode-core/src/interpreter.rs—tape_matmulconsults the hook before falling back to triple-loopomnimcode-cli/Cargo.toml—gpufeature pulls inomnimcode-gpuomnimcode-cli/src/main.rs—install_gpu_matmul_accelerator()at startupexamples/bench_prometheus_gpu.omc— wall-clock harnessexperiments/prometheus_parity/GPU_INTEGRATION.md— full writeup
1103/1103 OMC tests pass.
v0.8.1 — substrate-native tape primitives + broadcast-backward fix
What's new
Two new tape autograd primitives and a latent backward-broadcast bug fix that unblocks S-MOD + substrate-K end-to-end training in OMC.
tape_phi_log(x, scale=10.0) — substrate-native fused op
ln(|x · scale| + 1) / (π · ln φ) in one tape node. Replaces the four-op composition (tape_abs → tape_mul_scalar → tape_log → tape_div_scalar) with a single op whose backward derives directly from the substrate basis. Defined at zero (boring tape_log(0) returns −∞), exposes π·ln φ at the AST level rather than hiding it in a scalar constant.
This is the precedent-setting substrate-native primitive. The protocol — composed reference + fused alternative + unit-level equivalence proof + end-to-end training A/B — can now be applied to other substrate-native fused ops (substrate_resample, attractor_snap, attractor-modulated-backward variants).
tape_abs(x) — boring PyTorch parity
Element-wise |x|. Filled the obvious hole — the autograd tape had tape_log, tape_exp, tape_sin, etc., but no absolute value.
Pre-existing broadcast-backward bug, fixed in the same chapter
tape_div and tape_mul backwards panicked with col-broadcast denominators. The prom_substrate_softmax α>0 path ends in tape_div(attn_unnorm[N, N], row_sums[N, 1]) and indexed bv.at(i, j) for j up to N−1 in a [N, 1] matrix — out of bounds. Means S-MOD + substrate-K had never actually trained end-to-end in OMC; it would panic at first backward.
Both backwards now iterate the dy shape, reduce indices against each operand's actual extent, and accumulate gradient sums across broadcast axes. L1-MH + S-MOD α=1.0 can finally cross-validate in pure-OMC Prometheus.
A/B in pure-OMC Prometheus
examples/prometheus_q6_ab.omc, substrate-K transformer, seq_len=6, d_model=8, ff_dim=16, 80 AdamW steps, 3 seeds:
| mean val | Δ vs off | composed − fused | |
|---|---|---|---|
| off (no Q6) | 2.5692 | — | — |
| composed Q6 | 2.5530 | −0.0162 (−0.63%) | — |
| fused Q6 | 2.5530 | −0.0162 (−0.63%) | 1.2 × 10⁻⁷ |
Composed and fused agree to ~1e-7 after 80 forward+backward AdamW steps — floating-point accumulation-noise floor. The substrate-native primitive matches the boring composed reference exactly under training, confirming the abstraction is free.
Q6 itself wins 2/3 seeds at this tiny scale — first OMC-side cross-validation of the PyTorch Q6 finding (−12.15% 6/6 seeds at TinyShakespeare L1-MH).
Tests
examples/tests/test_tape_abs_phi_log.omc— 12 primitive unit tests (forward, backward, edge cases, composed-vs-fused equivalence)examples/tests/test_q6_modulate.omc— 4 modulation-dispatch tests
Full suite: 1103/1103 OMC tests pass.
Files
omnimcode-core/src/interpreter.rs—TapeOp::Abs,TapeOp::PhiLog(usize, f64), broadcast-aware Mul/Div backwardexamples/lib/prometheus.omc—prom_q6_modulate(q, scale, gamma, mode)+q6_modefield on substrate-K layerexamples/prometheus_q6_ab.omc— A/B harnessexperiments/prometheus_parity/TAPE_PRIMITIVES_AB.md— full writeup
v0.3.1 — Symbolic compression: 3.8× smaller predict default + omc_fetch_by_hash
WHAT CHANGED
- omc_predict gains a format parameter:
- hash (NEW DEFAULT, ~50 bytes/suggestion): fn_name + file +
canonical_hash + prefix_match_len + substrate_distance - signature (~100 bytes): adds the fn signature line
- full: complete source (previous default behavior)
- hash (NEW DEFAULT, ~50 bytes/suggestion): fn_name + file +
- omc_fetch_by_hash(paths, canonical_hash) — companion tool.
Recovers a function body by alpha-rename-invariant canonical hash.
Returns {found, fn_name, file, source} or {found: false}.
MEASURED COMPRESSION
Same query fn prom_attention_ x top_k=5 against prometheus.omc:
format=hash 1253 bytes 26.2% (3.8x smaller)
format=signature 1622 bytes 33.9%
format=full 4783 bytes 100% (v0.3 behavior)
The ratio widens on longer fns — top_k=5 over fns averaging 60
lines compresses ~10x.
WHY IT MATTERS
Canonical hash is alpha-rename invariant — recovery via
fetch_by_hash works even if the fn was renamed after the predict
call. The LLM workflow becomes: predict cheaply (hash), reason
over candidates, fetch only the body it commits to using.
Branching is now ~free at the context-budget level: 50 candidates
fit in the LLM's mind for the cost of 6-7 full bodies.
NOW POSSIBLE
- LLM agents can hold 5-10x more candidate fns "in mind" per query.
- Repeated browsing across a corpus stays cheap.
- The substrate's content-addressed identity becomes a first-class
context-compression mechanism.
TESTS
13/13 MCP integration tests pass. 231 Rust pass, 1087/1087 OMC.
DEFERRED TO v0.4
- Wire substrate codec (omc_codec_encode 10-50x ratio) into the
predict response path for full library-lookup compression. - Substrate-keyed conversation memory via fibtier.
- omc_compress_context(text) MCP tool.
- Cross-corpus blending.
See CHANGELOG.md#v0.3.1-symbolic-compression for the chapter index.
v0.0.6 — Prometheus: pure-OMC ML framework, first substrate-K wins
WHAT CHANGED
- Tape-based reverse-mode autograd in pure OMC: tape_var, tape_const,
tape_add, tape_matmul, tape_softmax, ~20 ops. Substrate-preserving. - Prometheus framework: prom_linear, prom_relu, prom_softmax,
prom_mse_loss, prom_sgd_step. Then AdamW, Embedding, LayerNorm,
CRT-Fibonacci PE, Sequential, TransformerBlock composition. - Multi-token batched forward: broadcast-aware tape ops, per-row
mean/var, multi-token attention. - TinyShakespeare end-to-end in pure OMC.
- Cross-framework parity bench: every Prometheus result reproduced
in PyTorch (experiments/prometheus_parity/). - Substrate-K (L1) wins single-head at TinyShakespeare scale:
-8% val vs vanilla, 3/3 seeds. First substrate-component win
that survives at real scale. - PyTorch 10-seed + multi-block reproduction of substrate-L3
(parameter-free attention) wins (-21.5% on toy data). - Fibonacci-tier memory primitive (fibtier) — bounded power-law
context buffer. - Substrate-native agent demo: two agents conversing over
OMC-PROTOCOL with persistent fibtier memory across simulated
process restart.
WHY IT MATTERS
OMC's substrate finally produces a measurable win on a real ML
training task at real scale, in both a pure-OMC implementation and
an independent PyTorch reproduction. The autograd + Prometheus
stack is the platform that the substrate-attention chapter (v0.1)
is built on top of.
NOW POSSIBLE
- Train a transformer end-to-end in pure OMC.
- Compare substrate variants apples-to-apples in PyTorch.
- Compose substrate primitives (codec + kernel + protocol +
Prometheus + fibtier) into a single working agent demo.
See CHANGELOG.md#v0.0.6-prometheus for the chapter index.
v0.0.5 — Substrate codec, kernel, OMC-PROTOCOL v1
WHAT CHANGED
- Substrate codec (omc_codec_encode / omc_codec_decode_lookup):
canonicalize source, tokenize, sample every Nth ID, return
compressed payload + content hash. Library-lookup decode. - omc-kernel: content-addressed filesystem store at
~/.omc/kernel/store/<hex_hash>.omc. Alpha-rename invariant — two
processes converging on the same canonical form produce the same
address. CLI: ingest, fetch, stat, ls, sign, verify. - omc-grep: code archaeology via canonical hash. Found 31.7%
redundancy in OMC's own examples tree. - OMC-PROTOCOL v1: formalized substrate-signed wire format for
inter-agent messaging. No PKI; integrity via canonical-hash
recompute. - MCP server (omnimcode-mcp): exposes OMC as a runtime to LLM clients.
- Substrate-aware tokenizer: 285+ builtins, 113 phrase-level dict
entries, CRT-packed (kind, vocab_id, position_class) IDs,
token_distance metric, attractor folding.
WHY IT MATTERS
The substrate gains an identity layer (canonical hash) and a wire
format. Two agents talking over OMC-PROTOCOL can verify each other's
claims by recomputing hashes — no shared keys needed. The tokenizer
turns OMC source into a substrate-typed symbol stream — the
foundation for the substrate-indexed completion engine that comes next.
NOW POSSIBLE
- Compress code by 10-50× via library-lookup codec.
- Persist Values content-addressed and dedupe across processes.
- Inter-agent messaging with cryptographic-style integrity but no
key infrastructure. - LLM clients can drive OMC over MCP.
See CHANGELOG.md#v0.0.5-codec-kernel-protocol for the chapter index.
v0.0.4 — LLVM JIT + dual-band SSE2 + harmony branch elision
WHAT CHANGED
- omnimcode-codegen crate: LLVM 18 lowering via inkwell.
- Scalar lowerer: allocas, CFG branches, comparisons, recursive Call,
f64 support. - Dual-band lowerer: i64 → <2 x i64> SSE2 vectors, packing classical
α-band with harmonic shadow β-band in a single SSE register. - Cross-fn calls in dual-band lowerer.
- phi_shadow(x) + harmony(x) primitives.
- Harmony-gated branch elision: high-coherence inputs skip entire
conditional blocks at native code speed. 270× speedup on @hbit
bench; +95% reduction when @Harmony+@predict stack. - Array support: NewArray, ArrayLen, ArrayIndex read, ArrSetNamed.
- NSL-KDD real-world JIT measurement: honest negative — array-heavy
code doesn't beat tree-walk by enough to justify lowering cost. - L1.6 Array ↔ JIT bridging at dispatch boundary.
- omc-bench harness with criterion.
WHY IT MATTERS
OMC gains a credible JIT path. Dual-band SSE2 codegen is novel — no
other language packs a value's classical and harmonic bands into one
register. Harmony-gated branch elision is the first demonstration
that substrate metadata can drive native-code-level optimization.
The NSL-KDD negative result is part of the chapter — being honest
about where the JIT doesn't help is what makes the does-help claims
trustworthy.
NOW POSSIBLE
- Math-heavy hot loops can hit native code speed via @hbit fn pragma.
- Branch elision based on input coherence happens transparently.
- The bench harness can compare tree-walk vs VM vs JIT-via-dispatch
vs JIT-direct on the same workload.
See CHANGELOG.md#v0.0.4-jit-and-dual-band for the chapter index.
v0.0.3 — Substrate algorithms + self-healing + stdlib
WHAT CHANGED
- Self-healing compiler (Phase H.1-H.5): harmonic + typo + divide-by-
singularity + parse-level recovery +safekeyword for runtime
self-healing. - Substrate-routed O(log_phi_pi_fib N) algorithm family: substrate_search,
lower_bound, upper_bound, rank, count_range, slice_range, intersect,
difference, insert, quantile, select_k, nearest, min_distance, hash. - Zeckendorf encoding as first-class integer representation.
- Stdlib expansion: 16 + 28 + 15 new built-ins across three rounds.
- Closures + first-class functions + mutable closures + module aliasing.
- Test runner + --test / --test-all CLI modes.
- Iterative heal-to-fixpoint + heal-on-runtime-error retry.
- CLI: --check, --fmt (pretty-print canonical OMC), --help.
- HBit harmony substrate-routing across the codebase.
WHY IT MATTERS
The language gains the safety primitives (self-healing) and the
substrate-routed algorithms that make the substrate observable in
everyday code. Closures + test runner round out the ergonomics.
NOW POSSIBLE
- Programs that recover from typos / off-attractor literals at
compile or runtime. - O(log_phi_pi_fib N) search-family operations on substrate-typed
arrays. - A real test runner and a real formatter for OMC source.
See CHANGELOG.md#v0.0.3-substrate-and-stdlib for the chapter index.