17 May 22:01

34f61fa

v0.8.5 — substrate ops, embedding lookup, cross-entropy fused, multi-head substrate-K

Five v0.8.5 optimization items shipped. The compound v0.8.4 + v0.8.5 effect: training-loop hot path is now fully in Rust builtins; the math-equivalent multi-head substrate-K attention is available; the architecture is positioned for v0.8.6+ to push remaining items (substrate-quantized GPU weights, sparse attention, etc.).

What's new

#1 `tape_cross_entropy_batch` — fused tape op

Per-batch cross-entropy as one tape node. Closed-form (p - one_hot)/N backward replaces the chain through 5 intermediate nodes (softmax → log → mask → mul → sum). Wins materialize at large vocab.

#2 `tape_embedding_lookup` — direct row gather

Replaces prom_embedding_batch's OMC-built [N, vocab] one-hot + matmul chain with a direct row gather. Backward scatters rows of dy back into the table gradient (same gradient as the one-hot @ table chain). Wins scale with vocab size.

#4 OMC_VM=1 negative finding

Measured: 0.662 s/step at d_model=256 (was 0.661 tree-walk). No win once hot paths are in Rust builtins. Not pursued further for Prometheus — the bytecode VM optimizes basic-block dispatch, but the hot work is now happening below that layer.

#5 Multi-head substrate-K attention — `prom_attention_substrate_k_mh_*`

Math-equivalent "sum of per-head W_O projections" form (avoids needing a tape_concat op). All single-head toggles (smod_alpha, v_resample_scale, q6_mode) honored per-head with same defaults.

Cross-validation at d_model=32, 4 heads (d_head=8), 400 steps, 3 seeds:

	mean tail loss	wins
SH (single head)	2.0047	—
MH (4 heads)	1.9998	2/3

Δ = −0.25%, directionally consistent with PyTorch's L1-MH −8.94%. Effect grows with capacity; same code path supports the PyTorch −12.15% Q6-MH finding once you turn on q6_mode=\"fused\".

#6 `tape_substrate_resample` — fused tape op

Skips tape_value → modulator_matrix → tape_const round-trip (which was extracting 16k f64s at d_model=256 seq_len=64 per call). Pairs with the substrate_resample_matrix Rust builtin from v0.8.4. Same math.

Honest framing

Wall-clock at d_model=256 is essentially unchanged from v0.8.4 for these five items in isolation — that scale was already AdamW-bound and the OMC overhead was already removed. These wins materialize when:

Vocab grows large — cross-entropy and embedding lookup get O(vocab) cheaper
Multi-head trained — the architectural win + the OMC-overhead-gone substrate-attention compose
Bigger d_model — fused substrate_resample skips proportionally more I/O

The MH cross-validation result is the load-bearing finding here: the PyTorch L1-MH win cross-validates in OMC's tape autograd.

What's still on the v0.8.5 list

#3 Route more tape ops through GPU — modest win at current scales (memory-bound ops aren't matmul-class), scaffold to be added in v0.8.6
#7 Substrate-quantized GPU weights — own chapter
#8 CRT-PE-keyed sparse attention matmul — own chapter
#9 LLVM JIT for tape paths — own chapter
#10 f16/bf16 GPU paths — own chapter

Tests

1111/1111 OMC tests pass.

Files

omnimcode-core/src/interpreter.rs — tape_cross_entropy_batch, tape_embedding_lookup, tape_substrate_resample builtins + backwards
examples/lib/prometheus.omc — wrappers + prom_attention_substrate_k_mh_*
examples/prometheus_mh_xval.omc — SH vs MH cross-validation harness

Assets 2

17 May 21:21

RandomCoder-lab

v0.8.4-substrate-builtins

8d8c214

v0.8.4 — Substrate Rust builtins: 40× CPU / 96× GPU end-to-end on Prometheus

Headline

Three Rust builtins replace OMC-side inner-loop helpers. The fused substrate_adamw_update is the actual bottleneck killer — replaces ~15 element-wise loops per parameter with one tight Rust loop. Combined with v0.8.2 (GPU integration) and v0.8.3 (substrate-shaped 8×32 tile), the three chapters compound to give the first real end-to-end Prometheus training speedup.

	CPU s/step	GPU s/step	speedup vs v0.8.2
v0.8.2 baseline	25.81	25.88	1.00×
v0.8.4 modulators only	26.38	26.28	0.98× ← no change
v0.8.4 + fused AdamW	0.65	0.27	40× / 96×

Same d_model=256 substrate-K transformer, same 5-step training, same final loss (6.95930 ± 5e-5 GPU roundtrip noise). Identical training trajectory, 96× faster on GPU.

The honest story

Initial guess was that the substrate-modulator matrix construction (_prom_smod_matrix, _prom_substrate_resample_matrix) was the bottleneck. Both got ported to Rust first — wall-clock did not move. Useful debugging finding, not a chapter on its own.

Profiling-by-fixing found the real bottleneck in prom_adamw_step: ~15 OMC-side element-wise loops per parameter per step. At 6 params of 256×256 cells, that's ~6M OMC ops per step. Replacing the inner block with one Rust builtin produced the 40× / 96× drop.

Both ports shipped — modulators because they're architecturally cleaner and verified correct, AdamW because it's the actual win.

The compound effect

v0.8.2 wired GPU in. End-to-end null result — OMC overhead dominated.
v0.8.3 found the substrate-shaped 8×32 tile (114 GFLOPS vs 71 at 1024²). Kernel-level win, no end-to-end change.
v0.8.4 removes the OMC overhead. Both prior chapters finally pay out:
- The GPU/CPU split is now 2.4× (the actual matmul speedup at d_model=256)
- The 8×32 substrate-shaped tile is doing real work in production training

Future scale-ups (d_model=512+, batched inference, longer sequences, multi-block) get both the OMC-overhead-gone benefit AND the substrate-GPU acceleration.

What this unlocks immediately

L1-MH + S-MOD α=1.0 in pure-OMC Prometheus — was unblocked by v0.8.1's broadcast-backward fix; now practical to run (seconds per step rather than minutes)
Larger-scale substrate-attention — d_model=512+, multi-block, longer sequences
Q6 cross-validation at real training length — v0.8.1's OMC-side Q6 result was at 80 steps; can now run 5000+ step training

API

Three new builtins:

```omc

Per-cell S-MOD modulator (alpha=0 → 1 everywhere)

substrate_smod_matrix(scores_2d, alpha)

Per-cell substrate-V resample modulator (scale != 0)

substrate_resample_matrix(v_2d, scale)

Fused AdamW per-parameter update — mutates m, v in place

substrate_adamw_update(cur, grad, m, v, lr, b1, b2, eps, wd, step)
```

prom_adamw_step in prometheus.omc now uses the fused builtin internally. Public AdamW interface is unchanged; any existing Prometheus training script picks up the speedup automatically.

Files

omnimcode-core/src/interpreter.rs — three builtins + flatten/rebuild helpers
examples/lib/prometheus.omc — _prom_smod_matrix / _prom_substrate_resample_matrix wrappers; prom_adamw_step inner block calls the fused builtin
examples/tests/test_substrate_modulator_builtins.omc — 8 unit tests verifying equivalence
experiments/prometheus_parity/SUBSTRATE_BUILTINS_WIN.md — full writeup

1111/1111 OMC tests pass.

Reproduction

```bash
cargo build --release -p omnimcode-cli --features gpu

CPU baseline (now fast)

OMC_GPU_BACKEND=cpu ./target/release/omnimcode-standalone examples/bench_prometheus_gpu.omc

GPU (now wins)

OMC_GPU_BACKEND=wgpu ./target/release/omnimcode-standalone examples/bench_prometheus_gpu.omc
```

Assets 2

17 May 21:05

RandomCoder-lab

v0.8.3-substrate-gpu

d1fa0a2

v0.8.3 — Substrate-shaped GPU matmul wins +38% vs conventional 16×16

Headline

Anisotropic 8×32 tiles (Fibonacci-aligned short dim, wavefront-divisor long dim) decisively beat the conventional square 16×16 tile on the user's AMD RX 580 / Vulkan. At 1024² matmul: 18.81 ms vs 30.31 ms — 1.61× the GFLOPS.

The substrate's role here isn't to fight hardware physics. It's to direct exploration toward configurations conventional GPU programming would never test. Nobody writes 8×32 for matmul by convention. The substrate said "try 8 first," the 9-variant sweep found that 8 paired with a wavefront-divisor long axis dominates, and now that's the default.

The sweep

9 variants × 3 sizes on AMD RX 580 / RADV Vulkan. 1 warmup + 5 timed iterations averaged. Parity verified (max_abs_diff < 1e-2) on every cell.

1024×1024×1024 (the most decisive case)

variant	ms	GFLOPS	vs 16×16
16×16 linear-K REF	30.31	70.85	ref
8×32 linear-K aniso	18.81	114.19	+61% ← winner
8×16 linear-K aniso	18.99	113.10	+60%
8×8 linear-K (1WF, Fib)	22.30	96.29	+36%
13×13 linear-K (3WF)	37.61	57.11	-19%
21×21 linear-K (7WF)	46.43	46.25	-35%
32×8 linear-K aniso	42.20	50.89	-28%
16×16 Fib-K-stride	29.74	72.20	+0.2%

The pattern

Anisotropic 8×N (Fib-short × wavefront-long) wins decisively. 8×32 = 256 threads = exactly 4 wavefronts. Short dim is Fib-8 (= half wavefront, fits L1 cache line). Long dim is a cache-line multiple AND maps to N (the output-column axis) for coalesced writes.
The 32×8 transpose LOSES by 30% — same total threads, but the wavefront-aligned axis is now M (rows) and writes become strided. Substrate wins only when it pairs with hardware constraints, not against them.
Pure-square Fibonacci tiles LOSE. 13×13 = 3 wavefronts × 64 with 23 idle lanes (12% waste). 21×21 = 7 wavefronts hurts occupancy. Fib alone isn't enough — needs to align with wavefront divisors.
Fib-K-stride is a wash — substrate-shaped reduction order doesn't matter; tile geometry does.

The deeper thesis

The substrate-IS-the-architecture hypothesis: strong form falsified, weak form confirmed.

Falsified: "any Fibonacci tile beats power-of-2 tiles." Wavefront geometry (64 lanes lockstep) is a hard constraint. Pure 13/21 tiles pay an occupancy tax.
Confirmed: "substrate-aligned dimensions, when they don't fight hardware, beat conventional tiles." 8×32 has Fib-8 short dim AND wavefront-divisor long dim, and wins by 60% at 1024².

The substrate is the heuristic that directs you to configurations conventional wisdom skips. Conventional GPU programming would never test 8×32 vs 16×16 — it's "too small a tile." The substrate said try 8, and the answer came back: not 8×8 (loses at small sizes due to dispatch overhead), not 13×13 (occupancy loss), but 8×wavefront-aligned.

Adoption

omnimcode-cli's install_gpu_matmul_accelerator() now uses WgpuBackend::with_tile_xy(8, 32) by default. Tunable via OMC_GPU_TILE_X / OMC_GPU_TILE_Y for hardware-specific A/Bs:

# Use the substrate-shaped default (8×32)
./omnimcode-standalone yourcode.omc

# Try a different tile for testing
OMC_GPU_TILE_X=4 OMC_GPU_TILE_Y=16 ./omnimcode-standalone ...    # NVIDIA warp=32 candidate

What's not yet tested

Other anisotropic shapes (5×32, 5×40, 13×32, 8×64)
Other GPU hardware: NVIDIA (warp=32), Apple M-series (different cache geometry). The hypothesis: 4×16 or 8×16 might win on NVIDIA
Combined with substrate-quantized weights (data-layer substrate-shaping)
Combined with sparse-via-substrate-distance (only computing high-value attention cells)

Files

omnimcode-gpu/src/wgpu_backend.rs — WgpuBackend::with_tile_xy(tx, ty) and with_config(tx, ty, kernel); MatmulKernel::{Linear, FibKStride} enum; WGSL source-substitution for both tile and inner-loop body
omnimcode-gpu/shaders/matmul.wgsl — parameterized template
omnimcode-gpu/examples/bench_fib_tile.rs — 9-variant sweep harness
omnimcode-cli/src/main.rs — default tile 8×32
experiments/prometheus_parity/SUBSTRATE_GPU_WINS.md — full writeup

1103/1103 OMC tests pass.

Reproduction

cargo run --release -p omnimcode-gpu --features wgpu --example bench_fib_tile

Assets 2

17 May 20:50

RandomCoder-lab

v0.8.2-gpu-prometheus

f6faea8

v0.8.2 — GPU wired into Prometheus (kernel: 13×, end-to-end: bottleneck elsewhere)

What's new

GPU matmul acceleration wired into the OMC tape autograd via a pluggable hook. tape_matmul forward + backward now route through omnimcode-gpu's wgpu (Vulkan) backend when built with --features gpu and shapes cross the CPU/GPU crossover threshold.

Kernel-level result: 13×

5 sequential 512² matmuls in an OMC tape:

backend	wall-clock	speedup
`OMC_GPU_BACKEND=cpu`	3.47 s	1.00×
`OMC_GPU_BACKEND=wgpu` (RX 580, Vulkan)	0.27 s	12.85×

Parity: f64 → f32 → f64 round-trip differs at the 9th significant digit. Fine for Prometheus training.

End-to-end Prometheus result: unchanged at d_model=256

Substrate-K transformer, seq_len=64, d_model=256, ff_dim=512, 5 AdamW steps:

	wall-clock	per step	loss
CPU	129.05 s	25.81 s	6.95930
wgpu	129.39 s	25.88 s	6.95932

GPU and CPU are dead even end-to-end. Why: matmul wall-clock is single-digit milliseconds per step; OMC tree-walk iteration in the substrate-shaping helpers (_prom_smod_matrix, _prom_substrate_resample_matrix, Q6 modulation) is tens of seconds. GPU saves ~50ms; OMC burns ~25s. The ratio explains the 0% movement.

This chapter ships the integration, not an end-to-end speedup. Naming the wall IS the chapter — every future direction that needs more matmul work in the time budget now gets it for free.

Architecture

omnimcode-core can't depend on omnimcode-gpu (which already depends on -core — would be a cycle). Solved with a OnceLock MatmulAccelerator hook in core::accel that the outer binary registers at startup. The hook signature uses raw (m, k, n, &[f64], &[f64]) so no core-internal types leak.

Tunables

OMC_GPU_BACKEND=cpu|wgpu        # force a backend
OMC_GPU_MATMUL_MIN_FLOPS=N      # crossover threshold (default 1,000,000)
OMC_GPU_VERBOSE=1               # log backend + threshold at startup

What this opens up

v0.8.3 substrate-native GPU kernels: Fibonacci-tile workgroups (13×13, 21×21, 34×34 vs the conventional 16×16), substrate-quantized weights, CRT-PE-keyed sparse attention matmul. Same composed-vs-fused protocol as tape_phi_log from v0.8.1, applied at the GPU layer. The substrate-IS-the-architecture question at kernel-level.
Bigger d_model (1024+): matmul time grows ~64× while OMC-side substrate ops grow ~4×, so the ratio inverts and GPU starts to win end-to-end.
Substrate ops as Rust builtins (separate work): would dissolve today's bottleneck — the substrate helpers are pure compute, fit cleanly into the tape primitive pattern.

Files

omnimcode-core/src/accel.rs — MatmulAccelerator hook + OnceLock + try_accelerated_matmul
omnimcode-core/src/interpreter.rs — tape_matmul consults the hook before falling back to triple-loop
omnimcode-cli/Cargo.toml — gpu feature pulls in omnimcode-gpu
omnimcode-cli/src/main.rs — install_gpu_matmul_accelerator() at startup
examples/bench_prometheus_gpu.omc — wall-clock harness
experiments/prometheus_parity/GPU_INTEGRATION.md — full writeup

1103/1103 OMC tests pass.

Assets 2

17 May 20:30

RandomCoder-lab

v0.8.1-tape-primitives

5b83d76

v0.8.1 — substrate-native tape primitives + broadcast-backward fix

What's new

Two new tape autograd primitives and a latent backward-broadcast bug fix that unblocks S-MOD + substrate-K end-to-end training in OMC.

`tape_phi_log(x, scale=10.0)` — substrate-native fused op

ln(|x · scale| + 1) / (π · ln φ) in one tape node. Replaces the four-op composition (tape_abs → tape_mul_scalar → tape_log → tape_div_scalar) with a single op whose backward derives directly from the substrate basis. Defined at zero (boring tape_log(0) returns −∞), exposes π·ln φ at the AST level rather than hiding it in a scalar constant.

This is the precedent-setting substrate-native primitive. The protocol — composed reference + fused alternative + unit-level equivalence proof + end-to-end training A/B — can now be applied to other substrate-native fused ops (substrate_resample, attractor_snap, attractor-modulated-backward variants).

`tape_abs(x)` — boring PyTorch parity

Element-wise |x|. Filled the obvious hole — the autograd tape had tape_log, tape_exp, tape_sin, etc., but no absolute value.

Pre-existing broadcast-backward bug, fixed in the same chapter

tape_div and tape_mul backwards panicked with col-broadcast denominators. The prom_substrate_softmax α>0 path ends in tape_div(attn_unnorm[N, N], row_sums[N, 1]) and indexed bv.at(i, j) for j up to N−1 in a [N, 1] matrix — out of bounds. Means S-MOD + substrate-K had never actually trained end-to-end in OMC; it would panic at first backward.

Both backwards now iterate the dy shape, reduce indices against each operand's actual extent, and accumulate gradient sums across broadcast axes. L1-MH + S-MOD α=1.0 can finally cross-validate in pure-OMC Prometheus.

A/B in pure-OMC Prometheus

examples/prometheus_q6_ab.omc, substrate-K transformer, seq_len=6, d_model=8, ff_dim=16, 80 AdamW steps, 3 seeds:

	mean val	Δ vs off	composed − fused
off (no Q6)	2.5692	—	—
composed Q6	2.5530	−0.0162 (−0.63%)	—
fused Q6	2.5530	−0.0162 (−0.63%)	1.2 × 10⁻⁷

Composed and fused agree to ~1e-7 after 80 forward+backward AdamW steps — floating-point accumulation-noise floor. The substrate-native primitive matches the boring composed reference exactly under training, confirming the abstraction is free.

Q6 itself wins 2/3 seeds at this tiny scale — first OMC-side cross-validation of the PyTorch Q6 finding (−12.15% 6/6 seeds at TinyShakespeare L1-MH).

Tests

examples/tests/test_tape_abs_phi_log.omc — 12 primitive unit tests (forward, backward, edge cases, composed-vs-fused equivalence)
examples/tests/test_q6_modulate.omc — 4 modulation-dispatch tests

Full suite: 1103/1103 OMC tests pass.

Files

omnimcode-core/src/interpreter.rs — TapeOp::Abs, TapeOp::PhiLog(usize, f64), broadcast-aware Mul/Div backward
examples/lib/prometheus.omc — prom_q6_modulate(q, scale, gamma, mode) + q6_mode field on substrate-K layer
examples/prometheus_q6_ab.omc — A/B harness
experiments/prometheus_parity/TAPE_PRIMITIVES_AB.md — full writeup

Assets 2

17 May 17:43

RandomCoder-lab

v0.3.1-symbolic-compression

c2f5e0d

v0.3.1 — Symbolic compression: 3.8× smaller predict default + omc_fetch_by_hash

WHAT CHANGED

omc_predict gains a format parameter:
- hash (NEW DEFAULT, ~50 bytes/suggestion): fn_name + file +
  canonical_hash + prefix_match_len + substrate_distance
- signature (~100 bytes): adds the fn signature line
- full: complete source (previous default behavior)
omc_fetch_by_hash(paths, canonical_hash) — companion tool.
Recovers a function body by alpha-rename-invariant canonical hash.
Returns {found, fn_name, file, source} or {found: false}.

MEASURED COMPRESSION
Same query fn prom_attention_ x top_k=5 against prometheus.omc:
format=hash 1253 bytes 26.2% (3.8x smaller)
format=signature 1622 bytes 33.9%
format=full 4783 bytes 100% (v0.3 behavior)

The ratio widens on longer fns — top_k=5 over fns averaging 60
lines compresses ~10x.

WHY IT MATTERS
Canonical hash is alpha-rename invariant — recovery via
fetch_by_hash works even if the fn was renamed after the predict
call. The LLM workflow becomes: predict cheaply (hash), reason
over candidates, fetch only the body it commits to using.
Branching is now ~free at the context-budget level: 50 candidates
fit in the LLM's mind for the cost of 6-7 full bodies.

NOW POSSIBLE

LLM agents can hold 5-10x more candidate fns "in mind" per query.
Repeated browsing across a corpus stays cheap.
The substrate's content-addressed identity becomes a first-class
context-compression mechanism.

TESTS
13/13 MCP integration tests pass. 231 Rust pass, 1087/1087 OMC.

DEFERRED TO v0.4

Wire substrate codec (omc_codec_encode 10-50x ratio) into the
predict response path for full library-lookup compression.
Substrate-keyed conversation memory via fibtier.
omc_compress_context(text) MCP tool.
Cross-corpus blending.

See CHANGELOG.md#v0.3.1-symbolic-compression for the chapter index.

Assets 2

17 May 16:48

RandomCoder-lab

v0.0.6-prometheus

686fc7a

v0.0.6 — Prometheus: pure-OMC ML framework, first substrate-K wins

WHAT CHANGED

Tape-based reverse-mode autograd in pure OMC: tape_var, tape_const,
tape_add, tape_matmul, tape_softmax, ~20 ops. Substrate-preserving.
Prometheus framework: prom_linear, prom_relu, prom_softmax,
prom_mse_loss, prom_sgd_step. Then AdamW, Embedding, LayerNorm,
CRT-Fibonacci PE, Sequential, TransformerBlock composition.
Multi-token batched forward: broadcast-aware tape ops, per-row
mean/var, multi-token attention.
TinyShakespeare end-to-end in pure OMC.
Cross-framework parity bench: every Prometheus result reproduced
in PyTorch (experiments/prometheus_parity/).
Substrate-K (L1) wins single-head at TinyShakespeare scale:
-8% val vs vanilla, 3/3 seeds. First substrate-component win
that survives at real scale.
PyTorch 10-seed + multi-block reproduction of substrate-L3
(parameter-free attention) wins (-21.5% on toy data).
Fibonacci-tier memory primitive (fibtier) — bounded power-law
context buffer.
Substrate-native agent demo: two agents conversing over
OMC-PROTOCOL with persistent fibtier memory across simulated
process restart.

WHY IT MATTERS
OMC's substrate finally produces a measurable win on a real ML
training task at real scale, in both a pure-OMC implementation and
an independent PyTorch reproduction. The autograd + Prometheus
stack is the platform that the substrate-attention chapter (v0.1)
is built on top of.

NOW POSSIBLE

Train a transformer end-to-end in pure OMC.
Compare substrate variants apples-to-apples in PyTorch.
Compose substrate primitives (codec + kernel + protocol +
Prometheus + fibtier) into a single working agent demo.

See CHANGELOG.md#v0.0.6-prometheus for the chapter index.

Assets 2

17 May 16:48

RandomCoder-lab

v0.0.5-codec-kernel-protocol

586112c

v0.0.5 — Substrate codec, kernel, OMC-PROTOCOL v1

WHAT CHANGED

Substrate codec (omc_codec_encode / omc_codec_decode_lookup):
canonicalize source, tokenize, sample every Nth ID, return
compressed payload + content hash. Library-lookup decode.
omc-kernel: content-addressed filesystem store at
~/.omc/kernel/store/<hex_hash>.omc. Alpha-rename invariant — two
processes converging on the same canonical form produce the same
address. CLI: ingest, fetch, stat, ls, sign, verify.
omc-grep: code archaeology via canonical hash. Found 31.7%
redundancy in OMC's own examples tree.
OMC-PROTOCOL v1: formalized substrate-signed wire format for
inter-agent messaging. No PKI; integrity via canonical-hash
recompute.
MCP server (omnimcode-mcp): exposes OMC as a runtime to LLM clients.
Substrate-aware tokenizer: 285+ builtins, 113 phrase-level dict
entries, CRT-packed (kind, vocab_id, position_class) IDs,
token_distance metric, attractor folding.

WHY IT MATTERS
The substrate gains an identity layer (canonical hash) and a wire
format. Two agents talking over OMC-PROTOCOL can verify each other's
claims by recomputing hashes — no shared keys needed. The tokenizer
turns OMC source into a substrate-typed symbol stream — the
foundation for the substrate-indexed completion engine that comes next.

NOW POSSIBLE

Compress code by 10-50× via library-lookup codec.
Persist Values content-addressed and dedupe across processes.
Inter-agent messaging with cryptographic-style integrity but no
key infrastructure.
LLM clients can drive OMC over MCP.

See CHANGELOG.md#v0.0.5-codec-kernel-protocol for the chapter index.

Assets 2

17 May 16:48

RandomCoder-lab

v0.0.4-jit-and-dual-band

ca30037

v0.0.4 — LLVM JIT + dual-band SSE2 + harmony branch elision

WHAT CHANGED

omnimcode-codegen crate: LLVM 18 lowering via inkwell.
Scalar lowerer: allocas, CFG branches, comparisons, recursive Call,
f64 support.
Dual-band lowerer: i64 → <2 x i64> SSE2 vectors, packing classical
α-band with harmonic shadow β-band in a single SSE register.
Cross-fn calls in dual-band lowerer.
phi_shadow(x) + harmony(x) primitives.
Harmony-gated branch elision: high-coherence inputs skip entire
conditional blocks at native code speed. 270× speedup on @hbit
bench; +95% reduction when @Harmony+@predict stack.
Array support: NewArray, ArrayLen, ArrayIndex read, ArrSetNamed.
NSL-KDD real-world JIT measurement: honest negative — array-heavy
code doesn't beat tree-walk by enough to justify lowering cost.
L1.6 Array ↔ JIT bridging at dispatch boundary.
omc-bench harness with criterion.

WHY IT MATTERS
OMC gains a credible JIT path. Dual-band SSE2 codegen is novel — no
other language packs a value's classical and harmonic bands into one
register. Harmony-gated branch elision is the first demonstration
that substrate metadata can drive native-code-level optimization.

The NSL-KDD negative result is part of the chapter — being honest
about where the JIT doesn't help is what makes the does-help claims
trustworthy.

NOW POSSIBLE

Math-heavy hot loops can hit native code speed via @hbit fn pragma.
Branch elision based on input coherence happens transparently.
The bench harness can compare tree-walk vs VM vs JIT-via-dispatch
vs JIT-direct on the same workload.

See CHANGELOG.md#v0.0.4-jit-and-dual-band for the chapter index.

Contributors

hbit, predict, and Harmony

Assets 2

17 May 16:48

RandomCoder-lab

v0.0.3-substrate-and-stdlib

2a4321c

v0.0.3 — Substrate algorithms + self-healing + stdlib

WHAT CHANGED

Self-healing compiler (Phase H.1-H.5): harmonic + typo + divide-by-
singularity + parse-level recovery + safe keyword for runtime
self-healing.
Substrate-routed O(log_phi_pi_fib N) algorithm family: substrate_search,
lower_bound, upper_bound, rank, count_range, slice_range, intersect,
difference, insert, quantile, select_k, nearest, min_distance, hash.
Zeckendorf encoding as first-class integer representation.
Stdlib expansion: 16 + 28 + 15 new built-ins across three rounds.
Closures + first-class functions + mutable closures + module aliasing.
Test runner + --test / --test-all CLI modes.
Iterative heal-to-fixpoint + heal-on-runtime-error retry.
CLI: --check, --fmt (pretty-print canonical OMC), --help.
HBit harmony substrate-routing across the codebase.

WHY IT MATTERS
The language gains the safety primitives (self-healing) and the
substrate-routed algorithms that make the substrate observable in
everyday code. Closures + test runner round out the ergonomics.

NOW POSSIBLE

Programs that recover from typos / off-attractor literals at
compile or runtime.
O(log_phi_pi_fib N) search-family operations on substrate-typed
arrays.
A real test runner and a real formatter for OMC source.

See CHANGELOG.md#v0.0.3-substrate-and-stdlib for the chapter index.

Assets 2

Releases: RandomCoder-lab/OMC

v0.8.5 — substrate ops, embedding lookup, cross-entropy fused, multi-head substrate-K

What's new

#1 tape_cross_entropy_batch — fused tape op

#2 tape_embedding_lookup — direct row gather

#4 OMC_VM=1 negative finding

#5 Multi-head substrate-K attention — prom_attention_substrate_k_mh_*

#6 tape_substrate_resample — fused tape op

Honest framing

What's still on the v0.8.5 list

Tests

Files

Uh oh!

v0.8.4 — Substrate Rust builtins: 40× CPU / 96× GPU end-to-end on Prometheus

Headline

The honest story

The compound effect

What this unlocks immediately

API

Per-cell S-MOD modulator (alpha=0 → 1 everywhere)

Per-cell substrate-V resample modulator (scale != 0)

Fused AdamW per-parameter update — mutates m, v in place

Files

Reproduction

CPU baseline (now fast)

GPU (now wins)

Uh oh!

v0.8.3 — Substrate-shaped GPU matmul wins +38% vs conventional 16×16

Headline

The sweep

1024×1024×1024 (the most decisive case)

The pattern

The deeper thesis

Adoption

What's not yet tested

Files

Reproduction

Uh oh!

v0.8.2 — GPU wired into Prometheus (kernel: 13×, end-to-end: bottleneck elsewhere)

What's new

Kernel-level result: 13×

End-to-end Prometheus result: unchanged at d_model=256

Architecture

Tunables

What this opens up

Files

Uh oh!

v0.8.1 — substrate-native tape primitives + broadcast-backward fix

What's new

tape_phi_log(x, scale=10.0) — substrate-native fused op

tape_abs(x) — boring PyTorch parity

Pre-existing broadcast-backward bug, fixed in the same chapter

A/B in pure-OMC Prometheus

Tests

Files

Uh oh!

v0.3.1 — Symbolic compression: 3.8× smaller predict default + omc_fetch_by_hash

Uh oh!

v0.0.6 — Prometheus: pure-OMC ML framework, first substrate-K wins

Uh oh!

v0.0.5 — Substrate codec, kernel, OMC-PROTOCOL v1

Uh oh!

v0.0.4 — LLVM JIT + dual-band SSE2 + harmony branch elision

Contributors

Uh oh!

v0.0.3 — Substrate algorithms + self-healing + stdlib

Uh oh!

#1 `tape_cross_entropy_batch` — fused tape op

#2 `tape_embedding_lookup` — direct row gather

#5 Multi-head substrate-K attention — `prom_attention_substrate_k_mh_*`

#6 `tape_substrate_resample` — fused tape op

`tape_phi_log(x, scale=10.0)` — substrate-native fused op

`tape_abs(x)` — boring PyTorch parity