Skip to content

Releases: RandomCoder-lab/OMC

v0.8.5 — substrate ops, embedding lookup, cross-entropy fused, multi-head substrate-K

17 May 22:01

Choose a tag to compare

Five v0.8.5 optimization items shipped. The compound v0.8.4 + v0.8.5 effect: training-loop hot path is now fully in Rust builtins; the math-equivalent multi-head substrate-K attention is available; the architecture is positioned for v0.8.6+ to push remaining items (substrate-quantized GPU weights, sparse attention, etc.).

What's new

#1 tape_cross_entropy_batch — fused tape op

Per-batch cross-entropy as one tape node. Closed-form (p - one_hot)/N backward replaces the chain through 5 intermediate nodes (softmax → log → mask → mul → sum). Wins materialize at large vocab.

#2 tape_embedding_lookup — direct row gather

Replaces prom_embedding_batch's OMC-built [N, vocab] one-hot + matmul chain with a direct row gather. Backward scatters rows of dy back into the table gradient (same gradient as the one-hot @ table chain). Wins scale with vocab size.

#4 OMC_VM=1 negative finding

Measured: 0.662 s/step at d_model=256 (was 0.661 tree-walk). No win once hot paths are in Rust builtins. Not pursued further for Prometheus — the bytecode VM optimizes basic-block dispatch, but the hot work is now happening below that layer.

#5 Multi-head substrate-K attention — prom_attention_substrate_k_mh_*

Math-equivalent "sum of per-head W_O projections" form (avoids needing a tape_concat op). All single-head toggles (smod_alpha, v_resample_scale, q6_mode) honored per-head with same defaults.

Cross-validation at d_model=32, 4 heads (d_head=8), 400 steps, 3 seeds:

mean tail loss wins
SH (single head) 2.0047
MH (4 heads) 1.9998 2/3

Δ = −0.25%, directionally consistent with PyTorch's L1-MH −8.94%. Effect grows with capacity; same code path supports the PyTorch −12.15% Q6-MH finding once you turn on q6_mode=\"fused\".

#6 tape_substrate_resample — fused tape op

Skips tape_value → modulator_matrix → tape_const round-trip (which was extracting 16k f64s at d_model=256 seq_len=64 per call). Pairs with the substrate_resample_matrix Rust builtin from v0.8.4. Same math.

Honest framing

Wall-clock at d_model=256 is essentially unchanged from v0.8.4 for these five items in isolation — that scale was already AdamW-bound and the OMC overhead was already removed. These wins materialize when:

  • Vocab grows large — cross-entropy and embedding lookup get O(vocab) cheaper
  • Multi-head trained — the architectural win + the OMC-overhead-gone substrate-attention compose
  • Bigger d_model — fused substrate_resample skips proportionally more I/O

The MH cross-validation result is the load-bearing finding here: the PyTorch L1-MH win cross-validates in OMC's tape autograd.

What's still on the v0.8.5 list

  • #3 Route more tape ops through GPU — modest win at current scales (memory-bound ops aren't matmul-class), scaffold to be added in v0.8.6
  • #7 Substrate-quantized GPU weights — own chapter
  • #8 CRT-PE-keyed sparse attention matmul — own chapter
  • #9 LLVM JIT for tape paths — own chapter
  • #10 f16/bf16 GPU paths — own chapter

Tests

1111/1111 OMC tests pass.

Files

  • omnimcode-core/src/interpreter.rstape_cross_entropy_batch, tape_embedding_lookup, tape_substrate_resample builtins + backwards
  • examples/lib/prometheus.omc — wrappers + prom_attention_substrate_k_mh_*
  • examples/prometheus_mh_xval.omc — SH vs MH cross-validation harness

v0.8.4 — Substrate Rust builtins: 40× CPU / 96× GPU end-to-end on Prometheus

17 May 21:21

Choose a tag to compare

Headline

Three Rust builtins replace OMC-side inner-loop helpers. The fused substrate_adamw_update is the actual bottleneck killer — replaces ~15 element-wise loops per parameter with one tight Rust loop. Combined with v0.8.2 (GPU integration) and v0.8.3 (substrate-shaped 8×32 tile), the three chapters compound to give the first real end-to-end Prometheus training speedup.

CPU s/step GPU s/step speedup vs v0.8.2
v0.8.2 baseline 25.81 25.88 1.00×
v0.8.4 modulators only 26.38 26.28 0.98× ← no change
v0.8.4 + fused AdamW 0.65 0.27 40× / 96×

Same d_model=256 substrate-K transformer, same 5-step training, same final loss (6.95930 ± 5e-5 GPU roundtrip noise). Identical training trajectory, 96× faster on GPU.

The honest story

Initial guess was that the substrate-modulator matrix construction (_prom_smod_matrix, _prom_substrate_resample_matrix) was the bottleneck. Both got ported to Rust first — wall-clock did not move. Useful debugging finding, not a chapter on its own.

Profiling-by-fixing found the real bottleneck in prom_adamw_step: ~15 OMC-side element-wise loops per parameter per step. At 6 params of 256×256 cells, that's ~6M OMC ops per step. Replacing the inner block with one Rust builtin produced the 40× / 96× drop.

Both ports shipped — modulators because they're architecturally cleaner and verified correct, AdamW because it's the actual win.

The compound effect

  • v0.8.2 wired GPU in. End-to-end null result — OMC overhead dominated.
  • v0.8.3 found the substrate-shaped 8×32 tile (114 GFLOPS vs 71 at 1024²). Kernel-level win, no end-to-end change.
  • v0.8.4 removes the OMC overhead. Both prior chapters finally pay out:
    • The GPU/CPU split is now 2.4× (the actual matmul speedup at d_model=256)
    • The 8×32 substrate-shaped tile is doing real work in production training

Future scale-ups (d_model=512+, batched inference, longer sequences, multi-block) get both the OMC-overhead-gone benefit AND the substrate-GPU acceleration.

What this unlocks immediately

  • L1-MH + S-MOD α=1.0 in pure-OMC Prometheus — was unblocked by v0.8.1's broadcast-backward fix; now practical to run (seconds per step rather than minutes)
  • Larger-scale substrate-attention — d_model=512+, multi-block, longer sequences
  • Q6 cross-validation at real training length — v0.8.1's OMC-side Q6 result was at 80 steps; can now run 5000+ step training

API

Three new builtins:

```omc

Per-cell S-MOD modulator (alpha=0 → 1 everywhere)

substrate_smod_matrix(scores_2d, alpha)

Per-cell substrate-V resample modulator (scale != 0)

substrate_resample_matrix(v_2d, scale)

Fused AdamW per-parameter update — mutates m, v in place

substrate_adamw_update(cur, grad, m, v, lr, b1, b2, eps, wd, step)
```

prom_adamw_step in prometheus.omc now uses the fused builtin internally. Public AdamW interface is unchanged; any existing Prometheus training script picks up the speedup automatically.

Files

  • omnimcode-core/src/interpreter.rs — three builtins + flatten/rebuild helpers
  • examples/lib/prometheus.omc_prom_smod_matrix / _prom_substrate_resample_matrix wrappers; prom_adamw_step inner block calls the fused builtin
  • examples/tests/test_substrate_modulator_builtins.omc — 8 unit tests verifying equivalence
  • experiments/prometheus_parity/SUBSTRATE_BUILTINS_WIN.md — full writeup

1111/1111 OMC tests pass.

Reproduction

```bash
cargo build --release -p omnimcode-cli --features gpu

CPU baseline (now fast)

OMC_GPU_BACKEND=cpu ./target/release/omnimcode-standalone examples/bench_prometheus_gpu.omc

GPU (now wins)

OMC_GPU_BACKEND=wgpu ./target/release/omnimcode-standalone examples/bench_prometheus_gpu.omc
```

v0.8.3 — Substrate-shaped GPU matmul wins +38% vs conventional 16×16

17 May 21:05

Choose a tag to compare

Headline

Anisotropic 8×32 tiles (Fibonacci-aligned short dim, wavefront-divisor long dim) decisively beat the conventional square 16×16 tile on the user's AMD RX 580 / Vulkan. At 1024² matmul: 18.81 ms vs 30.31 ms — 1.61× the GFLOPS.

The substrate's role here isn't to fight hardware physics. It's to direct exploration toward configurations conventional GPU programming would never test. Nobody writes 8×32 for matmul by convention. The substrate said "try 8 first," the 9-variant sweep found that 8 paired with a wavefront-divisor long axis dominates, and now that's the default.

The sweep

9 variants × 3 sizes on AMD RX 580 / RADV Vulkan. 1 warmup + 5 timed iterations averaged. Parity verified (max_abs_diff < 1e-2) on every cell.

1024×1024×1024 (the most decisive case)

variant ms GFLOPS vs 16×16
16×16 linear-K REF 30.31 70.85 ref
8×32 linear-K aniso 18.81 114.19 +61% ← winner
8×16 linear-K aniso 18.99 113.10 +60%
8×8 linear-K (1WF, Fib) 22.30 96.29 +36%
13×13 linear-K (3WF) 37.61 57.11 -19%
21×21 linear-K (7WF) 46.43 46.25 -35%
32×8 linear-K aniso 42.20 50.89 -28%
16×16 Fib-K-stride 29.74 72.20 +0.2%

The pattern

  • Anisotropic 8×N (Fib-short × wavefront-long) wins decisively. 8×32 = 256 threads = exactly 4 wavefronts. Short dim is Fib-8 (= half wavefront, fits L1 cache line). Long dim is a cache-line multiple AND maps to N (the output-column axis) for coalesced writes.
  • The 32×8 transpose LOSES by 30% — same total threads, but the wavefront-aligned axis is now M (rows) and writes become strided. Substrate wins only when it pairs with hardware constraints, not against them.
  • Pure-square Fibonacci tiles LOSE. 13×13 = 3 wavefronts × 64 with 23 idle lanes (12% waste). 21×21 = 7 wavefronts hurts occupancy. Fib alone isn't enough — needs to align with wavefront divisors.
  • Fib-K-stride is a wash — substrate-shaped reduction order doesn't matter; tile geometry does.

The deeper thesis

The substrate-IS-the-architecture hypothesis: strong form falsified, weak form confirmed.

  • Falsified: "any Fibonacci tile beats power-of-2 tiles." Wavefront geometry (64 lanes lockstep) is a hard constraint. Pure 13/21 tiles pay an occupancy tax.
  • Confirmed: "substrate-aligned dimensions, when they don't fight hardware, beat conventional tiles." 8×32 has Fib-8 short dim AND wavefront-divisor long dim, and wins by 60% at 1024².

The substrate is the heuristic that directs you to configurations conventional wisdom skips. Conventional GPU programming would never test 8×32 vs 16×16 — it's "too small a tile." The substrate said try 8, and the answer came back: not 8×8 (loses at small sizes due to dispatch overhead), not 13×13 (occupancy loss), but 8×wavefront-aligned.

Adoption

omnimcode-cli's install_gpu_matmul_accelerator() now uses WgpuBackend::with_tile_xy(8, 32) by default. Tunable via OMC_GPU_TILE_X / OMC_GPU_TILE_Y for hardware-specific A/Bs:

# Use the substrate-shaped default (8×32)
./omnimcode-standalone yourcode.omc

# Try a different tile for testing
OMC_GPU_TILE_X=4 OMC_GPU_TILE_Y=16 ./omnimcode-standalone ...    # NVIDIA warp=32 candidate

What's not yet tested

  • Other anisotropic shapes (5×32, 5×40, 13×32, 8×64)
  • Other GPU hardware: NVIDIA (warp=32), Apple M-series (different cache geometry). The hypothesis: 4×16 or 8×16 might win on NVIDIA
  • Combined with substrate-quantized weights (data-layer substrate-shaping)
  • Combined with sparse-via-substrate-distance (only computing high-value attention cells)

Files

  • omnimcode-gpu/src/wgpu_backend.rsWgpuBackend::with_tile_xy(tx, ty) and with_config(tx, ty, kernel); MatmulKernel::{Linear, FibKStride} enum; WGSL source-substitution for both tile and inner-loop body
  • omnimcode-gpu/shaders/matmul.wgsl — parameterized template
  • omnimcode-gpu/examples/bench_fib_tile.rs — 9-variant sweep harness
  • omnimcode-cli/src/main.rs — default tile 8×32
  • experiments/prometheus_parity/SUBSTRATE_GPU_WINS.md — full writeup

1103/1103 OMC tests pass.

Reproduction

cargo run --release -p omnimcode-gpu --features wgpu --example bench_fib_tile

v0.8.2 — GPU wired into Prometheus (kernel: 13×, end-to-end: bottleneck elsewhere)

17 May 20:50

Choose a tag to compare

What's new

GPU matmul acceleration wired into the OMC tape autograd via a pluggable hook. tape_matmul forward + backward now route through omnimcode-gpu's wgpu (Vulkan) backend when built with --features gpu and shapes cross the CPU/GPU crossover threshold.

Kernel-level result: 13×

5 sequential 512² matmuls in an OMC tape:

backend wall-clock speedup
OMC_GPU_BACKEND=cpu 3.47 s 1.00×
OMC_GPU_BACKEND=wgpu (RX 580, Vulkan) 0.27 s 12.85×

Parity: f64 → f32 → f64 round-trip differs at the 9th significant digit. Fine for Prometheus training.

End-to-end Prometheus result: unchanged at d_model=256

Substrate-K transformer, seq_len=64, d_model=256, ff_dim=512, 5 AdamW steps:

wall-clock per step loss
CPU 129.05 s 25.81 s 6.95930
wgpu 129.39 s 25.88 s 6.95932

GPU and CPU are dead even end-to-end. Why: matmul wall-clock is single-digit milliseconds per step; OMC tree-walk iteration in the substrate-shaping helpers (_prom_smod_matrix, _prom_substrate_resample_matrix, Q6 modulation) is tens of seconds. GPU saves ~50ms; OMC burns ~25s. The ratio explains the 0% movement.

This chapter ships the integration, not an end-to-end speedup. Naming the wall IS the chapter — every future direction that needs more matmul work in the time budget now gets it for free.

Architecture

omnimcode-core can't depend on omnimcode-gpu (which already depends on -core — would be a cycle). Solved with a OnceLock MatmulAccelerator hook in core::accel that the outer binary registers at startup. The hook signature uses raw (m, k, n, &[f64], &[f64]) so no core-internal types leak.

Tunables

OMC_GPU_BACKEND=cpu|wgpu        # force a backend
OMC_GPU_MATMUL_MIN_FLOPS=N      # crossover threshold (default 1,000,000)
OMC_GPU_VERBOSE=1               # log backend + threshold at startup

What this opens up

  • v0.8.3 substrate-native GPU kernels: Fibonacci-tile workgroups (13×13, 21×21, 34×34 vs the conventional 16×16), substrate-quantized weights, CRT-PE-keyed sparse attention matmul. Same composed-vs-fused protocol as tape_phi_log from v0.8.1, applied at the GPU layer. The substrate-IS-the-architecture question at kernel-level.
  • Bigger d_model (1024+): matmul time grows ~64× while OMC-side substrate ops grow ~4×, so the ratio inverts and GPU starts to win end-to-end.
  • Substrate ops as Rust builtins (separate work): would dissolve today's bottleneck — the substrate helpers are pure compute, fit cleanly into the tape primitive pattern.

Files

  • omnimcode-core/src/accel.rsMatmulAccelerator hook + OnceLock + try_accelerated_matmul
  • omnimcode-core/src/interpreter.rstape_matmul consults the hook before falling back to triple-loop
  • omnimcode-cli/Cargo.tomlgpu feature pulls in omnimcode-gpu
  • omnimcode-cli/src/main.rsinstall_gpu_matmul_accelerator() at startup
  • examples/bench_prometheus_gpu.omc — wall-clock harness
  • experiments/prometheus_parity/GPU_INTEGRATION.md — full writeup

1103/1103 OMC tests pass.

v0.8.1 — substrate-native tape primitives + broadcast-backward fix

17 May 20:30

Choose a tag to compare

What's new

Two new tape autograd primitives and a latent backward-broadcast bug fix that unblocks S-MOD + substrate-K end-to-end training in OMC.

tape_phi_log(x, scale=10.0) — substrate-native fused op

ln(|x · scale| + 1) / (π · ln φ) in one tape node. Replaces the four-op composition (tape_abstape_mul_scalartape_logtape_div_scalar) with a single op whose backward derives directly from the substrate basis. Defined at zero (boring tape_log(0) returns −∞), exposes π·ln φ at the AST level rather than hiding it in a scalar constant.

This is the precedent-setting substrate-native primitive. The protocol — composed reference + fused alternative + unit-level equivalence proof + end-to-end training A/B — can now be applied to other substrate-native fused ops (substrate_resample, attractor_snap, attractor-modulated-backward variants).

tape_abs(x) — boring PyTorch parity

Element-wise |x|. Filled the obvious hole — the autograd tape had tape_log, tape_exp, tape_sin, etc., but no absolute value.

Pre-existing broadcast-backward bug, fixed in the same chapter

tape_div and tape_mul backwards panicked with col-broadcast denominators. The prom_substrate_softmax α>0 path ends in tape_div(attn_unnorm[N, N], row_sums[N, 1]) and indexed bv.at(i, j) for j up to N−1 in a [N, 1] matrix — out of bounds. Means S-MOD + substrate-K had never actually trained end-to-end in OMC; it would panic at first backward.

Both backwards now iterate the dy shape, reduce indices against each operand's actual extent, and accumulate gradient sums across broadcast axes. L1-MH + S-MOD α=1.0 can finally cross-validate in pure-OMC Prometheus.

A/B in pure-OMC Prometheus

examples/prometheus_q6_ab.omc, substrate-K transformer, seq_len=6, d_model=8, ff_dim=16, 80 AdamW steps, 3 seeds:

mean val Δ vs off composed − fused
off (no Q6) 2.5692
composed Q6 2.5530 −0.0162 (−0.63%)
fused Q6 2.5530 −0.0162 (−0.63%) 1.2 × 10⁻⁷

Composed and fused agree to ~1e-7 after 80 forward+backward AdamW steps — floating-point accumulation-noise floor. The substrate-native primitive matches the boring composed reference exactly under training, confirming the abstraction is free.

Q6 itself wins 2/3 seeds at this tiny scale — first OMC-side cross-validation of the PyTorch Q6 finding (−12.15% 6/6 seeds at TinyShakespeare L1-MH).

Tests

  • examples/tests/test_tape_abs_phi_log.omc — 12 primitive unit tests (forward, backward, edge cases, composed-vs-fused equivalence)
  • examples/tests/test_q6_modulate.omc — 4 modulation-dispatch tests

Full suite: 1103/1103 OMC tests pass.

Files

  • omnimcode-core/src/interpreter.rsTapeOp::Abs, TapeOp::PhiLog(usize, f64), broadcast-aware Mul/Div backward
  • examples/lib/prometheus.omcprom_q6_modulate(q, scale, gamma, mode) + q6_mode field on substrate-K layer
  • examples/prometheus_q6_ab.omc — A/B harness
  • experiments/prometheus_parity/TAPE_PRIMITIVES_AB.md — full writeup

v0.3.1 — Symbolic compression: 3.8× smaller predict default + omc_fetch_by_hash

17 May 17:43

Choose a tag to compare

WHAT CHANGED

  • omc_predict gains a format parameter:
    • hash (NEW DEFAULT, ~50 bytes/suggestion): fn_name + file +
      canonical_hash + prefix_match_len + substrate_distance
    • signature (~100 bytes): adds the fn signature line
    • full: complete source (previous default behavior)
  • omc_fetch_by_hash(paths, canonical_hash) — companion tool.
    Recovers a function body by alpha-rename-invariant canonical hash.
    Returns {found, fn_name, file, source} or {found: false}.

MEASURED COMPRESSION
Same query fn prom_attention_ x top_k=5 against prometheus.omc:
format=hash 1253 bytes 26.2% (3.8x smaller)
format=signature 1622 bytes 33.9%
format=full 4783 bytes 100% (v0.3 behavior)

The ratio widens on longer fns — top_k=5 over fns averaging 60
lines compresses ~10x.

WHY IT MATTERS
Canonical hash is alpha-rename invariant — recovery via
fetch_by_hash works even if the fn was renamed after the predict
call. The LLM workflow becomes: predict cheaply (hash), reason
over candidates, fetch only the body it commits to using.
Branching is now ~free at the context-budget level: 50 candidates
fit in the LLM's mind for the cost of 6-7 full bodies.

NOW POSSIBLE

  • LLM agents can hold 5-10x more candidate fns "in mind" per query.
  • Repeated browsing across a corpus stays cheap.
  • The substrate's content-addressed identity becomes a first-class
    context-compression mechanism.

TESTS
13/13 MCP integration tests pass. 231 Rust pass, 1087/1087 OMC.

DEFERRED TO v0.4

  • Wire substrate codec (omc_codec_encode 10-50x ratio) into the
    predict response path for full library-lookup compression.
  • Substrate-keyed conversation memory via fibtier.
  • omc_compress_context(text) MCP tool.
  • Cross-corpus blending.

See CHANGELOG.md#v0.3.1-symbolic-compression for the chapter index.

v0.0.6 — Prometheus: pure-OMC ML framework, first substrate-K wins

17 May 16:48

Choose a tag to compare

WHAT CHANGED

  • Tape-based reverse-mode autograd in pure OMC: tape_var, tape_const,
    tape_add, tape_matmul, tape_softmax, ~20 ops. Substrate-preserving.
  • Prometheus framework: prom_linear, prom_relu, prom_softmax,
    prom_mse_loss, prom_sgd_step. Then AdamW, Embedding, LayerNorm,
    CRT-Fibonacci PE, Sequential, TransformerBlock composition.
  • Multi-token batched forward: broadcast-aware tape ops, per-row
    mean/var, multi-token attention.
  • TinyShakespeare end-to-end in pure OMC.
  • Cross-framework parity bench: every Prometheus result reproduced
    in PyTorch (experiments/prometheus_parity/).
  • Substrate-K (L1) wins single-head at TinyShakespeare scale:
    -8% val vs vanilla, 3/3 seeds. First substrate-component win
    that survives at real scale.
  • PyTorch 10-seed + multi-block reproduction of substrate-L3
    (parameter-free attention) wins (-21.5% on toy data).
  • Fibonacci-tier memory primitive (fibtier) — bounded power-law
    context buffer.
  • Substrate-native agent demo: two agents conversing over
    OMC-PROTOCOL with persistent fibtier memory across simulated
    process restart.

WHY IT MATTERS
OMC's substrate finally produces a measurable win on a real ML
training task at real scale, in both a pure-OMC implementation and
an independent PyTorch reproduction. The autograd + Prometheus
stack is the platform that the substrate-attention chapter (v0.1)
is built on top of.

NOW POSSIBLE

  • Train a transformer end-to-end in pure OMC.
  • Compare substrate variants apples-to-apples in PyTorch.
  • Compose substrate primitives (codec + kernel + protocol +
    Prometheus + fibtier) into a single working agent demo.

See CHANGELOG.md#v0.0.6-prometheus for the chapter index.

v0.0.5 — Substrate codec, kernel, OMC-PROTOCOL v1

17 May 16:48

Choose a tag to compare

WHAT CHANGED

  • Substrate codec (omc_codec_encode / omc_codec_decode_lookup):
    canonicalize source, tokenize, sample every Nth ID, return
    compressed payload + content hash. Library-lookup decode.
  • omc-kernel: content-addressed filesystem store at
    ~/.omc/kernel/store/<hex_hash>.omc. Alpha-rename invariant — two
    processes converging on the same canonical form produce the same
    address. CLI: ingest, fetch, stat, ls, sign, verify.
  • omc-grep: code archaeology via canonical hash. Found 31.7%
    redundancy in OMC's own examples tree.
  • OMC-PROTOCOL v1: formalized substrate-signed wire format for
    inter-agent messaging. No PKI; integrity via canonical-hash
    recompute.
  • MCP server (omnimcode-mcp): exposes OMC as a runtime to LLM clients.
  • Substrate-aware tokenizer: 285+ builtins, 113 phrase-level dict
    entries, CRT-packed (kind, vocab_id, position_class) IDs,
    token_distance metric, attractor folding.

WHY IT MATTERS
The substrate gains an identity layer (canonical hash) and a wire
format. Two agents talking over OMC-PROTOCOL can verify each other's
claims by recomputing hashes — no shared keys needed. The tokenizer
turns OMC source into a substrate-typed symbol stream — the
foundation for the substrate-indexed completion engine that comes next.

NOW POSSIBLE

  • Compress code by 10-50× via library-lookup codec.
  • Persist Values content-addressed and dedupe across processes.
  • Inter-agent messaging with cryptographic-style integrity but no
    key infrastructure.
  • LLM clients can drive OMC over MCP.

See CHANGELOG.md#v0.0.5-codec-kernel-protocol for the chapter index.

v0.0.4 — LLVM JIT + dual-band SSE2 + harmony branch elision

17 May 16:48

Choose a tag to compare

WHAT CHANGED

  • omnimcode-codegen crate: LLVM 18 lowering via inkwell.
  • Scalar lowerer: allocas, CFG branches, comparisons, recursive Call,
    f64 support.
  • Dual-band lowerer: i64 → <2 x i64> SSE2 vectors, packing classical
    α-band with harmonic shadow β-band in a single SSE register.
  • Cross-fn calls in dual-band lowerer.
  • phi_shadow(x) + harmony(x) primitives.
  • Harmony-gated branch elision: high-coherence inputs skip entire
    conditional blocks at native code speed. 270× speedup on @hbit
    bench; +95% reduction when @Harmony+@predict stack.
  • Array support: NewArray, ArrayLen, ArrayIndex read, ArrSetNamed.
  • NSL-KDD real-world JIT measurement: honest negative — array-heavy
    code doesn't beat tree-walk by enough to justify lowering cost.
  • L1.6 Array ↔ JIT bridging at dispatch boundary.
  • omc-bench harness with criterion.

WHY IT MATTERS
OMC gains a credible JIT path. Dual-band SSE2 codegen is novel — no
other language packs a value's classical and harmonic bands into one
register. Harmony-gated branch elision is the first demonstration
that substrate metadata can drive native-code-level optimization.

The NSL-KDD negative result is part of the chapter — being honest
about where the JIT doesn't help is what makes the does-help claims
trustworthy.

NOW POSSIBLE

  • Math-heavy hot loops can hit native code speed via @hbit fn pragma.
  • Branch elision based on input coherence happens transparently.
  • The bench harness can compare tree-walk vs VM vs JIT-via-dispatch
    vs JIT-direct on the same workload.

See CHANGELOG.md#v0.0.4-jit-and-dual-band for the chapter index.

v0.0.3 — Substrate algorithms + self-healing + stdlib

17 May 16:48

Choose a tag to compare

WHAT CHANGED

  • Self-healing compiler (Phase H.1-H.5): harmonic + typo + divide-by-
    singularity + parse-level recovery + safe keyword for runtime
    self-healing.
  • Substrate-routed O(log_phi_pi_fib N) algorithm family: substrate_search,
    lower_bound, upper_bound, rank, count_range, slice_range, intersect,
    difference, insert, quantile, select_k, nearest, min_distance, hash.
  • Zeckendorf encoding as first-class integer representation.
  • Stdlib expansion: 16 + 28 + 15 new built-ins across three rounds.
  • Closures + first-class functions + mutable closures + module aliasing.
  • Test runner + --test / --test-all CLI modes.
  • Iterative heal-to-fixpoint + heal-on-runtime-error retry.
  • CLI: --check, --fmt (pretty-print canonical OMC), --help.
  • HBit harmony substrate-routing across the codebase.

WHY IT MATTERS
The language gains the safety primitives (self-healing) and the
substrate-routed algorithms that make the substrate observable in
everyday code. Closures + test runner round out the ergonomics.

NOW POSSIBLE

  • Programs that recover from typos / off-attractor literals at
    compile or runtime.
  • O(log_phi_pi_fib N) search-family operations on substrate-typed
    arrays.
  • A real test runner and a real formatter for OMC source.

See CHANGELOG.md#v0.0.3-substrate-and-stdlib for the chapter index.