Release v0.8.8 — Q6 training pushes attention 8.3× toward substrate positions · RandomCoder-lab/OMC

The big finding

Q6 training pushes attention 8.3× toward substrate-aligned positions. This flips the v0.8.7 #8 falsification — sparse attention via substrate distance IS viable, but only after Q6 training.

After 1000 Q6-fused training steps (d_model=32, seq_len=32):

arm	mass in substrate-close cells	cell fraction	ratio
baseline (no Q6), trained	4.82%	6.84%	0.70 (anti-correlated)
Q6 fused, trained	56.80%	6.84%	8.31×

A sparse kernel computing only substrate-close cells captures 56.8% of attention with 6.84% of compute. Real architecture-level "substrate is the architecture" claim, unlocked as a post-training inference optimization.

Mechanism

Q6 dampens large-magnitude query components via exp(-γ · log_φπfib(|q · scale| + 1)). Components whose substrate log-distance is small get less dampening, so they survive training and dominate the attention pattern. The substrate isn't directly constraining position — it's reshaping the gradient landscape so substrate-aligned positions win.

Implications

Sparse inference kernel: q[i] · k[j] only for substrate_dist(i, j) ≤ τ
~10× attention compute reduction at the cost of ~43% attention quality (a defensible inference-time tradeoff)
The PyTorch Q6 −12.15% finding may partially be substrate-position alignment in disguise

Plus 3 more findings

Infrastructure fix — JIT eligibility audit

fn_uses_collections in omnimcode-codegen skips JIT for fns touching arrays/dicts/strings. OMC_HBIT_JIT=1 no longer crashes on Prometheus. Wall-clock unchanged at d_model=256 (v0.8.4 already removed the overhead JIT would have compressed); unblocks JIT for any future tape-using workload.

Negative — substrate-quant 6-seed verifies as noise

The v0.8.7 single-seed "lower loss" was seed noise. Mean 2.365 vs baseline 2.337 (+1.2% worse) across 6 seeds × 300 steps with OMC_GPU_SUBSTRATE_QUANT_SCALE=4096. Training-time substrate quantization is a marginal regression at this scale.

Negative — substrate-aware param init falsified

Snap-to-attractor at init scale 1024/4096 gives +2.6%/+4.7% worse mean loss vs uniform random init (6 seeds × 300 steps). Starting on attractors gives less gradient info per step.

Methodology: each experiment ≤ 10 min, all four genuinely tried

#	finding	result
1	Q6 post-train sparsity	POSITIVE — 8.31× substrate concentration
2	substrate-quant 6-seed	NEGATIVE — seed noise verified
3	substrate-init A/B	NEGATIVE — falsified, +2.6/+4.7% worse
4	JIT eligibility audit	POSITIVE infra — fix landed, 1111/1111 pass

Three negatives + one massive positive + one infra fix. The "fail forward" discipline keeps producing useful data either way.

Compounding architecture

v0.8.1 fixed broadcast-backward (unblocked S-MOD training)
v0.8.4 fused AdamW (dissolved 96× overhead)
v0.8.5 multi-head substrate-K (architecturally needed for parity)
v0.8.7 tried 4 deferred items
v0.8.8 four more attempts; #1 unlocks future sparse inference

Tests

1111/1111 OMC tests pass.

Files

examples/prometheus_q6_post_train_sparsity.omc — Finding 1
examples/prometheus_substrate_quant_6seed.omc — Finding 2
examples/prometheus_substrate_init_xval.omc — Finding 3
omnimcode-codegen/src/lib.rs — Finding 4 (fn_uses_collections)
omnimcode-core/src/interpreter.rs — substrate_snap_matrix builtin
experiments/prometheus_parity/V088_FOUR_FINDINGS.md — writeup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.8.8 — Q6 training pushes attention 8.3× toward substrate positions

Choose a tag to compare

Sorry, something went wrong.