Skip to content

v0.8.8 — Q6 training pushes attention 8.3× toward substrate positions

Choose a tag to compare

@RandomCoder-lab RandomCoder-lab released this 17 May 23:03
· 281 commits to master since this release

The big finding

Q6 training pushes attention 8.3× toward substrate-aligned positions. This flips the v0.8.7 #8 falsification — sparse attention via substrate distance IS viable, but only after Q6 training.

After 1000 Q6-fused training steps (d_model=32, seq_len=32):

arm mass in substrate-close cells cell fraction ratio
baseline (no Q6), trained 4.82% 6.84% 0.70 (anti-correlated)
Q6 fused, trained 56.80% 6.84% 8.31×

A sparse kernel computing only substrate-close cells captures 56.8% of attention with 6.84% of compute. Real architecture-level "substrate is the architecture" claim, unlocked as a post-training inference optimization.

Mechanism

Q6 dampens large-magnitude query components via exp(-γ · log_φπfib(|q · scale| + 1)). Components whose substrate log-distance is small get less dampening, so they survive training and dominate the attention pattern. The substrate isn't directly constraining position — it's reshaping the gradient landscape so substrate-aligned positions win.

Implications

  • Sparse inference kernel: q[i] · k[j] only for substrate_dist(i, j) ≤ τ
  • ~10× attention compute reduction at the cost of ~43% attention quality (a defensible inference-time tradeoff)
  • The PyTorch Q6 −12.15% finding may partially be substrate-position alignment in disguise

Plus 3 more findings

Infrastructure fix — JIT eligibility audit

fn_uses_collections in omnimcode-codegen skips JIT for fns touching arrays/dicts/strings. OMC_HBIT_JIT=1 no longer crashes on Prometheus. Wall-clock unchanged at d_model=256 (v0.8.4 already removed the overhead JIT would have compressed); unblocks JIT for any future tape-using workload.

Negative — substrate-quant 6-seed verifies as noise

The v0.8.7 single-seed "lower loss" was seed noise. Mean 2.365 vs baseline 2.337 (+1.2% worse) across 6 seeds × 300 steps with OMC_GPU_SUBSTRATE_QUANT_SCALE=4096. Training-time substrate quantization is a marginal regression at this scale.

Negative — substrate-aware param init falsified

Snap-to-attractor at init scale 1024/4096 gives +2.6%/+4.7% worse mean loss vs uniform random init (6 seeds × 300 steps). Starting on attractors gives less gradient info per step.

Methodology: each experiment ≤ 10 min, all four genuinely tried

# finding result
1 Q6 post-train sparsity POSITIVE — 8.31× substrate concentration
2 substrate-quant 6-seed NEGATIVE — seed noise verified
3 substrate-init A/B NEGATIVE — falsified, +2.6/+4.7% worse
4 JIT eligibility audit POSITIVE infra — fix landed, 1111/1111 pass

Three negatives + one massive positive + one infra fix. The "fail forward" discipline keeps producing useful data either way.

Compounding architecture

  • v0.8.1 fixed broadcast-backward (unblocked S-MOD training)
  • v0.8.4 fused AdamW (dissolved 96× overhead)
  • v0.8.5 multi-head substrate-K (architecturally needed for parity)
  • v0.8.7 tried 4 deferred items
  • v0.8.8 four more attempts; #1 unlocks future sparse inference

Tests

1111/1111 OMC tests pass.

Files

  • examples/prometheus_q6_post_train_sparsity.omc — Finding 1
  • examples/prometheus_substrate_quant_6seed.omc — Finding 2
  • examples/prometheus_substrate_init_xval.omc — Finding 3
  • omnimcode-codegen/src/lib.rs — Finding 4 (fn_uses_collections)
  • omnimcode-core/src/interpreter.rssubstrate_snap_matrix builtin
  • experiments/prometheus_parity/V088_FOUR_FINDINGS.md — writeup