v0.8.8 — Q6 training pushes attention 8.3× toward substrate positions
The big finding
Q6 training pushes attention 8.3× toward substrate-aligned positions. This flips the v0.8.7 #8 falsification — sparse attention via substrate distance IS viable, but only after Q6 training.
After 1000 Q6-fused training steps (d_model=32, seq_len=32):
| arm | mass in substrate-close cells | cell fraction | ratio |
|---|---|---|---|
| baseline (no Q6), trained | 4.82% | 6.84% | 0.70 (anti-correlated) |
| Q6 fused, trained | 56.80% | 6.84% | 8.31× |
A sparse kernel computing only substrate-close cells captures 56.8% of attention with 6.84% of compute. Real architecture-level "substrate is the architecture" claim, unlocked as a post-training inference optimization.
Mechanism
Q6 dampens large-magnitude query components via exp(-γ · log_φπfib(|q · scale| + 1)). Components whose substrate log-distance is small get less dampening, so they survive training and dominate the attention pattern. The substrate isn't directly constraining position — it's reshaping the gradient landscape so substrate-aligned positions win.
Implications
- Sparse inference kernel:
q[i] · k[j]only forsubstrate_dist(i, j) ≤ τ - ~10× attention compute reduction at the cost of ~43% attention quality (a defensible inference-time tradeoff)
- The PyTorch Q6 −12.15% finding may partially be substrate-position alignment in disguise
Plus 3 more findings
Infrastructure fix — JIT eligibility audit
fn_uses_collections in omnimcode-codegen skips JIT for fns touching arrays/dicts/strings. OMC_HBIT_JIT=1 no longer crashes on Prometheus. Wall-clock unchanged at d_model=256 (v0.8.4 already removed the overhead JIT would have compressed); unblocks JIT for any future tape-using workload.
Negative — substrate-quant 6-seed verifies as noise
The v0.8.7 single-seed "lower loss" was seed noise. Mean 2.365 vs baseline 2.337 (+1.2% worse) across 6 seeds × 300 steps with OMC_GPU_SUBSTRATE_QUANT_SCALE=4096. Training-time substrate quantization is a marginal regression at this scale.
Negative — substrate-aware param init falsified
Snap-to-attractor at init scale 1024/4096 gives +2.6%/+4.7% worse mean loss vs uniform random init (6 seeds × 300 steps). Starting on attractors gives less gradient info per step.
Methodology: each experiment ≤ 10 min, all four genuinely tried
| # | finding | result |
|---|---|---|
| 1 | Q6 post-train sparsity | POSITIVE — 8.31× substrate concentration |
| 2 | substrate-quant 6-seed | NEGATIVE — seed noise verified |
| 3 | substrate-init A/B | NEGATIVE — falsified, +2.6/+4.7% worse |
| 4 | JIT eligibility audit | POSITIVE infra — fix landed, 1111/1111 pass |
Three negatives + one massive positive + one infra fix. The "fail forward" discipline keeps producing useful data either way.
Compounding architecture
- v0.8.1 fixed broadcast-backward (unblocked S-MOD training)
- v0.8.4 fused AdamW (dissolved 96× overhead)
- v0.8.5 multi-head substrate-K (architecturally needed for parity)
- v0.8.7 tried 4 deferred items
- v0.8.8 four more attempts; #1 unlocks future sparse inference
Tests
1111/1111 OMC tests pass.
Files
examples/prometheus_q6_post_train_sparsity.omc— Finding 1examples/prometheus_substrate_quant_6seed.omc— Finding 2examples/prometheus_substrate_init_xval.omc— Finding 3omnimcode-codegen/src/lib.rs— Finding 4 (fn_uses_collections)omnimcode-core/src/interpreter.rs—substrate_snap_matrixbuiltinexperiments/prometheus_parity/V088_FOUR_FINDINGS.md— writeup