Multiplying attention's softmax output by 1 / (1 + α · attractor_distance(scores))
and renormalizing beats vanilla softmax on the L1 multi-head transformer
at TinyShakespeare scale.
- S-MOD val: 2.966 (vs vanilla softmax val: 3.099, −4.27%)
- wins 2/3 seeds (same variance pattern as L1 itself)
- stacks with substrate-K: L0+softmax 3.308 → L1+smod 2.966 = −10.3% cumulative
- no parameter cost — pure modulation of softmax output
| Variant | Formula | val | vs softmax | wins |
|---|---|---|---|---|
| softmax | exp(s) / Σ exp(s) |
3.099 | — | — |
| smod | softmax(s) × 1/(1+α·d) / norm |
2.966 | −4.27% | 2/3 |
| ssnap | softmax(s + β·(snap(s) − s)) |
3.095 | −0.12% | 2/3 |
| srank | softmax(0.5·s − rank·log(φ)·5) |
3.260 | +5.21% | 1/3 |
Where d = attractor_distance(score) and snap(s) = nearest Fibonacci attractor (signed).
The mechanism: after softmax converts scores to a probability distribution, positions whose raw scores landed far from a Fibonacci attractor get dampened. The renormalization recovers a valid probability distribution.
Architecturally: the modulation is substrate-aware regularization on the attention pattern. Off-attractor positions are weighted less in the value-aggregation step. The model is encouraged to attend to positions whose attention scores naturally align with the substrate's integer lattice.
The win is consistent with the broader OMC architectural rule: substrate metric applied to a quantity that has integer-coherent structure helps; applied to learned floats with no such structure, it doesn't. Here the attention score values get nudged toward discrete substrate addresses, which acts as a soft snap-to-grid on the attention pattern.
softmax(s + β·(snap(s) - s)) pulls raw score values toward attractors
before softmax. Theoretically this should also help, but at β=0.1
the magnitude of the snap is too small relative to the variance of
scores at this scale (which span several units). The substrate signal
is present but drowned out. A higher β might help; this run kept it
conservative.
Rank-based weighting softmax(-rank · log φ) is mathematically clean
(geometric weights by attractor-distance rank) but breaks smooth
attention gradients. The model can't learn to attend to specific
content positions; it can only adjust the magnitude of all positions
simultaneously. Predicted failure mode that materialized.
| Component | Substrate variant | Status |
|---|---|---|
| Positional encoding | CRT-Fibonacci PE | WINS −5.4% (TinyShakespeare) |
| OOD detection | HBit cross-cutting tension | WINS AUROC 1.0 |
| Attention K matrix | CRT-PE addressing | WINS −6.3% val (multi-head, TinyShakespeare) |
| Attention softmax | S-MOD harmonic modulation | WINS −4.27% val (multi-head, TinyShakespeare) |
| Geodesic attention bias | additive position bias | WINS 3/3 (single-block) |
| Optimizer | Harmonic SGD | WINS −13.2% (tiny-scale tinyLM) |
Six substrate-component wins across the transformer. Two of them (K + softmax) stack at TinyShakespeare scale for a combined −10.3% val vs the vanilla baseline.
- α was fixed at 0.5. A sweep might find a better point.
- Single corpus (TinyShakespeare). Generalization to other domains unmeasured.
- 3 seeds — minimum for "majority vote"; more would tighten the variance.
- S-MOD's gradient flows through softmax + multiplicative dampening
- renormalization. Numerical stability at very large gradient magnitudes is unmeasured.
The substrate-aware attention block in Prometheus should now use:
- K = CRT-Fibonacci (substrate-K, validated)
- Q = learned (per-head)
- V = learned (per-head)
- Normalization = S-MOD softmax (new, validated)
- Output projection learned
Two component swaps, ~10% cumulative val improvement on real corpus, ~10% parameter reduction from K removal alone.
def softmax_smod(scores, dim=-1, alpha=0.5):
base = F.softmax(scores, dim=dim)
mod = 1.0 / (1.0 + alpha * attractor_distance(scores))
out = base * mod
return out / (out.sum(dim=dim, keepdim=True) + 1e-9)8 lines. Drop-in replacement for F.softmax(scores, dim=-1) anywhere in an attention path.
See experiments/prometheus_parity/torch_substrate_softmax.py for the full A/B harness.
Original run fixed α=0.5 untuned. A 3-seed sweep ([42, 7, 123]) over {0.0, 0.1, 0.3, 0.5, 1.0} reveals a stronger setting:
| α | mean val | std | vs α=0 |
|---|---|---|---|
| 0.0 | 3.3007 | 0.033 | — |
| 0.1 | 3.1220 | 0.195 | −5.41% |
| 0.3 | 3.1872 | 0.215 | −3.44% |
| 0.5 | 3.2015 | 0.174 | −3.01% |
| 1.0 | 3.0837 | 0.218 | −6.57% |
Two takeaways:
- Every α > 0 beats α = 0. The S-MOD win in the original writeup is robust across the modulation-strength axis — not just a particular setting.
- α = 1.0 is the new best. Validation drops to 3.084 (−6.57% vs vanilla, doubling the −3.01% advantage at α=0.5). Variance is high (σ=0.22), but mean is decisively best across three seeds.
Updated production default in examples/lib/prometheus.omc:
fn prom_attention_substrate_k_new(d_model, seq_len, rng_state) {
...
dict_set(layer, "smod_alpha", 1.0); # was 0.5
...
}
Raw 3-seed data: results_torch_smod_alpha_3seed.json.