Replace attention's K matrix with the CRT-Fibonacci positional encoding. Keep Q and V learned. Result: 8% lower validation loss on TinyShakespeare with ~9% fewer parameters.
Variant params train val
L0 (standard QKV) 11,617 0.110 0.113
L1 (substrate-K, learned Q+V) 10,593 0.103 0.104 ← -8.0% val
L3 (parameter-free attention) 8,545 2.555 2.584 (fails)
L5 (substrate-K, learned Q, V=id) 9,569 1.941 1.976 (unstable)
L6 (sub-K, hybrid-Q, V=id) 9,570 1.899 1.961 (unstable)
Seeds: 42, 7, 123. Corpus: TinyShakespeare 1.1MB, 90/10 train/val split. Architecture: single-block transformer, d_model=32, seq=32, ff=64. Training: 1500 steps, AdamW lr=0.005.
The transformer's K (key) matrix exists to encode "what does position
j look like when something queries for it." In standard attention,
K is learned via K = x @ W_K. We replaced it with the substrate's
CRT-Fibonacci positional encoding table — fixed, no learnable params.
The model BENEFITS from this substitution at real-corpus scale:
- 8% lower validation loss
- ~9% fewer parameters (10,593 vs 11,617)
- Train/val gap stays tight (0.001 vs 0.003)
L1's K is the substrate; L1's Q is learned content-aware projection; L1's V is learned content-aware projection. The substrate replaces the addressing scheme while leaving the content paths free.
Attention is fundamentally a SOFT INDEXING OPERATION. Three roles:
- K — "addresses" each position has
- Q — "addresses" each position is asking for
- V — "content" returned when attended to
The substrate provides a globally-structured addressing scheme via CRT-Fibonacci moduli (positions encoded with pairwise-coprime periodicity). That's a strong inductive prior for SEQUENCE TASKS.
In standard transformers, K has to learn this addressing scheme from scratch. It eventually does, but:
- It costs ~d² params per head
- It takes training time
- Until learned, attention is noisy
By making K = substrate, we hand the model a pre-built addressing scheme. The model only has to learn what to ASK (Q) and what to PROVIDE (V) — both of which are inherently content-dependent.
L3 sets K = Q = CRT-PE AND V = identity. That removes both:
- Content-aware querying (Q frozen)
- Content-aware value projection (V removed)
The model has no way to do content-keyed attention or content-mixing. It's just position-soup. At tiny scale (73 chars), there's not enough data to demand content awareness — substrate's position prior is enough. At TinyShakespeare scale (1.1MB), real linguistic structure demands content keying — L3 hits a ceiling at near-uniform loss (2.58 vs log(65)=4.17 baseline).
| Scale | L1 vs L0 | L3 vs L0 | Winner |
|---|---|---|---|
| Tiny (73 chars, 250 steps) | −3.9% wins 8/10 | −28.5% wins 10/10 | L3 |
| Multi-block tiny | (similar) | −3.1% wins 3/5 | L3 |
| TinyShakespeare val | −8.0% wins 3/3 | +2185% fails 0/3 | L1 |
The takeaway: substrate-K (L1) is the universally-winning variant. At tiny scale, fully-substrate (L3) wins by more, but L1 also wins. At scale, L1 keeps winning, L3 catastrophically fails.
L1 is the substrate-attention sweet spot. It's the architectural recommendation.
The "transformerless" framing is wrong. The substrate isn't replacing the transformer — it's improving specific components of the transformer:
| Component | Substrate substitution | Status |
|---|---|---|
| Positional encoding | CRT-PE | WINS (-5.4% to -2.9% PyTorch) |
| OOD signal | HBit tension | WINS (AUROC 1.0) |
| Attention K matrix | CRT-PE addressing | WINS (-8% val at TinyShakespeare scale) |
| Attention Q | learn it | (substrate replacement loses) |
| Attention V | learn it | (substrate replacement loses) |
| Optimizer | harmonic SGD | WINS (-13.2% vs vanilla, tiny scale) |
| Geodesic attention bias | add bias | WINS (-0.4% to -32.5% range) |
Six substrate wins across the transformer architecture. None of them replace the entire transformer; each replaces a specific component where the substrate's structural prior beats learned-from-scratch.
The right framing: "substrate-aware transformer" — keeps the transformer architecture, replaces individual components with substrate primitives where they win.
For Prometheus' transformer block, the recommended default:
fn build_substrate_transformer_block(d_model, ff_dim, seq_len, seed) {
h emb = prom_embedding_new(vocab, d_model, seed);
h attn = prom_attention_substrate_k_new(d_model, seq_len, seed); # L1
h ln1 = prom_layernorm_new(d_model, seed);
h ff = ...;
h ln2 = ...;
h head = ...;
return ...;
}
L1 — substrate-K with learned Q + V — is the architectural default. L0 (standard QKV) and L3 (parameter-free) are available as alternatives for ablations / specific regimes.
- Single architecture (single-block, d_model=32). Larger models may behave differently.
- One corpus (TinyShakespeare). Other domains (code, math, multilingual) unmeasured.
- 3 seeds at scale. More seeds would tighten variance estimates.
- Training set size (1.1MB) is "real corpus" but not foundation-model scale. The behavior at 100B+ tokens is unknown.
What's clear: at the scale where most real-world LLMs operate (fine-tunes, specialists, small foundation models), substrate-K attention is a measurable improvement. That's the actionable result.