Geodesic attention — derived from what we know

RandomCoder-lab · claude · RandomCoder-lab · commit 5275c4dce073 · 2026-05-16T22:04:29.000-05:00
After three falsifications of HBit-tension attention gates (key,
score, learned), the common failure pattern was clear: every
variant applied substrate metric to a CONTINUOUS LEARNED quantity
(key magnitudes or attention scores). Those quantities have no
architectural reason to land on Fibonacci attractors.

What we know WORKS (CRT-PE, HBit OOD) applies substrate to
INTEGER-VALUED quantities (positions, sample-aggregate signals)
— quantities that intrinsically live in the substrate's basis.

The derivation (full writeup: GEODESIC_ATTENTION_DERIVATION.md):

  scores[i, j] = (q_i · k_j) / sqrt(d) - alpha * geodesic(i, j)

  geodesic(i, j) = sum over CRT moduli {5, 8, 13, 21, 34, 55, 89, 144}
                   of circular_distance((i % m), (j % m)) / m

This is ALiBi-style additive position bias, but the position
distance is computed in the same CRT-Fibonacci lattice that
CRT-PE already lives in. Substrate signal applied to POSITIONS
(integer, native basis) instead of activations.

Properties that distinguish this from the previous three failures:
  - Substrate metric on integer quantities (vs continuous floats)
  - Same lattice as CRT-PE (which is the validated substrate win)
  - Additive pre-softmax bias (composes natively)
  - Single learnable alpha per block, init 0 (must DISCOVER bias)
  - Precomputed at construction (no per-batch substrate compute)
  - Independent of token content (geometry only)

Training run kicked off; results in a separate commit.

Co-Authored-By: Claude Opus 4.7 &lt;noreply@anthropic.com&gt;
diff --git a/experiments/transformerless_lm/GEODESIC_ATTENTION_DERIVATION.md b/experiments/transformerless_lm/GEODESIC_ATTENTION_DERIVATION.md
@@ -0,0 +1,116 @@
+# Geodesic attention — deriving from what we've measured
+
+## What we actually know (not what we hoped)
+
+After CRT-PE (2 wins) + HBit OOD (1 win) + three falsified attention
+gates, the empirical map is:
+
+| Where substrate applied | Basis | Result |
+|---|---|---|
+| Position → CRT-PE | integer position `i` | **WINS** −5.4% / −2.9% |
+| Reference-free OOD score | per-sample HBit tension | **WINS** AUROC 1.0 |
+| Attention KEY magnitude gate | learned float `\|k\|.mean(-1)` | FAILS 0/3 |
+| Attention SCORE gate | learned float `q @ k^T / √d` | FAILS 0/3 |
+| Same with learned threshold | same float quantity | FAILS 0/3 |
+
+**The common failure pattern**: every loss applied
+`attractor_distance(·)` to a *continuous, Gaussian-ish, learned*
+quantity. Those quantities have no architectural reason to land
+on Fibonacci attractors — those attractors live in integer ID
+space (the basis that CRT-PE actually uses).
+
+**The wins share a pattern**: substrate signal applied to a
+quantity that's *intrinsically integer-valued* (positions in
+CRT-PE) or *aggregated cross-position* (HBit OOD over a sample).
+The substrate's lattice lives in those bases.
+
+## The right basis for attention bias
+
+Attention has TWO sources of structure:
+1. **The query/key activations** (continuous, learned, no substrate
+   structure → all three previous attempts)
+2. **The query/key POSITIONS** (integer, indexed 0..T, *is*
+   meaningful in substrate space — that's why CRT-PE works)
+
+We've been adding the substrate signal to source #1. The right move
+is to add it to source #2. Specifically: **attention bias should be
+a function of geodesic distance between positions i and j in the
+same CRT-Fibonacci-moduli space CRT-PE already uses.**
+
+## The formula
+
+For positions i, j and Fibonacci moduli M = {5, 8, 13, 21, 34, 55, 89, 144}:
+
+```
+d_circ(i, j, m) = min(|(i % m) − (j % m)|, m − |(i % m) − (j % m)|)
+geodesic(i, j) = Σ_{m ∈ M} d_circ(i, j, m) / m       # normalize to [0, ~|M|/2]
+```
+
+Each per-modulus term is a circular distance on a ring of size `m`
+(positions sharing the same residue contribute 0; antipodal residues
+contribute `m/2`). The total is the L1 sum over moduli — the
+geodesic length in the CRT-Fibonacci lattice.
+
+Why circular: positions on a ring of size `m` should be treated as
+adjacent at the wrap. This matches CRT-PE which uses
+`sin(2π·pos%m/m)` — same circularity.
+
+## The attention modification
+
+Pre-softmax additive bias (the form that works for ALiBi):
+
+```
+scores_ij = (q_i · k_j) / √d − α · geodesic(i, j)
+attn = softmax(scores)
+```
+
+α is a learned scalar per head (initialized to 0 — model can disable
+substrate signal if loss says to; same fairness as
+`hybrid_learned`).
+
+## Why this should work where the previous three failed
+
+| Property | Previous gates | Geodesic |
+|---|:-:|:-:|
+| Substrate metric applied to integer quantities | ✗ | ✓ |
+| Same basis as CRT-PE (proven to work) | ✗ | ✓ |
+| Composes additively with softmax | partly | ✓ |
+| Model can disable via single learnable | ✓ | ✓ |
+| Computable once at init (not per-batch) | ✗ | ✓ |
+| Independent of token content | ✗ | ✓ |
+
+The last two are important: the geodesic table is `[T, T]`
+precomputed at model construction. Forward pass adds the bias
+without computing anything per-batch. This is essentially **ALiBi
+with substrate-geodesic distances instead of plain absolute
+distance** — and ALiBi itself is known to work, so the prior on
+this formulation is much stronger than another activation gate.
+
+## Falsifiable prediction
+
+- If geodesic attention WINS vs crt_only on the distractor mix:
+  substrate IS useful as an attention modulator, but the basis
+  matters. The transformerless thesis gets a third architectural
+  win.
+- If geodesic attention LOSES: attention modulation in OMC's
+  substrate is truly dead at this scale, regardless of basis.
+  Honest pivot to tokenizer-layer substrate becomes the only
+  remaining substrate-in-attention story.
+
+Either way, this is the final attention-side experiment. After
+this we're moving the substrate's role away from attention
+unless this works.
+
+## Init details (matters for fair comparison)
+
+- α = 0.0 per head (disabled gate at init — the model has to
+  *find* the bias useful from gradient signal alone)
+- Geodesic table normalized so its mean over (i, j) for i ≠ j
+  is approximately 1.0 (so α has interpretable units)
+- All other hyperparameters identical to
+  `train_gate_reformulation.py` (d_model=128, n_blocks=4,
+  seq_len=128, 1500 steps, distractor_frac=0.20, 3 seeds)
+
+The only architectural variable changed from `crt_only` is the
+addition of the geodesic bias to attention scores. Everything else
+identical.
diff --git a/experiments/transformerless_lm/models.py b/experiments/transformerless_lm/models.py
@@ -103,6 +103,39 @@ def hbit_tension_gate(keys: torch.Tensor, scale: float = 1.0) -> torch.Tensor:
     return 1.0 / (1.0 + scale * attractor_distance(keys))
 
 
+# Same Fibonacci moduli as CRT-PE. The geodesic distance is computed
+# in the same lattice the positional encoding lives in — that's the
+# architectural coherence that the previous gate formulations lacked.
+_GEODESIC_MODULI = _FIB_MODULI
+
+
+def geodesic_distance_table(seq_len: int) -> torch.Tensor:
+    """Precompute a [seq_len, seq_len] table of CRT-Fibonacci
+    geodesic distances. For each pair (i, j) and each modulus m,
+    take the circular distance between residues (i % m) and (j % m)
+    — `min(d, m - d)` so positions on a ring of size m wrap.
+    Sum over moduli, normalize by m so each modulus contributes
+    bounded magnitude.
+
+    Returned table is normalized so its mean over i ≠ j is ≈ 1.0,
+    giving the learned α-bias scalar interpretable units.
+    """
+    table = torch.zeros(seq_len, seq_len, dtype=torch.float32)
+    pos = torch.arange(seq_len)
+    for m in _GEODESIC_MODULI:
+        ri = (pos % m).unsqueeze(1)             # [T, 1]
+        rj = (pos % m).unsqueeze(0)             # [1, T]
+        d = (ri - rj).abs() % m                  # [T, T]
+        d_circ = torch.minimum(d, m - d)         # circular distance
+        table = table + d_circ.float() / float(m)
+    # Normalize so mean of off-diagonal ≈ 1.0.
+    n_offdiag = seq_len * seq_len - seq_len
+    mean_offdiag = (table.sum() - torch.diagonal(table).sum()) / max(n_offdiag, 1)
+    if mean_offdiag.item() > 0:
+        table = table / mean_offdiag
+    return table
+
+
 # ---------------------------------------------------------------------------
 # Attention block
 # ---------------------------------------------------------------------------
@@ -125,21 +158,32 @@ class Attention(nn.Module):
                    substrate distance is a useful signal for the task.
     """
 
-    def __init__(self, d_model: int, gate_mode: str = "none", dropout: float = 0.0):
+    def __init__(self, d_model: int, gate_mode: str = "none",
+                 seq_len: int = 128, dropout: float = 0.0):
         super().__init__()
-        if gate_mode not in ("none", "key", "score", "learned"):
+        if gate_mode not in ("none", "key", "score", "learned", "geodesic"):
             raise ValueError(f"unknown gate_mode: {gate_mode}")
         self.d_model = d_model
         self.qkv = nn.Linear(d_model, 3 * d_model)
         self.out = nn.Linear(d_model, d_model)
         self.gate_mode = gate_mode
         self.dropout = dropout
         if gate_mode == "learned":
-            # Initialize so sigmoid(W*d + b) ≈ 1/(1 + d) near d ≈ 0:
-            # picking W = -1, b = 0 gives sigmoid(-d) ∈ (0, 0.5], a
-            # softer version of the falsified gate. Both are learnable.
             self.gate_w = nn.Parameter(torch.tensor(-1.0))
             self.gate_b = nn.Parameter(torch.tensor(0.0))
+        if gate_mode == "geodesic":
+            # ALiBi-style additive position bias, but using CRT-Fibonacci
+            # geodesic distance instead of plain |i-j|. Precomputed once
+            # at construction so the forward pass adds a [T,T] tensor
+            # to scores — no per-batch substrate compute.
+            self.register_buffer(
+                "geodesic_bias", geodesic_distance_table(seq_len)
+            )
+            # α scalar — initialized to 0 so the model starts as pure
+            # crt_only and must DISCOVER the bias is useful from
+            # gradient signal alone. Same fairness condition as
+            # gate_mode="learned".
+            self.alpha = nn.Parameter(torch.tensor(0.0))
 
     def forward(self, x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
         B, T, D = x.shape
@@ -149,17 +193,18 @@ def forward(self, x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
         scores = (q @ k.transpose(-2, -1)) * scale  # [B, T, T]
 
         if self.gate_mode == "score":
-            # Gate on score VALUES (pre-mask). attractor_distance of
-            # raw scores tells us whether the (q·k) magnitude lands
-            # on a substrate attractor. Off-attractor scores get
-            # additively penalized in log-space, so softmax handles
-            # normalization natively.
-            d = attractor_distance(scores * 10.0)  # [B, T, T]
+            d = attractor_distance(scores * 10.0)
             log_gate = -torch.log1p(d)
             scores = scores + log_gate
+        elif self.gate_mode == "geodesic":
+            # Subtract α * geodesic(i, j). Larger distance → more
+            # negative bias → softmax attenuates that pair. α<0 would
+            # invert (favor distant pairs), so the sign of α is
+            # itself a learnable architectural choice.
+            scores = scores - self.alpha * self.geodesic_bias[:T, :T].unsqueeze(0)
 
         scores = scores.masked_fill(mask == 0, float('-inf'))
-        attn = F.softmax(scores, dim=-1)  # [B, T, T]
+        attn = F.softmax(scores, dim=-1)
 
         if self.gate_mode == "key":
             key_mag = k.abs().mean(dim=-1)
@@ -168,7 +213,7 @@ def forward(self, x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
             attn = attn / (attn.sum(dim=-1, keepdim=True) + 1e-9)
         elif self.gate_mode == "learned":
             key_mag = k.abs().mean(dim=-1)
-            d = attractor_distance(key_mag * 100.0)  # [B, T]
+            d = attractor_distance(key_mag * 100.0)
             gate = torch.sigmoid(self.gate_w * d + self.gate_b)
             attn = attn * gate.unsqueeze(1)
             attn = attn / (attn.sum(dim=-1, keepdim=True) + 1e-9)
@@ -198,9 +243,9 @@ def forward(self, x):
 
 
 class Block(nn.Module):
-    def __init__(self, d_model: int, gate_mode: str = "none"):
+    def __init__(self, d_model: int, gate_mode: str = "none", seq_len: int = 128):
         super().__init__()
-        self.attn = Attention(d_model, gate_mode=gate_mode)
+        self.attn = Attention(d_model, gate_mode=gate_mode, seq_len=seq_len)
         self.ff = FeedForward(d_model)
         self.ln1 = nn.LayerNorm(d_model)
         self.ln2 = nn.LayerNorm(d_model)
@@ -234,7 +279,7 @@ def __init__(
             raise ValueError(f"unknown pe_kind: {pe_kind}")
         self.register_buffer("pe", pe)
         self.blocks = nn.ModuleList([
-            Block(d_model, gate_mode=gate_mode) for _ in range(n_blocks)
+            Block(d_model, gate_mode=gate_mode, seq_len=seq_len) for _ in range(n_blocks)
         ])
         self.ln_f = nn.LayerNorm(d_model)
         self.head = nn.Linear(d_model, vocab_size, bias=False)
@@ -282,4 +327,10 @@ def make_model(
         return TinyLM(**common, pe_kind="crt", gate_mode="score")
     if arch == "hybrid_learned":
         return TinyLM(**common, pe_kind="crt", gate_mode="learned")
+    if arch == "hybrid_geodesic":
+        # CRT-PE + ALiBi-style additive position bias in CRT-Fibonacci
+        # geodesic distance. Substrate signal applied to POSITIONS
+        # (integer, native to the substrate's basis) instead of
+        # activations (continuous, no substrate structure).
+        return TinyLM(**common, pe_kind="crt", gate_mode="geodesic")
     raise ValueError(f"unknown arch: {arch}")
diff --git a/experiments/transformerless_lm/train_geodesic_attention.py b/experiments/transformerless_lm/train_geodesic_attention.py
@@ -0,0 +1,126 @@
+"""Geodesic attention vs crt_only on distractor-mix TinyShakespeare.
+
+The LAST attempt at substrate-as-attention-modulator. See
+GEODESIC_ATTENTION_DERIVATION.md for the derivation.
+
+The change vs the three previously falsified gates: substrate metric
+is applied to POSITION INDICES (integer, native to the substrate's
+basis), not to learned float activations. Implemented as an
+ALiBi-style additive pre-softmax bias:
+
+    scores[i, j] = (q_i · k_j) / √d − α · geodesic(i, j)
+
+where geodesic(i, j) is the CRT-Fibonacci geodesic distance using
+the SAME moduli as CRT-PE (5, 8, 13, 21, 34, 55, 89, 144). The
+table is precomputed at construction; α is one learnable scalar
+per block, initialized to 0 (model has to discover the bias is
+useful from loss gradient alone).
+"""
+
+import argparse
+import json
+import sys
+import time
+import statistics
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).parent))
+from corpus import make_dataset
+from models import make_model
+from train_distractor_mix import (
+    build_distractor_stream,
+    train_one,
+)
+
+
+ARCHS = ["crt_only", "hybrid_geodesic"]
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--steps", type=int, default=1500)
+    parser.add_argument("--batch-size", type=int, default=32)
+    parser.add_argument("--seq-len", type=int, default=128)
+    parser.add_argument("--d-model", type=int, default=128)
+    parser.add_argument("--n-blocks", type=int, default=4)
+    parser.add_argument("--lr", type=float, default=3e-4)
+    parser.add_argument("--eval-every", type=int, default=100)
+    parser.add_argument("--seeds", type=str, default="42,7,123")
+    parser.add_argument("--distractor-frac", type=float, default=0.20)
+    parser.add_argument("--out", type=str, default="results_geodesic_attention.json")
+    args = parser.parse_args()
+
+    seeds = [int(s) for s in args.seeds.split(",")]
+
+    chars, stoi, itos, encoded = make_dataset(
+        seq_len=args.seq_len, source="tinyshakespeare",
+    )
+    vocab_size = len(chars)
+
+    print(f"Geodesic attention — distractor_frac={args.distractor_frac:.2f}")
+    print(f"Archs: {ARCHS}")
+    print(f"Corpus: TinyShakespeare ({encoded.numel():,} chars, vocab {vocab_size})")
+    print(f"Model: d_model={args.d_model}, n_blocks={args.n_blocks}, seq_len={args.seq_len}")
+    print(f"Training: steps={args.steps}, batch={args.batch_size}, lr={args.lr}, seeds={seeds}",
+          flush=True)
+
+    all_results = {arch: [] for arch in ARCHS}
+    per_seed_logs = []
+    for seed in seeds:
+        print(f"\n=========== seed {seed} ===========", flush=True)
+        train_split, val_split = build_distractor_stream(
+            encoded, args.distractor_frac, args.seq_len, seed,
+        )
+        seed_record = {"seed": seed, "archs": {}}
+        for arch in ARCHS:
+            r = train_one(arch, train_split, val_split, vocab_size, args, seed)
+            all_results[arch].append(r["final_val"])
+            seed_record["archs"][arch] = {
+                "final_val": r["final_val"],
+                "n_params": r["n_params"],
+                "time": r["time"],
+            }
+            print(f"  [seed {seed}] {arch}: final_val={r['final_val']:.4f}", flush=True)
+        per_seed_logs.append(seed_record)
+
+    print()
+    print("=" * 70)
+    print(f"{'arch':<18} {'mean_final_val':>16} {'std':>10} {'vs crt_only':>14}")
+    print("-" * 70)
+    base = all_results["crt_only"]
+    base_mean = sum(base) / len(base)
+    summary = {"distractor_frac": args.distractor_frac, "steps": args.steps,
+               "seeds": seeds, "per_seed": per_seed_logs, "summary": {}}
+    for arch in ARCHS:
+        vals = all_results[arch]
+        mean = sum(vals) / len(vals)
+        std = statistics.stdev(vals) if len(vals) > 1 else 0.0
+        if arch == "crt_only":
+            tag = "—"
+        else:
+            wins = sum(1 for v, b in zip(vals, base) if v < b)
+            rel = (mean - base_mean) / base_mean * 100
+            tag = f"{rel:+.1f}% ({wins}/{len(vals)})"
+        print(f"{arch:<18} {mean:>16.4f} {std:>10.4f} {tag:>14}")
+        summary["summary"][arch] = {"mean": mean, "std": std, "vals": vals}
+
+    print()
+    print("Interpretation:")
+    m_geo = sum(all_results["hybrid_geodesic"]) / len(all_results["hybrid_geodesic"])
+    rel = (m_geo - base_mean) / base_mean * 100
+    wins = sum(1 for v, b in zip(all_results["hybrid_geodesic"], base) if v < b)
+    if m_geo < base_mean:
+        verdict = "GEODESIC EARNS KEEP — substrate works on positions, not activations"
+    else:
+        verdict = "GEODESIC ALSO FAILS — substrate is exhausted as attention modulator"
+    print(f"  hybrid_geodesic vs crt_only: {rel:+.1f}%, wins {wins}/{len(base)}")
+    print(f"  → {verdict}")
+
+    out_path = Path(__file__).parent / args.out
+    with open(out_path, "w") as f:
+        json.dump(summary, f, indent=2)
+    print(f"\nWrote {out_path}")
+
+
+if __name__ == "__main__":
+    main()